Okay, so the data analysis step is over, and now we move on to feature engineering. As soon as this step is complete, we can see the two generated notebooks. Let's take a look. We'll start with the data exploration notebook. As the name implies, this generated notebook provides information about the dataset. We see how many rows were found in the training set. We told SageMaker Autopilot to use the Y feature as the target, and this was identified as a binary classification problem, which is correct. Here's a text sample, very similar to what we saw in my own notebook. We also see information on whether some columns have missing values. Those bluish boxes suggest what you should do or what you could do to improve the quality of the dataset. In this case, it's a very simple and clean dataset, so there isn't much to do. However, you could certainly work with much messier datasets. For example, you could add a step to the feature preprocessing pipeline to fill missing values using `scikit-learn.impute.SimpleImputer`. All those blue boxes are here to help draw your attention to techniques and tips that could improve the dataset. We have some extra statistics here, such as the number of unique entries per column. Again, this is information you would typically gather manually, but it's neat to have it done automatically. Basic stats like mean, median, and max are also provided, which are typical for a data scientist or machine learning engineer to review to understand the dataset. This is just the beginning, and we will likely add more as we go.
Let's look at the other notebook, the candidate generation notebook. This one shows the different candidates designed by SageMaker Autopilot to work on the problem. Again, you'll see lots of blue boxes with tips. We can see here that we can download the generated data transformation modules, which are the code that SageMaker Autopilot generated to transform data and improve feature quality, hopefully leading to a better model. These are stored in S3, and you can run all of this. We see candidates with names like DPP0, which likely means data preprocessing. This first candidate is based on a preprocessing script 0 and the XGBoost algorithm. We can see the data transformation steps, such as using a robust imputer and a threshold 1-1 encoder, which are SageMaker extensions on top of scikit-learn objects. All of this is open source and available on GitHub, so you can check exactly what each object does.
We see this for all the candidates. For example, one candidate uses a robust imputer, one-hot encoding, robust PCA, and a robust standard scaler, all to try different transformation techniques to find a set of features that trains a better model, again using XGBoost. Another candidate is based on the Linear Learner, which can handle binary classification and is another built-in algorithm in SageMaker. There's a lot to read here, and we can see that it will fire up 250 tuning jobs to tune these different algorithms and candidates, aiming to build the best possible model. SageMaker Autopilot uses advanced settings in the hyperparameter tuning module, including multi-algo tuning. You can run this code to see how it works.
We use SageMaker Experiments to collect training information and metadata on all SageMaker jobs, whether they are training jobs or data preprocessing jobs. There's an SDK to do this, and we will find the best model, which is actually the best pipeline, combining a preprocessing job and the actual model. This preprocessing runs on SageMaker, a new capability that allows you to run feature engineering or model evaluation jobs. The pipeline includes the preprocessing job and the actual model training and optimization. We can deploy this model, and this is what SageMaker Autopilot does under the hood. While we could ignore this notebook and trust that Autopilot is doing the right thing, for some developers, it's critical to understand how the model was built and how data was preprocessed. This allows them to determine if they would have done the same thing and if it makes sense to use this model. Developers can also tweak the preprocessing jobs based on their domain knowledge and expertise, focusing on promising models and keeping the process efficient.
Now, as we can see, we're doing feature engineering. SageMaker Autopilot is applying those preprocessing scripts to our dataset, preparing 10 different versions of the dataset that will be passed on to model training and tuning. This might take a little while, so let's wait for feature engineering to complete. I'll see you in a little moment.