In this video, I'll take you through an end-to-end example of using XGBoost on Amazon SageMaker, and I'll include many of the new features we recently added at re:Invent. Let's get started.
First, grab my code from this GitLab repo. Here it is, and you'll find the notebook here. Just go to this URL, clone this repo, and open the notebook in your favorite Jupyter tool. I'm using a notebook instance on Amazon SageMaker. The first few cells ensure we have the latest SDKs, particularly the latest SageMaker SDK to use the newer features. We need to restart the notebook to ensure all upgrades are applied in our environment.
Next, I import the SDKs, including SageMaker, and a few other dependencies. I grab the default bucket to store my data and the region name—standard SageMaker setup. I'll use Boto3 to show you the low-level APIs, so you understand what's happening. You can rewrite this notebook using only the SageMaker SDK, and I might do that later. For now, I want to highlight the low-level APIs and the specific aspects of the newer services.
Let's grab a dataset. It's a direct marketing dataset with features for customers and a label indicating whether the customer accepted the marketing offer. It includes integers, strings, categorical variables, etc., with over 41,000 lines and 20 features, plus the label.
I'll use SageMaker Autopilot to build a model automatically. You could pass the entire dataset to Autopilot, but I prefer to keep a small part aside for extra validation and testing. I'll split the dataset, using 95% for Autopilot and keeping 5% outside the process. An important note: SageMaker Autopilot currently works with CSV files, so features must be comma-separated. Ensure you save to CSV.
After splitting, I'll upload the 95% training set to S3 and set up the Autopilot job. I can specify completion criteria, such as the maximum time a candidate can run and the number of candidates to evaluate. I'll stick with the default values, but you can limit the job to fewer candidates for quicker testing. You can also set a maximum job duration to keep it within time constraints.
I need to define the data location and specify the target column, which is the "Y" attribute indicating whether the customer accepted the offer. You can specify the problem type, such as binary classification, and the evaluation metric, like F1, especially for imbalanced datasets. I'll use F1 because the dataset is biased towards the negative class.
I'll start the job using the `createAutoML` API, passing the job name, input config, output config, job config, objective, and role. The job will begin, and we can track its progress using the `describeAutoMLJob` API. The job will analyze the data, define candidates, apply feature engineering, and perform hyperparameter optimization.
Once the data analysis is complete, you can explore the generated notebooks. The Data Exploration Notebook provides information and stats on the dataset, while the Candidate Definition Notebook shows the evaluated pipelines. These notebooks help you understand the model's design and training process.
After feature engineering, Autopilot moves to model tuning, where it optimizes hyperparameters for the top pipelines. You can use the SageMaker Experiment Service to monitor ongoing tuning jobs and view metadata and hyperparameters.
Once model tuning is complete, we can see the top candidates, sorted by the objective metric, which is the F1 score in this case. The top model achieves an F1 score of 0.79, which is quite good. Deploying this model with SageMaker Model Monitor allows you to capture incoming data and detect data drift or quality issues.
First, create the model and register the best candidate as a SageMaker model. Set up the data capture location in an S3 bucket, define the capture configuration, and create the endpoint configuration with the data capture config. Then, create the endpoint and wait for it to be ready. Send some data to the endpoint to ensure it works and to test data capture.
I'll use the test dataset to compute the F1 score manually. I'll loop through the samples, invoke the endpoint, and score each prediction. This process takes about 30 seconds, and I'll print a confusion matrix. The results show a precision of 63%, recall of 95%, and an F1 score of 76%, which is close to the 0.79 from Autopilot.
I can inspect the captured data to ensure it's working correctly. The capture files contain the input and output data, which you can sample or configure as needed. For more information on SageMaker Model Monitor, check the extra notebooks in the video description.
That's it for this end-to-end example, from dataset to a trained and monitored model using SageMaker Autopilot, Experiments, and Model Monitor. If you have questions, leave them in the comments. Don't forget to subscribe to the channel, and I'll see you soon with another video. Bye-bye.