AWS DevDays 2020 Deep Dive on Amazon SageMaker Debugger Amazon SageMaker Model Monitor
March 26, 2020
In this code-only session, we'll explore advanced features of Amazon SageMaker that will help you boost your productivity, as well as the quality of your models. First, we'll see how to detect and fix training issues with SageMaker Debugger. Then, we'll also show you how to monitor models in production, and identify prediction issues such as missing features or data drift. Finally, we'll share some cost optimization tips to help you make the most of your machine learning budget.
⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos ⭐️⭐️⭐️
Code: https://github.com/juliensimon/awsdevdays2020/tree/master/mls1
For more content, follow me on :
* Medium: https://medium.com/@julsimon
* Twitter: https://twitter.com/juliensimon
Transcript
Hi everyone, welcome to this new session on Amazon SageMaker. If you have any questions, please submit them via the questions pane on the control panel, and I will answer them at the end. A copy of the slides can be found in the handout tab on the control panel, and you will get a copy of the recording in a follow-up email after the event. In this session, I'm going to dive even deeper into SageMaker, and this is actually zero slides or almost. We're going to talk about Amazon SageMaker Model Debugger and how it helps inspect what's going on during model training. Then we'll look at Amazon SageMaker Model Monitor, which helps you find data quality and prediction quality issues once your models have been deployed.
Okay, so let's jump straight into the notebook. This notebook is available on GitLab, and here I'm going to use the XGBoost algorithm to build a classification model on the direct marketing dataset. I won't dive too deep into the dataset and the problem we're trying to solve, as I'll cover that more in the next session, which focuses on performance and accuracy. Here, we want to inspect and monitor, so it's not so much about getting great accuracy.
In a nutshell, this is a direct marketing dataset. It's a supervised learning problem, classifying customers into two classes: those who accept an offer and those who don't. So it's a yes or no problem. The first step is to download the dataset, extract it, and take a look with pandas. It's a CSV file with a bunch of features and a Y column indicating whether the customer accepted the offer. Then I do some basic preprocessing, but I won't go into detail now because it's not the focus. After preprocessing, I split the dataset and upload everything to S3. So I have a training set, a validation set, and a test set, each with its own location in S3.
Now we want to train a model. If you've worked with SageMaker before, you know this means using an estimator. I'm using a built-in algorithm here, so I'm using the estimator object, which is the generic object for built-in algorithms. First, I grab the name of the container for XGBoost in the region I'm running in, and I configure the estimator. It's quite a mouthful because I tried to fit as much as I could in there, so let's take it step by step.
The bits we probably already know are these: we need to pass the name of the container, selecting the algorithm we want to use. We pass an IAM role to give SageMaker permissions to access S3, pull Docker containers, etc. The session is a technical object. We use file mode to say we want to copy the dataset to the instance before training. We define the output location for the model and specify how much infrastructure we want. Here, we're training on one ml.m4.xlarge instance.
The next bit is about using spot instances. Spot instances are a great way to save money. They've been available on EC2 for a long time and are now available on SageMaker. So, we can say, "Hey, I want to use spot instances for training." We set the maximum training time and the total time, including waiting for spot instances, to control how much time we're willing to wait if spot instances are in high demand.
The next bits are new. Let's look at the debugger configuration first. This is an object from the SageMaker Debugger SDK, and it enables SageMaker Debugger for this training job. SageMaker Debugger saves model information, such as tensors, which are high-dimensional arrays representing the state of the model, parameters, gradients, and weights. This model state is saved periodically during the training job and stored in S3. Later, we can look at it to understand what happened in the training job, potentially looking for bizarre conditions or plotting metrics.
We define collections for different tensors. We have metrics, and predefined names like feature importance, which tells us which features in the dataset contribute the most to the predicted outcome by XGBoost. We set the save interval, and here I'm saving all steps, literally saving everything. This is how you configure the saving part of the debugger. But it's not just about saving; we can define rules to check for unwanted conditions during the training job. There's a list of built-in rules in the documentation, and you can add your own. For example, I'm using a built-in rule for class imbalance because this dataset is very imbalanced, about 8 to 1. Building classifiers for imbalanced datasets is more difficult, so I want to ensure the training job isn't affected by this imbalance.
Once the job is configured, I set some hyperparameters. I'm being reasonable here, not setting any crazy ones because I'm not trying to optimize for a high-performance model. I'm just showing the new capabilities. If you're curious about optimizing hyperparameters, that's the next session. Then I call fit, and the training job starts. We see the usual log output, including the start of the training job, launching instances, and new stuff like debugger rule status, such as class imbalance in progress.
Based on the configuration, SageMaker fires up a parallel job for the debugging rule. If we configured multiple debugging rules, we would see multiple debugging jobs. These jobs look at the tensors saved in S3 in real-time and check for the conditions they're set up for. If a rule is triggered, the debugging job stops, and the training job stops as well because there's no point in continuing if something is wrong. This can save time and money, especially for deep learning jobs that train for days.
Once the job is over, whether it completed successfully or not, you can explore the tensors in S3 using the SageMaker Debugger SDK. You find the path where the tensors were saved, create a trial from that, and start exploring your data. You can access specific tensors by name, get all the steps, and plot metrics using matplotlib. For example, we can plot the area under the curve for the training and validation sets over time.
We also saved feature importance, which tells us which features contribute most to the prediction. Our dataset has over 60 features due to one-hot encoding. Plotting feature importance shows that features like job and housing are important, which makes sense because they affect how much money a person can spend. This is a great example of model explainability, helping you understand what's going on inside the model.
For more advanced examples, I recommend checking the Amazon SageMaker examples repository on GitHub. It has hundreds of notebooks showing how to use SageMaker in various ways, including deep learning and model pruning. These notebooks can help you understand your models deeply and optimize them.
Now, let's talk about another capability: SageMaker Model Monitor. Model Monitor helps save data sent to your model in production, such as data sent to a real-time endpoint, and captures incoming and outgoing data, saving everything to S3. We can then run analytics on this data. To enable data capture, we pass a data capture config object when deploying the model. This configures the endpoint to capture 100% of the data, though you can sample down if needed. The captured data is stored in S3, and we can inspect it using the invoke endpoint API from Boto3.
In the same repository, there are examples for Model Monitor. One example uses a churn prediction model. We deploy the model, set up data capture, and send it some data. We can then inspect the captured data in S3, which is stored in JSON lines format. We can go further by generating a baseline using the training set, which computes statistics and constraints. This baseline helps us compare incoming data to the training data, detecting issues like missing features, mistyped features, or drifting features.
We create a monitoring schedule that periodically checks the captured data against the baseline, looking for discrepancies. If issues are detected, the monitoring schedule alerts us. For example, we can break the data on purpose to test the system, and the monitoring schedule will detect violations, such as incorrect data types. This capability runs in the background, catching issues and pointing you to the problem, saving you time and frustration.
For cost optimization, I recommend a blog post on my Medium blog that walks through all the steps, from data preparation to using managed services like EMR or Glue, and from stopping notebook instances when not in use to using local mode and pipe mode for large datasets. Spot training can save you 60-70% on training jobs, and elastic inference can significantly reduce costs for GPU instances. Inferentia is another great option for high-throughput prediction.
If you want more content, check the SageMaker documentation, AWS blog, my Medium blog, and my YouTube channel. My audio podcast is on Buzzsprout, and I'm always happy to chat and answer questions on Twitter. Feel free to ping me if you need help or resources.
Thanks again for attending. I hope you learned a lot about SageMaker Debugger and Model Monitor. Now we're available to answer your questions. Thank you very much, and I hope you're safe wherever you are. See you soon. Bye-bye.
Julien Simon is the Chief Evangelist at Arcee AI
, specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.
With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.
Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.
Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.