End to end demo with Keras and Amazon SageMaker

January 29, 2020
In this video, we walk through an end to end demo where we first build an image classification model using Keras. We first train it locally, and then on a managed GPU instance. In the process, we use Amazon SageMaker Debugger to identify possible training issues. Next, we use hyperparameter optimization to improve model accuracy. Finally, we deploy the top model to a real time endpoint, and we configure data capture with Amazon SageMaker Model Monitor. ⭐️⭐️⭐️ Don't forget to subscribe and to enable notifications ⭐️⭐️⭐️ * My notebook: https://github.com/juliensimon/reinvent-workshops/tree/main/aim410 (TF1.15 and TF2.0) * SageMaker Model Monitor notebooks: https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker_model_monitor * Script mode in detail: https://youtu.be/x94hpOmKtXM For more content, follow me on: * Medium: https://medium.com/@julsimon * Twitter: https://twitter.com/julsimon

Transcript

Hi everybody, this is Julien from Arcee. In this video, I want to take you through an end-to-end demo of using Keras with Amazon SageMaker, and I will use some of the latest features that were announced at Amazon re:Invent, like SageMaker Model Debugger, SageMaker Model Monitor, and a few more things. Let's get started. The first step would be to grab the code for this demo, which is on GitLab. Just grab it from this repo here. Open a terminal and clone the repo. I'm using a notebook instance, and I've already cloned the repo. You can go into that directory and open the notebook. The first step is to ensure we have the latest SDKs, especially the latest SageMaker SDK. We also need to have `smdebug` and `smdebug_rules_config`. Restart the kernel to make sure all the new packages are taken into account, and then import the SageMaker SDK to get started. The purpose of this Keras script is to train a convolutional neural network (CNN) to classify images from the Fashion MNIST dataset. This dataset has 10 classes and 60,000 samples. The first thing is to download the dataset, which Keras provides a built-in API for. Save the training set and validation set in a local directory. Next, let's look at the Keras code. It's vanilla Keras code. I'm grabbing hyperparameters from the command line, which is important for script mode. I'll get back to that in a few minutes. SageMaker will pass environment variables to this code to define the location of the training set, validation set, where to save the model, and how many GPUs are on the machine. This is tied to script mode. I have another detailed video on script mode, which I will highlight in the video description. Extract those parameters, load the dataset, and do basic processing, such as normalizing pixel values to be between 0 and 1 and one-hot encoding the class identifiers. Then, build the convolutional neural network, compile it using the SGD optimizer, and train, evaluate, and save the model in TensorFlow serving format, which SageMaker uses to deploy models. We can try running this script on the local machine. For me, this means running it on the notebook instance, but it's exactly like running it on your laptop. Training for one epoch to ensure the code runs. The point is to verify that the code works. Then, the next step is to train in local mode using the TensorFlow container. We're still running on the local machine, but this time using the TensorFlow container. SageMaker will pull the TensorFlow container to the local machine, load the Keras script inside it, and train there. This validates that the code runs fine inside the TensorFlow container without training on managed infrastructure, which takes time and incurs costs. Training for one epoch again for validation, and you can see the code being invoked inside the container. The next step is to train on managed infrastructure. Upload the training set and validation set to S3 and set up the training job. It's the same TensorFlow estimator as before, but now using a proper instance type, such as a GPU instance. I'm even using a spot instance for a discount. Set up model debugging using SageMaker Model Debugger by specifying rules to check the training job against, such as loss not decreasing and overfitting. When training starts, debugging jobs also fire up. The training job saves information in S3, including metrics and tensor data, which the debugging rules inspect on the fly. If a rule is triggered, the debugging job stops, and the training job is halted. We can describe what's going on and look at the debugging rules. For example, the "loss not decreasing" job has been stopped, indicating the rule was triggered. The "overfit" rule has also been triggered, and the job has stopped. To understand what went wrong, look at the output information for the training job. List the saved tensor information and use the SageMaker Debugger SDK to create a trial from that data. Inspect the loss values and identify specific steps where issues occurred. This helps pinpoint problems during the training job. Next, run automatic model tuning to explore hyperparameter ranges. Define ranges for parameters like epochs, learning rate, and architectural parameters. Specify the metric to optimize, such as validation accuracy. Define the TensorFlow estimator, use script mode, and spot instances to save money. Set up the tuning job with the defined metric and hyperparameters, and train multiple jobs in parallel. SageMaker applies machine learning optimization to determine the next set of hyperparameters to try. While the tuning job runs, use SageMaker Experiments to monitor progress. Export the tuning jobs to a Pandas DataFrame to view the data, including hyperparameters tried. Once the tuning job is over, grab the best job, which had a top validation accuracy of a little more than 92%. For deployment, define the location to capture data using SageMaker Model Monitor. Capture inputs and outputs sent to the endpoint, and deploy the model using the SageMaker SDK. Send 10 random images from the validation set to the endpoint and compare real labels with predicted labels. Keep sending traffic to the endpoint to capture data. Build a confusion matrix to evaluate performance. For example, class 6 has mismatches with class 0, indicating areas for improvement. The captured data is stored in JSON line files in S3. Open one of these files to see the input data and predicted class. The next step is to train a baseline using the training set and compare real-time data to the baseline to check for data quality issues like data drift. Once you're done, delete the endpoint to stop incurring costs. That's a complete TensorFlow Keras demo on SageMaker, showing how to train with local mode, managed infrastructure, hyperparameter tuning, model debugging, and model monitoring. I hope it was informative. If you have comments and questions, please ask them. Don't forget to subscribe to be notified of future videos. Until next time, bye-bye.

Tags

KerasSageMakerModelDebuggerHyperparameterTuningModelMonitor