NEW Amazon SageMaker Studio Debug Models with Amazon SageMaker Debugger

Transcript

Hey, Julien from Arcee. In this video, I'd like to talk about one of the new capabilities that we launched at reInvent 2019, Amazon SageMaker Debugger, which is one of the most exciting ones, if you ask me. SageMaker Debugger gives you the ability to pinpoint and understand weird training issues that could be happening during your training jobs on SageMaker. This is compatible with the built-in XGBoost algo as well as the TensorFlow, Apache MXNet, and PyTorch built-in frameworks. There's a whole bunch of information in the documentation, we'll get back to that in a minute, but let's not waste any time and look at some code. Here, I'm going to show you an example based on Keras, the high-level API in TensorFlow. What I'm going to show you is very similar to what you would do with the other algorithms and frameworks. Okay, so here I'm trying to classify the fashion MNIST dataset using a simple convolutional neural network. I will put the link to this notebook in the video description. Here's the code, and if you're used to Keras, you know this is really vanilla Keras. The only change here is those lines that help me implement script mode, which is the default way of training framework code on SageMaker. I have another video out there for script mode, so look out for that one if you want to know more, but it's not really important here. Then I just grab the dataset and do some very basic normalization and one-hot encoding, build my model, train it, and save the model. This is really vanilla Keras code; there's nothing specific here with respect to debugging. I'm just using this exact same code, and we'll see how this works with Debugger. If I were using a lower-level API in TensorFlow, working with the `tf.estimator` object or the low-level API, I would have to add some extra code. I'm showing you how to do this in the blog post that I wrote for the launch. I will put the URL for this in the video description. This shows you a lower-level example when you can customize exactly how you want Debugger to work. But here, I want to keep it super simple. It's an introduction. This is really unmodified Keras code. Okay, so upload the dataset to S3. Then, as usual, configure the SageMaker estimator for TensorFlow. I think by now you're getting used to this. Pass the script, define how much infrastructure I need. The only change here is I'm going to pass some rules. Rules are built-in or custom rules you can write yourself, and each rule checks for a specific condition during the training job. Here, for example, I want SageMaker Debugger to watch for loss not decreasing, which would probably mean that my training job is not learning properly, and I want to check for overfitting, meaning it's probably learning too well and won't generalize to new data. To use this, you just need to import those new objects, `Rule` and `RuleConfig`, from `SageMaker.debugger`. Make sure you have the latest SageMaker SDK; otherwise, you might be missing this. Then I call `fit` to get the training job going. If I look at the training log, I see the training job starting. I would still see this in the SageMaker console if I looked for it. I also see these extra lines telling me Debugger has launched debug jobs. So one for each rule that I configured. These are actually run on another SageMaker capability called SageMaker Processing, which lets you run feature engineering or model evaluation jobs. We also use it internally for debugging jobs. So how does that work? The training job will automatically save all kinds of tensors to Amazon S3. It's going to save metrics, weights, and generally all parameters of the model in training. For Keras, by default, we save a number of things. If you work with lower-level APIs, you can specify the tensors and the collection of tensors that you want to save. For example, you might want to save all weights, all gradients, and extra parameters. You can very finely configure how that works. The training job saves this information to S3, and those two debug jobs read the tensors as they become available, apply the rule they've been configured for, and start looking for issues like loss not decreasing or overfitting. There are many more rules available. I can show you a list here. So these are the built-in rules: dead ReLU, exploding tensor, poor weight init, saturated activation, vanishing gradients, all zero, class imbalance, confusion, loss not decreasing, etc. For each of them, you get lots of information. For example, if we click on vanishing gradient, this one looks for gradients going to zero, meaning your weights are hardly getting updated anymore, and your training process is living dead. It's still running but not updating anything, so you're just wasting CPU or GPU time. You have examples and documentation, and there are cool notebooks in the Amazon SageMaker examples repository on GitHub showing different libraries, configurations, and even an example for custom rules. So, how did we do here? We wait for a few minutes for this to complete. If I go to the experiments pane and refresh, that's the last job, so I can open details for this. I can see metrics, training parameters, artifacts where the data and model are, and extra AWS settings. I can see the debugging status here. The loss not decreasing debug job did not find any problems, so loss steadily decreased over the epochs. Overfitting did not happen either, so no worries. I can see extra information, including the ARN for the actual SageMaker Processing jobs that were run. That's all good. Now, let's keep exploring a bit. I could describe the training job and figure out the same information we just saw in SageMaker Studio. No issues found on those two debug jobs, but let's keep looking. This is the location where debugging information has been stored. As the training job ran, it saved metrics and debug information to S3. If we list that location, we see a lot of stuff. It's not human-readable, but if we use the new SM Debug SDK, we can create a trial from that collection of debug information easily. We can see the location of the debug information, the steps, and tensor names. We see the collections: weights, biases, gradients, losses, metrics, etc. All that stuff is configurable and available. We can look at metrics, specifically validation loss, and print values for validation loss over different steps. We can see the validation loss is decreasing, so the training job looks fine. We could keep exploring, but this is the bulk of the service. Configuring rules, just like this, passing the rules to the estimator, and using the SM Debug SDK to load and inspect debug information. For example, if you configure the exploding tensor rule and it's triggered, it sends a CloudWatch alert, so you could act automatically. You would use the SM Debug SDK to look at specific tensors and find out which layer or operator is causing the problem. Once again, take a look at the examples, the documentation, the SDK, and the blog post, which gives a lower-level TensorFlow example. That's it for SageMaker Debugger. Again, one of my favorite launches from this year at reInvent. See you around!

NEW Amazon SageMaker Studio Debug Models with Amazon SageMaker Debugger

Transcript

Tags

About the Author