AWS DevDays 2020 Build Models Automatically with Amazon SageMaker

Transcript

Hi everyone! Welcome to this session on building models automatically with SageMaker. If you have any questions, you can submit them in the questions pane on the control panel, and I will answer them at the end. You can find a copy of the slides in the handout tab on the control panel as well, and you will get a copy of the webinar recording in a follow-up email after the event. Okay, let's get started! As you probably know by now, SageMaker is a fully managed machine learning service that lets you go quickly from experimentation to production using a modular service and modular APIs. By modular, we mean you pick whatever you like. If you really want to go from A to Z using SageMaker, that's perfectly okay. That works fine. That's what I'm going to do in the demo later on. But maybe you already have a model and you only need to deploy it on SageMaker. So that's fine. You can just call the deployment API and get that done. Maybe you want to train on SageMaker and deploy on your local machine. So fine, you can do that as well. It's not a siloed service where you have to go from A to Z and be locked in. You can really use whatever part of the service you like best. Okay? And if you look at this slide, there's a bit here that says optimization, right? Optimizing models. This is one of the two things we're going to cover today: how to build models and tune them automatically for the best performance, exploring hyperparameter ranges automatically. The second thing we're going to look at, which is even more powerful, is building and tuning models completely from scratch. Just provide your dataset, and SageMaker, specifically SageMaker Autopilot, will go and train and optimize that model for you. Let's get started. The first thing I want to cover is model tuning. So how do you extract as much performance as possible from your model? Well, I already mentioned hyperparameters, so I guess I need to explain. Hyperparameters are training parameters, algorithm parameters that you need to set. If you work with XGBoost, which is what I'm going to use in my demo, a popular open-source tool, you have plenty of hyperparameters such as tree depth, max leaf nodes, and then really weird ones like gamma, lambda, alpha. All of them have an influence on the outcome, so you can't just set them at random, right? And it's not always easy to understand which ones are the really important ones. Maybe the few you should really look at, and the others should just use default values. If you work with neural networks, it's probably worse because how many layers do you need? How wide should these layers be? What's the learning rate for your job? And if you use embeddings for maybe natural language processing, you have so many extra parameters to worry about. And of course, the combinations get crazy. There's no way you're going to be able to explore all of it. The chances that you're going to train that one golden job with the right set of hyperparameters is very, very slim. So we need to have a proper strategy to find those hyperparameters. Before we talk about that, let me remind you how we set hyperparameters in SageMaker. If you work with built-in algos like LinearLearner or KMeans, then you have a dedicated estimator for those built-in algos. The only thing you have to do is call a set hyperparameter API on that estimator. You have an example here for XGBoost as well. So super simple. If you work with a built-in framework like TensorFlow, MXNet, etc., then you pass a Python dictionary with the hyperparameters to the estimator. So if you use TensorFlow, that's the example here, you will pass that hyperparameters, Python dictionary, to the TensorFlow estimator. If you're using a built-in framework, you're probably using script mode as well, and that's really the preferred way to use those frameworks. This means the code, so let's say the TensorFlow code that you're training here, must be able to accept those hyperparameters as command-line arguments. That's the interface between the estimator and your code. Finally, if you use your own container, so anything else, your own code, then again, the hyperparameters will be passed to the estimator as a Python dictionary, and that dictionary is going to be copied inside your container automatically as a JSON file so that your code just has to read that `hyperparameters.json` file to grab the hyperparameters and then apply them to your own code. So it's not difficult to access those hyperparameters in your code. Now let's talk about the different tactics to find the right parameters. The first one is manual search, aka "I know what I'm doing." So I'm just going to go and select those hyperparameters myself and voilà. Unfortunately, most of us, including myself, don't really know what we're doing. Picking hyperparameters just like that is not the guaranteed option to get high-performance models. It's very easy to convince yourself of that. Just try to run a few training jobs, see what accuracy you get, and then try the other tactics, and you'll see that manual search is slow and really, you're guessing more than anything else. The second option is called grid search. Grid search will split the hyperparameter space into different areas and explore those areas systematically. If we try them all, then there's a good chance we're going to hit the area where models perform best. So that does work. Unfortunately, you end up training hundreds and hundreds of models, which is slow and expensive. If you have a very short training job, then sure, you can run 1,000 or 10,000. But even then, it's probably not the best option. The third option is random search, which I call "spray and pray" because random is exactly that. You might think, "How could random be more efficient than grid? Grid search is scientific. Random search is just throwing a dice." Well, yes, but in fact, it does perform better. You don't have to believe me; you can read this research paper by Bergstra and Bengio, and Bengio is a Turing Award winner. They proved that this actually works better than grid search, so you just get to high-performance models faster and cheaper. The problem I have with random is that if I have to convince a customer that this is the way to go, it's a long shot. Explaining that this is the model they should deploy because you randomly selected hyperparameters is a little difficult to explain. The fourth option, and my preferred one, is to use hyperparameter optimization, where we literally use machine learning to predict hyperparameters. Initially, we'll pick some random values and then observe the results from those models. We'll apply two machine learning algorithms called Gaussian process regression and Bayesian optimization to predict the set of hyperparameters that should be tried next. We're not guessing; we're really applying statistical analysis to decide where to look next. This will converge faster to high-performance models, so you end up training fewer models and spending less money. If you want to know more about those algorithms, we have details at this URL. But again, you don't really need to understand the details here. How do we use the API? Well, we create an estimator the usual way and then define the metric to tune on. If you use a built-in algo or built-in framework, we have predefined metrics like accuracy, area under curve, F1, that kind of thing. But if you have a custom metric in your training log, you can absolutely use it; you just have to provide a regular expression that will help SageMaker extract the actual value. Anything that is logged to the training log can be used as an optimization metric. The next step is to define parameter ranges to explore, so basically, you know what to look for. Then you put everything together: the estimator, the metric, the parameters, and tell SageMaker how many jobs you want to run and how many to run in parallel. You set the strategy. Bayesian is the default one and the one we recommend, but you can also use random search. Customers usually do that as a baseline to see that Bayesian is actually better. Then you just call `fit` on this hyperparameter tuner, so this isn't really hard, and we'll see in a demo in a few minutes. Summing things up, this is the workflow: you start from your client. I'm going to use notebooks, but you could use your IDE or even the console if you want. You create a hyperparameter tuning job with a certain strategy, and it fires up a number of initial jobs. Those jobs will log metrics. So this is the accuracy, let's say, that each job achieved with that set of hyperparameters. Then the tuning strategy will apply optimization to those results and predict the next set of hyperparameters that need to be explored. It does this again and again until it has hit the number of jobs that you specified. You can see the metrics and, of course, the hyperparameters that were set for each job. While this is running, you can actually view it in the AWS console. You can list all the jobs, inspect them, see the best training job. You can query the job status with the SageMaker SDK and deploy the best job simply by calling `deploy` on the hyperparameter tuner. If the job is over, this will literally deploy the best job. If the job is not over, it will deploy the best job so far, which is sometimes useful because if you have very long-lasting tuning jobs and still want to run some tests on the best so far, you can call `deploy` and it will deploy the best so far. If you call `deploy` at the end of the job, then it will deploy the best one. So that could be a shortcut if you wanted to take a quick look. All right, let's do a demo. So let's move on to this notebook. You can find this notebook on GitLab; this is the URL here. It's my beloved direct marketing example, which I've used again and again, but I think it's quite simple to understand. It's a supervised learning problem where we're trying to classify customers into two classes: customers who accept a marketing offer and customers who don't. So it's a yes or no kind of thing. We're using XGBoost here to build this classifier. First, I'm updating my SDKs, making sure I have the latest version for everything. Then I download the dataset, extract it, and I can see it's a CSV file. So we see this is the first sample. We have, I think, 20 features, and we have a label here. That column is called `Y`, and it says yes or no, did this customer accept the offer? We have many more no's than yes's. So that's a CSV file. We'd rather look at it using pandas, a Swiss Army knife for machine learning. So that's what I'm doing here, loading it in pandas, and I can view it in a more pleasant way. All the features are in that `Y` label at the end. The shape shows I have a little more than 41,000 samples and 21 columns. So 20 features and the label. I can count the number of yeses and nos, and I can see the dataset is unbalanced, about 8 to 1. So 8 times more no's than yes's, which makes sense because most people are not interested in that marketing offer. Now I need to do some simple transformations on the dataset. I'm going to go really quick here because this isn't really the bulk of the session. We have a column called `pdays` that tells us the last time we talked to that customer, and 999 is a placeholder value that says we never talked to that person. We need to remove that because 999 is not 999 days; it's really a placeholder value that could fool the algorithm into thinking this means something else. So we're going to drop this value and replace it with a column that says "no previous contact" set to 1 or 0. We're going to aggregate low cardinality classes. For example, student, unemployed, and retired are kind of low numbers here. They don't have a job basically, so we could put all of them in the same categorical "not working." So again, creating a column here to cluster those three classes. Then I'm one-hot encoding all the categorical variables. There are plenty, right? You can see jobs, marital status, education, etc. All these are categories. So I need to get rid of that. I use one-hot encoding, which basically creates different dimensions for all these categorical variables. Let me scroll back a bit. So, for example, we see days of the week have become categories, and jobs have become categories, etc., etc. Basically, creating new dimensions for all those different categories. All right, so now I end up having 66 columns instead of 21. But this is actually helping the model by understanding the different dimensions of the problem. All right, now I need to split the dataset for training, validation, and testing. There's a nice API in NumPy to do that. So 70% goes to training, 20% goes to validation, and 10% go to testing. So now I end up having three CSV files, which I upload to S3 and define their location as S3 prefixes because that's what I'm going to pass to the estimator. Okay, so basic preprocessing. You can read the notebook again for details. But okay, now we have a dataset to work with. So how do we launch that tuning job? First, I'm going to create my estimator. I'm going to grab the name of the XGBoost container for the region I'm running in. This container parameter is really a Docker image name now. I'm going to create my estimator, passing the container, the role so that SageMaker can access and pull the container and do some other things. I'm passing some information on the data. I want to use `FileInputMode`, so basically, I want to copy the dataset to the training instance. The alternative would be `PipeMode`, which would stream data, but this is a really tiny dataset, so streaming doesn't make sense. This is where to save the model. This is how much infrastructure I want to train on, so 1m4.xlarge. And I want to use a Spot instance to save some money, so I also enable Spot instances here. This is a good way to save 60 to 70%. So basically, I'm saying, "Hey, I want to use Spot. My job should not run for more than five minutes. It's a very fast one." And I don't want to wait for more than 600 seconds for Spot instances plus the actual training time. So this parameter lets you specify how long you're ready to wait for Spot instances. I set some basic hyperparameters, the static ones I should say. The one that says, "Hey, I just want to build a classification model and I want to train for 100 rounds." That's about it. So you can set some static ones like that. And then, of course, you can set the ones that you want to explore. So here I want to explore `eta`, `min_child_weight`, `alpha`, and `max_depth`. So two of them are floating-point values between 0 and 1 or between 1 and 10 or 3, sorry, and this one between 0 and 2, and this one is an integer. So you can have continuous parameters, integer parameters, or categorical parameters, but we don't recommend those; they tend to make optimization a little more difficult. Okay, so these are the ranges I want to explore. So you could say, "How do we know `eta` should be between 0 and 1?" Well, some of these parameters have default ranges. So `eta` by default is between 0 and 1. So there's no guessing what to explore. For other parameters, you can come up with reasonable ranges. The SageMaker documentation will actually point you at that, and then literature as well. So it's difficult to find the right parameters, but it's not too difficult to find the right ranges to explore. Next, we want to define the metrics. So here, let's say I want to optimize for the maximum area under the curve on my validation dataset. Again, this is a built-in metric because here I'm using a built-in algo. I want to maximize this value. Then I put everything together, so hyperparameter tuner, with the estimator, the metric, the ranges I want to explore, the objective, all that stuff comes from here. And I'm going to run 40 jobs, 4x4. For best results, you should actually disable parallelism because you will get more opportunities to apply optimization. But if you want to speed things up a bit, then you can run some parallelism. So here I will still get nine shots at optimizing. The first four will be randomly picked, and then the 36 next ones will be optimized. So I'm going to get plenty of chances to explore, so that should be fine. And then I call `fit` to train. My job fires up, and I could see it in the SageMaker console here. So it's already over, but you will see it here. This is the one. So I see all my jobs and the best training jobs, etc. And you can also use the SageMaker Experiment SDK to look at those jobs. So SageMaker Experiments is one of those capabilities that was launched at re:Invent, and you can very easily query the results even while the job is running and extract information in a pandas data frame and then visualize it like that. If you use SageMaker Studio, you can visualize all the nice graphs as well. Once the job is over, I say, "Hey, just give me the top three jobs." So sorted by descending objective. The top job achieved 0.951284 area under the curve. That's not too bad. And let's deploy it. So the only thing I have to do is call `deploy` on the tuner. And so it will print out the training log for that job. I can see I saved 66.7% on my training job thanks to Spot instances. So Spot is really, really a good way to save a lot of money. So make sure you use it. And after a few minutes, the model is deployed, and I can just predict with it. Okay, so I can load some data from my test set, read the first hundred samples, and use the `invoke_endpoint` API in the SageMaker SDK to basically push those 100 samples to my endpoint and get some results. And read the 100 results for all those 100 predictions. Then I can delete my endpoint if I'm done. Here's a quick example of running hyperparameter tuning with SageMaker. Some tips: Bayesian is really the way to go. It's better, faster, and cheaper. Random search is good as a baseline, but Bayesian will work better. Don't run too many jobs in parallel because you get fewer opportunities to predict. The more data points you have, the better job Bayesian will do. So don't parallelize too much. And you may have instance limits anyway, so you're not going to be able to fire up too many instances in parallel. Don't run too many jobs either. Bayesian is typically 10x more efficient than random. So if you were used to running maybe 500 jobs with random search, you should definitely consider running 50 with Bayesian. And of course, the cost of training and the cost of running those jobs might not be worth it. So you need to be careful of diminishing returns. If you end up running hundreds of jobs for negligible accuracy gains, and negligible meaning you don't see any business impact, well, then, it's a waste of time and money, so don't do that. We have plenty of resources on automatic model tuning. Of course, documentation, plenty of notebooks, some really great blog posts. So, again, you'll get the slides and you can go and learn even more. Now, let's switch to the second part of the presentation, which is building models automatically with SageMaker Autopilot. So, Autopilot is an end-to-end capability. The purpose of AutoML is to solve model building completely. Identifying the type of problem you're trying to solve: are you trying to build a regression model? Are you trying to build a classification model? Then based on this, select the best algorithm or algorithms for that problem. The third step is preprocessing the data for the candidate algorithms that were selected. Of course, data, as we saw in the previous example, is not always in the perfect format. You have text strings, categorical variables, placeholder values, and missing values maybe. So all that stuff needs to be handled correctly if you want to have good results. Finally, you're going to train all those different jobs, apply the candidate algorithms to the processed datasets, and apply hyperparameter tuning just like we saw before to extract every bit of accuracy. These are the different steps you want to have in an AutoML capability. You'll find two types of services out there: blackbox versus whitebox. Blackbox will train and give you the model, but they give you little or even no information on how the model was built. So it's difficult to understand where that comes from. Even if it performs very well, you can't really reproduce it yourself. So that's okay, but I guess whitebox is more desirable because if you get a high-performance model and understand how it was built and can reproduce everything yourself and tweak even further, well, I guess that's better, right? So SageMaker Autopilot is whitebox AutoML. It will cover all the four steps I mentioned: problem identification, algorithm selection, data preprocessing, hyperparameter tuning, and it's going to show you what it did. It's going to generate notebooks that show you exactly what was done, and you can run them again. SageMaker Autopilot was launched at re:Invent a few months ago. For now, it supports regression and classification problems using built-in algorithms on SageMaker. The workflow is very simple. You upload your unprocessed dataset to S3. The keyword here is "unprocessed"; you'll see what that means. You configure the AutoML job, where's the data, and how much tuning to run. If you launch the job, it's going to run for a while, and while it runs, you'll be able to read those auto-generated notebooks showing you the candidates, showing you the preprocessing steps, and then at the end, you can simply deploy to a real-time endpoint or for batch transform. So this is really, really simple to use. Let's do a quick demo. I'm using the same dataset here. Again, that notebook is in the same repo. It starts the same: install the SDK, grab the dataset, load it in pandas, take a look, blah, blah, blah. Nothing new here, so let's go fast. Here I'm splitting the dataset. The reason I'm doing this is because I want to save 5% of the dataset to score the model myself. You don't have to do that. I wanted to show you how to do it, but you could very well pass 100% of the data to SageMaker Autopilot. SageMaker Autopilot will internally split your training set for training and validation, etc. So of course, it is applying those best practices, but I just want you to have 5% of that data for me. And that's it. And I upload that 95% training/validation data to S3. As you saw, I did not modify anything here. I just passed the raw data, no one-hot encoding, no filtering, no nothing. So it just makes my life simpler. I set up the Autopilot job. I use the `AutoML` object from the SDK. Pass a role. This is really important. Point SageMaker Autopilot at the target attribute I want to predict. Remember in the dataset we had this `Y` column at the end that says yes or no. Did the customer accept the offer? So this is the target we want to learn. So all you have to do is tell SageMaker, "Hey, this is the attribute you need to focus on." Then where to save models, notebooks, etc. How many tuning jobs you want to run. So the default is 500 jobs. That's going to take a few hours. To make my demo run a little faster, I reduced this to 200. But I would recommend running all of it. It will just give SageMaker more opportunities to optimize and build even better models. I can give a max running time per training job, and I can also cap the total AutoML job to a certain amount of time. So here, it's going to be an hour. So the job will run until we tuned 200 candidates or until we run for an hour, whichever is hit first. And that's all there is to it. Now we call `fit`, passing the location of the input data, and off it goes. It's going to go through three stages. The first one is data analysis. You can see it here. Here I'm basically looping, describing the AutoML job, and looping on its status. So here it's going to loop until analyzing data is over. Analyzing data is exactly what the name means. Autopilot will look at the dataset you passed, the target attribute you defined. It's going to figure out what kind of problem you're solving here. So it's going to figure out this is a binary classification model. It's going to compute statistics on the dataset and, based on that, decide which data preprocessing to apply. This is typically manual work that a data scientist would do, and here it's completely automated. Once the data analysis step is over, SageMaker has generated notebooks. One notebook called Data Exploration gives you statistics on the dataset, and the other one called Candidate Definition lists the pipelines that have been designed by Autopilot. A pipeline is a combination of a preprocessing step and a training step. So you'll see the actual algorithms that have been selected. You can run that code, tweak it, and this is really the whitebox AutoML part of Autopilot. Everything that Autopilot does is visible in those notebooks. They're available in S3. You can copy them to your local machine and open them. I probably see them here. So this is the candidate and the data exploration notebook. Okay? So I'm not going to go through them in the interest of time, but as you can see, you can see all those pipeline definitions here. And then Autopilot moves on to feature engineering, as you can see here. Based on the pipelines that have been defined, Autopilot will transform the data, your input dataset, according to the preprocessing steps and store everything in S3. So if I have 10 pipelines, that means I have 10 processed input datasets in S3. Once feature engineering is complete, Autopilot will launch model tuning. For each of the 10 pipelines and trained on the processed dataset, it's going to launch a large number of tuning jobs, trying to extract maximum accuracy for the different candidates. This part is going to last for a while, depending on how many tuning jobs you want to run. And then, once again, you could use SageMaker Experiments while the tuning job is running to look at ongoing jobs and look at the top ones, etc. If you use SageMaker Studio, you can see that stuff in Studio and visualize it as well. Build all those nice graphs. Model tuning does last for a while. And then it stops because here, probably, I hit my one-hour limit. So I could check in the console, actually. Yeah, so we only ran 97 jobs. See, so I asked it to run 200 max, but it only ran 97 because it probably hit that one-hour limit. All right, and now if I want to deploy the top job, then again, I simply call `deploy` on my AutoML object, and I can build a predictor, real-time predictor from that. And of course, then I can predict with it. So here I'm using the `predict` API from the SageMaker SDK, and I can do that because I am using that predictor object above. I'm sending my test set, so the 5% data that I kept on the side is used here to compute additional metrics. I'm predicting all those test samples and then checking if they're true positives, false positives, true negatives, false negatives, keeping track of everything, comparing predictions and labels. That gives me a homemade confusion matrix so I can see, for example, 71 false positives and 99 false negatives, and I can compute accuracy, precision, recall, F1, etc. So just an easy way to do that. And finally, you can delete everything. It's not just deleting the endpoints. We have quite a bit of stuff in S3 by now. We have all those temporary datasets, etc. So you might want to clean that up as well in your S3 bucket. So this is how you use Autopilot. Pretty easy, I think. Okay, plenty of resources on Autopilot as well. Docs, plenty of nice notebooks, some blog posts. And this is the end of the session. So thank you very much for listening. If you want more content from me, then, well, this is all of it, I guess. Blog posts on the AWS blog, my own blog on Medium, my YouTube channel with tons of SageMaker videos. I have a podcast as well in audio and video on YouTube and Buzzsprout. And of course, you can always ping me on Twitter, ask me questions, share cool stuff that you built. It's always a pleasure. Okay? All right, that's it for this session. I hope you learned a few things. If you have questions, please ask all your questions. We're ready to answer them now. And thank you very much.

AWS DevDays 2020 Build Models Automatically with Amazon SageMaker

Transcript

Tags

About the Author