SageMaker Fridays Season 3 Episode 7 Building models automatically with AutoML

May 17, 2021
Broadcasted live on 14/05/2021. Join us for more episodes at https://amazonsagemakerfridays.splashthat.com/ ⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos ⭐️⭐️⭐️ In this episode, we'll dive into SageMaker AutoPilot, an AutoML capability. Starting from a tabular dataset, we'll launch an AutoML job in just a few clicks (or just a few lines of code). Then, we'll explore in detail the different steps in AutoPilot, such as automatic feature engineering and model tuning. We'll show you the auto-generated notebooks, and how you can run them yourself for further optimization. Finally, we look at AutoGluon, an open source library for AutoML. *** Notebook https://github.com/juliensimon/amazon-studio-demos/tree/main/sagemaker_fridays *** Blog posts https://aws.amazon.com/blogs/aws/amazon-sagemaker-autopilot-fully-managed-automatic-machine-learning/ https://aws.amazon.com/blogs/opensource/machine-learning-with-autogluon-an-open-source-automl-library/ ***AutoGluon https://auto.gluon.ai https://github.com/awslabs/autogluon https://arxiv.org/abs/2003.06505

Transcript

Welcome, everyone. Good morning, good afternoon, and welcome to this new episode of season three. I think it's episode seven already. My name is Julian. I'm a Dev advocate focusing on AI and machine learning. And by now, I think you know my co-presenter, Sego. Welcome. Thank you, Julian. Hello, everyone. My name is Segelen, and I'm a senior data scientist working with the AWS Machine Learning Solution Lab. My role is to help customers get their ML projects on the right track to create business value as fast as possible. Absolutely. Thank you again for joining us today. As you know, all episodes are live. We're still in the Paris office, and you won't get any slides except the final slide. Screenshot time. And we'll go through another demo today. Please ask all your questions in the chat. Anything you want to know, just ask and make sure you learn as much as possible. All right. We have another very full plate today. Let's get started. So, what are we talking about today, Sego? This week, we are going to discuss a very exciting topic in machine learning, which is AutoML. We will introduce you to what AutoML is and when you should use it. Then, starting from a solution available in SageMaker JumpStart, we are going to train a model for credit risk prediction. First, we will do it in the usual way, and then we will launch an AutoML job with SageMaker Autopilot. Finally, we will take a look at an open-source alternative named Autoblion. Oh, open source. I love open source. Yeah, Autoblion. Before we dive into running the code, let's try to understand what problems we're trying to solve. So, what's the business problem we're trying to model today? Today, we are building a model to predict credit risk. If you remember, we worked on this problem in episode four of season two. Oh, yeah. Long time ago. Where we used the LightGBM algorithm to train a binary classification model on the German credit data sets. We also explained its predictions using a game theory approach called Shapley additive explanation. Yeah, I remember that one. And you can actually go and look at this episode, which is on Twitch and YouTube. It's a binary classification problem, right? So yes or no, should this individual's credit be approved? I know it's a good thing to recycle, but I hope we're not running out of IDs, right? So we're not running the same thing again and again. No. Okay, good. Since last season, a number of end-to-end solutions have been introduced in SageMaker JumpStart, a beginner-friendly capability launched at reInvent 2020. We mentioned JumpStart a few times, but I don't think we actually covered it. Let's share my screen, and we can take a quick look at JumpStart and the solution we're going to use today. This is obviously SageMaker Studio. If you start from the launcher, you see this first box is about JumpStart. Inside JumpStart, we see solutions, and we also see quite a few natural language processing models, computer vision models, plus additional resources on built-in algorithms and sample notebooks, blogs, and video tutorials. It's a great place to start if you're just beginning with SageMaker. Looking at solutions, we can see we have 16 solutions as of today. Solutions are really end-to-end architectures to solve a business problem. Go and explore all of those. There is one for credit decisions, which is the one we're focusing on today. It trains a LightGBM model on SageMaker with the German credit data set and explainability, and the whole architecture used. You can think, oh, no, I don't want to create all that stuff; it's too complicated, but you don't have to. All you have to do is click on launch. So now you see why I like the service. What this does is it launches an AWS CloudFormation template. If you're not familiar with CloudFormation, it's an infrastructure as code service that helps you automate the provisioning and cleaning up of AWS resources. You just click on this, wait for a couple of minutes, and everything gets deployed. You see the solution is ready here. If you click on this, it takes you to a screen, and you click on Open the Notebook, and then you jump into the first notebook that explains everything, including a summary of the architecture and additional notebooks. This solution has actually six or seven different notebooks. We're not going through everything, just data processing and training. The first step is to use AWS Glue to build the dataset. It's a mix of the German credit data set and fake customer data generated with Faker. We covered all these steps in detail in episode four, season two. We store the dataset in S3 and then train with a custom container that implements LightGBM. If we move on to the next notebook, we'll really see the dataset preparation here. It's a mix of the German credit data set CSV file and fake data. We see a Glue job being launched and moving along. Then we get our data in S3. Perfect. Not going through this; we've seen this before. Let me close those notebooks to clean up. Then we move on to model training. It's what you would expect. It's a good example of building a custom container if you're interested in that. We build a custom container with Scikit-Learn and LightGBM and push it to Amazon ECR, which is the Docker registry service on AWS. Once we have this image, we train. It's business as usual. It took me eight minutes. Earlier than before. My condition is worsening. We use the Scikit-Learn estimator and pass the actual image. Let me zoom in a bit. We use the Docker image we built and pass our training script, hyperparameters, and data location. We've seen this a million times. Quickly looking at the training scripts, it's what we would expect. Lots of Scikit-Learn goodness, preprocessing objects from Scikit-Learn for numerical and categorical columns. Then, building a simple pipeline with preprocessing and training. Loading data, creating the LightGBM objects. Cool stuff. We've seen this a lot of times. And of course, all of this is part of the solution. You can go and read it. It's really machine learning as all of us do it. It's good stuff. We train this, passing the location of data in S3. We see lots of log information. The test or validation metric, which is the area under the curve, is about 76.8%. Write that down because the game today is to do better with AutoML. Otherwise, what's the point? This is what we've done on SageMaker many times. Pre-processed data in a notebook or in a SageMaker processing job, and then trained with the SageMaker SDK. It's all good. There's nothing wrong with that. We get full control over everything. But we need two things for this: machine learning skills and experience. Lots of people out there do not have ML skills or experience. Are they excluded from the party? No, they shouldn't be. And then there's another use case where you could be an experienced machine learning practitioner, but you might have 500 datasets and 500 models or thousands of models because you want to try different algorithms. You need to deliver, and you can't do manual feature engineering and tweaking on all those things. It's just not possible. Or maybe you have hundreds of datasets and you want to focus on the top 10 that lead to high-performance models because you have limited resources. So you solve those 10 problems first. I think these are the two areas where AutoML would make sense. But what is AutoML really? How do we define it? It's a buzzword. Sego, tell us a little bit about what AutoML is. Building machine learning models requires manually preparing, fitting, preparing features, testing multiple algorithms, and optimizing hundreds of model parameters to find the best model for your data. This approach requires deep ML expertise. If you don't have that expertise, you could use an automated approach, the AutoML approach. In a nutshell, AutoML automatically prepares your dataset, tries different machine learning approaches, and combines their results to deliver high-quality models. It should be a one-click thing. Here's my data. Go and figure it out. Grab a paper and go play video games. I love that. So I guess we have different steps inside the process. Do we find the same steps that we would do manually? Exactly. We've got problem discovery, model generation, feature engineering and data processing, model training, and evaluation. If you're new to machine learning, yes, as you will see, one click or one API call is good and you're good to go. Or if you're already experienced, you can still do one click or one API call, go play video games. But as we will see, you also get to see exactly how the model was built, how data was processed, etc. All the best practices are implemented in the AutoML tool. The tool should tell you what it's done because you need to understand how the model was built for explainability and compliance. If you have expert domain knowledge, you can add your own feature engineering steps to get even more accuracy. These are the two scenarios. SageMaker has a capability called SageMaker Autopilot, which we can use in two different ways. The first one, which we're going to show you in a second, is using SageMaker Studio and the graphical interface. You don't have to code anything. That's cool. The second way is to use SageMaker in a notebook, which is really dead simple. It's one line of code, one API call to get everything going. What problem classes can we solve with Autopilot? Can we solve everything or very specific problems? At this moment, we can solve regression problems, binary classification, and multi-class classification. That's good because it covers more than 80% of all problems out there. Classifying or predicting numerical values is a very common use case. Autopilot is simple. It's one click, really. But we still want to know what algorithms are available. What models will actually be trained? You've got the built-in algorithms from SageMaker. So you've got the linear learner, multi-layer perceptron, and neural networks. It's a recent addition, added a few months ago. Now we can train and tune neural networks, which is pretty cool. How do I know, once the model has been trained and I get the artifact and metrics, how it was trained? How do I trust it and explain this model to my customers? You get full visibility with Autopilot. You will know how the data was wrangled, how the models were selected, trained, and tuned. It's not a black box. We have autogenerated notebooks. Another recent addition is model explainability. Autopilot will automatically use SHAP to show you the most important features. We get a report and some cool stuff in the Studio UI. It's not just, "Click here, here's the model, here's the metric, and I won't tell you anything else." You get the initial dataset, the processed dataset, and the feature engineering code. You can see everything, including feature importance. Let's start running things. I can put my demo glasses on. I can't see the screen. I will be putting on the wrong stuff here. Okay. And yeah, this is pretty good. I can close this one too, and this one too, and this one too. All right. Remember AUC was 76.8. Now I closed it, but okay, anyway. I remember. You remember. Good, good, good. SageMaker Autopilot expects CSV data. It works on tabular data, which is the type of data you would use for regression and classification. As a first step, we just need to convert the data from the solution because the Glue job actually builds a dataset in JSON format. The LightGBM script loads that JSON data. Here, just a very simple notebook. It's just grabbing that JSON data, loading it in a Pandas DataFrame, and saving it to CSV. The one thing I'm doing is concatenating the training and test data because Autopilot will automatically split. You don't need to provide a training and validation set. You can provide as much data as you have, and Autopilot will split. I'm just doing this and uploading the data to S3. This is the URI for the full dataset, which we're going to use. It looks like a bunch of features, numerical and categorical features, and the label, which is called credit default. We can see this is a true or false value. I'm going to predict if a certain person is likely to default on credit or not. That's the starting point, a very basic notebook. Now, the only thing we have to do is grab this and click on the experiments icon, the triangle here. Then I'm just going to click on create experiment. This takes me to the screen where I can enter a few things: give a name to my Autopilot job, enter the location of the dataset in S3, name the target column, enter the output location to store all the artifacts, and select the machine learning problem. I could say, "Figure it out," but if we want to compare the area under the curve for Autopilot to the solution, we want to say, "Please use AUC." So that's why I'm saying binary classification. We want to run a complete experiment, which means tuning all the way. This will run 250 tuning jobs. If you only wanted to generate the candidates and notebooks, you would say, "No, just create the notebooks and show me what." You could decide to deploy the best model automatically, but let's not deploy for now. So that wasn't hard, right? Give it a name, where's the dataset in S3, CSV format, what's the column you want to predict, and where to put all the stuff we're going to generate. Click on Create, and off it goes with a fancy AWS animation. Beautiful. I'm sure that probably 20 engineers worked about six weeks on this. At least it's something. What we have done here is very interesting. We can see the different steps on the job. Pre-processing means analyzing the dataset. It's looking at that column and figuring out the distribution of values, understanding if it's predicting numbers or categories. This will run for a few minutes. Once we understand the problem, we can start recommending and analyzing the other columns to figure out their type and distributions. Based on that, we can recommend feature engineering steps and algorithms. It's the analysis phase that a machine learning specialist would do. What am I going to try here? Once this is done, we generate the candidates. A candidate is a pipeline with a combination of feature engineering, transforming the dataset with the feature engineering steps, training a model on that, and tuning. We're going to generate several pipelines. Then, once we have the pipelines, we transform the dataset in different ways according to those pipelines and launch tuning jobs to optimize hyperparameters. We're going to train 250 models. This particular example runs for anywhere between an hour and an hour and a half. Even an expert would spend more than an hour and a half doing all of this. Finally, we generate an explainability report for the best model. This is going to run for a little while. We can keep it running and see it moving through the first few steps. I've run this before. Once we're done, you'll see it here. You can right-click and say, "Describe AutoML Job." We see the job profile, which are all the parameters we used. It's a binary classification problem. We used the AUC metric. The list of trials shows 250 jobs. You can sort them according to the metric. There's a little star that tells you this is the best job. That's a very good AUC, 85. Remember how much we had? 76.8. It's almost an 11 percent improvement. We improved AUC by about 11 percent. We can deploy that model. If we select it and click on deploy, we can go and deploy to an endpoint. Still no code, right? If we're curious about this job, we can right-click and say, "Open in model details" and see additional information. We can see all the artifacts, the input dataset, the splits, the feature engineering code, the trained model, and explainability. It's open book. We can go and look at everything. For the record, this is an XGBoost model. Let's look at the feature engineering code. I actually did this already. Here it is. This is the code transforming the data. We're using objects from a package called SageMaker SQLEarnExtensions, which should look a lot like Scikit-Learn objects. We build a pipeline for numerical columns, do robust imputing to fill missing values, one-hot encoding on categorical columns, and apply scaling. All that code is generated. You can see exactly how data is processed. For model explainability, you need to explain it to your compliance body or customers. We can trust it. We see all the artifacts and model explainability. In this case, we see global SHAP values. The top values are credit duration, credit amount, employment duration. These are reasonable. Credit duration is very important. If it's a 25-year credit, there's a stronger chance to default than on a three-month credit. The amount, the more money taken from the bank, the higher the chance of not paying back. Employment duration, if you have a very stable job, there's a better chance of repaying the loan. Pretty interesting. You can export this to a PDF report and download the data to build your own property. Here's our model. It was really one click to do all of this. It would be just one click to deploy. Now, let's understand why this is the best one. What are the other things we tried? This is where we can open the candidate generation. The data exploration notebook has basic stats on the dataset. The candidate generation dataset, and I'm going to import it. We can use the data science card. Oh, let's take a look at the one we created. It's moving along. Perfect. Preprocessing is done. We can see those notebooks have been generated. Now it's doing the feature engineering thing. Mono tuning is the longest step. Here's the actual node. This is the golden piece in Autopilot. It's totally okay if you just want a high-performance model and deploy it, but if you want to understand what happened under the hood, this is the one to read. There's a little bit of setup. We could try running some of those things, see what happens. These are setup parameters. Where do I store all those things? Transform data, etc. What I like here is the "available" cell throughout the notebook, which tells you, "Hey, if you want to tweak, this is the place to start." For example, we see this first pipeline, DPP zero. I'm guessing DPP means data processing pipeline. Here we see the first candidate. This data transformation strategy first transformed numeric features using a robust imputer, categorical features using a threshold one-on encoder. This could be the winning pipeline. It merges all the generated features and applies a robust standard scaler. Transform data will be used to tune the next job with the model. Here's the definition. You could try larger instance types or your thing. So we say, "Yes, this is an interesting pipeline. I want to try it." Here's another one. This time we do robust imputer and threshold one-on encoder, then robust PCA to reduce dimensions, and then standard scaler and XGBoost again. You could say, "This is a cool one," but you could also exclude it if you want. You could say, "No, I want to run the pipeline myself and skip this candidate." Let's add this candidate anyway. Here's another one with Linear Learner. This one goes a little crazier on the data prep and uses the built-in linear learner. One tile extreme values transform. Not sure what that is, but I'll take a look. Sounds cool. I want to have that. Here's another one with XGBoost. Why not? Here's another one with XGBoost. This one uses some feature engineering steps and a multilayer perceptron, a neural network. Let's add neural networks to the mix. Now we can see, I have nine pipelines. You could run the job again and say, "I want to keep tweaking." You could take the winning pipeline and keep tweaking. Pipeline zero is the winner. You could exclude the other ones and keep tweaking this one. Then, of course, we're going to start running things. We're going to start transforming data. This will process the dataset, store the processed artifacts in S3. We will have nine different versions of the dataset. Then we can go and launch model tuning. You could go extremely crazy on hyperparameter ranges. You could say, "I'm going to tweak XGBoost in a slightly different way than Autopilot did because I think this could help or I want to add a new parameter." You can reproduce the exact experiment and tweak it. This uses a very cool feature in hyperparameter optimization where we tweak for different algorithms. We tune for multiple algorithms. This is a very good example of that. Then we launch that. At the end of the day, we could deploy the model. This gets deployed on an endpoint, and you can start predicting with it. It's completely open. You can tweak a lot. You can go and have coffee or play video games. Makes sense? Now, let me show you how to do the same thing programmatically. Here's how we do exactly the same with code. We use the SDK. We grab the same data. Here's the one line I was mentioning. Import the AutoML object. Pass the SageMaker role, the problem type, the metric. These are optional. The only reason I'm running it like that is to make sure we use AUC. Max candidates is optional. This is the default value. But if you want fewer jobs or longer jobs, why not? You can tweak this. Then you just call fit, just like a normal SageMaker job, and pass the location of the data. The simplest version could really be AutoML, target attribute name, the role, and call fit. Super nice. We have a simple function to wait for the different steps. You could say wait equal true and come back two hours later, or wait equal false and just watch it run, which is extremely boring. We query the AutoML job and look at the status. We wait for analyzing data to be complete. Once this is complete, we have those generated notebooks. Then we wait for feature engineering, and then model tuning. Each line is 60 seconds. Data analysis is one, two, three, five, six minutes. Feature engineering about the same. Model tuning is when you can play video games. It generates the explainability report. We can get lots of very informative JSON. We can get the same information on the best candidate. In this case, it was pipeline two. The AUC was 84.88. Pretty close to the previous one. We can see the container and know which algo this is. But that's too much JSON for a Friday night. I'm bored already, so let's move on. I'm pretty sure it's actually a boost again. We can deploy. Call deploy, pass the instance type and endpoint name, and wait for two minutes. We create a Boto3 SageMaker client and invoke the endpoint. Is that endpoint still up? Oh, yeah. In service. So I'm going to try it. Notebook died. Let me try again. Endpoint name not defined. Where is it? Oh, yeah. I'll fix it. Come on, come on. All right. Now we pass this sample to the endpoint, and it's obviously CSV. In this case, this brave customer is not going to default. It actually says false, not 0.12345 because this endpoint is an inference pipeline. What we deployed here is the sequence of feature engineering, preprocessing, prediction, and post-processing. If you want to dive into the details, we actually have several containers chained in an inference pipeline. We can pass the data in the exact state format we trained in and get a label. It's not just the raw prediction; it's the process, preprocessing, prediction, post-processing. This is what Autopilot does. It's super friendly. It's pretty good for beginners and advanced users. We have one more thing, of course. There's an open-source project called Autoblion. It lets you do the same kind of thing. Autoblion Library was open-sourced by AWS at reInvent 2019. It is an open-source AutoML framework that helps you train state-of-the-art machine learning models for image classification, object detection, text classification, and tabular data prediction with little to no prior experience in machine learning. Same kind of thing. Put your data, bring your data, one line of code, build me something. Exactly. Pretty cool. There's a research paper. You'll get that on the final slide. It compares Autoblion to other frameworks and goes into some of the details. What's specific about Autoblion? It uses different machine learning techniques, like assembling, etc. Can you tell us a little more about it? Some key aspects of Autoblion, in our case, Tabular, include robust data processing to handle heterogeneous datasets, modern neural network architecture, and powerful model ensembling based on novel combinations of multi-layer stacking and repeated techniques. It focuses more on combining models versus having a limited set of algos optimized to the max using clever model tuning. Autoblion is more about trying many different models and combining them in clever ways. Exactly. Okay, interesting. Tell us a little bit about those techniques: bagging, stacking, and assembling. Ensemble models combine predictions from multiple models to outperform individual models. We take weak learners and combine them. It's very common. All the best-performing auto-eval frameworks today rely on some form of model ensembling, such as bagging, boosting, stacking, or weighted combination. In the case of Autoblion tabular data, a collection of individual-based models are trained on the dataset. Then, you have a stacker model trained using the aggregated predictions of the base models as its features. Thanks to this stacker model, you improve the shortcomings of individual base predictions and exploit interactions between base models for enhanced predictive power. Let's look at an example. This is how you install Autoblion. I need some extra stuff because they have some widgets, s3fs because I'm loading the same dataset from S3. The label is still called credit default, and the metric is still AUC. In this case, we have tabular data, so we import the appropriate objects, create a tabular dataset, and a tabular predictor. We call fit, passing the location of the training data and the time limit. Here, I fire it up for an hour, but you can use any amount of time. Autoblion is pretty clever about how it uses time. It tries high-performing algos first. Even if you try for a limited amount of time, you still get good results. Then it goes into more exotic stuff. We have presets for best quality, which will take more time but are worth it if you want production-grade models. It just runs, and we see feature engineering and lots of models being trained. We see KNN, LightGBM, Random Forest, CatBoost, Extra Trees, Neural Net FastAI, XGBoost, Neural Net MXNet, and more. It trains for an hour. It saves all those models locally. You see the different runs. These are pickle objects, so you can load them and tweak them more if you want. It trains again and again. At the end of the hour, you can see the leaderboard. The top model achieved an AUC of 82.34, which is still much better than the initial job we ran in the solution. It's a little less than the Autopilot job, but I didn't apply model tuning here. You can add model tuning to Autoblion. You can see the weighted ensemble, and this combination of models is available for inspection. You can see stack level two, so two layers of models. Of course, I can predict. Here, I'm just grabbing my test set, dropping the label, predicting, and getting feature importance. We see checking balance, credit duration, employment duration, credit amount, and finance repayment history. This is really interesting because we see the same features that we saw in Autopilot with completely different algos. If these two frameworks come to the same conclusion about which features are important, you would say, "Okay, yes, we're really looking at the right things here." You can explore more. We just scratched the surface. This is Autoblion, a very interesting alternative, good for benchmarking. Go and try both and see what you can do and learn. That's really what we wanted to tell you today. A little trip to AutoML. If you're completely new, this should show you that there is an easy way into machine learning. It's not intimidating. As you dive deeper, you get into algos, model tuning, and hyperparameters, which get a little more complicated. But this is a great way to start. Train some models on simple datasets and start reading about the algos and data processing steps. Just get familiar with this. Start from real examples. Don't start from math and equations. Just start from this. It's a beginner-friendly way. For experts, it's a great way to train tons of models automatically and quickly figure out which ones to keep tweaking and which ones to discard. Screenshot time? Two minutes. Perfect. The notebook will be posted very shortly at the same location. There's a blog post I wrote on Autopilot. There's another great blog post on Autoblion by one of my colleagues, Shashank. Screenshot time again. One Autopilot, one Autoblion, the Autoblion website, the GitHub repo, and the research paper, which is very interesting. We're on time. Amazing. Sego, thank you very much. Thanks for telling us about assembling and stacking and bagging. Thanks, everyone, for watching. Thanks to our colleagues who helped organize all of it. Much appreciated. We'll see you in two weeks. We're not quite sure what the topic will be. We're not running out of ideas, but we need to find a good topic. We'll find one. Trust us. Thank you very much. Stay safe. Have a great weekend. Until then, keep rocking. Okay, good machine learning. Perfect. Bye.

Tags

AutoMLSageMakerCreditRiskPrediction

About the Author

Julien Simon is the Chief Evangelist at Arcee AI , specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.

With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.

Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.

Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.