SageMaker Fridays Season 4 Episode 9 Using AutoML to build a click prediction model

October 04, 2021
Broadcasted on 01/10/2021. Join us for more episodes at https://pages.awscloud.com/SageMakerFridays ⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos ⭐️⭐️⭐️ In this episode, we use SageMaker Autopilot and AutoGluon to automatically build a click prediction model. *** Notebook https://github.com/juliensimon/amazon-studio-demos/tree/main/sagemaker_fridays/season4/s04e09

Transcript

Hi everybody and welcome to this new episode of SageMaker Friday season 4. I think this is episode number 9. My name is Julien and I'm a dev advocate focusing on AI and machine learning. As usual, please meet my co-presenter. Hi everyone, my name is Siguren and I'm a senior data scientist working with the AWS Machine Learning Solution Lab. My role is to help customers get their ML projects on the right track and create business value as fast as possible. Thank you, Siguren, for being with us. Thank you. So where are we in this season? Last week, we completed the automation segment. And this week, we are embarking on the last four episodes, which are dedicated to a really cool topic. We're going to dive into AutoML and learn what we can do with this technology. Today, we are starting with a media, entertainment, and web use case where we're going to try and predict clicks. Click prediction is a very cool topic. Before I forget, this is where you'll find the code for those last four episodes. Hopefully, by the time you watch this, the code will be there. I still need to clean it up a little bit, but it will be there. If not, you can ping me. Let's start looking at our problem for today. Click prediction is a very common problem for retail, media, and anytime you want to predict whether a customer will click on an item. It could be an ad, and the ad world is very eager to predict the right clicks. Generally, you have some content displayed, such as ads, products, movies, or songs. You want to figure out if a certain item is going to get a click or not so that you only show items to users that have the highest probability of getting a click. This is a binary classification problem, where the label is either 0 or 1, indicating no click or click. We get probabilities between 0 and 1. We need a dataset for this, and we found an interesting one on Kaggle. You can see the URL here. It's a public domain dataset, so thank you to the user who uploaded it. It's relatively large compared to our examples so far, with over 450,000 rows. However, it's not big compared to ad click datasets, which usually have millions or billions of lines. It's large enough to give us a challenge and small enough to work with today. Let's take a quick look at the data. It's a CSV file, and we have a training set and a test set, but we're only using the training set here. We have about 460,000 rows and 15 columns. The data includes session ID, timestamp, user ID, product ID, campaign ID, web page ID, and other features like product categories, user group IDs, gender, age range, and more. The label, isClick, tells us whether there was a click (1) or no click (0). As expected, the dataset is unbalanced, with about 7% clicks and 93% no clicks, which is actually quite high compared to typical ad tech scenarios where click rates can be as low as 0.1%. An interesting thing about the columns is that campaign IDs and web page IDs are integers but are actually categorical variables. If we were doing machine learning the usual way, we would look closely at each column to determine its usefulness. Session ID is a unique identifier and probably not very useful. Campaign ID, web page ID, and product category are all categorical variables. There are also many categorical variables here, and data scientists would typically start engineering features with this data. We could use our favorite algorithms like XGBoost or GDS to build a classification model. However, today we're being lazy and will look at a technique called AutoML. Siguren, can you describe a little bit what AutoML is, why we care, and the use cases? Then we'll talk about how to do it on SageMaker. Yes, building machine learning models usually requires manually preparing features, testing multiple algorithms, and optimizing hundreds of model parameters to find the best model for your data. This approach requires deep ML expertise. If you don't have this expertise, you can use an automated approach called AutoML. AutoML automatically prepares your datasets, tries different machine learning approaches, and combines their results to deliver high-quality models. Essentially, you just bring the data, and AutoML figures it out, giving you a model in the end. Laziness is a virtue, and AutoML is perfect for that. But there are real-life use cases for ML. If you don't have ML skills to build quality models, you can use AutoML. For experienced practitioners, AutoML helps analyze data and train models at scale. It allows you to implement best ML practices once and repeatedly deploy them, which is especially useful if you have a small team and many problems to solve. AutoML helps you quickly identify the most promising problems and focus on the best ones, improving them further. SageMaker has an AutoML capability called SageMaker Autopilot, which we introduced last season. We'll explore it in these four episodes. We'll also look at an open-source library called AutoGluon, which we'll cover in detail in the next few episodes. Today, we'll focus on Autopilot and give you a quick taste of AutoGluon. There are two ways to use Autopilot: the SageMaker Studio UI and the SageMaker SDK. Let's start with the UI. We'll go to experiments and trials, click on create autopilot experiment, and name it something like "autopilot click prediction." We need to select the dataset, which is in an S3 bucket, and specify an output location for the training job. We can select the problem type as auto, which will figure out if it's binary classification, regression, or multi-class. Alternatively, we can specify it as a binary classification problem and use AUC as the metric. We can run a complete experiment, which includes preprocessing, training, and tuning. We can also auto-deploy the best model, but we'll skip that for now. We have extra settings for permissions, encryption, and VPC, which we won't touch. Finally, we click on create experiment, and it starts running. The process includes preprocessing, where Autopilot looks at the dataset, figures out feature engineering steps, and generates different pipelines. It then applies these pipelines to the dataset, trains and tunes models, and generates an explainability report with SageMaker Clarify. Let's time travel to the end of the job. Once preprocessing is done, we get access to auto-generated notebooks for data exploration and candidate generation. The data exploration notebook provides basic stats on the dataset, which is useful for a sanity check. The candidate generation notebook shows the decisions Autopilot has taken, including the feature engineering and training code. We can see different pipelines, such as transforming numeric features using a robust imputer, categorical features using a threshold one-hot encoder, and scaling everything before training with XGBoost. We can also see pipelines that apply PCA before scaling, use linear learner, and try multi-layer perceptrons (MLP). Each pipeline is a combination of feature engineering and model training. We can run individual pipelines, launch hyperparameter tuning, and use multi-algorithm hyperparameter tuning. This code is generated and can be replicated, allowing us to understand and tweak the process. Let's check the status of our job. It's still preprocessing, which is the most time-consuming part. We can see the code for data preparation steps in the S3 output location. The code is based on transforms similar to those in scikit-learn and is available on GitHub. We can inspect the feature engineering and training code, and even modify it if needed. As the job progresses, we start to see tuning jobs appear. Each job is a training job with specific feature engineering and model training steps. We can inspect individual jobs to see the input data, splits, and the actual models trained. Once the job is complete, we get the top job with an AUC of 0.60, which is a reasonable baseline. We can inspect the top job, which is an XGBoost model, and see the feature importance report from SageMaker Clarify. We can deploy the model on an endpoint, configure data capture, and test it. Using the SageMaker SDK, we can create an AutoML job with just a few lines of code. We load the dataset from S3, create an AutoML job with the necessary configurations, and call fit to start the job. This is a very simple and efficient way to build and deploy models. At the beginning of the episode, we mentioned AutoGluon, an open-source library for AutoML. AutoGluon supports AutoML on tabular data, text, and computer vision. It has a wide range of algorithms, including linear regression, XGBoost, LightGBM, CatBoost, and deep learning. It also has powerful ensembling techniques, such as bagging and stacking, which combine multiple models to improve predictions. We can install AutoGluon, import the tabular dataset object, and create a tabular predictor. We can specify the label and evaluation metrics, and call fit to start training. AutoGluon can infer column types and allows us to specify our own types. It provides options for training, such as time limits and quality vs. inference speed. We can exclude specific algorithms and see the training process, including bagging and stacking. AutoGluon can achieve significantly better results than Autopilot, with an AUC of around 0.65 to 0.7. This is likely due to its wider range of algorithms and powerful ensembling techniques. In summary, AutoML is powerful and interesting. It's great for those who don't like writing machine learning code and want to experiment with different algorithms quickly. With full visibility and the ability to tweak, it's a valuable tool. If you prefer a managed service, SageMaker Autopilot is the way to go. If you want more control, AutoGluon is a great option. Thank you, Siguren, for being with us. I hope you found this first episode on AutoML interesting. We have three more episodes, and they'll get even more exciting. Stay with us until then, and have a great week. See you soon.

Tags

AutoMLSageMakerClickPredictionMachineLearningDataScience

About the Author

Julien Simon is the Chief Evangelist at Arcee AI , specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.

With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.

Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.

Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.