Good morning, everyone, and welcome to this new season of Sage Makeup Fridays. It's already season four. My name is Julien, and I'm a principal developer advocate focusing on AI and machine learning. Please meet my co-presenter. Hi, everyone. My name is Seygoleyn, and I'm a senior data scientist working with AWS Machine Learning Solution Lab. My role is to help customers get their ML projects on the right track to create business value as fast as possible. Thank you for joining us. We will definitely need your expertise. If you're new to SageMaker Fridays, you'll quickly find out. This is 100% demo. We have just a few intro slides to let you know about this new season, but then we're going to be running code and diving into problems. If you have questions, please ask all your questions. We have moderators who are ready to answer everything. So just don't be shy. Ask anything you'd like to know. Make sure you learn as much as possible. Okay? So here's the agenda for this new season. It's in three parts. The first four episodes focus on the data science part of machine learning, where we prepare data, train models, explain models, etc. Then we'll have four more episodes on ops in general and automation, with a strong focus on automation. We'll probably revisit some of those early examples and add automation to them. Finally, we have four episodes where we look at AutoML, trying to automatically build models. So quite a lot of episodes ahead, and we hope you find them interesting all the way to October 22.
What are we going to talk about today? Today, we are going to work on the music recommendation problem. We will focus on the data science and machine learning angle. We are going to cover data preparation, deployment, training, prediction, and explainability. Let's look at the pipeline. Don't worry, we'll show you the URL again if you didn't catch it the first time around. This is the end-to-end workflow we're covering. It's a great opportunity to learn about many of SageMaker's capabilities. We'll start from a dataset, use SageMaker Data Wrangler to process it and run feature engineering, ingest the engineered features into SageMaker Feature Store, and use those features to create a dataset. We'll use Amazon Athena to query the Feature Store and retrieve data. Then, we'll use that dataset to train a model with XGBoost, and use SageMaker Debugger during training to capture the model state, particularly for explainability and future reports. We'll stop here for this episode because there's so much more to cover. We'll discuss the rest, mostly automation, the model registry, model monitor, and other ops, in a September episode. For now, we'll cover data prep, feature store, training, and explainability and feature importance.
That's about as much of the slides as you'll get today. Let's save them for later and start talking about the problem. Music recommendation with XGBoost. It's a bit of a surprise because I thought recommendation was a very hard problem that required fancy deep learning models. Here we're using XGBoost. But how did we frame the problem so that it can be learned by XGBoost? We framed it as a regression problem. Each track has features such as energy, speechiness, tempo, genre, etc. We can frame the problem as a regression problem instead of using more advanced deep learning techniques. We're trying to predict a numerical rating from one to five. So, it's a regression problem.
Looking at the dataset, we have the tracks.csv file with features like length, energy, speechiness, instrumentalness, tempo, and genre. These are values between zero and one. Zero means no speech, one means it's spoken. Similarly, for how instrumental the track is, whether there's a live audience, the tempo, and the genre. I should mention that there's no heavy metal or hard rock, but I guess I don't need machine learning to tell me what I like. Heavy metal, nothing else. Let's not discuss my musical taste.
We have a user ID and a track ID that matches the track ID in the other file, the session ID, the location of the song in that session, and the rating. These are individual reviews. Assuming I was in this dataset and had rated 200 tracks, how would you know what I like? You would aggregate these features and build the recommendation system on top of that. For a given user ID, you would look at all the ratings for that user, find the five-star ratings, and understand their preferences. For example, they might like very energetic and up-tempo tracks. You would compute aggregated stats on the file. The five-star reviews are a good indicator. We need to do this because it's not in the initial dataset. It's part of our data prep workflow. These become new features that we can inject.
How big is this dataset? It's based on the Last.fm Million Song Dataset with additional processing. The dataset we're working on today has about 140,000 tracks and 258 users. We have around 700,000 ratings. It's reasonably large, with many ratings per user. We should be able to do a good job. The tracks have properties and a track ID referenced in the ratings, which have the user ID and the rating itself.
Let's look at data prep now. We're using SageMaker Data Wrangler. You can visually import datasets and add transforms. We've applied a few simple transforms. For tracks, we insert a new column called event time with a timestamp. We also compute a new column called danceability, which I don't like, so I should rename it headbangability. We encode categorical features, like the genre. For ratings, we probably insert a timestamp as well. We set data types for the columns. We can export this workflow. For the two tables, we do limited work but then join them on the track ID. This gives us the track features for each review in the same row, which we need to compute aggregators. We apply a PySpark transform to compute averages of numerical columns and one-hot encoded genres for five-star reviews, which become user preferences.
To run the transforms, we go to export. We can save to S3, which means running a SageMaker process. SageMaker processing is easy to work with and lets you run batch jobs for data prep, model evaluation, etc. We run a job on managed infrastructure with a few lines of code, and get the processed data in S3. Another option is to export the Python code for the feature engineering workflow, which you can put in your repository. We can also create a notebook that exports data to the Feature Store, which we'll do here. Finally, we can set up a pipeline for automation, which we'll cover later.
We have a flow file in JSON format. We run this three times to process tracks, ratings, and build user preferences. We have three notebooks that do the same thing over different sources. For tracks, we need a schema for the feature store, which is the name of the features and their types. We create a feature group, an object that stores features from a data source. We create the feature group with a name, schema, and unique identifier (record ID) and event time. The record ID is the unique identifier for each row, and the event time is the timestamp for the engineered feature. We create the feature group, wait for it to be fully created, and then run our workflow to push data to the Feature Store.
We use SageMaker Processing, a batch job with inputs and outputs. The inputs are the tracks.csv file in S3, and the output is the processed data in S3. The output is the data from a specific node in the workflow graph. We pick the node identifier from the JSON file to get the processed data at different stages. We run the processing job on the specified instance size and count, and it saves the data. We do this for user preferences and ratings as well.
Now that we've processed our data, we can train our model. We build the dataset by querying the three feature groups in the offline feature store using Amazon Athena. We can query the data directly in S3 because creating a feature group automatically creates an Athena table. We have tables for ratings, tracks, and user preferences. We can preview the data and see the five-star aggregated preferences, such as user 11063 liking high-energy songs and disliking acoustic and instrumental songs, with a favorite genre of rap.
We merge the ratings with the track and user preference data to create the final dataset. Each row has the rating, review information, user preferences, and track information. We split the dataset for training and validation, save it to CSV files, and upload them to S3. We configure our training inputs and set up the training job with infrastructure requirements, the XGBoost container, and hyperparameters. We use SageMaker Debugger to save model state during training, including metrics, feature importance, and SHAP values. We set rules to check for unwanted conditions, like loss not decreasing, which could indicate overfitting.
We train the model with a one-liner, passing the location of the training and validation sets. The job runs on the specified infrastructure, pulls the container, loads the data, and starts training. Once complete, the infrastructure is shut down, and the model is saved in S3. We can load the data saved by SageMaker Debugger and plot metrics like mean square error. We can also plot feature importance and SHAP values to understand how features contribute to the output prediction. For example, we see that some features have a positive impact, while others have a negative impact. We can analyze SHAP values for each instance in the training set to understand the model's explainability.
This is a good example of a data scientist's daily job, from processing data to training and explaining the model. We'll come back to downstream tasks like the model registry and automation with pipelines in future episodes. The repository for this example is available, and you can run the code yourself. Thank you for joining us. Next time, we'll cover fraud detection, including bias analysis and mitigation. Thank you, Seygoleyn, and thank you everyone for listening. See you next week for another episode. Bye-bye.