Hi everybody and welcome to this new episode of Sage Makeup Fridays, Season 4. My name is Julien and I'm a Principal Developer Advocate focusing on AI and Machine Learning. Once again, I sit with my co-presenter. Hi everyone, my name is Ségolène and I'm a Senior Data Scientist working with the AWS Machine Learning Solutions Lab. My role is to help customers get their ML project on the right track to create business value as fast as possible. Great. Thanks again for helping us prepare this new episode, Sego. So once again, this episode is going to be demo-based. If you have any questions, you can ask all your questions. We have friendly moderators who are waiting to help. So don't be shy. There are no silly questions. Ask everything you'd like to know and make sure you learn a lot.
In previous weeks, we've covered different use cases on model building and model tuning. This is actually the last episode focusing on this aspect of machine learning. Starting next week, we will start looking at automation. We will probably revisit some of our previous use cases with an automation angle. Okay. But for now, we are going to look at a new use case. So Sego, what are we looking at this week?
So this week, Julien, we are going to work on a recommendation use case specialized for retail application. Starting from an online retail data set, we are going to train a model that predicts the quantity of items that a customer is likely to buy. Okay, interesting. So another recommendation example, but this time for retail. Here's the notebook we're going to work with. You can get it right now or you can get it later. I'll show this again at the end of the episode. Don't worry if you didn't have time to grab it. Everything we're doing today you can actually replicate. Let's take a quick look at the architecture.
We're going to start from a data set, do some processing on it, some cleaning, and some heavy-duty preparation for the algorithm we're using. Today, we're using factorization machines. The data needs to be organized in a very precise way, as we'll see. We'll spend some time discussing that, and then, of course, we'll train the model and see how that works. And I guess we'll talk about deployment and automation, starting in future episodes, right? Starting next week.
Okay, so recommendation. Let's talk a little bit about that. It's retail, so we have users, customers, items, and certain customers buy or interact with certain items. What's the name of the game, really? How can we represent that problem? We can represent this problem as a very large matrix showing the interaction between users and items. Imagine we're trying to build a recommendation model for movies. We have users and movies. Potentially lots of users and lots of movies. Here we have a very small number, obviously, to put that stuff on the slide. Certain users are watching certain movies and giving them a rating. One, they really didn't like the movie. Five, they loved it. Kind of like Amazon stars. For example, user one has watched three movies, etc.
That's the starting point. That's ground truth. But would user 1 like movie 1? We don't know yet. Imagine we had 10,000 movies. User one has not watched 10,000 movies. Maybe they watched 15 or 20 or 100. But what about all the other ones? We're trying to get a score, a rating for those empty cells. That's the game. We're trying to fill all those cells and then find the ones with the highest score and recommend those. No one wants to be recommended a thousand movies, but out of a thousand movies, which are the 10 highest scores, so to speak, right? The 10 movies that we think you're going to like.
In this case, as you mentioned, we're doing things a little bit differently. We're trying to predict the quantity of items that our user is going to buy. It's a buy signal. So we think you would be interested in buying those items. It's a buy signal on the recommendation problem, but it's still a recommendation problem. That's the first initial gut feeling. Obviously, we're not using movies here. What data set are we working with?
The game is to fill empty sets, and we'll show them with the data set we are going to use. Today, the data set comes from the UCI machine learning repository, the famous one, and it's going to contain all the transactions occurring between 2010 and 2011 for a UK-based online retail store. It's one year of historical data and it's going to have 540,000 transactions, if I remember well. So, reasonably large. We're going to use this data set to create our recommendation system.
Here's the actual data set. It's a CSV file, and we see invoice number, stock code, serial codes for the products, description, the quantity that the customer bought, the date, unit price, customer ID, and UK. For example, in this invoice, the same customer or it's actually bigger invoices. All these are a single invoice and they include different products from the same UK-based customer. A lot of those, if you're wondering why someone would buy so many items like 32 bird ornaments, this data set actually contains information for wholesale customers. That's why predicting the quantity of items makes sense. Individual customers would buy one piece of pretty much everything, but for wholesale customers, the quantities would be much higher.
It's a pretty easy, simple data set, quantities, text. We'll need to do something about that. We might do a bit of NLP here. It's 500,000 rows. We're going to go all the way. But in fact, it's quite small when we talk about recommendation. What about Amazon? What about Netflix? What about Spotify? What about even large e-commerce websites? Millions of users. Millions of items. Millions of items, maybe even more. But let's say, okay, millions times millions.
Imagine millions of rows, millions of columns. That's thousands of billions of cells. And, of course, most of them are empty. So it's a very sparse matrix. I've been buying stuff on Amazon for about 20 years. I think my first orders were in 2001. So let's say I've bought one product per week for 20 years. 50 weeks per year, 20 years, it's a thousand weeks. My gut feeling is I bought a thousand products from Amazon, something in that range. Assuming there are 10 million products on Amazon right now, and I'm thinking it's probably more, but let's stick with the 10 million. If I built a row for me with the 10 million products and flagged the thousand that I bought, that row would be 99.99% empty. That's why we give the 99.99% sparsity number, and it makes sense.
That's pretty bad because if we want to store that data set in a matrix like that, 99.99% of cells are empty, so there are going to be zero values, but that still takes 32 bits. When we start working with the matrix, multiplying things, we end up mostly multiplying zero by zero. So it's horribly inefficient. That's the first problem we have to solve. Let's keep that in mind for now.
Let's talk about the algo we're going to use. Based on what we said and this issue of sparse data and sparse matrix, we're going to replace a very large, very sparse matrix by projecting it into two much smaller dense matrices. We're going to explain this. We want to approximate existing instances as closely as possible and predict new instances. We are going to use the factorization machine algorithm, which can be seen as a generalized linear regression algorithm. It's a really cool article. I put a link to the research paper. It was actually invented in 2010. It's still heavy to use today. It is built-in in SageMaker, so we don't have to write the code, which is great.
Imagine on the left here, you have your matrix with users and items and either a rating or whatever value you're trying to predict. For us, it's going to be quantities. Imagine this thing is 1 million lines, 1 million rows, maybe more. Hugely inefficient. What factorization machines do is compute two smaller, much smaller matrices. You can see they have a common dimension, which is the number of factors. This dimension here and this dimension here. We'll talk about this again. It's a dense matrix. The magic is that when you multiply those two matrices, not only do you closely approximate the existing values, the ground truth values that you have, but you also compute values for all the empty cells.
That's pretty cool because this means your model is really built from those two much smaller matrices. You have a really small model that predicts the right value, predicts closely to ground truth, and predicts all the other cells. That's what we're trying to do here. Really, really cool. So that's what we're using, factorization machines. And as you mentioned, it's built into SageMaker. So let me jump to this.
Here's the research paper. Oh, it's actually 2010. Yeah, it's the end of 2010. I was close. For Friday, it's not too bad. You can go and read the cool paper. If you look at the SageMaker documentation, you see it's supported, and you get plenty of interesting information. We'll look at some of the hyperparameters, etc. It's generally an easy algorithm to work with because it doesn't have so many ultra-weird parameters. But preparing data for it is a little different from what we usually do. So I think it's good that we look at this one now.
Let's go and take a look at the notebook. First, we're doing basic cleaning, basic prep, and then we're doing the actual formatting for the algo. The repository actually includes a SageMaker Data Wrangler workflow. We've covered Data Wrangler very extensively in the first two episodes of the season, so I'm going to skip it. But if you want to go and run Data Wrangler and process the data with it and add more transforms, it's in the repo. The flow file is here, and you can go and do this.
For a change, we're going to run the data prep with Python code in the notebook. It's very interesting because you can very easily replicate. Typically, some people would start with manual work in the UI, some people would work with Python code, whatever floats your boat. But it's easy to do one and the other. If you write Python code and then want to automate with a workflow, it's easy to apply the same transforms in Data Wrangler. And if you did the work manually in Data Wrangler, you can easily export to Python code.
So let's do this in a notebook and do some pandas. Load the data set. We see the same thing, of course. First, let's try and find if we have missing values. We have some empty descriptions, which we can live with, but we have lots of missing customer IDs. That's very bad because how could you recommend something to someone you don't know? We're going to be very conservative and just drop the rows with no customer ID. We're stripping the descriptions, removing any left and right spaces.
In the process, we did lose a bit of data, which is a shame, but customer ID is key. We can compute some stats on those numerical fields. We're going to look at the quantity, the unit price. Everything looks okay. Some products have a zero unit price. I don't want to zoom in on this because there could be freebies, but in real life, I would investigate. Negative quantities, no. Something is wrong. We can find all the rows with negative quantities. There are about 9,000 of them. We'll take a deeper look, but for now, we just drop them. Now we drop them, and the minimum quantity is one. You can order free stuff, but you've got to order at least one thing.
As we saw in the data set, we have invoices with multiple items. The same customer will have multiple invoices, and they could be repeat orders for the same products. To get the actual quantity, we want to group rows that have the same product and the same customer. It looks like some people may want to order inflatable political globes again and again. A Swiss customer did order 12. It could be one single order or multiple orders. As we're trying to predict the quantity for each individual item, we want to reduce that. We use Pandas. You could use Athena on that CSV file if you prefer SQL. Pandas is very nice.
Now we end up with about 200,000 rows with unique products and unique customer items and the quantity. Customer IDs, item IDs, and quantities in the middle. Fine. Now we need to do a little more because we have categorical values, countries. Customer IDs are actually categorical descriptions. Items are categorical values. Each is a different dimension, either a row or a column. They need to be encoded as such. They're not numerical values.
We do that. We have the description. We could go extremely fancy here, but the descriptions are pretty short. If we look at the data set, there are not extensive descriptions. There are five or six words. So we use TIDS, vectorization, term frequency, etc., to count how many times a word appears. That's going to give us quite a few additional columns because the vocabulary in the description is probably a few hundred words, maybe more. So that's going to create lots of different values.
Now we get to the slightly complicated bit where we need to build the real matrix that the algo will work on. It needs to look like this. Let me go full screen because I need to explain. All dimensions in the problem need to be columns. We have as many columns as we have customers and as many columns as we have items. I'm still using the movie example, but it could be products. Each row flags or codes the actual user and the actual item. Each row is going to be customer one, two, three, and item four, five, six. We would have a one for the customer cell, one for the product cell, and the label would be the quantity they bought.
As you can imagine, this is going to get really big. If we have 1,000 customers, then we have 1,000 columns. If we have 10,000 products, then we have 10,000 columns. We also injected the vector with the description, so it's even more columns. It's going to be a very big, very sparse matrix. Even worse, it's actually even more sparse because we added so many columns. It's going to be a very big, very sparse matrix.
This is what the factorization machines algo needs. Pretty specific input. The algorithm needs that specific input format. That's something that threw me off when I started working with this algorithm. Because I had this row versus column thing and that didn't work. Then I realized, no, it's really all columns. Rows are the different instances, and columns are really all the features, all the dimensions.
This is what we need to build, and we need to save it in an efficient format. Not CSV. Not even NumPy. It's going to be a compressed sparse matrix. So we can actually see how sparse this thing is. The H stack call here is horizontal stack. It just takes the one-hot encoded columns and the TF-IDF column for the description and adds everything. The quantity is the label. We have all our columns next to each other and the labels.
We do this and then if we compute the sparsity, NNZ gives you the number of non-zero cells, and we can see it's actually 99.9% sparse. We have a huge matrix and it's only marginally useful with these non-zero values. So in this case, imagine millions of items, millions of rows, millions of users. That's millions and millions of cards. When you start one-hot encoding and processing, it could be tens of millions, hundreds of millions of columns. It's literally huge. Plus, if you have millions of actual ratings, you can't work with that.
We're going to split it for training and validation and save that for archive purposes to NumPy. But the real data formatting step is this. We're taking that hugely empty and inefficient NumPy array and saving it to a compressed sparse row matrix. We're saving it to protobuf format. Protobuf is a serialization format, binary serialization, which is pretty efficient and dense. So not only do we have this CSR object, which is optimized not to store zeros, but we're taking this and saving it to a dense serialization format.
Eventually, this is actually pretty small and efficient. You could add a huge initial dataset and compress it into a very efficient data set of files. We have a protobuf file for the training set and a protobuf file for the test set. All the magic happens in this utility function, which is part of the SageMaker SDK. It takes your feature matrix, labels, and writes that stuff to a memory buffer. Then that memory buffer gets synced. It's three objects. You write memory, and then you push that to it.
This is completely generic. I've used this function in many notebooks. As long as you pass the right types in X and Y, that works. We do this for the train set, the test set, and we have our friendly protobuf files in S3. Here we're doing it in a notebook, but at scale in production, you would automate this, maybe run it in SageMaker processing or on other services. These are the steps.
From just to make sure, this is the naive, intuitive view. This is what factorization machines are going to do. Split that thing. This is the technical format for the factorization. Same thing, but different formats, different views. We're structuring the problem in a way that the algo can learn. Hopefully, that makes sense.
Now we can train. Perfect. Let's jump to the good bit. Once again, built-in algo. We've seen this in the doc. It works. This is nice until you see the formula. It's a simpler way. Hyperparameters. Let's look at the required ones. Feature dimension is the dimension of the input feature space. In plain speak, that means the number of columns. In our case, we have the actual number. We have over 9,300 columns because of the number of users, items, and the vectorization of the description.
Number of factors is the common dimension of those two matrices. The doc says 64 typically generates good outcomes and is a good starting point. We're going to try 64 and then discuss other options. Predictor type. Factorization machines can be used for regression, which is what we're doing here, trying to predict the numerical value. It can also be used for binary classification. We can predict in the 0, 1 range and say, lower than 0.5 is no or 0, and higher than 0.5 is 1, yes, true, whatever. Here, we're definitely using regression.
All the other ones are optional and scary. Unless you read the research paper and figure out if bias, weight decay should be decreased, I'm happy to live with default values. Number of epochs. This is optional, but generally, you don't need a large number of epochs to get okay results. In this case, we're going to ignore them.
What are we doing here? Once again, we retrieve the built-in algo container. If you're new to SageMaker, training and deployment on SageMaker is always container-based. In this case, as we're using a built-in algo, we just retrieve the name of the container for factorization machines. You don't need to worry about Docker containers or writing that algo. It's already there. Just go and grab it.
The estimator is the central object. It's where we configure the training job. Which algo to use, meaning which container. Roles, that's permissions, allowing SageMaker to grab the container and read, write from S3. The role is attached to your SageMaker Studio instance, so usually that works. Infrastructure requirements. The built-in articles have instance recommendations. We recommend training and inference with CPU instances. Training with one or more GPUs on dense data might provide some benefit, but we have very sparse data, so we don't want the GPU to multiply zeros just to tell us the result is zero. We'll stick with CPU instances. C5XL is a reasonable choice.
Where to save the model. Now, hyperparameters. Number of dimensions. We have that value, the 9,300 something number of columns. Minus the label. The predictor type is a regression problem. Batch size, 1,000. Why not? Number of factors, 64 because good outcomes and good starting points. We'll train with 20 epochs. We could monitor the metrics. Does this one support early stopping? Maybe we could. Do we have early stopping here? No. Or patience? No. Okay. We'd have to be careful that we're not overfitting. We can look at the log.
Just call train, pass the location of training data and the test data, and we call fit. We create that instance, a C5 instance. We pull the factorization machines container to it. We load the data. We are copying the data because we're using file mode. In the previous episode, we used pipe mode, which can stream data, saving the copying and not requiring lots of storage on the training instances. For recommendation, if you had a huge data set, like gigabytes, tens of gigabytes, hundreds of gigabytes, pipe mode would be good. But here, it's not so big.
It trains, and we have the training log. The first epoch is not so great. We see RMSE loss equal to 251. Then it trades, and we can hopefully get to a lower value. We get 60, which is much better. We're learning stuff. It's always a little bit difficult to understand that value. For linear regression, it's easier because you can compare the scale of the loss to the scale of the values you're trying to predict, but here it's a little more complicated. At least we're learning, and good stuff is happening.
We train for only two minutes. The instance shuts down, and we only pay for that. We've not discussed cost optimization so far. It's a good topic for the automation episodes. Spot instances could give us a discount. We'll save you money in the other episodes. Just an incentive to keep you watching us.
We've trained, and we can see the training job in the studio. It's a very short training job, only two minutes. We don't have a lot of data points, but if you have longer points, you can plot your metrics right there. You can plot, create line, time-based, etc. Here we get the final value. If you have trained jobs lasting longer, you can easily see that nice curve. We see all the parameters, features, etc. It's all in there, and we see it in studio.
Of course, you can query that. You can describe the jobs. We've seen some of those APIs over time. We didn't do any debugging stuff. We showed debugger early on. We showed explainability in the fraud detection example two weeks ago. Bias, we also showed two weeks ago. If you're curious about those, go and watch those previous episodes.
We trained, and we have a model. The next step would be to deploy it and predict with it. We'll talk about that in the ops/automation episodes in September. Stick with us. This is pretty much what we wanted to show you today.
Quick recap. We started from a reasonable CSV data set. We could pre-process it with SageMaker Data Wrangler, which makes Sego very happy, or we can process it in the notebook. Generally, you're going to do one or the other. I kind of like writing Python code. We did some pre-processing here, but we studied in detail how we need to format that data for factorization machines, which is a little confusing the first time you see it, but hopefully, you understand that now.
I'll give you a second to take a screenshot of this because I wish I'd found something like this on the web a few years ago. All right, and then we train, which is really the simplest thing, right? Just create the estimator and fit. This is really not typical.
I think that's the end of this episode. Let me show you the notebook URL again. Here it is. Go and grab that. You can find lots of other examples. The examples we've used so far are located in end-to-end. Music rec, which was the first one we did. The fraud detection was number two. Then use cases with computer vision and retail Rico, and there are more. You have more technical examples if you want to zoom in on particular capabilities. If you're interested in PyTorch, TensorFlow, and all that good stuff, it's the Python SDK. It's a big repo. There's a good chance you're going to find something that looks like the problem you're trying to solve. It gets updated all the time.
I hope this was useful and a little bit fun because that's important as well. Next week, automation. We're done with those four first episodes on ML and DataSense and model building. Next week, we're starting to look at pipelines. We'll revisit some of these examples, look at registering those models in the model registry, managing model versions, approving or rejecting models, tracking model lineage, looking at all the artifacts that go into building a model, automation with pipelines, and really cool workflows and blinking stuff in studio. It's going to be amazing.
We have four amazing automation episodes in September. Stay with us. Take care, Sego. Thank you so much. Thanks for your help and thanks for watching everybody. We'll see you very soon. Bye.
Julien Simon is the Chief Evangelist at Arcee AI
, specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.
With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.
Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.
Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.