SageMaker Fridays Season 3 Episode 3 Managing engineered features with SageMaker Feature Store

March 22, 2021
Broadcasted live on 19/03/2021. Join us for more episodes at https://amazonsagemakerfridays.splashthat.com/ ⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos ⭐️⭐️⭐️ In this episode, we build a sentiment analysis model starting from the Amazon Customer Reviews dataset. First, we import the dataset in Parquet format in Amazon Athena. Then, we import it from Athena to SageMaker Data Wrangler for a quick look. Then, we move to a Jupyter notebook and we start engineering features using popular open source libraries (nltk and spaCy), and we automate them with SageMaker Processing. Next, we load the processed dataset in SageMaker Feature Store, both offline and online. Next, we run Athena queries on the offline store in order to build a training set, which we use to train and deploy a sentiment analysis model with the built-in BlazingText algorithm. Finally, we see how to update and delete individual features in the online store, and how to use timestamps for feature versioning. 100% live, no slides :) https://aws.amazon.com/sagemaker/feature-store/ https://aws.amazon.com/blogs/aws/new-store-discover-and-share-machine-learning-features-with-amazon-sagemaker-feature-store https://github.com/juliensimon/amazon-studio-demos/tree/main/sagemaker_fridays/season3/s03e03

Transcript

Hey, good morning, everybody, and welcome to this new episode for season three of SageMaker Fridays. My name is Julian, and I'm a dev advocate focusing on AI and machine learning. And once again, please meet my co-presenter. Hi, everyone. My name is Segelen, and I'm a senior data scientist working with the AWS Machine Learning Service Trandom. My role is to help customers get their ML project on the right track in order to create business value as fast as possible. Yes, indeed. Thank you very much for being with us today. So as you know, SageMaker Fridays is a bi-monthly event, and we focus on real-life machine learning use cases, which we try to solve using AWS services and Amazon SageMaker. And we try to focus a little bit on the new capabilities that were launched at re:Invent just a few months ago. So we are absolutely live. We're in the Paris office. No slides, discussion, and demo only. So please ask all your questions in the chat. We have friendly moderators to answer all of them. And remember, there are no silly questions. Don't be shy. Make sure you ask all your questions and learn as much as possible. Okay. All right, let's get started. We have a very busy episode today. So what is this one about? So two weeks ago, we dove deep into data preparation and feature engineering with SageMaker Data Wrangler. We briefly discussed that we could export engineered features to SageMaker Feature Store, another new launch at AWS re:Invent in 2020. And this is precisely what we are going to discuss in this episode: processing data, storing features, and reusing them again and again. Okay, let's get started. So we like to focus on real problems, not on services. So what's the actual problem, the machine learning problem we're trying to solve today? So today we are going to work on a natural language processing task called sentiment analysis. Starting from a piece of text, we would like to know if it expresses positive, neutral, or negative sentiments. Okay. We can view this problem as a classification problem with three classes, and the predicted output will contain three probabilities. Sure. Ideally, one of these should be much higher than the other two, telling us what the strongest sentiment is. Okay, so that's a pretty typical problem. How does it relate to actual projects that customers may want to build? So sentiment analysis is a very popular use case to understand your customers. For example, you could use it to analyze emails, social media content, comments on your blog posts, or analyze your product reviews on your website. This would let you measure how your customers feel about your project and service instantaneously and at any scale. So these are very valuable insights. Okay, I understand what we're trying to do, but don't we have a service for this? We have an AI service called Amazon Comprehend, which is super easy to use. So are you trying to make me work on a Friday afternoon for no good reason at all? Oh, Julian, you're complaining already. Yes, I am. You are right about Amazon Comprehend. It does have an API for sentiment analysis. However, data scientists may still want to build a model trained on specialized content. This is exactly what we are going to do today, as we are going to train a model on a camera review dataset. As you can imagine, photography has a lot of domain-specific vocabulary, and it is important that our model understands all these nuances. Okay. Amazon Comprehend, of course, can train your own test classifier, so we can definitely give that a try. But today, we would like to have full control over the machine learning process and use our own code instead. Okay, makes sense. I'm convinced. So thank you. You said this was a classification problem, and we've done this quite a few times already in previous episodes. So I'm guessing we're not repeating ourselves. So how is this one different? So you're right. In a previous episode, we built a fraud detection model based on tabular data to classify legitimate transactions and fraudulent ones. The huge difference here is that we are working with natural language data, which is what we call unstructured data. And we can't just throw that data into an XGBoost model and hope it will work. Because unstructured data, like natural language, needs time-consuming processing and engineering, especially since natural language datasets are usually quite large. Also, each algorithm usually requires a specific input format. So, we will have to process the same dataset in different ways during the course of the project. Yeah, so I see why we'd want to do the feature engineering once and then store the features instead of computing them again and again. So, of course, once we've stored them, we can share them with other team members or maybe other teams inside the company and save even more time for everybody. So you mentioned algorithms. I guess we're not using XGBoost today. What are we using? So we could definitely go with deep learning, maybe even use advanced models for natural language processing, such as the Hugging Face transformer models. Oh yes, maybe we will do that in a future episode. Who knows? But as you know, it is important to try the simple thing first. I've told you 100 times, try the simple thing first. So what's a reasonable first step today? So we are going to use a built-in algorithm in SageMaker called BlazingText. Ah, I love this one. Yes, definitely. BlazingText has been invented by Amazon, and as the name implies, it scales very well thanks to GPU training. It provides a highly optimized implementation of the Word2Vec and text classification algorithms. The input format for training data is also very simple. As it is compatible with the popular FastText algorithm. So that's what we are going to do. Okay, it's starting to make sense. So I guess the game today is to start from unstructured data, camera reviews, transform that dataset into something that BlazingText can understand and train on, store it somewhere we can use it again and again for model training and prediction, right? Okay. So let's take a look at the dataset. What dataset are we using today? So as mentioned earlier, we would like to build a model specialized for camera reviews. For this purpose, we are going to use the Amazon customer reviews dataset hosted on AWS. Pretty cool. So real reviews from real Amazon customers. Yes. How big is it? It's included over 1.3 billion Amazon customer reviews, broken down by product categories. Data is available in TSV format and Parquet format as well. Oh, Parquet. Okay, good. Yeah, Parquet. All right, 430 million. I think we can do that. So let's share my screen, please. And we can start looking at the dataset and exploring it. So this is the homepage, so to speak, for the Amazon Customer Reviews dataset. It's hosted in S3, as you would expect. And you can find a list of all files. So either you get the full dataset or you get the dataset broken down into categories. But we don't really want to look at CSV files. So we're going to load this dataset in Amazon Athena and try to see what this thing is. So switching to Athena, the first thing I would do, and let me zoom in a little bit, is to create an external table based on the Parquet files in S3. Okay, so don't worry, that create statement is on the page I just showed you, so you don't need to write this yourself. Create the table, and then you run that MSCK repair table, which basically, even though it says repair, it's going to load the partition data into the table we create. Okay, and now we have this table here. I've done this previously. It's in the Athena test database. And of course, we can run some queries. So let's look maybe at different categories available there. Okay, so run the distinct on product categories. And we see pretty much all the stuff that's for sale on Amazon.com, right? So video games, baby products, beauty products, automotive, and yes, camera, right? So we're interested in camera reviews. So we could say, all right, camera reviews, show me camera reviews then, right? What does this look like? Let's close this. Okay, so we see pretty much what we would expect: customer IDs, review IDs, product IDs, product title, star rating from one to five, votes, review headline, review body, the date, the year, etc., and of course, we only have cameras here. Pretty cool, we have a lot of those. Actually, we could count how many we have. So we have a little more than 1.8 million camera reviews. So I'm guessing that's enough, right? That's enough for our purpose. We could look at how many different items we have. And can you tell I'm madly in love with Athena? One of my favorite AWS services. So we have 168,000-something cameras here, okay? And we could do more, but we'll get to that later on, okay? So you can see Athena makes it super easy to explore that data. But remember last week, we talked about SageMaker capability called Data Wrangler, right? So why don't we try and load this stuff in Wrangler? So I'm jumping to SageMaker Studio now. Okay. And if you remember last week, we saw that you could import from Athena. So let's do that. Okay. Select database. And here I just need to provide a SQL query that's going to select. So I'm just going to go back here. And we're going to take only the camera reviews. So let's remove the limit. And we can run this. So we should see a preview here. Let me zoom out a little bit. So we see exactly the same thing. And confirm that this is what we want. We can import the dataset and let's call this Amazon reviews. Camera. Okay. So now it's imported, right? So we're not going to do the data rendering episode. That's last week, go and see that one. Let's just do a quick analysis on this one and why not build a histogram for star rating? So let's wait for a second for Wrangler to fetch column names. Take a second. And we're going to build a histogram with stars and star ratings, so we can count them. So in case you're wondering, Data Wrangler is not loading the 1.8 million reviews. It's actually sampling the table, and I think it's loading 50,000. So let's just say, all right, I want to do star rating. And yeah, that's it. And yeah, just preview. And we should see Instagram. Okay. Right. So we have lots of five-star products. Because it's Amazon, right? Of course. And we see a distribution here. So, okay, just a quick review of what we've done on Data Wrangler last week. And I guess we could start doing some feature engineering here. But as Segelen told us, we want to be fully in control. We want to do proper machine learning. And we want to write code. Yes, yes, okay. All right. So I'm not going to spend too much time on Data Wrangler. Let's close it. And we're going to move to feature engineering. Okay. So the first step would be to load the dataset, right? Of course. So there are different ways to do this. Of course, we could copy a file from S3 and load it. But it's cheap. The cooler option is to load from the CSV file that's automatically created when you run Athena queries. Okay, let me show you how to do this. So when you're in Athena, if you go to settings, yes. You can specify a location for your query outputs, which I've already done. So anytime now you run a query, you're going to find CSV files in there, organized by day. So those CSV files are the actual output of the query. And if you're wondering which one to pick, you can go into query history and you see here, click on this, you get the query ID, right? Which is what you need to look for to get results. Okay. All right. So we could do this. And there's a third way, which we'll discuss later, right? Either way, now we've loaded our data. So what are we doing? Dropping any line with null values, which is reasonable. 1.8 million, I think we can afford to waste a few reviews if we need to. So we see 1.8 million, 16 columns. And this is what they look like. We've seen this already. No big surprise. So one thing I'm going to do is the text here is pretty much the headline and the body. So to keep it simple, I'm going to merge them. You could say, well, maybe the title has more importance and you need to build a more clever model, but I don't do that. I do CV once. She does the clever ones. So I'm just going to put those two things together and I'm going to keep only four columns: the review ID, the product ID, the star rating, and the review body that I just created. Okay, so we'll see why we need those things later on. Okay, so this is what the data looks like. Now, remember, we need BlazingText. Tell us a little bit about the label that BlazingText needs. It doesn't work with one, two, three, four, five, right? No, no. So BlazingText training format expects a single preprocessed text file with space-separated tokens. Yeah, you can see it here. These are examples. Exactly. Yeah, a text label that actually says label and then those weird-looking centers. Why do we have all those spaces? Explain that to me and to everyone else. So this is the tokenization regarding the case of NLP. So remember that what we said at the beginning of the episode is that the text data are unstructured data. Yes. So in order to work with them, you need to clean and structure them to apply later NLP techniques. Tokenization is really a key concept in NLP and one of the most common preprocessing tasks when you work with text. A tokenizer will divide streams into a list of substreams, and in other words, will split sentences or documents into smaller tokens. Okay, which we see here, right? So we have each word, and even punctuation signs. Even punctuation, exactly. Numbers. And numbers, and it's, okay, one thing and, you know, one thing only, and then spaces around it. Exactly, and thanks to this to delimit everything. Exactly, and it will help you, for instance, to identify the words of interest within a string of characters. So it's really important. Okay, so basically we're telling the algo where each word and each sign is. So I understand that. Now, obviously, if we look at our text here, it doesn't look like that, right? Punctuation signs are next to words and so on. So tell me we're not going to write any code for this, right? No, no, no, we definitely don't need to do that. And we can, because we can use nice open source libraries like NLTK or SpaCy. Nice. Thank you guys, exactly. So NLTK, which means Natural Language Toolkit, was originally created 20 years ago, I think, in the Department of Computer and Information Science at the University of Pennsylvania. And it is a suite of Python libraries providing modules for the main language processing tasks, like tokenization. And we're going to use it. Sure. And after you've got as well the possibility to use SpaCy, which is another popular open source library for advanced natural language processing, providing software for production usage. We'll look at both. Exactly. We love open source. So okay, so we need to get rid of those one, two, three, four, five integers star ratings because we want text labels. So here I've decided to go with three classes, right? So negative sentiment, neutral sentiment, positive sentiment. So one and two stars are negative, three is neutral, four and five is positive. But you could have five if you really want to. You could have, you know, very negative, negative, neutral, positive, very positive. I mean, you'll get the notebook, right? Don't worry. So you can go and tweak. So I'm using this really cool API in Pandas called Map to replace the integers with a new column, right, called label. And that's what it looks like now, okay? So now we tokenize, right? And it's not scary at all. We use NLTK, like you said, import NLTK. We download an NLTK module called punkt. So maybe it's a German developer. I don't know. But anyway, it's a really cool download for this. It's a good one. And we can apply tokenization to our review body column, just like that. You notice here, it takes 17 minutes. Right? Remember, we have 1.8 million rows. Okay? So I'm always a little bit nervous to run very long cells in a Jupyter Notebook. Because first, you know, it can always fail in bizarre ways. And second, you know, it stops me from getting anything else done. So we'll leave it at that for now, but don't worry. Yeah. We'll do better in a few minutes. Okay? So 17 minutes. And then we just, so tokenization will, as you said, split the sentences and return an array of tokens. So we have to join them again. We don't want an array. We want a string, but a space-separated string, right? So we join those things again. And now it looks like that, right? And we can see if you zoom in a little more. Yeah, you can see here we have spaces after and before punctuation, so it's all good. Okay. So, we could continue training, but remember I said there was a third option? There's always more, right? And yeah, now we're awesome because what we really want to do instead of waiting for 17 minutes here is we want to offload that to a SageMaker processing job. Okay. And SageMaker processing is another service that I love because you can just throw everything away, run that stuff in a different notebook, and continue experimenting. So we've covered SageMaker processing again and again and again. And SageMaker processing, and I said, yes, again, and you will see it in future episodes because I think it's a great capability. So how do we do this? So we've seen this many times, just a quick recap. So what I'm doing basically is I'm moving my preprocessing code to a Python script, which is exactly what you saw in the notebook, except I decided that you could choose between NLTK and SpaCy, right? Yeah, yeah, yeah. And we'll see why in a second. And just put your code in the script. And the only stuff you need to add is, first, you need to receive our actors and arguments as command line arguments. That's how SageMaker processing will pass them to your code. Okay. You need to read the dataset from a well-known place and you need to save the results in a well-known place. Speaking of which, here, I'm actually saving two files. I'm saving a final BlazingText dataset, which we could use exactly like that and train. And I'm also saving kind of an intermediate shape because, of course, we're going to push this stuff to feature store. So this is why we have two outputs here. So this is my code. And then I just need to run that stuff in SageMaker Processing. And what it really means is I create this SKLearn Processor. And I run it, passing the location of my input data and defining my outputs and passing my parameters. So here I'm using SpaCy, and this runs for a little bit, right? I can see SpaCy being installed, SpaCy running, and the complete job ran for nine minutes, which means the SpaCy bit probably ran for four or five minutes, right? So SpaCy is much faster than NLTK. No offense to NLTK, it's brilliant. But if you're looking for performance, SpaCy is fast, right? You know why? Yes. All right, tell us why. No, because SpaCy is based on Cython, the C language extension for Python. And there you go. You cannot beat C, right? The old C developer speaking to you today loves this, right? So more power to C. And there you go, right? SageMaker processing, you can run different versions. You can just get this long-running steps out of your notebook and run them programmatically again and again and again. Okay. And so we get an output. We saw the output here, right? And yeah, I guess we should take a look. So do I have a terminal here? Yes. Okay. So let's see what this looks like. Okay, so if we look at the BlazingText data, it looks exactly like it should, right? So the label thing and the text, it's nicely separated. Good format. Yes. And what about that TSV thing? So what's in there? So we see the four columns or five columns, right? Review ID, product ID, star rating, review body, and label. Because we're going to push those features to feature store. So that's why I want to keep them. All right, back to this. So we use the awesome option three, which is we load data processed by our fancy processing job. So we just load directly into Pandas, this thing here. And we see exactly what we saw in the TSV file. And now we can move on to Feature Group. So we need to set up a client for Feature Group for Feature Store. And this is very generic code. You can just copy exactly from this. And this is what we're going to use to invoke Feature Store APIs. OK. So obviously, we need to define a name for the feature group. So let's call it Amazon Reviews Feature Group with a unique timestamp. And now we get to the important stuff. So let me show you the data I can, right? Now you will understand why we kept that review ID thing. So let's explain what a feature group is. So a feature group is an object that stores records, okay? And those records represent the processed row from the original dataset. Okay? So each row in the CSV file or the Parquet file has been processed by our script and becomes a record in the feature group. Okay? So it's just vocabulary, right? If you want to call them rows, it's okay. So just like a row in the original dataset has columns and values inside the columns. A record has key-value pairs. OK. So for example, in each record for this data, we'll have a review ID key, a review ID, a product ID key, a star rating key, etc., and individual values. And of course, you see where I'm getting at, because if you have key-values, then you need some kind of unique identifier to say, give me the value for the product ID key for record block, right? And of course, this unique identifier is going to be review ID. Okay. So this is a really important concept. When you create your feature groups, you need to come up with a unique identifier that you can use to grab records and read features. Okay? So in our case, it's super simple. We have this review ID thing, it's unique, and we're using that. If we pull data from a relational database, we could use a primary key or something unique, okay? All right, so that's what we do here. We say, hey, the record ID name is going to be review ID, right? Okay. Now there's a second column we need, and it's a column to store timestamps. So we'll see at the end of the session what we can use those timestamps for. Now we're just going to say, hey, okay, please create a new column. It's called event time, and please fill it with whatever timestamp is now, okay? And you can easily do this with the assign API in Pandas. Right, create a new column and assign a unique value to it, a single value to it, okay? So once we've done this, this is what the data looks like. Okay, we created the new column with the timestamp and obviously it's the same timestamp because I've done all of that in the same run, so to speak. Make sense? Makes sense. All right. Not too difficult. And maybe in your data, you already have some kind of timestamp. It's just we didn't have that. All right. So we're going to ingest this into Feature Store. So data types are always going to be important. Right. So you have two options. You can define a very nice JSON dictionary, which is called Feature Definitions and pass it to the feature group create API. Who loves JSON? Not me. No? No. Okay, no one here, right? So we're not gonna do that. The second option is to just make sure your Pandas columns have the right type and let feature store infer data types. And this is much easier for me, right? So I'm just making sure all my columns have the proper type. And that's it. The only weirdness, so to speak, that kind of threw me off guard is the event time needs to be float64, even though it's not really float. But OK. So yeah, if you use anything else, it's going to complain when you ingest. So remember, the event time should be float64. We could also use the Unix time and date format. So it's a string with a date and a time. And of course, we'd set the type to string if we did. OK, so we load the feature definitions from the data frame. We can see it's all good. And now we can actually treat the group. And it's pretty simple. So features will be stored, which means in S3, right? So you need to pass an S3 location. We pass the unique ID or records. We pass the timestamp name, the SageMaker role. We can also store them online, which we'll use again at the end of the demo so that we can query, put, and delete individual features with a very simple API so that's useful. We could use this API actually at prediction time to inject features request with low latency, so pretty good. A description and you should absolutely fill this because that's how people will find the feature groups that you created and tags again to explain what you're doing because one of the key things in SageMaker feature group is you want to share the features, right? Yeah, so if you don't put any information, it's just gonna have a name and it's not very clear, right? And actually once we've created it, we can see it in Studio and it looks like this, right? Okay, so we see the description, we can double click on this, yeah, we can see the description, all the parameters. We can see feature definitions. We can see tags. We can add more tags. So please make sure you fill that stuff because it's, and you can even click on this thing, which is very cool. Oh, yes. Oh, let's do that. Yes. Eye candy. And you can see very quickly, OK, what is this feature group? Am I interested in it or not? So tags are great, and you should use them here as well. Yeah. OK. So creating, let me close this, creating the group is just create, right? Create API. We have to wait for a few seconds. And yes, here's where I'm going to complain. So Boto3 has this cool mechanism of waiters. So you have an API to wait for some resource to be ready, some AWS resource to be ready. There isn't any waiter for a feature store. So I've created an issue on GitHub, and I would very much appreciate it if you could say, yeah, we need this. Yeah, we need this. So all of you go now to this and say, yes, we need it. Thank you. Right? And the Boto3 maintainers will hate me for this. Exactly. But anyway, we can just, you know, write a silly loop to wait for the group to be ready, but we would love to have a waiter, please. Okay. And then we run this ingest API. Okay. So we could store individual records. I will see. We could call put which is very similar to what you would do with DynamoDB, for example. So put an individual record. I'll show you how to do this later. But here we want to bulk ingest, right? So just load those 1.8 million things into feature group. And you can do this passing the data frame and a number of workers, right? So 64 is probably too much. But I guess I had to try it. So it's probably more reasonable to do four or something like that. And you can make this synchronous or asynchronous operation. So after a little while, right, because it's a rather large dataset, we have this stuff in the offline store. And well, the next cell kind of gives it away how are we going to query this we're going to use Athena again so we're gonna go back to Athena again, yep, okay, so let's go back to queries and okay now I see if I move to SageMaker the SageMaker feature store database I see this table, right? So if I preview it, aha, this is the equivalent data that we just ingested, right? And it has some extra columns at the end, but hey, we'll pretend we didn't see those for now, okay? So of course we can query that, okay? You could say, for example, let's find the cameras with the largest number of reviews and their average rating. So standard SQL that we're running on this table that was created in the Feature Store. So let's run this query. OK, and we can see pretty much the most popular items and their ratings. So now we could say find cameras that have at least 1,000 reviews. And the reason for this is maybe I want to build a specific model for my best-selling cameras. Maybe I want to do that. So I've done all the feature engineering work already. Why would I do that again and again? It's there, it's stored now. I can query and say, okay, I'm going to pull some data, some engineered data, and build my dataset from that, okay, and we could say all right, let's find cameras that have at least 1,000 reviews, right, and if we run this, let's see what we get. All right. Aha! Magic. We see exactly what BlazingText would need. Okay. Right. So here it's just the two columns because I selected the two columns, right? So this is just an example here, but imagine this is a little bit different from how we usually do things, where usually we would, you know, we'd have a CSV file and say, okay, maybe we want to keep only the rows. So we'd process that and save it again to CSV and then maybe load that somewhere and train on it. And then we'd say, okay, now I want to train maybe on a different subset. Okay, so let's run feature engineering again on a different subset, which honestly tends to be a waste of time, right? If you do that again and again. So here, I think it's, you know, it's probably a better option to just process everything, dump it in the offline store, and then say, okay, that's done. And other people could be working on that, right? You could be training a model for mid-priced portable whatever, and we would work on the same feature store and we wouldn't have to run any additional processing because it's already done. It's great, right? And we can very easily query with the team Athena and find what okay so fine and this is exactly how I've done it. I mean, I've written my queries in Athena because it's so friendly. And now I say, okay, this is what I want. Okay, the reviews for the cameras that have at least a thousand reviews. Okay. And I could say, all right, I can take this query and go back to my notebook. And I could just copy-paste that query and run it from Python, from my notebook in Athena, right? So yeah, just find the table name for the feature group, which is this. And these are the same queries you just saw. And this is the final one, the one that extracts labeled reviews for the cameras with at least 1,000 reviews. You can see it's exactly the same. You just have to pay attention to escaping characters. That's why you see quotes here because we have characters in the table name that could be messing with Python. So we need to be careful here, but that's about it. And so we see the query, we run this on Athena, and we get the result in a data frame, which is very nice. And this is exactly what we want. OK. And we have 82,000 something reviews. Right. And a lot of them are positive, as we saw. That dataset is biased towards positive reviews. I guess our customers are too nice. Right. We have a few negative and a few neutral. So you could say, well, it's imbalanced. Okay. It's a little bit imbalanced. It's not severe imbalance, but it's probably something we would want to fix. So maybe we would remove some positive reviews, do some sampling. We're not going to do that today, but something to keep in mind. We're going to split our training and validation. As you want. Yes, 90%, 10%. Which is a good ratio in this case. Save to TXT files. And again, you know, just to reassure ourselves, this is exactly what we want, right? This is what the doc says, underscore, underscore, label, underscore, underscore, something. And then, yes. Not going to check all of it, but it looks good to me. Okay. And, and honestly, we could leave you now and go have a drink or a beer or something because the rest is, and I'm gonna say my favorite sentence. And we agreed with Segelen, I'm gonna have a t-shirt made with this. It's SageMaker business as usual. Yeah, you can start making fun of me. That's okay. It's in my job description. And it is really SageMaker business as usual because, well, we're uploading We're loading the data to S3. We're grabbing the container of BlazingText, creating a SageMaker estimator. I'm training on GPU here because it's BlazingText, as you said, is compatible with FastText. But the key thing is BlazingText actually runs on GPU, right? And it can run on multiple GPUs. GPUs when it's used in Word2Vec mode. I think in classification mode, like here, it can only run on a single GPU. But it's quite fast indeed. So we only set the one mandatory hyperparameter, which is supervised mode, which means text classification. Again, unsupervised is Word2Vec. Like today. Word vectors. We define the training channel, the validation channel, and we call fit. We've seen this a million times. So we have 6 million words. We have 15K something unique words. Vocabulary size is 15K. And we train at 31 million words per second. So you'd say, yeah, is that fast? Yes, it's fast. It's quite fast. And the funny thing is this, right? The actual training time of the algo is one second. So BlazingText learned those 82,000 reviews in just one second. It's blazing. Yeah, it's blazing fast. And validation accuracy is not too bad. It's 90,000. It's over 91%, which is pretty cool given that I've only done very basic feature engineering. I haven't touched any of the hyperparameters. We could do hyperparameter tuning and get even better accuracy, but that's already pretty good. Yeah, definitely. So we can deploy, wait for a few minutes, and now we can test. Here are three examples. I really love this camera. It takes amazing pictures. Sounds positive to me. And this camera is OK. It gets the job done. Nothing fancy. And the third one is poor quality. The camera stopped working after a couple of days. Add your favorite profanity. So yeah, I wrote those, right? I wrote those myself, so you can see how imaginative I am. Now, if we run those in samples on the model, calling the predict API. Again, SageMaker business as usual. We can see the first one is super positive. The second one is mixed as it should be, right? It's equally positive and neutral, right? And you could say, yeah, I mean, that customer is not unhappy, but they're not thrilled. Okay, there. It's like, okay, fine. It's not negative. It's not negative. It's okay, could be better, I guess. And the last one is strongly negative, as it should. Right? Cool. So this model works, right? So don't forget to clean up, delete the endpoint, delete the feature group, and this is how you remove all feature groups. And they just go, everything goes away, right? So summing things up, I think if you use the offline store like this, you can build datasets on the fly using standard SQL, which is nice. I mean, I find it much easier to query SQL to build a dataset than to write complex pandas or PySpark stuff. So it's just me, right? Think, no, I hate SQL, I want to do PySpark instead. That's great. You can do that. But I think for maybe business analysts, people who are not software engineers, it's a great capability that you can build and extract data using SQL. Everybody knows SQL, right? Everybody should know SQL. And yeah, all the tools work with, you know, you can use your favorite SQL tools, whatever you like. It's great. It's a good way to build datasets, right? And share them. Okay. There's one more thing, right? We have a few more minutes. Fine. Remember timestamps? Ah, timestamps. Okay. Yeah, nothing here deals with that. So what about the timestamp? So we saw that if an individual record has a timestamp that we create when we put a record in the store. So let me show you that API. And here it is. Okay, so it starts the same, create a feature store client and access the feature group. Okay, so exact same thing as before. Now I could say, okay, I want to see one individual record. Okay, so here I'm using the unique ID, right? And I'm calling the get record API to get to fetch the full record. Optionally, you can pass feature names. So if you only want the review body or the label, you could just get that. Okay. So here I'm getting the full record and remember I said key-value pairs? Well, this is what we see here, right? So it's an array of key-value pairs, pretty much, right? And we see, of course, the timestamp. So you can get those values, you can retrieve them. Like I said, here, it's not a good example of that, but imagine you had a very complex feature that you created, you know, transforming raw data into some complex feature for whatever life sciences or something like that. And you do and and you would want to use that at prediction time, right? So you know, maybe you get a molecule name in the prediction request, and you don't want to use the molecule name, you want to use the molecule property, blah, blah, blah, that you computed in the feature store. You could do that. You could say, okay, get record for molecule name, blah, blah, and retrieve the complex features that took two weeks to engineer and inject them in your prediction request. And it would just take a few milliseconds to retrieve. Okay. So again, this dataset is not a good example of that, but this is one way you could use the online store. Okay. I'm sure we'll show that in another episode. Maybe not with molecules, because I'm not really a chemist. Oh, we should call my friend Francesco. He's the chemist. OK. So obviously, we can add new records. So let's say this particular customer decides to update their reviews. So initially, they gave a five-star rating to the product. And then they go and say, no, actually it's four. It's four. It's on grade. OK. So somewhere in your data, you see, well, it's now four. And you want to ingest this again. You want to update the feature to four. So you could say, OK, that's fine. I'm going to grab that review. And I'm going to change the star rating to four. And I'm going to update the event time to now, right, the new time for this update. Okay, so now my record looks like okay, it's a four and the timestamp has changed. Okay, and I put the record back into the store. Okay, super simple. Right? Now if I get that record again, well, no surprise. I see star rating is four and the timestamp is whatever timestamp I have for this. So when you update a record, so to speak, and you get it again, obviously you're going to get the latest, which is what you want, right? Which is what you want. If you update your features for prediction, you don't want to get the old stuff. You want the up-to-date stuff. Now, these updates actually go to the offline store. And so this is, which is great because you don't want any discrepancy. You want the same features at prediction time and training time. So you just need to wait for a few minutes. It's not instantaneous. You need to wait for a few minutes for the offline changes to propagate to the offline store. So now if I query the feature store, the offline store again for that record identifier, okay, right, I see two records. Okay, I see the initial one and I see the new one. Okay, and it's the same. So although you may want to use a primary key for this, the behavior is not the behavior of a primary key. You can have duplicated entries for your records. And you could say, oh, that's awful. But absolutely not. It's exactly what we want because we can keep track of the different versions, right, of our features. Okay? When we get the record, we'll get the latest one. So there's no, I mean, it's deterministic. We know what we're gonna get, the latest one, okay? And of course, the main application for this is we can time travel, okay? So we can query the feature store, the offline feature store, and say, show me my features at this point in time, right? So a lot of people have issues with, you know, versioning datasets and keeping track of the different iterations. Well, here's how you could do it, right? You could say, well, show me my data as of, you know, an hour ago, two hours ago, etc. And the only thing you would do is just, you know, use the event time to do that, right? Say, show me this thing before more time, whatever. So here, for example, I could go back to the initial version of my feature. So this is a very simple way to manage different features. Now, we saw this extra column here called is deleted. So of course, there's a delete API, which is what you would think. Delete, pass the record identifier, and an event time that will be stored. So if I delete this record and I get it again, I get an empty response. So that record is gone from the offline store, the online store, which again is what happens. Because we could say, oh, these features are obsolete or buggy, and I want to make sure they never get used for prediction, right? They should be ignored. They should be unavailable. So here, if I get this record ID and try to inject it into a prediction request, things are going to fail because it's gone. But, and that's the last thing I want to show you today, if we query the offline store again, now we see three events, three versions of the event. We see the original one, star rating five. We see the updated one, star rating four. And we see an empty one, right, which has nine values, the timestamp that we passed, and it is deleted to true. And this is called the tombstone record, because it kind of seals that record. As long as we have this more recent deleted event, the store will never be able to grab those online. We can still query them offline and we could say we wanna use those for training sets or just for historical purposes. Yeah. Or just, you know, for versioning, but they can't be used online again. So I really, really like the feature. And I think, you know, it's a really cool way to kind of automatically manage features without going through, you know, code, you know, committing files to GitHub or tagging or et cetera. You just put everything in the store and the timestamp takes care of everything. And you can have those tombstone records to delete whatever you wanna see anymore. And still keep everything and query everything with Athena. Okay. Really good feature. Yes, that's a pretty cool feature for versioning, I think. All right, my friends, I think this is the end of the demo. It was a lot today. It was a lot. So if you have questions, you have a few more minutes for your questions. And so Segelen, let's wrap up. So what did we cover today? So many things. Lots of things. Yeah, yeah, yeah. BlazingText, Amazon Feature Store, Athena, Queries. NLTK, SpaCy. So, yeah, it's, you know, it's pretty, you know, it's a pretty complete example. Exactly what's happened in real life. Yeah. And, you know, you can just go and grab this. Let me show you the most important slide. It is the most important because it's the only one. Okay. Okay, screenshot time, right? So I'll leave this one on so that you can grab it. So the Amazon reviews dataset, the notebook or notebooks. So all the code that you saw is on GitLab. The SageMaker docs, so the feature store, the service docs and the SDK docs, right? Although, I have to say these are pretty simple APIs. Reading the code should be more than enough, but there are some options that you may want to look at. We have a couple of blog posts. So the launch blog post, if you'd like a quick recap. And there's a very, very cool blog post by my colleagues on streaming ingestion in near real time. So very, very cool one. And in case you've missed the event, a few weeks ago we ran AWS Innovate for AI and Machine Learning. And this had plenty of sessions, including Feature Store and Clarify and Data Wrangler and pretty much all the new SageMaker stuff. So you can, I think the sessions are still available on demand. So just go to this URL. You can register quickly and start watching all the content for the event. There should be good stuff there. Okay. We're on time. Almost. Thank you very, very much. Thank you for watching this. I hope you learned a lot. I'm not quite sure what the next episode will be about. We'll see. Maybe say SageMaker Pipelines. Thank you to my colleagues who helped us organize and moderate the event. We really appreciate it. Segelen, thank you very much. Thank you very much. And yeah, we hope you had a good time and we'll see you in two weeks. Two weeks, yeah. Until then, it's gonna be SageMaker Business as usual. And keep rocking with Machine Learning. Bye.

Tags

SageMakerFeatureStoreBlazingText

About the Author

Julien Simon is the Chief Evangelist at Arcee AI , specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.

With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.

Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.

Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.