Hi everybody, this is Julien from Arcee and in this video I would like to introduce you to another SageMaker capability that was launched yesterday at reInvent, and this one is SageMaker Feature Store. As you can guess, SageMaker Feature Store lets you store your machine learning features following feature engineering steps on your raw data for online and offline use. Let me show you how to get started.
This is fully integrated into SageMaker Studio, and you can use the UI to create and store your features. Of course, you can use the SageMaker SDK, and I'll show you a little bit of both. Before we do this, I need to introduce some terminology. Features are grouped in a feature group. A feature group is an object that represents engineered features coming from a data source. How do you define these groups? It's really up to you. You could have one feature group per CSV file, per relational database table, anything goes as long as you can define a common schema for the features in there. Here, we're going to work with CSV files and have one feature group per CSV file. Inside these feature groups, we have the actual feature records. Feature records are the engineered version of your dataset rows. Let's say we work with CSV data to begin with. We have a CSV file with rows and columns. Each row in the CSV file goes through data processing and feature engineering steps, and each row becomes a record stored in a feature group. Inside a record, you have key-value pairs for each column name and column value. So, we go from columns and rows in a dataset to a feature group with records and feature names and values inside each record. That's the terminology you need to know before we get started.
Now, let's see how we can do this. This is integrated with SageMaker Studio, and you can click through the UI to create your feature groups. But I'd rather use the API. Once you understand the finer details, it's super easy to use the UI. Here, I'm using one of the sample notebooks in the GitHub repository for SageMaker examples. I will put the URL in the video description. We're trying to build a fraud detection model using transaction information and customer information, so it's a binary classification problem, and we'll train a model with XGBoost. The steps here are what you would expect: upload the processed data, create one feature group for identity features, one feature group for transaction features, write both to the offline store for model training, and to the online store for prediction.
So, import the SageMaker SDK, do a lot of setup, and grab an S3 bucket for the offline data store. We can take a quick look at the dataset. So, two CSV files: identity information and transaction information. Read those CSV files, and do very basic processing, like encoding categorical features. This is the transformed data we want to push to the feature store. Let's see how we can create feature groups and ingest that data. Two feature groups: one for identity features, one for transaction features. Just create a name, and the first thing we need to do is create feature definitions. There's a bit of code here to do this, but it's easier if we check the UI. Let's take the feature group for transactions. I've already run this to save some time. Feature definitions are what you would expect: feature names and feature types. It's a schema for your data source. We need to build these feature definitions. If you have a small dataset, you can build them manually. For a larger, more dynamic dataset, you can use code similar to this to create a list of Python key-value pairs with feature names and types.
In addition to this, you need to pass the name of two columns, which are super important. The record identifier feature name: remember, records are the equivalent of rows in the original source data, and we need one column with a unique ID because this is how we access record features when we query the stores. If you have a unique identifier, like a primary key or transaction ID, that's good as long as it's unique to a record. The second column name we need is a timestamp column because we store the timestamp for the feature, and you can have multiple versions over time. This lets you find features going back to a certain point in time. If you iterated on your feature engineering workflows and have different versions, you can use the timestamp to go back in time and retrieve the state of the dataset at any point.
Once we've identified those two important columns and have the feature definitions, we can create the feature groups. The feature group is as simple as specifying the location for the offline store, the name of the unique ID, the name of the timestamp column, and whether you want online storage for that group. We run those two API calls, wait for the feature groups to be ready, and can describe them to see the feature definitions and S3 locations. Now we've created our groups, and we see them in Studio. It's time to store something in there. Let's go back to the notebook and ingest the data. Just pass the dataset itself, and it loads that data into your feature group. You can parallelize this with multiple workers, or use the put record API to store one record at a time. Here, we're doing the initial loading, so we're just loading everything. We do the same for the other feature group. At this point, we have data in the online store, and it gets propagated to the offline store a few minutes later.
We can quickly check that data has been ingested by grabbing one particular record using its unique ID. We can retrieve the full record with the key-value pairs. If you're working with Hive or Hadoop, you can also get automatic DDL statements to create external tables in Hive to access your features directly in S3. A few minutes later, your offline store is ready. At this point, we have our engineered features in S3 and in the online store. Now we can use the offline store data to build our dataset. In this notebook, we use Athena to join the customer information and transaction information to build the dataset. We run that SQL query in Athena on the S3 data, visualize the result, filter the columns we don't want, and write it back to a CSV file for training. Now we have a CSV file ready for training. We create an estimator, use the XGBoost algorithm, define the training input, and train the model. Then we deploy the model. Once the endpoint is up, we can access the features in the online store for prediction. We can pull engineered features for a prediction request, call the get record API to get the features needed, and call predict on the endpoint to get results. Finally, we should clean up by deleting the endpoint and the feature groups.
To sum things up, start from raw data, apply preprocessing (for example, using Data Wrangler), create feature groups with feature definitions, a unique identifier, and a timestamp column. Create feature groups for offline and potentially online usage, push your features using the ingest API or put record API, and you have your offline store in S3 and your online features available in the feature store. You can query them in real-time at very low latency for real-time prediction. Go and try it out, and let me know what you think. I'm happy to listen to your feedback and answer any questions you may have. Thanks.