Standardize and automate your feature engineering workflows February 2021

Transcript

Hi, my name is Julien and I'm a dev advocate focusing on AI and machine learning. In this session, we're going to discuss feature engineering workflows. As you certainly know, feature engineering is a critical step in machine learning projects. We usually start from raw data and apply specific transforms to that data to build more expressive features. These more expressive features will help the machine learning algorithm pick up the patterns in the data, and hopefully, this will yield a higher quality model. Given the very iterative nature of machine learning, even a small project can require tens or hundreds of attempts. It's quite likely you'll be running your feature engineering code again and again just because you're trying out different algorithms, different hyperparameter combinations, and so on. If you work with large data sets or multiple data sets, this can be a significant waste of time and resources. So, it's definitely something we want to address: the ability to reuse features that we've already built. Another problem is the difficulty for different machine learning teams to share and discover features that they've built. The last thing you want is different teams rewriting the same feature engineering code because they simply didn't know that the team next door had already done that. Another problem is that if you want to use your engineered features at prediction time, you often need to rewrite code embedded in your prediction app to apply the same transforms to the incoming data that you applied to your training set. There's always the chance that you get it slightly wrong, and this can have a huge impact on the prediction quality. Even if you get it exactly right, why would you write that code twice, possibly in different languages? So, these are the three problems we want to address today: avoiding running the code again and again and reusing features instead, making it easier for teams to share features they've built, and trying to use the exact same features at training time and at prediction time without having to rewrite any code. To illustrate this discussion, I'm going to run a demo and we're going to build a sentiment analysis model on the Amazon reviews dataset using a built-in algorithm in SageMaker called BlazingText. Let's get started. This is a notebook we're going to run, available on GitLab. I'll share the URL at the end so you can run it yourself. Before we start, a few words about the Amazon reviews dataset. It's a pretty large dataset, as you would expect, hosted in Amazon S3. You can grab reviews, which are actual customer reviews for different categories. Here, I will be using camera reviews. There are about 1.8 million of those, which should be enough for our purpose. You could work with video game reviews or television reviews, whatever you'd like to use. This is what we're going to use as far as the dataset is concerned. As far as the algorithm is concerned, we're going to use BlazingText, an algorithm invented by Amazon and implemented in SageMaker. We will use it in supervised learning mode to build a text classification model. We need to understand the format that the algorithm expects. It expects one instance per line in a text file, with the first column in that line being the label. We'll have three labels: positive, negative, neutral, followed by the text you want to classify. The Amazon Reviews dataset looks nothing like that initially, so we'll run through simple feature engineering steps to transform the raw data into something similar. Let's go to the notebook now and start working on this. The high-level steps are to take a look at the data, use pandas to process it, and convert it to a format that BlazingText can understand. There are lots of libraries for this. I'm going to use NLTK, a popular open-source library. You could use SpaCy or your own code. I'm using NLTK here because I think it's fun and a good way to learn. The alternative would be to use SageMaker Data Wrangler, where you can visually build a transform pipeline using built-in and custom transforms. Later, you can export this directly to a Jupyter notebook that stores your engineered features in SageMaker Feature Store. Another option would be to run a predefined feature engineering script as a batch job on SageMaker processing, using managed infrastructure. But here, I'm pretending to experiment with code and can worry about automation later with SageMaker processing and other capabilities. Once we've looked at and processed the data, we'll use SageMaker Feature Store. We'll create a feature group, an abstraction that will hold the engineered version of our features. We'll store them offline in Amazon S3, where we can query and build datasets, and online where we can retrieve them at prediction time without rewriting any code. So, let's get started. The first step is to grab the dataset from S3. I'm using the camera reviews here. Then, import pandas, numpy, and a few libraries I'll need. I'm using pandas to load the reviews, which is a compressed TSV file. I'm ignoring any error lines because there are over 1.8 million reviews, so we can afford to lose a few. We'll drop anything that causes an error or has undefined values. We can see there are a little more than 1.8 million reviews and 15 columns, which is what you would expect if you're looking at product reviews on Amazon.com. If we display a few lines, we see the review ID, which is a unique identifier and will be useful later. We also see the product title, review headline, review body, and star rating from one to five. We'll need the star rating for our sentiment analysis classifier. I could work with 1.8 million reviews, but it makes everything a bit longer, and I don't think it's necessary. So, I'll stick to 100,000 reviews, but feel free to train on a smaller or larger dataset. If you want to go really big, you might need a larger SageMaker Studio instance. The first step is to concatenate the review headline and review body because I want all the text in a single place. There could be good information in the title, and I don't want to lose it. Next, I'm keeping four columns: the review ID (a unique identifier), product ID (for building models for specific products), star rating, and review body. The star rating is an integer from 1 to 5, but BlazingText needs text labels. So, I'll map one and two-star reviews to negative, three-star reviews to neutral, and four and five-star reviews to positive. I'm adding this to a new column called "label" and dropping the integer rating. Now, my data looks like this: review ID, product ID, review body, and label. One more important step for data prep is tokenization. BlazingText wants samples that are tokenized, with everything neatly space-separated, including punctuation. We need to run a tokenizer on the reviews. We could write one, but why? This is where NLTK comes into play. We'll use a built-in tokenizer on the review body column. This is an intensive step; with 100,000 reviews, it takes about 37 seconds. With 1.8 million, it would take 10 to 12 minutes or more. This shows why you don't want to do this repeatedly. The tokenizer returns an array of strings, but I need a single string with everything space-separated. So, I join the different bits of the array as a space-separated string. Now, if I look at my data, I'm getting close to what BlazingText wants. I still have my review ID, product ID, space-separated review, and label. For BlazingText, we need to grab the label and review columns, put the label first, and that's it. Now, it's time to stop engineering features and move on to storing them. This is where we start using SageMaker Feature Store. First, I grab the usual stuff: the SageMaker session, S3 bucket, region name, etc. Next, I define a name for my feature group, adding a timestamp for uniqueness. The first thing I need to do is define the column that will act as the unique identifier. The feature group will store the engineered version of the rows in my dataset, called records. Each record has key-value pairs representing the different columns in your initial data. When retrieving records, I need a unique ID, so I define the unique ID as the review ID. Another nice feature in Feature Store is the ability to store timestamps, allowing you to store different versions of your features over time. I add a new column called "event_time" and fill it with the current timestamp. I recommend checking that timestamps are correctly set to avoid ingestion failures. Before ingesting, we need to take care of feature definition, basically typing. We have two options: provide a feature definition JSON dictionary or let Feature Store infer the types from the pandas types. I'll let Feature Store infer the types, making sure my four columns are typed as strings and the timestamp as a float. We load the feature store, and we see confirmation that the strings and fractional columns have been correctly picked up. Now, we create the actual feature store, passing the S3 location for the offline store, the name of the column for the unique ID, the name of the column for the timestamp, and enabling the online store. We can query the group status, and after a few seconds, it's created. In SageMaker Studio, you can see the feature group listed as active, meaning it's ready to go. You can ingest and retrieve data. If you click on it, you get summary information, including feature definitions and tags. We also provide queries for using offline features with Athena. You can create feature groups using Studio, similar to the API, passing the name, enabling online storage, and setting the S3 bucket location. Now, we can ingest in bulk mode, passing the Pandas DataFrame and setting the number of workers to parallelize ingestion. This loads the DataFrame to the offline store and propagates it to the online store if enabled. Depending on your data and the number of workers, this could take a few minutes. While ingesting, you can go to the Athena console and see the table has been created. Once ingestion is complete, you can run queries to build your dataset. For example, we want the label column in the first position followed by the tokenized review. We can create an Athena query and retrieve the results as a Pandas DataFrame. Now, our DataFrame has the right format for BlazingText, and we can move on to training the model. We split the dataset for training and validation (90% training, 10% validation), save the files as text files, and upload them to Amazon S3. We grab the name of the container that implements BlazingText in the region we're running in. We configure a training estimator, passing the name of the container, infrastructure requirements, and where to store the model. We set BlazingText mode to supervised, meaning build a text classifier. We can tweak other hyperparameters or use model tuning, but we won't go into that. We put everything together, specifying the training and validation channels, and fire up the training job. The training job creates a managed instance, downloads the dataset, and trains the model. We achieve 91% accuracy, which is pretty good given no hyperparameter tuning. The training was much faster than feature engineering, so there's even more reason not to repeat feature engineering. We deploy the model to a SageMaker endpoint, and after a few minutes, we have a prediction API. I'll show you how to retrieve a row in the online store and predict it directly. In more elaborate use cases, you might retrieve pieces of the dataset to convert raw features into engineered equivalents or add additional features to the prediction request. I'll grab one record from the feature store using the simple get record API, passing the group name and unique ID. There is also a put record API for storing individual records. This is what my record looks like: key-value pairs with feature names and values. I'm only interested in the text, so I extract the review and pass it to my prediction request using JSON for serialization and deserialization. Finally, I invoke the endpoint using the predict API and see the probabilities for the three labels. This is an example of going from A to Z. There's another option to store data in Feature Store using the integration between SageMaker Data Wrangler and SageMaker Feature Store. Data Wrangler is a visual tool that helps you prepare datasets, apply built-in and custom transforms, build a transform pipeline, and export it to destinations, including a Jupyter notebook that stores your engineered features in SageMaker Feature Store. You don't have to write all that code yourself. The two things you need to take care of are enabling the online store and providing a record ID column name and a timestamp column name. You can run the notebook and directly push your features to Feature Store. A few resources to get started: the product page, documentation, Feature Store SDK, and the location for the notebook I used. We've made it to the end of this session, where we saw how to use Feature Store to avoid running feature engineering code again and again, make it easier to discover and share features, and help you run the same features for training and prediction without rewriting any code. I hope this was helpful. Go and try it, send me feedback, and let me know what you like and don't like. Please feel free to connect on Twitter, Medium, YouTube, or anywhere else. I hope this was good, and I hope you enjoy the rest of the conference. Bye.

Standardize and automate your feature engineering workflows February 2021

Transcript

Tags

About the Author