Hi everybody, this is Julien from Arcee and in this video I would like to introduce you to a new SageMaker capability that was launched yesterday: SageMaker Data Wrangler. This is going to help you a lot with data preparation in the early phase of your machine learning project. Let's get to work.
Obviously, first you need to have an AWS account. Go to the SageMaker console once you've created your account, and go to SageMaker Studio to create a user. It only takes a minute, and then you can open Studio directly. You can do all of this in five minutes. The one place you want to go is the new icon on the left. We'll look at these other new capabilities in other videos. Now, let's focus on Data Wrangler. Click on New Flow. We can give it a name because "Untitled" feels a little generic. Let's call this "Wrangler Demo."
We need to provide some data. I'm going to use the Titanic passenger survival data set, which I'm sure you've seen before. If not, don't worry, it's super simple. I copied it to one of my S3 buckets. So, I'll click on S3 to import it. The first time you do this, it fires up the Data Wrangler image in Studio, which could take a few minutes. Just hang in there; it will show up. Let's find my bucket. There's a lot of stuff here, so I need to clean up. Okay, and here's the Titanic data set. I can close this; we don't need it anymore. As I select it, I see a preview to make sure this is the right data. I don't want sampling; I want the full data set. You can import CSV or Parquet data. Let's import this.
I'm presented with a graphical view. As we add preparation steps, they will be reflected here. The first thing we probably want to do is check that the data types are correct for each column. So, we can check this. It's a simple data set, so there shouldn't be any issues. The columns look good to me, so I don't need to change anything. I can go back to the analyze view. The next thing you want to do is visualize the data to understand what's in there. Let's add an analysis. We can build histograms and scatter plots. A table summary is probably a good place to start for basic stats on our different columns. We can save this visualization and see it anytime we want. Let's do another one. Maybe we can see the age distribution of passengers and color them by passenger class to see who survived. It looks like if you were in third class and between 10 and 40, your survival odds weren't that good. We can save this one as well, calling it "Age vs. Survived."
Once you have a better sense of what the data looks like, you can start preparing it by adding transforms. We see a long list of built-in transforms, such as encoding categorical variables, managing columns, and more. Let's try a few things. First, I want to drop the name column. I don't need names here. I can preview the change, add it, and move on. Next, I want to one-hot encode the passenger class, which is a categorical variable, not an ordinal one. So, I'll encode the categorical p class and choose one-hot encoding. I want the column style, not the vector style, and I can give a prefix. I forgot to select the input column, so I'll encode p class as columns with a prefix. I can preview and see my encoded columns. I want to keep this and drop the original p class column. Preview, add, and we're good.
If you have custom transforms, you can add your own code using PySpark, Pandas, or PySpark SQL. You can also use custom formulas with Spark SQL. What else do I want to do? Probably move the label to the first column, which XGBoost expects. So, let's move the "Survived" column to the start. Preview, add, and you get the idea. You can explore all of these and use your own. Now, we see our pipeline of preparation steps. You can create multiple groups of steps for data cleaning and feature engineering, keeping everything organized.
We can delete or revert steps if needed. Let's say we're happy with this and want to see how it would perform. We can go back to analysis and use the quick model feature to train a model right there. The label is "Survived," and this trains a model on the pre-processed data immediately. This is nice because you can see the impact of your data preparation steps. It's a binary classification problem, and I get an F1 score of 0.718, which is okay but could be better with more processing. I see feature importance, so sex, cabin, and being a first-class passenger are important. By running this again, you can see if your transforms are helping your model.
Once you're happy with the accuracy, you want to export your data processing steps for use in your own code. This is super simple. Just go to export, select the steps you want, and click on export step. You get different options, such as exporting to Python code, which is ready to run and integrate into your machine learning project. This ensures you don't need to recode or replicate what you've done interactively. Another option is to export to a Data Wrangler job, which is a SageMaker processing job that applies those transformations to your dataset and outputs a processed version to S3.
Option three is to create a SageMaker pipeline, another capability announced yesterday. This allows you to prepare and automatically run all those transforms, creating a workflow. If you want to automate and replicate this, you can use SageMaker pipelines. The last option is to export to another notebook that pushes your engineered features to the SageMaker Feature Store, another capability announced yesterday. I'll cover this in another video using the same example for continuity.
As you can see, it's really simple to explore your datasets interactively, with a long list of built-in transforms and the ability to add your own. You can evaluate their impact with the quick model feature, and once you're happy, you can export in various ways to replicate the exact steps in your machine learning project, whether it's Python code, a SageMaker processing job, a SageMaker pipeline, or a feature store.
That's the whirlwind tour of Data Wrangler. It's available in all SageMaker regions today. Go try it and let me know what you think. Send me your feedback, and I'm more than happy to read and answer your questions. Thank you.
Tags
SageMaker Data WranglerData PreparationMachine Learning WorkflowAWS SageMakerFeature Engineering