SageMaker Fridays Season 3 Episode 2 Easy data preparation with SageMaker Data Wrangler

March 07, 2021
Broadcasted live on 05/03/2021. Join us for more episodes at https://amazonsagemakerfridays.splashthat.com/ ⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos ⭐️⭐️⭐️ In this episode, we start from the popular Titanic survivor dataset. We import it in SageMaker Data Wrangler, where we build visualizations and apply built-in transforms (column operations, imputing missing values, one hot encoding, normalization). Then, we export these transforms to a Jupyter notebook running a SageMaker Processing job. We run the notebook and take a look at the processed dataset, before training a model with XGBoost. We also take a quick look at other export options (Python code, SageMaker Pipelines, SageMaker Feature Store). 100% live, no slides :) https://aws.amazon.com/sagemaker/data-wrangler/

Transcript

Hi, everybody. Welcome to season three of Sage Makeup Fridays, and this is episode number two. My name is Julien. I'm a dev advocate focusing on AI and machine learning. And once again, please meet my co-presenter. Hi, everyone. My name is Sigelen, and I am a senior data scientist working with the AWS Machine Learning. My role is to help customers get their ML projects on the right track in order to create business value as fast as possible. Thank you for being with us again. SageMaker Fridays is a bi-monthly event where we focus on real-life machine learning use cases that we try to solve using SageMaker and particularly all the new capabilities that were launched a few months ago at AWS re:Invent. All episodes are live. We're in the Paris office again. We don't use any slides. It's going to be, as always, 100% discussion and demo. So please ask all your questions in the chat. We'll try to keep an eye on the questions and we'll try to do live Q&A for a change. We also have colleagues helping us with answers. So if we don't answer your question, I'm sure someone else will pick it up. But we'll try to do some live Q&A. Why not? And as always, there are no silly questions. So don't be shy. Don't feel embarrassed. Ask anything that you'd like to learn about and make sure you can learn as much as possible about machine learning and SageMaker. So today we are talking about data preparation and we'll focus on a new capability called SageMaker Data Wrangler, which is the beginning of the machine learning lifecycle. Before we dive into the service and the demo, Sigelen, can you tell us a little bit about data prep? What is it? What are the main things we need to focus on? What are some of the pain points? Based on your experience, what should we know about data preparation? Data preparation takes a lot of time for any kind of data. As a scientist, sometimes it can turn out to be a nightmare. When you start any kind of ML project, you need to understand your data, where it comes from, and familiarize yourself with it to do a good ML project. Data prep often involves cleaning data, filling missing values, and transforming data to make it more expressive, which makes it easier for the algorithm to learn. This is what we call feature engineering. It's a big part of the job of data scientists and ML practitioners because creating good features is crucial for having a good ML model. For example, you might need to clean, encode, or drop columns. Visualization is also important. Histograms, scatter plots, and other types of visualization help make your data speak. Each kind of data has an optimal way of visualization, and it's important to try different visualizations. After transformations, you usually write a lot of code. Today, we'll see that Data Wrangler lets us do all of this without writing any code, which is great because, as you know, I'm very lazy. I'd rather focus on understanding the problem than writing Python code to encode stuff. Once we've done that, how do we use it again and again? Can we export the transforms? Can we automate? What are typical steps we want to do there? Exactly. During the demo, we'll see that we want to reuse the work we've already done. There are ways to store data preparation processing and reuse it on other types of data. You can run it programmatically, and so on. Okay, should we start the demo? Yes. Okay. Because I know that's what you want to see—demos. You don't want to see us; you want to see interesting demos and code. Let's go and share my screen. We start once again from SageMaker Studio. From the launcher, we see this thing called new data flow, which is what we want to do: prepare and visualize data with SageMaker Data Wrangler. Clicking on this opens a window where you can see the different steps. We need to import data, analyze, prepare data, apply transforms, do some visualization, and then export it. Here, I could pull data from Athena, but we're going to use Amazon S3. So we click on this, make sure your bucket is in the same region as SageMaker Studio. On my screen, I'm using the iron region. I'm going to get my bucket here, which is in the same region. It has a file called titanic.csv. Hopefully, my demo won't be the Titanic demo. It's a well-known toy dataset. It's a very dramatic topic, so we'll make it fun. We have information on all the passengers. As soon as I select this, we see a preview of the headers. We could enable sampling or take everything. We'll leave sampling enabled here. We can decide the format. This is a CSV dataset, but you can also use Parquet data, which is very efficient for large datasets and supported by Redshift and Athena. The only thing we can do here is import it. So let's do that. We get to the preparation screen. We see our data source, and we're going to use just one for simplicity. If you add multiple files or sources, you could join them in the SQL sense. We have data types that are automatically detected. Let's edit data types and check that the data types for all your columns are correct. They look okay, but you could change them and apply the change. We don't need to do this, so we can go back to the data flow. We want to see what's inside the dataset. So you want to build graphs. Good. We add an analysis and see different types: histogram, quick models, character plot, table summary. Table summary looks like a good start. I'm going to call this summary. We get basic stats, so the number of values. We can see we have some missing values. Age is not completely available for all samples, and cabin is very empty. We'll worry about that later. We can see mean values, standard deviation, and max. It's a quick way to see what's happening. We can create this and it's available. Let's try an histogram. We'll plot survivors. On the X axis, we could have survived. We see about 500 people who didn't survive and 300 who did. We can add some color. Let's try age. Not so bad. We could try something else. Do we have the passenger class? These are integer values, so using a float scale here is not great. We could use facets to build those graphs. We have built-in types like histogram, scatter plot, and you can add your own code if you wanted. You can run your own code and build your own visualization code here. There's another one I like, which is the quick model. It's going to train a quick model in place and give you a sense of the predictive power in the dataset. As you run feature engineering steps, you can see if you're increasing your metric. It's going to be your baseline. I just have to say, here's my label, which is called survived. It's a binary classification, so we use the F1 score. 1 is perfect, 0 is awful. 0.79 is not too bad. We can see feature importance. Oddly enough, we see the name as an important feature, which is probably not right because it's only strings. We see sex, cabin, ticket, passenger class, etc. Let's start preparing data. We can go back to the preparation view and click add transform. This is where the good stuff starts because this is where you find all the transforms. You have over 300, and if that's not enough, you can run your own PySpark, pandas, PySpark SQL code. Let's look at some typical transforms. We want to train with XGBoost, which is a good all-round choice. XGBoost has one requirement: the label needs to be the first column. We see it's not. We have a passenger ID column, which we probably need to drop. Let's move the label to the start. We just go and manage columns, move column, move to start. We can preview to make sure that's what we want. This is really what I want to do. I want to see this column here, so I'm going to add this transform. Then we can say, hmm, passenger ID, is that a feature? It's just the index of the CSV line. Let's remove this. We're going to drop column, passenger ID, preview, and get rid of it. P class is the passenger class. If Leonardo had been first class, it's still your life today, right? Did you cry? Yes, of course. I didn't. Oh, you're not romantic. No, I'm not very sensitive to those stories. But these are not really integers; they are different dimensions. One, two, three, there's no sense of scale. Third class is not three times first class. This encoding doesn't work for me, so I'm going to say this is more of a categorical variable. I'm going to one-hot encode P class, and the output style could be either a vector or different columns. I prefer different columns, so I'm going to prefix them with P class. I'm going to preview and see P class 1, 2, 3. Now I have my three dimensions. That's great. I'm going to add this and drop P class. It's important to drop P class to avoid multicollinearity. I'm going to drop names because those are strings. Maybe there is predictive power there, but I'm not so sure. Sex is definitely a category. XGBoost is pretty good at handling category strings, so let's leave it like that. Age has lots of missing values. Should we try and fill in missing values? We could drop all the rows with missing values or fill them with a placeholder value or impute. We could impute the mean or the median. I'll use the median. We preview this, and it helps. I'm going to add that. Now we have values for age. Regarding missing values, it's really important because most of the time your data will have some missing values. It's important to have domain expert knowledge to know how to replace missing values. It's super important to think about why you have these missing values and how to overcome them. Here, we just use the median because we're trying to have fun with data. In real life, I would try dropping all the rows without age, build both models, and see which one works best. Maybe do an ensemble because why not? There is no right or wrong answer; you have to experiment. Tickets have some weird characters. It's mostly numbers, so I'm going to drop ticket numbers. Fare is probably interesting. Cabin is mostly empty, and I don't have a good idea to fill values here, so I'm going to drop cabins. We still haven't written a line of code. Embarked is a string, so it's a category. The value brought by Data Wrangler is that all the things we are currently doing normally require lines of code. Here, it's just a UI, and you don't lose yourself in lines of code. The way I used to do it before Data Wrangler was copy-pasting from Stack Overflow or old notebooks. I trust that this list of transforms is based on code that works. I'm not so interested in writing the Python code to remove outliers. I'm really interested in getting rid of them and moving on to the next task. It's super important for new joiners in machine learning because all this transformation is the core of the data preparation process. Let's do a couple more. We have numerical values. Maybe we should normalize. We want to show you how to do this. We have normalization in there. Scale value, standard scalar, min max. I love min max. I'm going to transform age between 0 and 1. This is probably not needed because XGBoost knows how to deal with this. Now I should do fare as well. Standardization might be interesting for data with high dispersion or high standard deviation. In this case, the scale of these two features is not so different, but let's do this one anyway. It's really easy to do this. If we go back to analyze, I see my 10 transforms. If I want to add another one, I would just say add, and I see all my previous steps. I can remember what happened there. Now, let's say we're happy with this. We want to export, so let's go and export this. We get to pick the transforms we want to export. This is an easy way to say maybe I want to try different combinations. You can just click and select the ones you want. Here, just export everything. Now we click on the magic button and get four options. The first one is a Data Wrangler job, which is a SageMaker processing job. SageMaker processing makes it easy to run batch jobs on managing infrastructure. Machine learning is not just training and deploying; it's data preparation, model evaluation, etc. SageMaker processing makes it very easy. Just provide your script, where the data is, and where to write the output. We can export to SageMaker Pipelines, which allows you to build end-to-end automated, repeatable, traceable workflows. Python code is just Python code. If you just want code without SDK dependencies, you can run it in your own machine learning code. FutureStore is what the name says. We'll take a quick look afterwards. We automatically generate a notebook to export our engineered features to SageMaker FeatureStore. Let's just run the Data Wrangler job. This is completely generated. I just have to click on those cells. Let's get a kernel. By default on the studio, if you don't know what kind of kernel to take, you take the data science kernel. We can take simple questions as well. Let's just run this. This will import what we need. These are some definitions, where the dataset and flow are. It's still called untitled, so this is in JSON format with the list of transformed parameters. You can actually edit the file. If I open this with the editor, we can see the actual nodes in the graph and all the different transforms. This is very good because you can put this in Git and manage it. You can open it in Studio again and go back and forth. It's not like, oh, I do all this clicking in the UI and then what? Well, then you get this file, and it's the artifact that traces everything. The container for SageMaker processing, some infrastructure requirements. We can just leave all this stuff by default. The output should be CSV. We need to upload that flow file to S3. All data for SageMaker is in S3. So, that's where we put it. Then we can run the SageMaker processing job. The inputs for this are the flow file we just uploaded and the dataset. We have an output, which is an S3 prefix. We're going to store the data. We should run the cell. We use this processor object, which is saying, please use one M5 Excel or whatever we picked, and just need one of them. If you don't like to manage EC2 instances or any infrastructure, that's perfect. Then you just run the job, pass your inputs, pass your output, and wait for the job to complete. When you get to the end of this, after a few minutes, you have your processed data in S3. Let me show you the one I ran a few minutes ago. It's probably a different list of transforms. I didn't tweak all of it as much as we did here. We got to the end of this, and it typically runs for maybe five minutes. Then, of course, we can go and train with this. This is also generated in the notebook. It's a default example where we train XGBoost on classification data. We grab the path of the processed data. If you open this file, it's a CSV file with lots of zeros and ones. It's pretty much the original Titanic dataset processed according to our pipeline. We can use this as a training input. It's a basic example, just a training set. You would want to load this and split it into training and validation, do it properly, and then use your SageMaker estimator to configure a training job and set hyperparameters for binary classification, use the AUC metric, and train. We get a high training accuracy, but that's meaningless. We want to look at validation AUC. All you have to do is start from there and grab the output of your SageMaker processing job. You configure the output location, so you know where to get your output. Let's check the other one because it didn't work. It's not over. I should keep talking for a few more minutes. It's a good example and easy to automate. Here, it's just a notebook, and we click through the cells because it's a demo, but we would really have a Lambda function triggered by an S3 object being written in the application, or maybe you used the pipelines export. Let's take a quick look at this. Export pipelines, just to show you what it looks like. We see the different steps, processing, training, etc., and then you chain them and run the pipeline. Here we have a processing step and a training step. We put them together, create a pipeline, and run the pipeline. That's a very good way to automate things. How am I doing on this notebook? Yes, it worked. Wow. Live demo that worked. I should play the lottery tonight. Lucky you. Because we rehearsed this for about 29 seconds. There's a bit of a story this week, but I'll spare you the story. I want to take a look at the file. Can I take a look at the file? Let's check what you did. We have time. We have more prefixes. So I'm going to give you my magic recipe. We see the CSV file. I'm going to copy this and call it data.titanic. We should do more live demos. They work. Amazing. So, perfect. This is what your data looks like, and you can load it on your own laptop and do whatever you want to do with it. Now, let's go back to this. This worked, and we could train a model and already showed you how to do this. We have a few more minutes, so maybe we can start quickly looking at the other export options. We looked at this one and quickly looked at this one. We can look at Python code. Python code again is exactly this. It doesn't use a SageMaker SDK. You can just take this, put it in your own code, and it will work. There's quite a bit of code because I added quite a lot of transforms, but you can run this and see all the different transforms we added. It's great to have this Python generated automatically for you. It's nice. You have lots of APIs in scikit-learn and pandas, and it's not like we're inventing new transforms here, but it's great not to worry about this. If you have a machine learning engineering team that loves to write this, that's okay. But if you're on your own, you can do that. Or if you just want to cut corners and go fast, that's fine. I'm very impatient and love to go fast. We're missing one, right? Yes, we're missing feature store. We'll look at feature store next week, but it's exactly what you think it is. It's the ability to store the engineered features. We're going to store those rows offline, so that's S3. We can also store them online, which is interesting because now if we store those features online, we can query them at very low latency and inject them in prediction requests. This means we don't have to rewrite feature engineering code to reduce prediction time. We can use exactly the same features. Here it's a simple example, but we'll try to show you something more complex for Feature Store. As you can see, we covered quite a bit of ground with Data Wrangler. We imported the dataset from S3, applied a whole bunch of different transforms, and honestly, just scratched the surface. You have some bigger ones, like outliers and managed vectors, and search and edit is actually pretty cool. One good example is a dataset with missing values replaced by a question mark. You can easily get rid of the question mark, replace it with a null value, and then drop all the lines that have a null value. It has a crazy number of options, like regular expressions. You can vectorize text and use custom stuff. You build your pipeline and can export it in different ways, like Python, processing job, feature store, and pipelines. You can select the transforms and export them. It works if you're experimenting. It's quite fast and just works. You can interactively build your transforms and still get all the code, pipelines notebook, and feature store notebook. If you want traceability, you have the flow file, which is very important. Feel free to experiment and send us feedback. You can always ping me or Sigelen on LinkedIn or anywhere. If you tried the service and are missing something or really disliked something, or if you find something broken, that's fine. You can yell at me on Twitter or LinkedIn. I'm happy when people yell at me. It's a good thing. Next week, we'll discuss Feature Store. We want to continue the logical progression: prepare, build models, train, deploy, debug, profile, optimize. We have a lot to cover. So, we'll see you in two weeks for probably Feature Store. I'll try to build something bigger and show you how to build training sets with your offline features and use them at prediction time online. It's two weeks away, so plenty of time to build something cool. And you'll help me, right? With pleasure. Well, I think we're going to call it a day. Thank you, everybody, for watching this. Thank you to all the colleagues involved in setting this up. We'll see you in two weeks. Thank you, Sigelen. Thank you, Julien. Until then, keep rocking with machine learning. Bye bye.

Tags

SageMaker Data WranglerMachine Learning LifecycleData Preparation TechniquesLive Demo AWS ServicesFeature Engineering Tools

About the Author

Julien Simon is the Chief Evangelist at Arcee AI , specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.

With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.

Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.

Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.