Amazon SageMaker Studio Deep Dive AWS Online Tech Talks
February 27, 2020
Traditional machine learning (ML) development is a complex, expensive, iterative process made even harder because there are no integrated tools for the entire machine learning workflow. You need to stitch together tools and workflows, which is time-consuming and error-prone. Amazon SageMaker Studio solves this challenge by providing all of the components used for machine learning in a single, web-based visual interface. SageMaker Studio gives you complete access, control, and visibility into each step required to build, train, and deploy models. You can quickly upload data, create new notebooks, train and tune models, move back and forth between steps to adjust experiments, compare results, and deploy models to production all in one place, making you much more productive. In this tech talk, we will explain how it works, including a demo.
Learning Objectives:
*Understand how Amazon SageMaker Studio helps complete all steps of the ML workflow
*Learn how Amazon SageMaker Studio helps you improve model quality
*Understand how you can improve model performance with Amazon SageMaker Studio
***To learn more about the services featured in this talk, please visit: https://aws.amazon.com/sagemaker/ Subscribe to AWS Online Tech Talks On AWS:
https://www.youtube.com/@AWSOnlineTechTalks?sub_confirmation=1
Follow Amazon Web Services:
Official Website: https://aws.amazon.com/what-is-aws
Twitch: https://twitch.tv/aws
Twitter: https://twitter.com/awsdevelopers
Facebook: https://facebook.com/amazonwebservices
Instagram: https://instagram.com/amazonwebservices
☁️ AWS Online Tech Talks cover a wide range of topics and expertise levels through technical deep dives, demos, customer examples, and live Q&A with AWS experts. Builders can choose from bite-sized 15-minute sessions, insightful fireside chats, immersive virtual workshops, interactive office hours, or watch on-demand tech talks at your own pace. Join us to fuel your learning journey with AWS.
#AWS
Transcript
Welcome everybody, this is Julien from Arcee, and I'm super happy to be back with a new webinar. Today, we're going to talk about Amazon SageMaker Studio, one of the most exciting services launched at AWS re:Invent just a few months ago. Before we dive into the service itself, let's look at what it means to build machine learning workflows today, and unfortunately, it is a pretty difficult task. You have to go through a number of different steps, and there are so many different tools you could use for each of them. Everything needs to line up perfectly if you want to build a successful model and deploy it.
Of course, first, you need to prepare your dataset. You need to collect data, annotate and clean it, transform it, and get your training set ready. Once you have that, you can start experimenting. You can start trying out different algorithms, from statistical machine learning to deep learning to even more exotic techniques. There are plenty of things you could try. You need an environment for experimentation, basically a sandbox you can use. Then, once you start to figure out which algorithm is going to work for you, you need to train, try out different dataset combinations, and different parameter combinations. This requires a bunch of infrastructure, and no one really wants to manage that, but you have to. You need to power those training jobs and debug them because sometimes they don't go the way they're supposed to. Understanding what's really happening during the training process isn't easy. You need to tune them, extract every bit of accuracy from that training job to get the best possible answers for your predictions and the best business outcome.
Over the time of the project, you'll end up training hundreds, maybe thousands of jobs. It's not easy to remember what you did last week, where that high-performance model you tried is, and what the parameters for it were and what dataset you trained it on. Managing that stuff becomes a project in itself, and a lot of customers actually build tooling around that. Once you have a model you really like, the hardest part is deploying that model in production, bringing it into your production environment where it can start predicting from samples that your business apps send to it. You need to monitor that and make sure your model is performing the right way. You need to reproduce in the production environment the performance and accuracy you saw in your dev and test environments. As your app becomes successful, you need to scale your prediction infrastructure and manage everything from high availability to security.
Although people tend to focus a lot on algorithms and data science and neural networks, there is quite a lot of stuff that needs to be done if you want your machine learning project to be successful. This stands in the way of achieving the expected outcome, which is why a couple of years ago we launched a service called Amazon SageMaker. At AWS re:Invent just a few months ago, we expanded on SageMaker and added a bunch of capabilities, and I'll demo some of them in just a few minutes.
We still see the same workflow: preparing, building, training, tuning, deploying, managing. These problems are still there, but we added a whole bunch of capabilities to let you go from experimentation all the way to production using the same modular service and APIs, trying to give you an end-to-end solution. Capabilities like SageMaker Debugger to debug your models, SageMaker Experiments to help you track thousands or tens of thousands of different models, and SageMaker Model Monitor to help you check your models in production and detect unwanted problems like data drift or missing features. The whole collection of additional features is now available.
On top of this, we launched SageMaker Studio, a web-based IDE for machine learning. You can still run your notebooks and work in the usual way, but we're also plugging all those different capabilities into SageMaker Studio so that you can have a single pane of glass where you just switch from one tab to the next, run your notebooks, and jump from debugging to monitoring, etc. Let's look at the main features of SageMaker Studio. Some parts of SageMaker Studio are still in preview, but you can absolutely go to the AWS console today, just make sure you go to the Ohio region (USC2), where Studio is available. Go to the SageMaker console, and you will see the link to SageMaker Studio. There is a simple setup procedure, just a few steps to get access, and you can start using it in just a few minutes.
In Studio, you'll find notebooks, but you don't have to create notebook instances anymore. If you've used SageMaker before, you probably know about notebook instances—fully managed instances that come pre-installed with Jupyter environments and Conda environments and all your favorite machine learning and deep learning libraries. These are still there, but one of the main new things in Studio is the ability to just open Studio and get access to serverless notebooks. You don't have to start instances any longer. Fully integrated in Studio is experiment management, the SageMaker Experiments capability, and we'll look at it. You can also run AutoML, automatically create a machine learning model, and this is a no-code experience. I'll show you how to do that. You have quality and productivity capabilities, such as the model debugger integrated into SageMaker, so you can easily look at any issue that's happening. Model Monitor is also integrated into SageMaker Studio, so you can quickly and easily enable data monitoring on an endpoint that has already been deployed.
The philosophy of Studio is that it's an IDE based on notebooks, a high-level IDE on top of all these new capabilities. You can still use notebook instances or your own local Jupyter notebooks and use the experiments SDK and the monitoring SDK, but we're trying to make it simple with SageMaker Studio. Enough talk. Let's go and look at a demo.
So, if you go to the USS2 region, the Ohio region, and go to the SageMaker console, you will see the link to Studio. Follow the setup procedure, and you get a link which you can see in my browser here. Click on this, and it opens SageMaker Studio. It looks very familiar, and we can create notebooks and open terminals, etc. On the left, I can see my files, and I've already cloned a few repos with some sample notebooks. Let's start looking at a notebook. This is one of my examples, and here we're training an XGBoost model. We're building a binary classification model on a dataset for direct marketing. It's a simple problem: we have customer records telling us if a customer has accepted a marketing offer. We want to build a classification model using this data.
First, I start by installing a few libraries that I need, like the SageMaker SDK, and then I download my dataset and extract it. This is a CSV file, and we see a bunch of features here, including employment status, marital status, and other information. The last column tells us if a specific customer accepted the offer. We use the Pandas library, which I call the Swiss Army knife for data science, to open the CSV file and display the first 10 lines. We see features and the label, and the column is called 'Y'. Most people say no to a marketing offer, but we have some yeses. We have more than 41,000 samples and 21 columns, 20 features, and the yes or no label. One interesting thing about this dataset is that it is very unbalanced, with an 8 to 1 ratio of no to yes, making it difficult to build a good classifier.
We do a little bit of transformation, removing fields or columns that don't make sense, and we group students, retired people, and unemployed people into a new column called 'not working'. We convert all categorical variables using one-hot encoding, which helps the model understand the different dimensions. We do the same for marital status and all categorical variables. Once we've done that, we still have the same number of lines but many more columns due to the extra categories.
Next, we need to split the dataset into training, validation, and testing. We split the data set three ways and save three CSV files: test, train, and validation. We upload these files to S3 because that's where SageMaker needs them. We define their location in a Python dictionary and are done with the preprocessing step.
Now, let's train the model. The dataset is in S3, and we can use the XGBoost built-in algorithm from SageMaker. We use the latest version and configure the estimator object from the SageMaker SDK. We set the container, the algorithm (XGBoost), permissions, and how to access the data. We use file mode to copy the dataset to the training infrastructure. We save the model and set the training infrastructure requirements, such as one M4.2xlarge instance. We can use Spot instances, which are unused capacity at a deep discount. We configure debugging, which saves the internal state of the model (tensors) to S3. We can check for class imbalance using a built-in rule in SageMaker Debugger.
We set some hyperparameters, such as the area under the curve metric for binary classification, and call fit to start the training job. We see the training job starting, and the debugger job runs in parallel, inspecting the tensors saved to S3. After 74 seconds, the training is done, and we saved 79.7% on the training cost using Spot instances.
The model and tensors are saved, and we can check the debugger job. Running the notebook in sequence, the debugging job was in progress, but running it again shows no issues found, meaning the class imbalance rule was not triggered. We can use the SageMaker Debugger SDK to inspect the tensors and plot metrics using matplotlib.
The last step is deploying the model. SageMaker makes it easy with the `deploy` method in the SageMaker SDK. We set the endpoint name, initial instance count, instance type, and data capture configuration to capture data sent to the endpoint. We send some data to the endpoint using the first 100 samples from the test set, invoke the endpoint, and read the answers. We see probabilities between 0 and 1, and we can capture and inspect the input and output data in S3.
For model tuning, we use the same dataset and estimator, and we explore different hyperparameters for XGBoost. We create a hyperparameter tuner object, run 40 jobs, and use the SageMaker SDK to analyze the results. We can sort the jobs by the area under the curve metric and deploy the best model.
Now, let's use AutoML to simplify the process. We download the dataset, split it, and upload the training set to S3. We create an Amazon SageMaker Autopilot experiment, give it a name, pass the location of the data in S3, and specify the target attribute. We select the machine learning problem type, such as binary classification, and run a complete experiment. The autopilot job goes through data analysis, feature engineering, and hyperparameter tuning, generating notebooks that show the steps and transformations. We can explore the top models, deploy the best one, and monitor the results.
This is a whirlwind tour of SageMaker Studio and its capabilities. Even if you're just using the SDK, it's easy to set up and use. For a simpler experience, AutoML can automatically handle the entire process, from data preparation to model deployment. Thank you for joining me today, and I hope you found this demo helpful. This is still in preview, so expect some changes, but the direction is to put everything in the same place to simplify experimenting, building, training, deploying, and using advanced capabilities in the same tool.
We're almost done, and I just want to share some resources. These are some links I recommend. SageMaker is part of the free tier, so you can experiment at a tiny scale. If you start training, especially with SageMaker Autopilot, this won't be in the free tier, so understand the pricing before running hundreds of jobs. The second URL is the SageMaker service page, where you'll find feature descriptions and customer stories. The next two URLs are the SageMaker SDK, which I use in those notebooks, and the SageMaker examples repository, where you'll find a ton of notebooks showing how to use SageMaker in various configurations. I cannot recommend them enough. A couple of links to my own resources: my YouTube channel, where you'll find videos on SageMaker, and my blog on Medium, where you'll find technical posts. I also have a podcast on machine learning.
I hope you learned a few things and are excited about SageMaker Studio, SageMaker Debugger, SageMaker Model Monitor, and all the other tools. It's a growing family, and we're trying to make your machine learning workflows as simple as possible. Please send us feedback through the AWS forums, your AWS contacts, or by pinging me on Twitter. Thank you for listening, and I'll see you soon. Bye-bye.