Sagify an open source CLI for Amazon SageMaker

January 10, 2020
In this video, Pavlos Mitsoulis-Ntompos, a Data Scientist for the Expedia Group and an AWS Machine Learning Hero, demos his open source project Sagify: https://kenza-ai.github.io/sagify/ Sagify is a CLI tool that makes it extremely easy to train and deploy models, letting you focus completely on the problem you want to solve. Here, we train and deploy a scikit-learn model on SageMaker with just a few simple CLI calls. ⭐️⭐️⭐️ Don't forget to subscribe to be notified of future episodes ⭐️⭐️⭐️ For more content, follow me on: * Medium: https://medium.com/@julsimon * Twitter: https://twitter.com/julsimon

Transcript

So Pavlos, what are we looking at? Sadesify, a tool I open-sourced that makes it very easy to train, deploy, and perform hyperparameter optimization on SageMaker using just a few commands. Okay, pretty cool. What's the elevator pitch for the demo? What are we going to run? We're going to build a classifier on the iris dataset. No MNIST, thank you. Another boring dataset. We'll train it on the cloud using SageMaker and then deploy it as a RESTful endpoint on SageMaker. We're going to do this in about 10 minutes. All right, let's go. The floor is yours. I will just ask silly questions. Let's have a quick look at the dataset. It's called the Iris dataset. Many of you are probably familiar with it. It's about identifying which class of iris flower each sample is. It consists of four attributes or features and has three labels. So, three types of flowers, right? Exactly. Here's the dataset: four features and the label. The dataset is on S3. I'm trying to make this realistic for the real world, as most companies store their data on S3. It's under Sageify demo, and here is the Iris data. Just a text file. I'm going to GitHub and create a new repository. We'll call it Sageify demo. Let's have an ignore for Python. That's it. Let's clone it. Cloning the repository. Getting into the repository. We need to create a virtual environment, as in Python projects. I usually use Anaconda, so it's Conda. Create. You could do this on a notebook instance as well, as we have Conda pre-installed. It looks like I already have the environment, but I'll create it from scratch. If you're not familiar with Conda, it's a Python package manager that allows you to create different environments. For example, you can have a self-contained environment for TensorFlow, and another for MXNet or a different TensorFlow version, isolating your environments to avoid dependency issues. So, I created the environment. We're going to activate it. Let's clean our terminal and install the dependencies. We need Sageify, scikit-learn, pandas, and aws-cli. Did I forget anything? I think we're good. Install them. Of course, in a real scenario, you'd have a script to do this automatically. Alright, almost there. Oh, BotoCore, yeah, of course. That's actually the important one; it's required for the AWS CLI. Almost. This should be done in a few seconds. Live demos, you'll see everything. Cool, the Python environment is installed. We're going to run `Sageify help` and see all the commands. We see `build`, `cloud`, where we can upload data, train the model, and deploy it on SageMaker. But first, we need to do `Sageify init`. Essentially, we're creating a new project and it asks for the name of the project. We'll name it Sageify demo. It's a new project. Python 3, of course. It will ask for all the AWS profiles you have locally. I'll use the ML hero profile. These are AWS profiles defined in the AWS CLI config file, right? Exactly, in your home directory `.aws/config`. The point is, you get the credentials you need to get the job done, and you can change them whenever you want. And then your desired region name, US East 1. You have to define where you've installed the requirements, so it's the `requirements.txt`. Can we pause here? Sure. I just realized I forgot to do that. Perfect. We've created the project. Now, we'll use PyCharm to open and create the training and predict logic. Let's use PyCharm. We'll open a new project. Here we are. It has created the source directory. Under this, we have Sageify, and it has all the boilerplate code, like the Dockerfile, everything you need to train your model on SageMaker. Go to the `train` file. If you scroll a bit, zoom in because it's probably hard to read. There's a `train` function. You have the input data path, where your training data will live, and where you'll save your model. This is optional if you've passed hyperparameters through a JSON file. This is where you'll save any failures. You see the todos. Replace these todos with your modeling logic. First, we'll import pandas and read the training data. So, `dataframe = pd.read_csv`. It was a CSV. Concatenate the input data path. Remember, the iris data was called `iris.data`. This is the full path to our training data. The next step is to say there was no header. Yes, no column names. The new column names, I'll call them `feat1`, `feat2`, `feat3`, `feat4`. The last one is the label. X will hold the features only. Y will hold only the labels. We do `.values` to get the NumPy array. It's a good practice to split the dataset into train and test datasets. From `scikit-learn.model_selection`, we import `train_test_split`. Before we go to the modeling side, let's split the data. `train_test_split`, and we give the `x`, `y`, then the test size. So, 30% of the data will be used as the test dataset. It's a good practice to use the random state to avoid splitting in the same way all the time. We'll use support vector machines. While Pavlos is doing that, we're skipping steps in the sense that, of course, you'd probably write and experiment with this code in a Jupyter notebook, either locally or on a notebook instance with a fraction of the dataset. Here, it's a very small dataset, but imagine you have 100 gigabytes of data. You might grab one gigabyte, write this code, make it work, and then train it on SageMaker to scale. If you're new to machine learning, you'd write this code first, debug it, test it on a subset of the dataset, and then once you're happy, train at full scale and integrate your working code. These two lines, and we've trained an SVM. So simple. Scikit-learn is brilliant; all the math is hidden. Now, we need to evaluate it. To evaluate, we make predictions on the test dataset and compare the labels. From `scikit-learn.metrics`, import `accuracy_score`. The accuracy equals `accuracy_score` where `y_true` is `y_test` and the predictions. Perfect. We need to save the model and the accuracy in another file, like a report. From `sklearn.externals`, we import `joblib` to serialize the model and deserialize it later. We need to construct the model path, which is the concatenation of the model save path. You have to save the model in a very specific place. SageMaker will tell you where to save it. You have to use that path, which is local to your container. At the end of the training job, SageMaker will take the model from that location and copy it to S3 in the S3 prefix you define in your training job. Many people say their training job failed, but the job itself probably worked; they just saved the model in the wrong place. Pay attention to where you save the model. Here, you just need to save it somewhere. Use `joblib.dump` to save the trained classifier to a file called `model.pkl`. We also need to save the accuracy in a file. Define the report path in the same location where the model is saved. We'll call it `report.txt`. Open the report path and write the accuracy, which is a float. It's almost done. We can have the usual indentation problem in Python. Tabs or spaces? We'd need a full conference on that one. It's reformatted. Okay, cool, perfect. We're ready. Loaded the dataset, split it, trained a support vector machine, scored it, and saved the model and the score. If you're not familiar with Scikit-learn, we had a session today on that. Go watch it, and you'll understand everything in there. That's perfect. Now, we need to do `Sageify build`. It's building the container based on the boilerplate code and the code we had in it. The Docker image is built. Now we're ready to train the model on the cloud. It's really easy: `Sageify Cloud Train`. Specify where on S3 the data lives, so `S3 Sageify Demo`. Specify where to save the model and the report, in the same bucket but under `output`. Specify the EC2 type, and we'll use `ml.m5.large`. Probably it's too big for this task. Do you support Spot instances? They are supported in SageMaker, but does the tool allow for that? No, at the moment. Pull request, please. That would be really nice. It will save me a lot of money. Here it says no IAM role provided. It's okay; it uses the one we used in the beginning, the ML hero. If you've changed the profile, you can pass in a different one here. It's completely dynamic. We get the logs here, so it's starting the training job. If we go to the AWS console and go to SageMaker, we should see the training job. Hopefully, we see that there is one training job in progress. The training has been completed, and the logs in red are from our training logic. The training is successful, and we can see the S3 location. If you know SageMaker, you know this. We get the model artifact as a zipped tar file. Now, we need to deploy that model as a RESTful endpoint. It's really simple: `Sageify Cloud Deploy`. Specify the S3 model location. Let's use a small instance, `t2.medium`. The number of instances, let's use one. We can give the endpoint a name, let's call it `my awesome endpoint`. The endpoint is deployed. We can see that from the logs. We can go quickly to the console and see `my awesome endpoint`. It's in service. Nice. Now, we can go to our favorite Postman and send the features. We pretend we drove somewhere, found an iris flower, measured those four dimensions, and want to know what type of iris it is. Let's go. We'll call the endpoint, and here you are. It's a `setosa`. So simple. HTTP posting to the endpoint. Three commands, and you just train and deploy the model. The good thing is that you can use anything you want. It's not tied to Scikit-learn. You can install TensorFlow, MXNet, PyTorch. It's open to you. To sum things up, the typical workflow would be experiments in Jupyter, coming up with a first model that you train locally or on a notebook instance, debugging your code, then moving the code, creating your Sageify project, using the scaffolding code to inject your own code, and then `Sageify Cloud Train`, `Cloud Deploy`. You can do local mode too if you want. It supports batch transform and hyperparameter optimization. If you find the SageMaker SDK a bit too low-level or verbose, or if you just want to work with the SageMaker SDK, this tool is great. It couldn't be simpler, honestly. Exactly. A super simple CLI for SageMaker.

Tags

SageMakerMachineLearningPythonCLIDeployment