Introducing Amazon SageMaker Pipelines AWS re Invent 2020

Transcript

Hi everybody, this is Julien from Arcee. In this video, I'd like to introduce you to another SageMaker capability that was launched yesterday at reInvent. This one is SageMaker Pipelines. SageMaker Pipelines is integrated with SageMaker and SageMaker Studio, as you would expect. I'm going to walk you through the creation and deployment of an end-to-end machine learning pipeline from pre-processing to training to deployment. This capability should be interesting both to data scientists, machine learning engineers, and to ops teams. It makes it easy to create your machine learning pipelines in Studio, where you can easily automate and replicate your sequence of steps. It also makes it easy for ops teams to guarantee that only approved models are deployed in production with the right level of automation. In this video, I'm going to wear those two hats, data science and ops, and I'll make sure to point out when I'm playing each one of those roles. So let's get started. In Studio, you can find the pipelines here, of course. You can also see them in the notes in the launcher. If we click on new project, this should work for us. Let me zoom in a bit. The first thing we immediately notice is that we have templates. We have built-in templates for model deployment, model building and training, and model building, training, and deployment. These are built-ins, and you can also add your own. Here, I don't have any, but you can write your own templates and make them available to your teams. Let's go and select the Building, Training, and Deployment template. I'm going to do this one in real time. I'll pause to skip waiting times but let's try and do this end-to-end. So I'm selecting a template, let's give it a name, all right, create a project. Now it's creating the project. If you're a data scientist, you just have to wait; it's completely automated, and after a few minutes, we will see a new project being available in Studio. After a few minutes, the project is ready, and we can see some of the artifacts. We see two repositories. We can see the code commit repos here. Let's clone them. We'll need them locally to inspect the code. There's one for model building and one for model deployment. The rationale for this is the model building part is the data science part where you experiment and train again and again. Of course, you want automation there. At some point, you're going to train a model, it's going to get published, and there needs to be some quality gate before this makes it to production. So that's why we have a second repository with artifacts to actually deploy the model in production, and potentially you would have different permissions on those Git repos, one for data science teams, one for ops teams. It makes sense to have those two things in different repos. We have those two repos. We'll take a look in a minute. We see the pipelines. Let's see what's going on here. We can see the pipeline is executing already automatically. So let's go back to this later on. We see automation at work. So what is actually going on here? Let me show you the two repositories. First, let's look at the model building repository. This contains all the artifacts needed to build your machine learning pipeline with the different steps, run them automatically, and register the trained model once your training process is done. Then you can deploy it. Here, I will deploy it in this account for testing or staging, and then we would need manual approval to deploy it for production. But let's look at the data science part first. We have a notebook here, a sample notebook that shows us the different steps required to build this pipeline. We can see all the scaffolding code and the artifacts that are available. We can see the sample dataset and the sample project included here is a linear regression model based on the Abalone dataset. The Abalone is a kind of shellfish, and we try to predict the age of individual shellfish by looking at physical dimensions. It's a toy example, but it's enough for our purpose. We'll see this here. The rest is built-in code that you probably don't need to look at. So let's see what the pipeline looks like. You need to go to pipelines-abalone-pipeline.py. This is the code you would write to build the sequence of steps that build your machine learning pipeline, the one you want to follow and automate. This is based on a pretty simple Python SDK. If you worked with scikit-learn pipelines or Spark pipelines before, it's very similar. You define each individual step and then chain them as a pipeline, and you can pass parameters to the pipeline and execute it. Let's take a look at the different steps. We see we have a processing step, so preprocessing the dataset, maybe normalizing values, etc. Here's the actual step, and we use this sklearn processor object from the SageMaker Processing SDK, so running a scikit-learn script with SageMaker Processing as a first step. Then we have a model training step using XGBoost, and we can see here we use the normal SageMaker estimator, hyperparameters, etc., and then we add a training step to the pipeline. The third step is another processing step where we use a Python script to run model evaluation on our model. We build a script processor object from SageMaker Processing and then add this as a pipeline step. Then, based on a conditional step that we can see here, we have the ability to define conditions. If the model accuracy, in this case, the mean square error, is lower than a threshold, then we register the model to SageMaker Pipelines. If the model has a higher error, then we don't register it. We consider it not accurate enough and don't want to go forward. So we define the condition and the registration step. We put everything together. So processing the dataset, training the model, evaluating the conditional step leading to model registration if the model is accurate enough. As you can see, this is really simple. If you're used to working with scikit-learn pipelines, this is very, very similar. This is the machine learning pipeline we want to run. We can see there's the preprocessing script, which is vanilla code, and there's the evaluation script that's vanilla code. It's just adapted to run on SageMaker Processing, but nothing complicated here. This is what you would need to write and provide in your project. Once you have that, you can start running your workflow. If we go to our pipeline here, we can see this is executing. It started to execute automatically when we created the project. We can actually inspect this. It's all green, so that's good news. We can see each of the steps, training metrics, validation metrics. Validation means the error is less than 6, so it looks like we registered this. The conditional step actually ended up true, so we registered the model. Mean square error was 547, so that's okay. It's under the threshold we defined. The pipeline completed successfully, and the model was registered, but it wasn't deployed. One of the parameters of this pipeline is called model approval status, and it's set to pending manual approval. Let's find this in the code just to make sure we understand what this does. Going back to the model build, code pipelines, abalone, pipeline.py, if we find the model registration, we see an approval status parameter. This is set to model approval status, which is one of the parameters passed to the workflow. If you pass this pending manual approval, the default value here, then the model is registered but not deployed. How do we deploy it? Of course, we want to test it. Starting back from the project here, myPipelinesDemo, if we go to model groups, we can see models associated with this pipeline. There's only one, we only trained once. If we click on this, it says status updated to pending manual approval, which is exactly what our Python pipeline did. I can update the status and say yes, I would like this model to be deployed for staging. Now that the model has been approved, it's going to be automatically deployed. This is based on wearing my ops hat. If the model status has been updated to approved, it's going to be deployed automatically. We just have to wait for a few minutes to see it in Studio. But wearing our ops hat, we could see in CloudFormation that a new template is executing, and it's creating the endpoint. This is part of the second repository, all the deployment activities here. We can see we have a deployment CloudFormation template, and it's either using a staging configuration or a prod configuration. Staging is called staging, and it's deploying on M4. You could have extra parameters. This repo would be controlled by the ops team because they're in charge of deploying and production infrastructure. It's a collaboration, but they would want to know what's going on here. If we go back to projects and open this, we've got our repositories, pipelines, models, and endpoints. After a few minutes, the endpoint is in service, and I could use it for testing. This is just a normal SageMaker endpoint, and I can use my testing script and send predictions, etc. So that's fine. Our pipeline ran end-to-end from data processing to training to model evaluation to model registration. We had this approval step because we wanted to have a first quality gate to avoid deploying models that are not good enough. But the gate was still within our control. We could run our testing scripts on the model and say, okay, it's good for staging deployment. Click, set the model to approved, deploy, and then it gets deployed. Now I would run more testing and other tests. At some point, I would want to deploy this for production, for real traffic. This is where you need proper approval from ops or whoever is giving you the green light to deploy. You cannot do this in Studio, which is fine. There needs to be another quality gate where you say, OK, go for it. Whatever communication medium you use, maybe you ping somebody on Slack, maybe you create a Jira ticket, maybe you have to do something else, it's up to you, you can define your process. But now what needs to happen is someone with the appropriate AWS permissions is going to manually approve the model to be deployed in prod. Here, for simplicity, I'm doing it in the same account, but in real life, this would certainly happen in different accounts. Probably you have dev accounts and staging accounts and prod accounts. And that's okay because this service actually supports cross-account deployment. You don't need to copy models and do complex things to get the model in prod. Just use cross-account deployment. Here, in this workflow by default, I have a manual approval step so someone reviewing testing reports and so on would say okay, the model is good for production, and they would approve this. This would unlock the last step of the code pipeline, which is deploying to prod. It's going to take a few seconds to kick off. Here we go, and you can see it's in progress. Now we are deploying the model for prod. Again, doing it in the same account, but we could do this and should do this in a different account. You got to be friendly to your ops team because probably they're going to do this. If I refresh this, hooray, I can see this model being deployed for production. Again, we have to wait for a few minutes for this to happen. Then you have your live endpoint in production. So congratulations, you've made it to the end of the pipeline. To sum things up, SageMaker Pipelines is for both data science teams and ops teams. Data science teams will use the Python SDK to create pipelines and automate their machine learning workflows in a very simple way. This is really similar to scikit-learn pipelines and other pipelines. As you saw, the code is really easy. Then, collaborating with the Ops team, you can work on model deployment, providing a simple CloudFormation template. You can see this one is really simple. It's all about the endpoint config and the endpoint. It's good practice to have different environments, staging and prod, so you can work with them and put all those pieces together. You can use the built-in projects to start from; you don't need to write this stuff from scratch. They're a good starting point for your own workflows. I think this is a really nice service. A lot of customers have been waiting for something like this, giving data science teams flexibility and automation and robustness. Just run those, execute your machine learning pipelines, and don't wait and don't run stuff in notebooks that is always error-prone, etc. It also works for ops teams because they know exactly what's going on, they can set approval whenever they need to. Hopefully, it's going to work for everybody. I hope this was useful. Again, happy to listen to your feedback. Feel free to get in touch, ask me questions. I'm always happy to learn how you're using these services. Well, that's it for this one. I think there will be more next week, but we'll see about that. Definitely not the end of the SageMaker story for reInvent 2020. I'll see you soon. Thank you.

Introducing Amazon SageMaker Pipelines AWS re Invent 2020

Transcript

Tags

About the Author