Hi everybody. Welcome to this new episode of StageMaker Fridays. My name is Julien, and I'm a principal developer advocate focusing on AI and machine learning. Once again, please meet my co-presenter. Hi everyone. My name is Ségolène, and I'm a senior data scientist working with the AWS Machine Learning Solution Lab. My role is to help customers get their ML project on the right track to create business value as fast as possible. All right. Thanks again for being with us. So where are we in this season? We are at the second episode of our automation trip. Last week, we started to discuss model deployment, model monitoring, and this week, we are going to dive into pipelines. The whole episode is dedicated to building an end-to-end pipeline. If that's your thing, stay around. Of course, we have more episodes coming in the next few weeks on automation, and then we'll dive into AutoML. It should be fun.
So, Ségolène, we have some moderators. If you have questions, please ask all your questions. We have colleagues, very friendly and expert colleagues, who are waiting for questions. So thanks for joining us. We appreciate it. All your questions are welcome. Ask all the questions to make sure you learn as much as you can. Now we've got this covered. We can talk about the content for this week. Like last week, we are revisiting an example that we covered earlier in the season. But we're going to focus on automation. Ségolène, can you give us a quick recap on this use case?
Of course, Julien. This week, we are going to work on the fraud detection use case. In episode two of this new season of SageMaker Fridays, we trained a model to figure out if an auto insurance claim is fraudulent or not. We also analyzed bias in the dataset and applied mitigation techniques. If you haven't watched episode 2, it covers bias mitigation techniques, which are very cool. We applied mitigation techniques to build a better model. Now, in this episode, we are going to focus on automation again. We are going to build an end-to-end pipeline that includes all the steps from episode 2. We will automate everything, from data prep, feature store, training, bias, and deploying, getting rid of all those complicated notebooks we saw earlier. That's good news, right? Keep it simple.
Here is the notebook or the set of notebooks we're actually going to use. Screenshot time. As usual, Ségolène will remind me to show this before we end the episode. Don't worry if you didn't catch it. That's about as many slides as you will see today. Now let's jump into our example.
So, just a quick recap, in case you didn't see that episode when we prepared data, etc. What did we do here? We ran through notebooks 1, 2, 3, right? Exactly. We prepared data with SageMaker Data Wrangler, trained the model, and worked on bias, etc. We are picking up at Notebook 5. If you're following me, that means we did not run Notebook 4 because last week's episode focused on deploying endpoints, model monitoring, and all that good stuff. If you're interested specifically in that, go and watch last week's episode. Today, we're going straight to full automation, but we'll discuss endpoints a little bit. Don't worry, you won't be confused. That's where we're starting. If you haven't run this episode before, you can find lots of information in that first notebook. Our goal is to automate all of it today.
Maybe just a quick word about the dataset. What does it look like? And what kind of model did you train? For this example, we have two datasets. The goal of this episode is to see if an insurance claim is fraudulent or not. We need data regarding the claims and data regarding the customer. We have two datasets: one for claims and one for the customer. We are going to do binary classification with XGBoost to determine if the claim is fraudulent or not. We have a CSV file with claims, describing the claims, and a label indicating if it's a fraud (yes or no). This has a column named policy ID, which we also find in the customer dataset. This is the key, the join column. We see customer age, how long they've been a customer, zip code, education, etc. Simple CSV datasets. There's a bit of preparation we can do with SageMaker Data Wrangler, but we also have pre-processed work. If you want to jump straight to that, you can.
Once we've processed the data, what do we do next? We store our data in SageMaker Feature Store. One feature group for claims, one feature group for customers. We store the transformed dataset in SageMaker Feature Store, and each one goes into its own feature group. We use Athena to query the offline feature store and join both datasets with the policy ID. As usual, we split the datasets for training and validation. Finally, we upload the files to S3 and train the model using XGBoost for binary classification. We'll show you the scripts and the Python code we inject into the pipeline. You'll see how to create a dataset and train the model. We'll look at script mode and some features we covered earlier in the season.
Now let's dive into pipelining. The purpose is simple: take every single step we just discussed and define it as a pipeline step with inputs, outputs, and connect them all. It's like Legos. The outputs of one step become the input of the next. Once we have those steps as a pipeline, we just run the pipeline. We'll see a nice visual representation of the pipeline in Studio, including executions and logs for each step. We might show some failed executions for fun. As promised last week, we'll also show the pipeline execution for the music recommendation example to give you another perspective.
Here we're going to work only in one notebook. It all fits in one single notebook. High-level overview: define each step, connect them implicitly or explicitly, and then execute the pipeline to see things happening in Studio. First, we load some variables from previous notebooks, import objects from the SageMaker SDK, and all the pipelining and step objects. We need technical objects to call SageMaker APIs, potentially BOTO3 APIs as well. We have output paths for S3 to store bias reports, training sets, artifacts, and instance types to parameterize the pipeline. We can set parameters, such as instance types for processing steps.
This is the whole workflow, and we're going to work on each part. First, pipeline parameters. Why do we need them? You could use hard-coded values, but you want the pipelines to be reasonably generic. By reasonably generic, I mean there's a balance. Trying to be too generic can lead to nonsense with 500 parameters for stuff that should be hard-coded. Be reasonable, like with AWS CloudFormation for infrastructure as code. We have just the instance type and the model approval status. We'll come back to this later. The status is set to pending manual approval, meaning it's not approved yet. Someone will need to manually review the model and approve it for deployment.
The first step is the data wrangler step, the preprocessing step. If you remember the past episode, how did we run this programmatically? We're running a SageMaker processing job and passing the actual flow definition of the preprocessing flow. There's a flow for claims processing and a flow for customer processing. These are JSON files listing all the transforms you manually created in Data Wrangler. We upload the file to S3 and define the input for the processing step using `SageMakerProcessing, ProcessingInput`. We need to find the node in the flow where we want to output data. We can specify the output after certain transforms, not all of them. There's code to find this node ID, which is the name of the node in the data prep flow where we want the output. The inputs are the flow file and the node output. We store the transformed data in FeatureStore in a defined feature group. This is the first step. We do the same for customers: upload the flow file to S3, define it as an input, find the correct node, and store the output in another feature group. These feature groups are pre-existing and can be reused with versioning.
Now we have processed data for customers and claims. The next step is to create the dataset by running an Athena query, joining the datasets on policy ID, and saving the training and validation sets. We do this with a Python script, which we upload to S3. We create our compute using an SKLearn processor, define the framework version, instance requirements, and name. We pass input data as command line arguments, such as the Athena table names and feature group names. We have two outputs: the training dataset and the validation dataset. We pass the location of the code. The outputs of a step become the input of the next step. We use the `depends_on` parameter to explicitly chain steps. This creates the dependencies between steps.
Let's take a quick look at the script. We grab command-line arguments, build paths, wait for the feature store to be ready, create an Athena client, and define the query string to join the claims and customers tables. We start the query execution, get the results, and download the CSV file. We split the data 80-20 for training and validation and save it to a well-known location inside the container. SageMaker automatically copies this to the S3 output path. This is generic code for feature groups and Athena that you can use in your projects.
The next step is to train the model using the XGBoost estimator. We set hyperparameters, specify the S3 location for results, and define infrastructure requirements. The training step receives the estimator and inputs, which are the S3 locations of the training and validation sets. We reuse the output from the previous step to chain the steps implicitly.
There are two ways to proceed with deployment. In a dev/test environment, we can create a model object and deploy it. This step creates a SageMaker model that can be deployed. In a production setting, we register the trained model in the model registry, a proper entity in SageMaker. The approval status parameter comes into play here. When you register a model, it's in the registry, and someone in the ops team can run tests and switch the status to approved for deployment. This can be done using a Python script or CloudFormation template.
We also run bias metrics using the Clarify processor. We run pre-training bias analysis on the dataset, not the trained model. We can do post-training bias analysis as well. We run this analysis on customer gender, specifically for female drivers, who are underrepresented in the dataset. We run the processing step with inputs and outputs, where the dataset comes from the output of the training step.
We then register the model in the model registry. This creates a model package with the model, content types, response types, instance types, and approval status. We can also specify model metrics. This is more powerful than just creating a model, which only makes the model visible in SageMaker for deployment.
We deploy the model the old-fashioned way using a script because there isn't a deploy step in the pipeline. We use the Boto3 API to create an endpoint config and an endpoint. This is fine for testing, but for production, we should use the model registry.
Now we have defined all the steps. We combine them into a pipeline with a name, parameters, and steps. SageMaker figures out the dependencies, but it's better to list them in a logical order. We upsert the pipeline definition, view it, and start it. It runs for a while, and we see the execution in Studio. We can click on any step to see inputs, outputs, logs, and lineage information.
The model registry shows different versions of the model with their status. We can approve a model, and downstream logic can be triggered to deploy it. We can use APIs to query the model registry and build manual or automated deployment processes.
This is how you build workflows using the SageMaker SDK to combine steps, chain them, run pipelines, track executions, and debug. The model registry stores different versions, and you can approve models for deployment in different accounts. That's it for this episode. Here's the notebook we used. Go and grab it, start running stuff, and have some fun. Ségolène, thanks again. I hope you learned a lot today, and we'll see you next week with more. Bye-bye.
Julien Simon is the Chief Evangelist at Arcee AI
, specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.
With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.
Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.
Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.