So, automation, automation, automation—why is it so important? Good question. Automation has a lot of benefits. Obviously, it cuts down on manual work, allowing you to focus on more interesting and important tasks, such as understanding the business problem or exploring data. Less manual work also means fewer mistakes, especially when you automate a workflow, which will run the same every single time. Most importantly, automation standardizes your workflows, making it easier to test, document, and share things. Automation also helps you scale, especially when using on-demand infrastructure in the cloud. Automation and cloud infrastructure go hand-in-hand. There is an API for everything, and you can schedule and fire up as many jobs as you need, as often as you need. Finally, traceability, auditing, and maintenance are improved thanks to automation, as runs and logs can easily be stored in a well-known place. As you can see, automation has a lot to offer, which we would love to add to our ML workflows for data prep, model training, model deployment, and so on. Automation is central to any kind of workload and is beneficial for machine learning projects, even at a small scale. People often think, "I'm doing small-scale work or working with a small team. Why do I really need automation?" In fact, small teams need automation even more because they have fewer resources and no time to waste on manual tasks.
Let's zoom in on automation a little bit. In my experience, we really have two angles: the development angle and the operations angle. ML teams need complete freedom in experimenting, training, and deploying. They should be as autonomous as possible in their development sandbox and use many different tools. If you're wearing the data science or machine learning engineering hat, you want to build workflows for data processing, model training, model evaluation, batch transform, and even local prediction in your account. You can use different tools. Developers I meet use a mix of open-source and AWS tools. For open-source, popular options include MLflow, a project started by Databricks, and Apache Airflow, which is very popular for orchestrating machine learning workflows. On the AWS side, we have the Step Functions Data Science SDK, which has been improved by the SageMaker Pipelines SDK, a Python SDK that makes automation pretty simple. These are just examples, and there are many other tools worth a look.
On the operations angle, teams are responsible for uptime, performance, scalability, and security of the production platform. This requires implementing rock-solid processes that guarantee the quality of all artifacts coming from the development teams, such as Q&A, continuous integration, and continuous deployment. These processes need to be automated, although some manual approval steps may be necessary, like approving models before deployment. There are many tools for managing production and infrastructure. If you're in DevOps, your focus is on building workflows to provision, manage, and scale infrastructure, such as deploying models, updating models in production, and scaling prediction infrastructure. There's a mix of open-source and AWS tools. Open-source tools include Terraform, a great infrastructure-as-code project, and Troposphere, a Python project that lets you code your CloudFormation templates. AWS options include CloudFormation, which uses templates for infrastructure as code, and CDK, a programmatic way to create templates using languages like JavaScript and Python. SageMaker Pipelines MLOps feature helps define deployment templates that integrate with the SageMaker environment.
Can we use only one tool to do everything? Is there one tool that can handle all of it? This is a good point because data scientists usually work with one set of tools, and ops teams with another. This can make it difficult to move models across environments, for example, from the data science sandbox to a production environment. This is why customers ask us to build SageMaker Pipelines, a tool that all teams can collaborate with to facilitate and speed up ML workflows. Let's start looking at our pipelines. The steps we're going to add to the pipeline using the pipelines SDK include feature engineering, ingesting the engineered features into SageMaker Feature Store, building the dataset using an Athena query on the offline store, training with BlazingText, creating the model in SageMaker, and registering the model in the SageMaker Model Registry. We will then use the SageMaker SDK and CloudFormation to deploy our models. So, end-to-end, we start from the dataset, perform feature engineering, ingest into the feature store, build the dataset, train, register, and deploy.
The methodology I used is this: first, I write a notebook to solve the problem, similar to what we did two weeks ago. Once the model works, I start automating each step individually with SageMaker processing. This gets you close to building the pipeline. After that, you can take all the pieces and build your pipelines with the SageMaker Pipelines SDK. Eventually, I want to completely automate this, so I move all the code to a Python script or a Lambda function that is triggered or scheduled whenever the workflow needs to run. The steps are: make it work, automate it piece by piece, put all the pieces together in a pipeline, and then run it automatically or on demand.
A SageMaker pipeline is a series of interconnected steps defined by a JSON pipeline definition using a directed acyclic graph. The structure and dynamics of the pipeline are determined by the data dependencies between steps. When you create a pipeline instance, you define a name and the steps of the pipeline along with the parameters attached to these steps. First, I copy the datasets to S3 and define parameters for the pipeline, such as the region, the number of instances for SageMaker processing jobs, the instance type for training, the default model approval status, and the S3 location for the input data. The preprocessing step uses the SageMaker processing object, similar to what we did last week. The input of the ingestion step is the output of the preprocessing step, and the output of the ingestion step is the feature group name. The dataset is built using an Athena query on the offline store, and the input here is the output of the previous step. The training step uses the same code as before, with the inputs being the outputs from the Athena query. After training, I create the model in SageMaker and register it in the model registry.
The pipeline is defined with a name, parameters, and steps. The steps can be in any order because dependencies are managed by inputs and outputs. We run the pipeline by inserting it into the list of pipelines ready to start and then starting it. We can pass parameters, such as the S3 location of the customer reviews dataset and the model name to register in SageMaker. In SageMaker Studio, we can see all the executions of the pipeline, including logs and outputs. If a step fails, the logs are useful for debugging, and you can iterate quickly without leaving Studio.
The model registry is important because it lets you track and catalog your models. In SageMaker Studio, you can view model history, list and compare versions, and track metadata such as model evaluation metrics. You can define which versions may or may not be deployed in production. After creating a model version, you typically want to evaluate its performance before deploying it to a production endpoint. If it performs well, you update the approval status to approved. If not, you update it to rejected. When the status is set to approved, it triggers a CI/CD deployment for the model.
To deploy models in a sandbox, we use the SageMaker SDK. We start from the model package ARN, find it in the model registry, and import it using the model package object. We then call deploy as usual, ensuring we use one of the instance types defined when the model was registered. We can deploy the model to an endpoint and predict with it.
For lineage tracking, SageMaker Studio provides detailed information on each step, including the artifacts produced and the inputs that contributed to them. This is useful for tracing model lineage, improving model governance, and strengthening compliance posture. For example, you can see which dataset a model was trained on and the exact version of the code used.
For production, we need a more robust and traceable method than Jupyter notebooks. Amazon CloudFormation is a good solution. CloudFormation is infrastructure as code, where you write templates in JSON or YAML to describe resources and create them. A CloudFormation template describes your desired resources and their dependencies, allowing you to launch and configure them together as a stack. This is useful for managing resources throughout their lifecycle. You can deploy any kind of AWS resources, from EC2 instances to S3 buckets.
A CloudFormation template for deploying a model includes parameters like the model name, the path to the model artifact, the container used to train the model, the instance type, and the role. The resources section includes the SageMaker model, endpoint configuration, and endpoint. The output of the stack is the endpoint ID and name, which are created automatically. You can run the template using the CloudFormation console or programmatically using Python code. The events section in CloudFormation shows everything happening in real-time, and you can see the endpoint being created.
For deploying new model versions, you can use Green-Blue Deployment or Canary Deployment. Green-Blue Deployment involves duplicating the environment, testing the new model in a hidden environment, and switching traffic to the new model using DNS. If issues arise, you can quickly roll back to the old environment. Canary Deployment involves gradually introducing the new model to a small percentage of traffic and monitoring performance. If everything is fine, you gradually increase the traffic to the new model.
To implement Canary Deployment with CloudFormation, you update the stack with a new version of the template that includes the new model and its weight. A change set lets you see what changes will be made before executing the update. This ensures minimal risk and allows you to monitor business KPIs and roll back if necessary.
In summary, automation is not complicated. Using the SageMaker Pipelines SDK, you can build and deploy models manually in your account. For production, you can use CloudFormation templates to deploy in a safe, controlled, and repeatable way. This approach is simple and works for production, and there are more sophisticated methods with full CI/CD using SageMaker Pipelines and MLOps.
All the notebooks and resources are available on GitLab, and you can run everything we covered today. The SageMaker docs and a blog post on SageMaker pipelines provide more details. Thank you for watching, and we'll see you in two weeks. Keep rocking with machine learning. Bye-bye.
Julien Simon is the Chief Evangelist at Arcee AI
, specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.
With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.
Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.
Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.