Mixing Spark with Sagemaker ?

This short post comes from a question asked by Manel Maragal (thanks!) on my YouTube channel. It’s a really good question and hopefully my answer doesn’t suck… so why not share both with everyone?

Here’s Manel’s question:

In the line of “Building your own algorithm container”, is it possible to use Spark code entirely (and distributively) on SageMaker? What I get from the documentation is that I’m supposed to do ETL in my Spark Cluster. And then, when fitting the data to the model, use sagemaker_pyspark that will create a Sagemaker job. Moving the dataframe into S3 with protobuf format, to then train with a new Sagemaker instance cluster.

The question is: if I already have my dataframe loaded into my distributed cluster, why would I want to use Sagemaker? I might as well use Spark ML for it, which has a better algorithm support, and avoids creating an additional cluster. Maybe I got the whole thing wrong…

My answer:

Spark and Spark ML are great tools, more power to both :) Still, I see a few reasons why combining them with Sagemaker would make sense:

1 — Decoupling ETL infrastructure from training infrastructure. They could have different sizing requirements (compute and/or storage) and you wouldn’t want to oversize your Spark cluster just because one part of the overall process requires more capacity. Sure, you can resize an EMR cluster dynamically, but it’s extra work and potentially troublesome if you have a lot of data. SageMaker will start and terminate instances automatically, which I find cleaner and simpler :)

2 — In the same vein, imagine you need GPUs for training. It would be costly and sub-optimal to run your EMR cluster on GPU instances just for this.

3 — Spark ML is great and has indeed a lot of algos. SageMaker is just getting started and we’ll keep adding scalable implementions of state-of-the-art algos, so maybe one of them will actually catch your eye, such as DeepAR (released a couple of days ago).

4 — Deploying models in production on managed infrastructure. To me, this is the single most important feature in SageMaker: call a single API and deploy an endpoint.

If you’re curious about combining SageMaker and Spark, this Github repository has several examples.

aws/sagemaker-spark
sagemaker-spark - A Spark library for Amazon SageMaker.github.com

Keep the questions coming and have fun testing :)

About the Author

Julien Simon is the Chief Evangelist at Arcee AI , specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.

With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.

Previously serving as Principal Evangelist at AWS and Chief Evangelist at Hugging Face, Julien has authored books on Amazon SageMaker and contributed to the open-source AI ecosystem. His mission is to make AI accessible, understandable, and controllable for everyone.