Mixing Spark with Sagemaker ?
This short post comes from a question asked by Manel Maragal (thanks!) on my YouTube channel. It’s a really good question and hopefully my answer doesn’t suck… so why not share both with everyone?
Here’s Manel’s question:
In the line of “Building your own algorithm container”, is it possible to use Spark code entirely (and distributively) on SageMaker? What I get from the documentation is that I’m supposed to do ETL in my Spark Cluster. And then, when fitting the data to the model, use sagemaker_pyspark that will create a Sagemaker job. Moving the dataframe into S3 with protobuf format, to then train with a new Sagemaker instance cluster.
The question is: if I already have my dataframe loaded into my distributed cluster, why would I want to use Sagemaker? I might as well use Spark ML for it, which has a better algorithm support, and avoids creating an additional cluster. Maybe I got the whole thing wrong…
My answer:
Spark and Spark ML are great tools, more power to both :) Still, I see a few reasons why combining them with Sagemaker would make sense:
1 — Decoupling ETL infrastructure from training infrastructure. They could have different sizing requirements (compute and/or storage) and you wouldn’t want to oversize your Spark cluster just because one part of the overall process requires more capacity. Sure, you can resize an EMR cluster dynamically, but it’s extra work and potentially troublesome if you have a lot of data. SageMaker will start and terminate instances automatically, which I find cleaner and simpler :)
2 — In the same vein, imagine you need GPUs for training. It would be costly and sub-optimal to run your EMR cluster on GPU instances just for this.
3 — Spark ML is great and has indeed a lot of algos. SageMaker is just getting started and we’ll keep adding scalable implementions of state-of-the-art algos, so maybe one of them will actually catch your eye, such as DeepAR (released a couple of days ago).
4 — Deploying models in production on managed infrastructure. To me, this is the single most important feature in SageMaker: call a single API and deploy an endpoint.
If you’re curious about combining SageMaker and Spark, this Github repository has several examples.
Keep the questions coming and have fun testing :)