Data preparation AWS Glue Data Brew or Amazon SageMaker Data Wrangler

December 07, 2020
In this video, I compare two AWS services for data preparation: AWS Glue Data Brew and Amazon SageMaker Data Wrangler. I discuss their unique capabilities, and when you'd want to use one, the other or both. ⭐️⭐️⭐️ Don't forget to subscribe to be notified of future episodes ⭐️⭐️⭐️ For more content: * AWS blog: https://aws.amazon.com/blogs/aws * Medium blog: https://julsimon.medium.com/ * YouTube: https://youtube.com/juliensimonfr * Podcast: http://julsimon.buzzsprout.com * Twitter https://twitter.com/@julsimon

Transcript

Hi everybody, this is Julien from Arcee. In this video, I would like to answer a question that I've been asked a few times since we launched SageMaker Data Wrangler, our data preparation service inside of SageMaker. The question is, should I use Data Wrangler or should I use Glue Data Brew? It's a really good question because the two services are pretty similar. They help you with preparing, cleaning, transforming, exporting data that you can use for downstream applications like analytics, training machine learning models, and so on. So it's a good point to try and figure out which one should I use? And what can I do with one that I can't do with the other? Let's quickly introduce what Data Brew is. Data Brew was released a few weeks ago; it's part of AWS Glue, a fully managed ETL service, and it lets you connect to data sources, create datasets, and get datasets from S3, Redshift, RDS, etc. Then you can go and start processing. Here, I created a dataset already, a pretty simple CSV dataset. One of the things I really like about DataBrew is that once you've imported the dataset, you can actually run a profile job. The profile job is a fully managed job that will give you basic information, column statistics, etc. All it takes is to say, run the profile on this dataset. It's a one-click operation. You also get a correlation matrix, so you can see which features are highly correlated, which you may want to remove later on. I think this is pretty cool and a unique thing in DataBrew. You can't do the same in SageMaker Data Wrangler, though you can run your own code in your own notebook to achieve similar results. The UI in DataBrew is pretty slick. On the Data Wrangler side, we can connect to S3, Athena, and import data, then get a graphical view of our workflow. When it comes to preparing data, let's go back to DataBrew. You can create a project, and once it's created, you have a slick UI with your transforms. It looks pretty similar to Data Wrangler. Cleaning missing values, and I'm sure we have some unique things in both services. One obvious one is the ability in Data Wrangler to run your own custom code, such as PySpark, Pandas, PySpark SQL, or custom formulas in Spark SQL, which, to my knowledge, is not available in DataBrew. If you don't want to run any code, DataBrew is a good option. If you need or want to run your own code and work at the code level, Data Wrangler is a better option. For example, if I want to drop the name column here, I just select the column and delete it, preview changes, and apply. You start building your recipe just like this. You can download it as a YAML file or a JSON file if you want to use it for other purposes. In Data Wrangler, you can also build analysis, create graphs, and use the quick model capability, which is really cool. You can train a quick model in place, and here it's a classification problem, so we see an F1 score and feature importance. These are things you would be interested in for machine learning, to see the impact of feature engineering on your problem's required metric. This is not something you can do in DataBrew. When it comes to running your transformations, DataBrew makes it very simple. You go to Jobs, Create Job, select your project, give the job a name, select the file type for the output, and you can see there are quite a few options, which is quite nice and probably not available as is in Data Wrangler. You can pass an output location and just fire up the job, and then you'll get your transformed file in S3. In Data Wrangler, you have multiple export options. You can export to Python code that you add to your project, export your features to SageMaker Feature Store, export to SageMaker Pipeline for automation, or export to a Jupyter notebook that runs a SageMaker processing job to transform your data. We have more options here, and again, code-level options. If you're just interested in the transformed dataset, DataBrew is probably much simpler. If you're interested in code and potential automation and feature engineering, Data Wrangler is the most flexible option. These are similar services, and you can certainly do the same things with one or the other. The main difference is if you're a machine learning engineer or data scientist and need code artifacts for automation, customization, further exploration, tweaking, etc., then Data Wrangler has the advantage because it can easily export all your transformation steps to Python code. If you're only interested in transforming data from one format to another, DataBrew is simpler. It has a nicer UI, and you can easily get the job done without writing a single line of code. You can still automate those steps, as DataBrew has APIs just like pretty much every AWS service. Once you've defined a recipe and want to run it again, you can automate it. But you can very easily work with the UI and interactively transform your data without ever seeing a line of code. For business analysts or non-technical people, DataBrew is a friendlier, easier option, and you can still plug it into Glue and automate down the road. If you're really focused on machine learning and need full visibility into the code, then SageMaker Data Wrangler is probably a better option. I would encourage you to try both and see which one works better for you. Maybe you can even combine them. If you already use Glue to cross from different data sources, it's probably very easy to integrate DataBrew for initial processing and cleaning. I would say the common generic part of the cleaning and processing jobs that you could do on your datasets. Machine learning teams could then use Data Wrangler on those pre-processed datasets to work on feature engineering and do the specific work for their ML project. In any case, feel free to experiment and invent your own workflows. If you have any questions, I'm happy to answer them. If you have feedback, I'm also very happy to hear about it. And until then, keep rocking.

Tags

AWS SageMaker Data WranglerAWS Glue DataBrewData Transformation and CleaningMachine Learning Data PreparationETL Services Comparison

About the Author

Julien Simon is the Chief Evangelist at Arcee AI , specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.

With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.

Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.

Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.