Data preparation AWS Glue Data Brew or Amazon SageMaker Data Wrangler

Transcript

Hi everybody, this is Julien from Arcee. In this video, I would like to answer a question that I've been asked a few times since we launched SageMaker Data Wrangler, our data preparation service inside of SageMaker. The question is, should I use Data Wrangler or should I use Glue Data Brew? It's a really good question because the two services are pretty similar. They help you with preparing, cleaning, transforming, exporting data that you can use for downstream applications like analytics, training machine learning models, and so on. So it's a good point to try and figure out which one should I use? And what can I do with one that I can't do with the other? Let's quickly introduce what Data Brew is. Data Brew was released a few weeks ago; it's part of AWS Glue, a fully managed ETL service, and it lets you connect to data sources, create datasets, and get datasets from S3, Redshift, RDS, etc. Then you can go and start processing. Here, I created a dataset already, a pretty simple CSV dataset. One of the things I really like about DataBrew is that once you've imported the dataset, you can actually run a profile job. The profile job is a fully managed job that will give you basic information, column statistics, etc. All it takes is to say, run the profile on this dataset. It's a one-click operation. You also get a correlation matrix, so you can see which features are highly correlated, which you may want to remove later on. I think this is pretty cool and a unique thing in DataBrew. You can't do the same in SageMaker Data Wrangler, though you can run your own code in your own notebook to achieve similar results. The UI in DataBrew is pretty slick. On the Data Wrangler side, we can connect to S3, Athena, and import data, then get a graphical view of our workflow. When it comes to preparing data, let's go back to DataBrew. You can create a project, and once it's created, you have a slick UI with your transforms. It looks pretty similar to Data Wrangler. Cleaning missing values, and I'm sure we have some unique things in both services. One obvious one is the ability in Data Wrangler to run your own custom code, such as PySpark, Pandas, PySpark SQL, or custom formulas in Spark SQL, which, to my knowledge, is not available in DataBrew. If you don't want to run any code, DataBrew is a good option. If you need or want to run your own code and work at the code level, Data Wrangler is a better option. For example, if I want to drop the name column here, I just select the column and delete it, preview changes, and apply. You start building your recipe just like this. You can download it as a YAML file or a JSON file if you want to use it for other purposes. In Data Wrangler, you can also build analysis, create graphs, and use the quick model capability, which is really cool. You can train a quick model in place, and here it's a classification problem, so we see an F1 score and feature importance. These are things you would be interested in for machine learning, to see the impact of feature engineering on your problem's required metric. This is not something you can do in DataBrew. When it comes to running your transformations, DataBrew makes it very simple. You go to Jobs, Create Job, select your project, give the job a name, select the file type for the output, and you can see there are quite a few options, which is quite nice and probably not available as is in Data Wrangler. You can pass an output location and just fire up the job, and then you'll get your transformed file in S3. In Data Wrangler, you have multiple export options. You can export to Python code that you add to your project, export your features to SageMaker Feature Store, export to SageMaker Pipeline for automation, or export to a Jupyter notebook that runs a SageMaker processing job to transform your data. We have more options here, and again, code-level options. If you're just interested in the transformed dataset, DataBrew is probably much simpler. If you're interested in code and potential automation and feature engineering, Data Wrangler is the most flexible option. These are similar services, and you can certainly do the same things with one or the other. The main difference is if you're a machine learning engineer or data scientist and need code artifacts for automation, customization, further exploration, tweaking, etc., then Data Wrangler has the advantage because it can easily export all your transformation steps to Python code. If you're only interested in transforming data from one format to another, DataBrew is simpler. It has a nicer UI, and you can easily get the job done without writing a single line of code. You can still automate those steps, as DataBrew has APIs just like pretty much every AWS service. Once you've defined a recipe and want to run it again, you can automate it. But you can very easily work with the UI and interactively transform your data without ever seeing a line of code. For business analysts or non-technical people, DataBrew is a friendlier, easier option, and you can still plug it into Glue and automate down the road. If you're really focused on machine learning and need full visibility into the code, then SageMaker Data Wrangler is probably a better option. I would encourage you to try both and see which one works better for you. Maybe you can even combine them. If you already use Glue to cross from different data sources, it's probably very easy to integrate DataBrew for initial processing and cleaning. I would say the common generic part of the cleaning and processing jobs that you could do on your datasets. Machine learning teams could then use Data Wrangler on those pre-processed datasets to work on feature engineering and do the specific work for their ML project. In any case, feel free to experiment and invent your own workflows. If you have any questions, I'm happy to answer them. If you have feedback, I'm also very happy to hear about it. And until then, keep rocking.

Data preparation AWS Glue Data Brew or Amazon SageMaker Data Wrangler

Transcript

Tags

About the Author