Simplifying dataset preparation with Amazon SageMaker Processing October 2020
October 12, 2020
Webinar organized by Data Science UA, 8/10/2020
https://data-science-ua.com/events/simplifying-dataset-preparation-with-amazon-sagemaker-processing-online-meetup/
⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos ⭐️⭐️⭐️
As ML practitioners know, transforming raw data into a dataset ready for training is hard work. Converting data to the format expected by the algorithm, splitting and shuffling data, handling outliers, filling missing values, engineering new features: the list goes on! Indeed, running, scaling, and keeping track of these processing jobs can quickly add lots of extra cost and complexity to any ML project. In this session, we’ll start with a quick introduction of Amazon SageMaker, a fully-managed and modular service for ML. Then, we’ll discuss SageMaker Processing, a capability that lets you easily run and your data processing workloads without having to worry about infrastructure at all. We’ll also talk about SageMaker Experiments, another capability that makes it easy for you to organize, and track ML jobs at any scale.
Transcript
We are ready. Hi everyone, let's start. I am very happy to see you on our online webinar. We have a special guest, Julien Simone. Yes, you managed it well. Thank you. Julien is a Principal Developer Advocate for AI and Machine Learning at Amazon. I heard Julien's speech one or two years ago here in Kyiv. Now, of course, we are only in an online format. Today, we will have an online meetup on simplifying dataset preparation with Amazon SageMaker Processing. I will tell you a little bit about our company, Data Science Yay, after which we will have a very interesting speech from Julien, followed by a Q&A session. We have a chat here in Zoom. You can use this chat to introduce yourself. Please share your position—data scientist, data analyst, data engineer, or manager—and your city and country. It's also interesting to know your goals for today's meetup. If you have questions, please write them. We will be happy to read them.
I will tell you a little bit about our company. We are Data Science UA. We have been building a data science community in Ukraine since 2016. We have several tracks in our company. One is education. We organize the Data Science Conference, our big international brand. We have held eight conferences, some offline in Kyiv and one online. Our ninth Data Science Conference will be on November 20th, and Julien will be a speaker. We also organize courses, meetups, and workshops on data science, AI, and analytics. Before the quarantine, everything was offline in Kyiv. Now, due to the quarantine, everything is online. This allows us to invite participants from all over the world, including Ukraine, Europe, the USA, Canada, and other countries. We have an efficient recruiting team that helps find the best projects for data scientists globally. We can close positions not only in data science but also in other IT roles, such as front-end, back-end, architects, and C-level positions. However, our main focus is on data science and AI positions. Please meet our recruiting team: Nastya, Alena, Alexandra, Bogdan, Oksana, and Sofia. You can contact them if you are interested in our positions. They will provide more details about the projects, requirements, and everything else. Don't hesitate to reach out to them with your questions.
We also have a consulting track. We help companies with business analytics, data collection, machine learning, advanced analytics, data mining, and business intelligence. If you have data, we can help you understand what you can do with it. Nika Tamaya-Forrest, our head of consulting, knows everything about data and how to make the best predictions and insights. Please write to me or Nika if you need help using AI in your company. We offer education and mentorship programs on data visualization and data science. We can customize courses based on your needs. For example, if you work in FinTech, we can create data science cases specifically for FinTech or e-commerce. Please contact us for more information.
A few months ago, we launched a new track. We opened an R&D center in Kyiv for our American company. We hired an extremely efficient and cool team of computer vision, machine learning, and server-side engineers. They are working on complex solutions for industrial manufacturing safety. We can open R&D centers in Ukraine and help you find the best team for your projects. This is a promo for our conference, which will be held on November 20th. We will have 24 hours of content across three tracks: business, technical, and workshops. Julien will be speaking on natural language processing with Amazon. Thank you for coming, Julien. Please follow us on social networks. We have a page on Facebook, a Telegram chat, an Instagram page, and a YouTube channel. You can find videos from our previous offline conferences on YouTube. If you need help with data science or developer positions, please write to me. Now, I will hand it over to Julien. Thank you again for coming. It's a pleasure to have you.
I see some messages in the chat. Yuri, CEO of a bank in Kyiv, is here. Someone asked about the sound. Do you hear me, guys? Do you see everything in the chat? Yes, it works for me. Julien, the floor is yours.
---
Good evening, everyone. My name is Julien, and I'm a tech evangelist working for AWS. I've been with AWS for about five years and have had the pleasure of visiting Ukraine several times. You have a fantastic tech community in Kyiv and Lviv. Unfortunately, the world is a bit crazy right now, so we have to rely on online conferences. I really hope to visit you in person next year and enjoy some Ukrainian food, which I love. I apologize for the football game last night; I'm sure if your team was the normal team, it would have been a different story. Good luck to Shevchenko in rebuilding the Ukrainian team. There's a lot of work to be done, but now let's talk about machine learning.
In this session, I will focus on the most time-consuming step in the machine learning workflow: data preparation. As we all know, building and deploying a machine learning model involves many steps. You might think you'll spend all your time writing complex machine learning algorithms and training models, but in practice, you'll spend a lot of time collecting, preparing, and cleaning data. Then, you'll try to find an algorithm that fits your dataset, build and manage training infrastructure, and keep track of the hundreds or thousands of jobs you run in a project. Bringing the model into production, managing prediction quality, monitoring performance, and scaling prediction infrastructure are also significant challenges. You have to go through all these steps.
A few years ago, as machine learning became a growing priority for customers, they asked us to make this process easier. We listened to many customers, as everyone approaches machine learning differently, and we came up with Amazon SageMaker. SageMaker is our flagship service for machine learning, allowing you to go from experimentation to production using a single service with many capabilities under the same umbrella. The goal is to help you move from data collection and preparation to scalable production as quickly as possible. Over time, we added capabilities to SageMaker to assist you at every step of the process.
Today, I will focus exclusively on the data preparation step. In my talk at the conference, I will show you the full process with a different dataset. I'm a bit lazy, so I won't repeat the same content. If you're interested, make sure to attend my presentation at the conference, where we'll go from data cleaning and processing all the way to deployment.
You likely have existing tools for data cleaning and feature engineering, such as Spark, Hadoop, or custom code. On AWS, you can use big data tools like Amazon EMR for managed Hadoop or Spark, run your code on EC2 instances, or use our database services for data transformations. We also have an ETL service called AWS Glue for running ETL jobs. There are many ways to do this.
We thought, "Why not integrate one capability into SageMaker in the easiest way possible?" When you focus on machine learning, you want to spend 99% or even 100% of your time on machine learning, not on infrastructure. This led to the creation of SageMaker Processing. SageMaker Processing is fully integrated with SageMaker and has simple APIs. It allows you to run batch jobs on your data, such as data cleaning and feature engineering, but you can also use it for model evaluation, running test sets, or cross-validation on trained models. It's fully managed, so you never worry about infrastructure. You don't have to provision or configure instances; it's all done for you.
If you don't love infrastructure, and if you're working on machine learning, you should focus on machine learning. You can bring your own code, as data processing is dataset-specific. You can bring your own scripts and easily integrate existing scripts into SageMaker Processing without rewriting them. All activity in SageMaker is based on containers, including training, deployment, and processing. We provide a scikit-learn built-in container, so you can write scikit-learn code and run it. As of a week ago, you can also run PySpark on Processing, which could replace Spark jobs running elsewhere. You can also bring your own container if you need to run, for example, C++ feature engineering. Everything runs on AWS, so you get built-in security, compliance, on-demand infrastructure, and pay-as-you-go pricing.
To use SageMaker Processing, you need to bring your processing code, adapt it to read the dataset and write the processed dataset, and write a few lines of Python code to get everything going. You can forget about managing, scaling, and paying for clusters or any infrastructure you traditionally use for this.
Let's move on to some demos. We'll start with a demo based on Spark and then show another demo based on scikit-learn, including an extra SageMaker capability called SageMaker Experiments to help manage all your jobs.
Let me switch to my browser. Here is the SageMaker console. If you work with AWS, you're familiar with the AWS console, which is the web application for working with all AWS services. Inside the SageMaker console, you can see SageMaker processing jobs. One cool feature in SageMaker is the availability of built-in development environments. We have notebook instances, which are managed EC2 instances pre-installed with Jupyter and all the necessary libraries. You can create one in a few minutes, open it, and start working in a Jupyter notebook. For the second demo, I'll use a more recent development environment called SageMaker Studio, which is an IDE running in the browser.
I've already created a notebook instance for the first demo. We'll process a dataset using Spark and SageMaker Processing. The first important step is to have the SageMaker SDK. We released a major revision of the SDK in early August, version 2.x. If you see 1.x in older notebooks, there are a few differences, mostly renaming, with very few breaking changes.
First, we need to get some data. We'll download a dataset from Amazon S3, our storage service. It's the Abalone dataset, which describes abalone shellfish. The goal is to predict the age of the shellfish based on physical properties like weight and size. Here, we'll just clean and process the data. The dataset is in CSV format. It has a categorical variable (M for male, F for female, I for infant) and seven numerical features. The label is the age of the shellfish.
We have a PySpark script to process this data. We'll one-hot encode the categorical feature, index strings, and combine everything into a vector. We're using standard PySpark APIs. The script receives command-line arguments for the input and output locations, which are in S3. We define a schema, read the CSV file, apply the schema, and perform the transformations. We create a pipeline, apply it to the dataset, split the data for validation and training, and save the processed data back to S3.
We use the PySpark processor object from the SageMaker SDK to run this code. We create a unique timestamp for the job, upload the unprocessed CSV file to S3, and define the processing job. We specify the Spark version and the number of instances to use. SageMaker makes infrastructure transparent; you just say how many instances you need. We run the job, passing the location of the code, input and output locations, and any arguments. The job runs on the specified instances, and the processed data is saved to the output location in S3.
For the second demo, I'll use Amazon SageMaker Studio, a web-based IDE for machine learning. It's available in several regions, including Europe. You set up a user with a simple wizard, and then you can open Studio and start working. It's based on JupyterLab, so if you're familiar with JupyterLab, you'll be comfortable with SageMaker Studio.
This time, we'll use scikit-learn to process a marketing dataset. We download the data, which has about 41,000 lines, 20 features, and one label. The features are customer attributes, and the label indicates whether the customer accepted a marketing offer. We upload the data to S3, use the scikit-learn processor object, and run the job on m5.xlarge instances. We pass the script and parameters, such as the input and output locations and the train-test split ratio.
The processing script reads the command-line arguments, loads the data, removes missing values and duplicates, and performs transformations. It handles an imbalanced dataset by creating a new column for previous contact and grouping certain job categories. We split the data for training and validation, build a scikit-learn pipeline, and save the processed data to S3.
If you run many processing jobs, managing them can become challenging. SageMaker Experiments helps track all jobs involved in a machine learning project. You create an experiment to group related trials, run processing jobs, and attach them to the experiment. You can view and manage these jobs in SageMaker Studio, which provides visual cues and allows you to explore jobs programmatically using the SageMaker Experiments SDK.
SageMaker has many more capabilities beyond processing. We have a collection of AI services based on pre-trained models, such as image and video analysis, speech-to-text, and text-to-speech. We also offer infrastructure and framework tools for customers who want to build and manage their own machine learning infrastructure, including EC2 instances, custom chips like Inferentia, and optimized libraries.
If you're interested in learning more, visit ml.aws for an overview of all AWS machine learning services. The SageMaker page has customer references and high-level information. The AWS blog has many posts on SageMaker, and we have a collection of sample notebooks. My YouTube channel and Medium blog have more content. We also have SageMaker Fridays, a series of events starting this Friday, where we dive deep into real-life use cases for machine learning. Our technical conference, Reinvent, is coming up and will be free this year. Finally, I recently published a book on SageMaker, which covers all its capabilities and includes 62 original notebooks. You can get a discount on the paper edition and the ebook.
Thank you very much for your attention. I hope to see you in person next year. If you have questions in the future, feel free to reach out to me on Twitter or LinkedIn. Thanks again, Julien.