An overview of Machine Learning services on AWS September 2018
August 01, 2019
Closing keynote from AWS Innovate EMEA 2018 .
Transcript
Hey, hi everybody. My name is Julien. I'm a tech evangelist with AWS and I focus on AI and machine learning. You've just had a busy day of sessions covering all sorts of topics, and now it's my job to close the day. I figured you might be tired of slides and PowerPoint, so I decided to do something a little different. We're going to do a whiteboard session, and I'm going to talk about machine learning, but from a slightly different angle.
A lot of people, when they talk about machine learning, focus on algorithms, theory, and the technical aspects. While these are important, sometimes we should take a step back and look at the bigger picture. Where does the data come from? What do we need to do to the data before it can be used for training a machine learning model? And what happens after the model has been trained? This is what we're going to discuss for the next 30 minutes.
You may have seen this reInvent talk, which has been quite popular over the last few years. It's called Big Data Architectural Patterns, and you can find this video and session on YouTube. The session number you need to look at is ABD 201. The reason I'm referencing this is because I'm going to start from there, with a focus on machine learning. Let's get started.
Generally, when you work with data, you need to go through different steps. The first step is to collect the data, ingest it. The second step is to store the data. Then, the fun starts. We start processing the data, transforming it, and finally, we move to consuming the data.
If you looked at this from a big data perspective, you could say, "I've got some web logs, and I'm collecting them from my web servers and copying the files to S3, then storing everything there. Then, I run Hadoop jobs, Redshift, or Athena on those logs, extracting information that gets pushed somewhere for consumption."
Now, let's use this same pipeline but with a focus on machine learning. Let's look at all four steps and see what's available on AWS to help you get from A to Z. There's quite a lot to talk about, as you will see.
### Collection
Data scientists might say, "I don't care so much about collecting the data; it's not my job. Just put the data somewhere I can work with it." But collecting the data is important. Let's look at some typical scenarios. Files are going to be important—web logs, CSV files, JSON files. You can grab them from web servers, storage systems, etc., in your company or on the internet. Streams are increasingly important, capturing streaming data from mobile apps, IoT systems, and devices. Kinesis is your friend here. Amazon Kinesis is a scalable messaging system that can handle very large volumes of data.
Kinesis Video Streams, announced at reInvent last year, is a simple way to stream video data into AWS for processing later. Kinesis Firehose is another easy way to move data into S3 or Redshift. A cool feature here is that you can use Lambda to run functions on messages in transit in Firehose. This can be useful for cleaning data, looking for missing or weird values, and performing cleanup and validation before storage.
When it comes to data, you will have all kinds of backends—SQL, NoSQL. You might need to pull data from these backends to build or refresh your dataset. There are many ways to do this. You could use a data pipeline, a tool like Scoop, or even DMS (Database Migration Service) to pull data from SQL databases into S3. SageMaker needs all data to be in S3, so there's some work involved in pooling data into S3. DMS is a fully managed way to migrate your data to S3, and it's a cool, easy way to do it.
### Storage
When you talk about storage for machine learning on AWS, it's pretty much going to mean S3. If you work with high-level services like recognition, you can pass your image to the API call, either inline or from S3. For SageMaker, all data needs to be in S3. So, you need ways to funnel everything into S3. Firehose, DMS, and copying files to S3 are all good options. S3 is where your data needs to be. You could use Lambda to trigger functions when new objects are written to S3 for cleaning, aggregating, or pre-processing.
### Processing
Let's subtitle this "Transform, Train, and Optimize." Compared to running a query in Redshift or Athena, we need to do more with the data. Transforming the data into features is crucial. For example, if you have customer addresses with street number, street name, and zip code, you might concatenate these to create a higher-level feature like a full address or GPS coordinates. This is a simple example, but it illustrates the importance of feature engineering.
For transformation, you could use EMR and Spark for ETL operations at scale. Redshift, Athena, and Glue (our ETL service) are also good options. For interactive data exploration, I recommend using a notebook instance in Amazon SageMaker, which comes pre-installed with Jupyter and popular machine learning libraries.
### Training
In the consumer phase, you could use high-level services like recognition, Polly, etc., which don't require training. If you need to train a model, SageMaker is the service to look at. SageMaker supports training at any scale using data hosted in S3, and you don't manage any servers. You can choose from built-in algorithms, deep learning environments (TensorFlow, MXNet, PyTorch, Chainer), or custom environments.
For infrastructure, C5 instances are good for small to medium-sized datasets, while P3 instances, with NVIDIA V100 GPUs, are ideal for large datasets. SageMaker automatically stops training once it's complete, so you never overpay.
### Optimization
Hyperparameter optimization (HPO) is a crucial step. SageMaker uses Bayesian optimization to find the best parameters, helping you converge quickly to the right model.
### Deployment
Once you have a model, you can deploy it to an HTTPS endpoint, use batch transform, or integrate it into your own application. You can also use high-level services like recognition, Polly, Translate, and Transcribe for general-purpose tasks.
That's a good overview of what it means to do machine learning on AWS. Just before I go, let me remind you that Q&A will run for 30 more minutes. You can go to the Ask the Expert section and ask questions about anything you heard today. We're more than happy to help.
If you can convince your boss to send you to reInvent in Vegas at the end of November, I'll be there running some sessions. We'd be more than happy to meet and help you learn even more. For more information, visit ml.aws. If you want to stay in touch, connect with me on LinkedIn or Twitter. Thank you for participating in this online conference today. I hope you learned a lot. It was a pleasure talking to you, and I hope to see you at reInvent or on the road. Have a great day. Bye bye.