Devoxx Ukraine 2020. Getting started with Machine Learning on AWS. Julien Simon
November 11, 2020
The talk from Devoxx Ukraine 2020 ONLINE
Fb: https://www.facebook.com/DevoxxUkraine/
Website: https://www.devoxx.com.ua
Over the years, cloud computing has proven to be the most agile, scalable and cost-effective option to build and run IT applications. Of course, these benefits can also apply to Machine Learning (ML). In this hands-on session, we'll start with a quick introduction to the different layers of the AWS ML stack. Then, we'll dive into the different solutions that you can use to build, train and deploy your ML applications, using either Amazon EC2 instances, container services, Amazon SageMaker, and high level services like Amazon Forecast and Amazon Personalize. Services will be demoed using the AWS console and AWS SDKs. Along the way, we'll also compare them from a skill, technical, operational and cost perspective. In order to fully enjoy this session, you should have a basic understanding of core AWS concepts (EC2, S3, IAM, etc.). We'll use simple ML examples, so no deep ML knowledge is required.
#devoxxua #java #devoxx
Transcript
Hi, Julien. How are you today?
Hello, I'm good. How are you? Have you already woken up?
Yeah, it's fine. I'm really happy to speak to all of you this morning. I wish I could be in Ukraine. I've been there many times. It's a place that I really like. Unfortunately, we have to stick to online videos, but I really hope we can soon be in the same room and have a great conversation. The sooner, the better.
Yes, I hope. Let's do it next year. Let's imagine. And what topic have you prepared for us?
Well, this morning, we're going to start the day in style. It's an introduction to machine learning, what machine learning is, how to easily get started with your machine learning project, sharing some best practices, and then walking through and demoing some of the services that AWS has built for people who are just beginning with machine learning or people who already have a bit of experience. So, lots of real-life advice and, of course, lots of demos and code, which is what we like. The stage is yours; I'll come back after you finish.
All right. Thank you very much. Okay. So let me share my screen, full screen, and we can get started. So again, good morning, everybody. Thank you very much for joining me this morning. We have a long session with lots of stuff to cover. My name is Julien. I'm a tech evangelist working with AWS. I've been working with them for about five years. My role is to help developers and customers understand how to build quickly and easily with AWS services.
We're going to start with a primer on machine learning because I'm guessing most of you have to know. We've got to get our definitions right, understand what this thing is, and what are some of the technical issues and challenges that we need to focus on. Then we'll talk about starting your machine learning projects, figuring out which tools to use and how to use them, and how to organize the projects, which is a really important topic if you want to be successful. And then we'll start looking at services.
So let's start with a few basic definitions. We hear constantly about AI and machine learning. Is there a problem with the screen? Oh, you don't see the full screen. OK. That's OK. Let me try screen sharing again then. Looks like I just don't know which screen it's sharing. OK, it's kind of weird because I'm sharing the right one. But okay, anyway. All right. We'll figure it out. No worries. Okay. Here we are.
So the first thing we need to get clear is AI versus machine learning versus deep learning. You hear those words thrown around constantly, and not everybody understands the differences. So really quickly, just to set the scene, AI is the high-level domain. AI is a 60-plus-year-old domain and it focuses on building and designing software applications that exhibit human-like behavior. Building apps that can recognize images or videos, use natural language, and show some kind of reasoning, intuition, and the ability to solve complex problems. Inside AI, you have tons of different domains over time, such as expert systems, computer vision, speech synthesis, and more. But as it turns out, one of those domains, machine learning, has become very successful because it proved to be very effective at solving hard problems.
Machine learning is a capability for machines to learn from datasets using statistical algorithms, without being explicitly programmed. When you write business rules in your applications, you explicitly state what you want to achieve, manipulate data, process it, and get some kind of output. Machine learning is different. It starts from datasets where features, such as columns or variables, are clear. You use off-the-shelf algorithms to learn how to predict a certain value in the dataset. Instead of writing the application code, you use existing algorithms to extract patterns and insights from datasets.
Deep learning is a subset of machine learning that focuses on neural networks. It uses various architectures to extract information from complex datasets where features are not easily expressed. Neural networks are great at figuring out patterns and useful information in complex data like images, videos, and speech. They are mysterious and complex but very effective. So, artificial intelligence is the high-level topic, inside of which is machine learning, and inside machine learning is deep learning.
Now that we're clear, let's talk about the main two ways you can do machine learning. The most common way is supervised learning. As the name suggests, you show the algorithm the correct answer in your dataset, and the algorithm learns how to predict this answer. If the model is built correctly, it will generalize to other data it hasn't seen before. The key here is that you need to label the dataset. For example, if you're building a fraud detection model for credit card transactions, you label transactions as legitimate or fraudulent. The model learns to predict the right answer. Building the dataset and labeling it is a lot of work. Classification is a popular use case for supervised learning. It can be binary, like yes or no, or multi-class, like categorizing images into 50 different categories. Regression is another use case, where you predict a numerical value, such as the price of a house or the number of traffic jams in Paris tomorrow.
Unsupervised learning, on the other hand, doesn't require a labeled dataset. You start with a raw dataset, and the model finds patterns and organizes samples. Clustering is a good example, where you group samples based on their statistical properties. For instance, you might group customers into bronze, silver, and gold categories. Topic modeling is similar, grouping text documents based on common words. Both techniques are useful. Supervised learning builds very sophisticated models, such as those used in computer vision and natural language processing, but it requires a lot of labeled data. Unsupervised learning achieves less sophisticated tasks but doesn't require labeled data and usually needs less data.
There's another technique called reinforcement learning, which works even without data. It involves letting an algorithm explore a simulated reality and figure out how to do the right thing. It's used in robotics, the stock market, and energy management, where building a dataset is not an option. But I won't discuss that today.
To give you a sense of what these algorithms are, here's a list of some available in the popular machine learning library called scikit-learn. We can see different topics like classification, regression, clustering, and dimensionality reduction. This map shows that there are many algorithms to choose from, depending on your problem. You start from data, figure out the problem, and experiment with different algorithms. Decision trees are a popular family of algorithms. They build a tree that lets you go left or right based on feature values, eventually reaching the right answer. They are easy to understand and explain, and variants like XGBoost are very popular. The goal is to figure out the parameters automatically, such as which features to use and where to split.
The machine learning cycle starts with the business problem. It's crucial to step back and think about what you're trying to solve. Maybe you're classifying images, detecting fraudulent transactions, or translating languages. Then you frame the problem, understanding the data and algorithms needed. Not all problems are machine learning problems. If data can't help, you need another approach. Assuming it is a machine learning problem, you collect, clean, and prepare data, which can take 50 to 80% of the time. Then you experiment, visualize, and build features. Next, you train and tune models, measure accuracy, and iterate until you're satisfied. Once you have a good model, you deploy it for predictions, but production challenges like monitoring, scaling, and debugging become important. You'll need to retrain models to account for new data. This is the machine learning cycle, and you'll need tools and infrastructure to manage it.
Machine learning is more complex than just training a model. It involves data collection, preparation, and production operations. Setting expectations is crucial. People have many ideas about machine learning, but you need a clear, quantifiable business question. For example, detecting fraudulent transactions with 99% accuracy or forecasting monthly revenue within 1% error. You need data to solve the problem, and if you don't have it, you need to figure out how to get it. Involving a diverse team is essential, including business, IT, and operations people. Red flags include vague requests like "see what this technology can do" or "we have tons of data, surely we can do something with it." These are not business questions. Define clear metrics showing success, such as reducing fraud detection errors by 5% each quarter. Machine learning is iterative, so define what success looks like but be reasonable. Technical and business metrics are both important.
Assess what you really need based on your skills. Can you build and manage a dataset? Do you have the tools and skills to write and tweak algorithms? There's a wide range of services, from fully managed to do-it-yourself, so figure out where you stand. If you have to pick between cost, time, and accuracy, you only get to choose two. A less accurate but quick and inexpensive solution might be enough to get started and build a proof of concept. Improving accuracy becomes increasingly difficult and costly, so know when to stop. State-of-the-art models are exciting, but be pragmatic. Techniques like transfer learning and AutoML are useful, but hype-driven machine learning is dangerous. Use best practices, as machine learning is software engineering. Use development tools, QA, documentation, and agile techniques. Experiment in production as often as needed, and iterate. Start small, with simple datasets and algorithms, and learn from each iteration.
Machine learning can be complex, but focusing on the business problem, your skills, and your objectives makes it clearer which tools to use. AWS has many machine learning use cases, from product recommendations to backend tasks like logistics and customer service. The secret is customer obsession and machine learning, making everything efficient. We help you use our services to build smarter applications. We have tens of thousands of active customers in all industry segments, from large enterprises to startups and public sector organizations. You can explore customer use cases and quotes at this URL.
We have built many services over the years, organized into three layers. Today, I'll focus on the top and middle layers, with a few words about the bottom layer. The top layer, AI services, provides high-level services where you don't need to do any training. You just call a cloud API, pass your data, and get an answer. These are based on pre-trained models optimized on large datasets. In this layer, we also have services like Amazon Personalize, Forecast, and Fraud Detector, where you can bring your own data and train models with simple APIs, without needing machine learning knowledge. I'll show you a quick demo of Amazon Personalize, which helps build recommendation models. You can also do application profiling with CodeGuru, detecting performance problems and bottlenecks in production. For the strictly development part of the software engineering process, this is a pretty cool service.
If you're completely new to AI services, AI services is probably where you want to go. Now, ML services are one level down. The rationale for this set of capabilities, which are part of a very important service called Amazon SageMaker, is that we want to give you full control over the machine learning process. You can bring your own datasets in whatever format and your own machine learning code based on whatever open-source library you use or maybe your own custom code. Full control, full flexibility, but without having to manage any infrastructure, as we will see. Infrastructure becomes completely transparent. You don't have to spend a minute on it and can focus on the machine learning problem. On top of that, we bring high-value machine learning capabilities that help you solve and go quickly through that machine learning cycle. Training, debugging, data preparation, data labeling, pretty much all the boxes in that cycle are covered. These are very high-productivity tools that help you go faster from experimentation to production.
Frameworks and infrastructure is exactly what the name says. You'll find all the Amazon EC2 instances, CPU, GPU, FPGA, and a custom chip called Inferentia. I'll say a few words about that. All the infrastructure you need to build your own ML platform inside AWS, if that's your goal. We also have software tools, prepackaged popular machine learning libraries inside Amazon machine images for virtual machines and inside containers. You can just pick those tools and fire up whatever infrastructure you want to build, saving you time and getting to work quicker.
Let's start exploring those services. I'm going to do some demos. We're not going to go through all of them, but I've picked a few that I like and think you should like. Let's try a bit of Rekognition first. Rekognition is about analyzing visual content, images, and videos. Here, I'm using the AWS console, a web app that lets you easily use AWS services. It's great for experimentation, learning, and, of course, demos. Everything I'm doing here is based on APIs, so anytime you see me doing something, I'm actually calling an API, which you could call as well in your favorite SDK, whatever language you want to use.
Let's try face detection. I'll pick an image. Here, I'm uploading an image to the service. I could also pick an image hosted in Amazon S3, our storage service, and call the detect faces API. We can see the request here. We pass the image and get a JSON response with bounding boxes and attributes. Every time I do something here, I'm calling a service API, which you could call in Python, Java, PHP, etc. We can see all the faces have been correctly detected, and we see some attributes. This is clearly a machine learning-powered solution based on deep learning models.
Now, we can do another thing with this. Let's try celebrity detection. Here, we're detecting faces and trying to match them to known people—sports players, artists, politicians, business people, movie stars, etc. This is the detect celebrities API. You can compare faces, detect text in images, and more. For example, it detects numbers on jerseys and text in the image, and we get the coordinates in the response.
Image moderation is also fun. It can detect explicit content, suggestive content, and violent content. Let's give it a shot. It says "violence and weapon violence," which is pretty accurate. You get a simple JSON response with labels, so it's super simple to use. We recently added personal protective equipment detection, which is a major concern. If you want an automated solution to ensure workers and staff are properly protected, you can use this feature. You can select face cover, hand cover, head cover, etc., and know whether people are correctly protected. This feature is also available for video analysis.
You can also train your own model in a simple way using custom labels. You bring your data to Amazon S3 in a simple format and train your own image detection model without going into deep learning. This is a pretty interesting solution.
Let's try Polly now. Polly is text-to-speech. I'll switch to my speech so you can hear this. Hi, my name is Joanna. I will read any text you type here. This is the synthesize text API. You pass the text and the voice. Each voice has a name and corresponds to a certain language. Let's try Norwegian. You call the API, say, "I want to say this using Liv's voice," and that's it. You can save the output to an MP3 file or get a byte stream to play immediately. It's super simple and has many voices.
We have different engines: the standard engine and the neural engine, which generates a more natural-sounding waveform. Let's give it a shot. Hi, my name is Amy. I will read any text you type here. It sounds more fluid and human-like. Let's try a longer piece of text with the standard voice and then the neural voice.
Barcelona presidential hopeful Tony Freixa has admitted that club captain Lionel Messi will have to lower his current wage at the club to renew his contract. The Argentine superstar's contract expires at the end of the current campaign, and he will be free to negotiate terms with other clubs from January 1st, should he not renew his deal ahead of that point.
The neural voice sounds more fluid and human-like. Another cool feature is SSML, a markup language that lets you specialize how the speech is spoken. You can use the newscaster style for a more professional tone. Let's give it a try. You can hear the difference; it sounds like a news broadcast.
If you need to generate sound or voice messages for your applications, Polly is a very simple and quick way to do that with very good quality.
Now, let's look at speech-to-text with a service called Transcribe. Transcribe has real-time transcription. We can run batch jobs by putting sound files in S3 and getting transcriptions, but we can also do it real-time. I'll stick to English. We have more languages for batch transcription, but we're only supporting four for real-time at the moment. Let's give it a try.
Hey, good morning, everyone. My name is Julien. I'm very happy to talk to you this morning and introduce you to machine learning. I hope you learn a lot and can start your own projects very soon. This is pretty accurate, and I'm quite satisfied with this demo.
From an API perspective, we're streaming data to Transcribe and using natural language processing for live transcription. You can imagine the applications: live captioning, live transcriptions for audio and video content, transcribing university classes, and more. By the time the class is over, there's a transcription available. You can share it with other students or archive it. Many options here, like customer conversations and meeting notes. If you have Transcribe, you don't need meeting notes. You can focus on the meeting itself.
There's also a medical version of Transcribe for medical content, specialized with medical vocabulary. Same idea, batch or real-time, but trained to understand medical jargon.
Translate is pretty obvious. It's translation, and we can do batch and real-time translation. Translate will automatically detect the input language. Let's give it a try. I'll start typing and see if it translates to Ukrainian correctly. Let's check what I said and translate it back to English. That's what I meant. Hopefully, this makes sense.
Translate is super simple and supports many languages, including some rare ones. We have variants like Canadian French and Mexican Spanish. You can combine these services. You can transcribe with Amazon Transcribe and feed it into Translate. Now, not only can you transcribe your meetings or classes, but you can almost instantly translate them into many languages.
Let's try TextTrack, an OCR service with a twist. It can extract text from scanned documents and PDF files, but it can also extract forms and tables. I'll upload a document to the service or use one in S3, call the appropriate API, and get a JSON response. It's a research article with illustrations and tables. We detected all the text correctly and kept the table structure. This is crucial for processing documents and keeping the data structure.
Let's try a form, like an insurance form. The text is correctly detected, and more importantly, the form fields are associated with their labels. You can insert this into your database and do whatever you need with it. The table is also correctly detected. This service makes it easy to solve the problem of extracting information from documents.
Let's quickly look at Personalize. The philosophy of Personalize, Forecast, and Fraud Detector is very similar. If you understand one, you understand all three. I have a dataset called the MovieLens dataset, popular for movie recommendations. It shows user-item interactions, like user 298 liking item 474. I want to build a movie recommendation model. Recommendation is a complex problem, but Amazon Personalize makes it easy. You bring the data, upload it to S3, call APIs to train a model, deploy it, and predict.
The steps are simple: upload the data, create a solution (a recommendation model), launch a campaign (deploy the model), and predict. Each step is one API call. You can stream data from mobile or web apps, create a user-item interaction dataset, and add metadata for increased accuracy. Then, train a solution, give it a name, select a recipe, and deploy it to a campaign. You get a prediction API, which you can invoke to get recommendations for users.
Forecast works similarly for time series data, and Fraud Detector works for building fraud detection models. These services quickly get your models into production.
Now, let's look at machine learning services, primarily Amazon SageMaker. SageMaker was launched at re:Invent three years ago and has added many capabilities. I'll stick to the simple ones today. Remember the machine learning cycle: data preparation, building datasets, training models, debugging, optimizing accuracy, and deployment. SageMaker covers the full spectrum, from experimentation to production.
Some customers need only specific parts. For example, you might train in the cloud and deploy on-premises, or you might have an existing model and want to deploy it in the cloud. SageMaker supports these workflows. We build the end-to-end experience but also ensure you can use each module independently.
Let's overview the modeling options. The first option is to use pre-trained models from the AWS Marketplace. You can find a model that fits your needs, deploy it in SageMaker, and experiment. Most models are free to test, and you pay a fee if you deploy them.
The second option is SageMaker Autopilot, an AutoML feature. Bring your data to S3, define the target value, and Autopilot will build a model for you. You get auto-generated notebooks showing the process, and you can replay them manually.
The third option is to use built-in algorithms. We have 17 algorithms covering typical problems. You don't need to write machine learning code; just select an algorithm and use it to train.
The fourth option is to bring your own code. If you have existing code for TensorFlow, PyTorch, or XGBoost, you can run it on SageMaker. Training and deployment activities are container-based, and we open-source the framework containers. You can inspect, customize, and push them back to AWS.
The last option is full customization. If you need other languages like R or C++, or a custom Python container, you can build it, push it to AWS, and use it on SageMaker. Training and deployment are fully managed, and you can use spot instances to reduce training costs by up to 75-80%.
Let's see how this works. This is the SageMaker console, and I'm using SageMaker Studio, a web-based IDE for machine learning based on Jupyter. I'll run an example using XGBoost to predict house prices based on the Boston Housing Dataset. It's a regression problem, and we want to predict the median value of houses based on features like crime rate, pollution, distance to the city center, and tax level.
We read the CSV file with pandas, display the first five lines, and see the features. The median value is the target we want to predict. The metric we use is how close we get to the actual prices. This is a reasonable problem to understand, and it's a good example of how SageMaker can help you solve it. Since this is a regression problem, and XGBoost is a good algorithm for regression, we'll use it. XGBoost is one of the built-in algorithms in SageMaker. I get the name of the XGBoost container for the AWS region I'm using, which is EU West 1 in Ireland. This is all I need to do to set up the container.
Next, I configure the training job using the SageMaker SDK. I pass the container, the role (which handles permissions for accessing S3), and decide on the instance type. I'm using an M5 large instance, which is suitable for this simple training job. This is the only infrastructure decision I need to make. No need to set up EC2 instances, VPCs, or subnets—SageMaker takes care of it all.
I set the output location for the model in S3 and configure the training job parameters. For regression, I set the appropriate objective for XGBoost and decide to train for 200 rounds, with early stopping if accuracy doesn't improve. I define the training and validation channels, specifying the S3 locations and data format (CSV). Then, I call the `fit` method on the estimator to start the training job.
From a machine learning perspective, we've done very little. We put CSV files in S3, selected XGBoost, and told it what to do. When we call `fit`, SageMaker handles the infrastructure. It launches the instance, copies data from S3, pulls the XGBoost container, and starts training. We can see the training log, which is also available in Amazon CloudWatch.
The training job completes after 47 rounds, with the best model at 37 rounds. The model is saved in S3, the instance shuts down automatically, and we only pay for 59 seconds. Using spot instances could reduce costs even further, potentially by 70% to 75%.
Now that we have a model, we can deploy it using the deploy API. I select a simple instance for deployment, and after a few minutes, the endpoint is ready. The endpoint is an HTTPS URL that can be invoked using any language or tool to get predictions. I call the predict API with some test data and get the predicted house price, which is 23.9. When I'm done, I delete the endpoint to stop paying for it.
We could do the same with a scikit-learn script. Using the same dataset, I train a linear regression model with scikit-learn code. I use the `sklearn` estimator in SageMaker, pass my training code, and handle hyperparameters and data locations using environment variables. The scikit-learn code is vanilla, with minimal changes to work with SageMaker.
For customers who want to build and manage their own infrastructure, SageMaker offers various instance types, including CPU, GPU, and FPGA. We also have a custom chip called Inferentia for accelerating machine learning predictions at high throughput and low cost. You can use Amazon EC2, ECS, and EKS to build your training and deployment clusters, with optimized machine images and deep learning containers.
For more information, visit ml.aws for a high-level overview. For SageMaker-specific details, check the SageMaker service page, the GitHub repo for the SDK, and the notebook examples. My YouTube channel and blog have more content, and my book on SageMaker is available with a discount. The discount link is valid for a few more days, so don't wait if you're interested.
Hi, Julien from Arcee. Thank you for waiting. If you have a few minutes for a Q&A session, you can join the Zoom room to meet the participants and answer their questions.
Julien Simon is the Chief Evangelist at Arcee AI
, specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.
With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.
Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.
Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.