A whirlwind tour of Amazon SageMaker July 2019

Transcript

so Here we go. How many does it say? That's impossible. No, what? Let me see that. Oh, we're live. We're live. We went live. Okay, great. Hi. Hi, everyone. Sorry, I missed the notification on that one. So my name's Nikki. I'm a senior technical evangelist. I have the absolute pleasure and honor of being joined today by Julian Simon. Hi, everybody. Machine learning expert and our principal evangelist for machine learning. So he's actually going to do something really cool today that we haven't had the opportunity to do yet. He's going to go through what is SageMaker, you know, why we made it, and everything that's launched from the day that it launched, which was 2017. So if you feel like you've missed out at all or any point in time or you're just joining the SageMaker train and you want to know all of its full capability, this is the session for you to tune into. Absolutely, we're going to cover pretty much all the new features that were added to SageMaker since Reinvent 2017. So it's over a year and a half ago. So we have a long list. So let's get started. So who should watch? Can anybody watch? Anybody follow along? Or are you looking for a certain kind of developer? Well, I guess everybody can follow along. We're going to try and keep things reasonably simple because that's really why we build Amazon SageMaker, to make machine learning accessible and simple for everybody, from new machine learning practitioners all the way to experts. So yeah, stick around. Please ask all your questions. Niki is keeping an eye on questions. She will send them to me later. So yeah, stick around and if you're already using SageMaker, then it's an opportunity to catch up on the new stuff. If you've never used SageMaker, if you don't know what SageMaker is, we're going to explain what it is and what are the different modules in there. And hopefully, you'll want to try it. And I'm going to show you some examples. I won't really run any demos today, but I'll show you some and basically how to get started. Sounds great. So everybody can watch. Seriously, if you're not familiar with machine learning or SageMaker, feel free to tune in and keep asking your questions. I will be watching that chat live and I will stop him at every single point that you guys ask something that is important. So yeah, we definitely encourage audience participation. So without further ado, what is SageMaker? What are we talking about? So SageMaker is a fully managed service for machine learning. And that says it all really, but I got to dive a little deeper. So let's maybe rewind to how we used to do machine learning a few years ago. So a few years ago, you would build a data set, you would start from whatever data was meaningful to solve your problem. And that was already, sometimes was a daunting task, right? Building the data set. Finding the data. Yeah, because there's a big difference between having data and having a data set, right? So data is in your backends, you know, MySQL and Postgres or whatever you use, or maybe web logs, you know, Apache somewhere so that's data and it's it's in raw form it's unclean it's incomplete you know it's messy in all kinds of ways and then you need to build that into a data set that's good enough to train on so we would do that and I was already quite an involved process and then we would write you know bespoke code using libraries or maybe even custom code to train on and We would try this on dev infrastructure, so whatever was on the desktop or whatever we had in the closet, right? Most of the time, that was in the closet. You have a server just chilling in your closet somewhere? Yeah, just warming the office in the winter and warming the office in the summer as well. Got it. So that got interesting. Then we would try to get some kind of early model and we would try then to put that on prod infrastructure, train at scale, and basically you would have to manage all the infrastructure yourself and all the way to deployment, which we'll talk about a lot about deployment today because I believe it's actually the hardest part. And we would have to manage everything really. So it was heavy lifting from beginning to end, spending a significant portion of your time on non-machine learning tasks. And even the machine learning tasks were not that easy. And if someone like me, who's just a regular software developer, wanted to get started with machine learning, how difficult would that have been in those days? I would say so, yeah. Maybe five, seven years ago, the answer would be pretty difficult. I had no chance, guys. Yeah. So you would need to do all that work and you would need a solid background in computer science and machine learning and stats and math because again chances were you would write your own algos or you would tweak pretty deeply what those algos did and so the typical setup back then was you would have like a machine learning team or a data science team or whatever the name was for those guys at the time and they would build the models in their sandbox and then they would literally throw them over the fence to product teams or engineering teams who would treat those as kind of a black box, try to integrate them into apps and deploy them, etc. And I was that guy. That was my team back in the day. And I got a lot of stuff thrown over the fence. And then I would put it in production on quite a bunch of web servers and everything would be on fire. And then we would have to debug it. It was a painful conversation because there was a gap on vocabulary and skills and everything, really culture, between the data science guys and the engineering guys. And that's what SageMaker is about, closing that gap, allowing software development developers who have very little data and ML background. Yeah, you and the other 20 million people. So we want to help all of you out there do machine learning, simplifying the machine learning part, giving you built-in algos, we'll talk about that, and freeing you completely of infrastructure constraints. So zero infrastructure work. And I'll stand by I'll show you the actual code, a couple of examples later on. Zero infrastructure work, whatever the scale is. So that's SageMaker. Focus on machine learning 100% of your time, working on the data, working on the algo, getting the best accuracy, no infrastructure work, and in most cases, not a lot of machine learning expertise needed. So again, all the newbies out there, You said all the right things. Stay there. You're going to learn. So how many new features have been released for this service since it launched in 2017? So it's a lot. I mean, you know how we build services, right? We listen to customers. We start from what customers need. We build the first version of the service. We have this MVP, a minimum viable product mentality. So we put it out there quickly, get some features. and then iterate, iterate, iterate and never stop. Some services like S3 or EC2 have been iterating for more than 10 years, so the wheel never stops. We're doing the same for SageMaker. Globally, last year in 2018, we released over 200 new machine learning features. It's almost one every day if you exclude weekends. We've got to rest sometimes. Sometimes. Sometimes. Occasionally. So that's a whole lot of features and quite a few of those are actually SageMaker features. So yeah, as you will see, it's across the spectrum and it's a lot of new stuff and we're not done far from it. So are you ready to dive into these lodges? So let's go. Let's do it. Have a cup of coffee or 12. We're diving in. We're going to be here for a while. Okay. of launches is around security. Of course. So what did we launch around security? Yeah, so and it could be a surprise, right? It's like why are those guys talking about security? They said they talked about machine learning, but guess what? Machine learning starts with data, right? And of course all of you are there and you know and all of us, right? Everyone would ask that first question is like hey wait a minute you want me to put my data on the AWS cloud and it could be very sensitive data if you're building, let's say, healthcare applications and it's going to be healthcare data. So that's pretty sensitive. If you want to do fraud detection, well, that's like customer transactions. It's very, very, again, sensitive data. So it's got to be safe, right? Yeah. So we say it all the time and it also applies to machine learning. security is job zero. I mean, we're not getting anything done until we have security figured out. And so SageMaker is about training and deploying models. So it's still based on EC2 instances. So they're fully managed, of course, but features related to security to network security and encryption etc will be there so for example we released we integrated SageMaker with the VPC so virtual private cloud which is your own little piece of the AWS cloud with network traffic completely private inside that that VPC so you can launch your your SageMaker instances inside your your own VPC. You can apply VPC policies. So if you deploy your SageMaker endpoints, so if you deploy your models to APIs hosted in your VPC, you can actually control who accesses the endpoint using IAM, Amazon IAM policies, applying all kinds of restrictions or access control based on IP addresses and time of day and whatnot. So anything that IML allows you, you can also apply on SageMaker. So this is really important because a lot of customers want to train and deploy inside their own VPC. They want their infrastructure to be completely private. We got a funny question. Of course. Let's have the questions. Richard H. Boyd says, can SageMaker predict how bad pineapple pizza will taste? I'm losing my voice. Absolutely. So the answer is yes. So Richard would have to build a pizza dataset. So I would recommend trying pizza places across the US, I guess. Right. And probably around Europe, because I can see there's a joke in there because it feels like Europeans have a stronger taste for pineapple pizza. Is that what you mean, Richard? So you go and try out a whole bunch of... I love pineapple pizza. What is the problem with that? It's horribly wrong. Sorry. We'll do a pizza session later on. So go and try out all kinds of pizza places, build a data set out from that. So you could have, you know, maybe features in the day set would be, you know, time of day and everything. what date you had that pizza and the location so maybe the zip code and how how how much pineapple was actually on the pizza because I suspect that's that's an important feature and how many ounces or grams depending on where you live or of pineapple you have on that pizza and a few more features like that and did you what did you drink with that pizza because if you have a coke with the pizza it's probably different than if you have different flavors Yeah, a glass of red wine, so what did you drink, blah, blah, blah. And then maybe rate your experience from 1 to 5 or 1 to 10, and you would go and have a few thousand pineapple pizzas. And that would be enough to build a classification or a regression model, actually, that would predict from a new sample, you know, the 1 to 5, Great. So, yeah, absolutely. How good that pizza will be. Yeah. So that's typical. And it's a very it's a it's a very cool question because it goes to show if you have a clear business question that you want answered and you should start from there, express the problem in a single sentence that people want to figure out and then find what data is necessary to answer that and then build a data set and then go and train and then go and predict. Right and of course if you add a billion samples for that, that's a whole lot of data That's a whole lot of servers. That's a whole lot of work If you want to do this manually have a pizza streaming session where we actually try like 10 different kinds of pizza And then we create our own little tiny data. I'll be back for that. All right. We're gonna do that. Thanks Richard. Good idea guys Okay, so continuing on the security. Yes. What else is a part of SageMaker? What else did we launch? So So network and access control is important. Encryption is paramount. We keep hearing Werner saying encrypt everything, encrypt everything. Yeah, the shirts all say encrypt everything. Exactly. And Steve Schmidt, our chief security officer, says the same. And he says when you encrypt everything, you actually make your job easier and you make our job easier. Because if anything happens, Should anything bad happen, there's a huge difference between leaking clear text data and linking encrypted data. So encrypt everything, encrypt everything. And SageMaker is fully integrated with our encryption service called Amazon KMS, Key Management Service, that lets you encrypt your storage, basically, either with Amazon provided keys or your own keys. depending on your security requirements. And so we can encrypt all data at rest on SageMaker, so data stored for training purposes or data stored for prediction purposes. So you can fully encrypt all your storage. We can also encrypt communications between instances, that's an important use case if you do what we call distributed training. So if you use multiple training instances that collaborate on a single training job because you have a lot of data. Like parallel training? Exactly. Yeah, you basically have those parallel instances that all work on a piece of the data set and then they put the results. They come together. Exactly. So of course they need to exchange information to make sure they are synchronized and and everything. And so we have encryption there as well. So that ongoing traffic during the training process can be encrypted. That's awesome. And I guess last but not least, we can also restrict permissions on your dev environment. So SageMaker has those things called notebook instances. And notebook instances are really what the name says, right? It's a fully managed instance pre-installed with a dev environment and that's for data science and machine learning. Jupiter is the number one thing. So Jupiter is pre-installed. Machine learning and deep learning libraries are pre-installed so you get your scikit-learn and your TensorFlow and your Apache MXNet and so on. And all the SageMaker libraries as well. And the SageMaker libraries and the SDK and and Boto3 and pretty much all in one. You can fire up those instances and get to work. I guess we'll see some of that when we talk about training. We got a good question here from ZH Jinx. Does SageMaker have a built-in UI or do I need a front-end dev to build that? That's a good question. Just like for any other service, you can use SageMaker with the AWS console. We're going to show that. Yeah, we're going to show that. I think it's fine if you're learning about the service, if you're experimenting, it's okay. I mean, generally, we think for production, you should not use the console because it's hard to trace what you've done, it's hard to automate, etc. But for learning and experimentation, it's perfect. So you can actually, you know, I know some people who work with the console and that's fine. The preferred experience would be working in the notebook using the SageMaker SDK, which is a Python SDK to get training going and deployment going, et cetera. And then eventually you will deploy your model and you have, if you deploy to an HTTPS endpoint, then it's just a regular API that can be integrated with your apps. You can also do batch prediction. We'll talk about that. And you still can put security on that endpoint. Exactly. So coming back quickly to those notebook instances, of course, they're Jupyter instances. So you can open a terminal and you can add some extra libraries if you want, pip install, whatever. And they're connected to the internet because you may want to download data sets, et cetera. Again, in some environments, that's too permissive. So you can actually restrict, you can shut down internet access for notebook instances. If you're concerned that maybe somebody in the company would do something silly and leak some of your data to the internet, so you can cut that off. And you can also prevent root access on notebook instances. So from, I would say, network access and access control to encryption to user permissions. We think we have a good set of security features and we actually have a very good blog post on building a secure data science environment. I think that's pretty much what the title is. Another question here from Madhuji. Can we use SageMaker for academic projects? Is it included in the free tier or in the education tier? So SageMaker is absolutely part of the free tier. Just go to AWS I think it's something like 450 hours. Yeah, it's a few hundred hours, but I would say you can only use the smaller instances. So I would say once again for education, personal education and experimentation and just learning the service, it's certainly fine. I don't think you could run actual projects, but we have There are several initiatives for education and research organizations. To give you that flexibility to use it for academic purposes. Yeah, to give you credits and grants, etc. So yeah, get in touch with us and we can point you to the right people in AWS to get you started there. Awesome. So moving on now, we covered the security features. Let's get to the important stuff. Data. Building data sets. So, I mean, I'm pretty new to machine learning and I've built a few models now, but I would say the hardest thing and the thing that takes the longest is building the data set before you actually train it. Sure. So what have we launched to actually make that easier, simpler, go faster? Yeah. So I would say a lot of people simply work with, you know, let's say, tabular data, right? So they pull some data from whatever relational database they have or they have CSV files or TSV files and I guess they still need to do some cleaning, etc., etc., but these are not so bad, right? You need to do some ETL work and you need to do some labeling. Maybe split the data half production. or half prediction and half training. Exactly. So there's somewhere there. Now, but people have been doing this for a while and I guess that's all right because if you're building that fraud detection data set, let's say, you know, right, your historical data tells you this transaction was fraudulent. Okay, so one, there's a one somewhere that says it's fraudulent. A zero somewhere that says it's not fraudulent. So you have that, it's called ground truth. Sorry for the machine learning mumbo jumbo. Ground truth means you're literally writing down in the data set that this is the actual truth for this sample. Now, if you work with more complex data like images or... Like media. Yeah, unstructured data. So natural language, images, videos, etc. You need to build a dataset. And let's say you want to build a pizza image model. I have like 100 JPEGs. Now what? Oh, more than that. You have probably 100,000. Okay, 100,000. You have 100,000 pictures of food. It was a smaller dataset. Yeah, all kinds of food, and you want to learn how to detect pizza. So you start from those images, and you need to annotate this data. So you need to... Let's say you want to do labeling and object detection. So you want a text label that tells you what kind of food is in the picture and where it is. So you want what we call a bounding box around the pizza. Let's say you want to detect pizza and ribs and wings and hummus. Hi, Buzz. build that hummus not hummus demo it's pretty hard um anyway so you would take those hundred thousand images and label them manually saying hey this image as a pizza this image has wings this image as whatever and then draw a rectangle literally drawing rectangles and there's like different ways to label right like sometimes it could be like putting in different folders that are labeled different things and you need to organize the data set and you need to do all kinds of things okay so let me ask you a question now how how much given one of those pictures how much time would it take you to annotate it like decide which category that food is and drawing the box and maybe you have different types of food maybe there's a pizza and wings gonna take me a hot second yeah it would take you you know a few seconds like 10 20 30 seconds okay multiply that about a hundred thousand now that takes for much yeah multiplied by a million multiple by 10 million okay impossible so we build a service called SageMaker Ground Truth okay it's actually a module inside of SageMaker and and it gives you graphical tools to annotate text data image data on different use cases so you can do to yourself you build your You upload your raw data to S3 and then you build your workforce. The workforce could be you and me annotating stuff or a team of evangelists annotating stuff. A private workforce, people you know inside your company or it could be a third-party company approved by Amazon. A vendor that knows how to annotate your kind of data. Or it could be Amazon Mechanical Turk. He was getting to it. I was waiting for it. Scaling up to 500,000 workers. So if you need to annotate millions of images, that's how you do it. Plus, while humans are annotating, we try, if you want to, we can enable this feature called active learning. And that's going to train a machine learning model that looks at human annotations and tries to learn how to annotate itself. So it can do it in the future. Exactly. So let's say human accuracy is right there. Initially, the model doesn't do a good job. It's learning, it's learning, it's learning. And at some point, it becomes just as good and maybe better than humans at annotating because it learned from those guys. And then it starts annotating like crazy. And all those people are out of the job. And all those people say, okay, we can move on with our life. And you save time and you can save up to maybe 70% of the manual work. So it just goes faster. It's cheaper as well. And that's how you annotate millions or hundreds of millions of data points. And that's what you need to do for things like autonomous driving or healthcare applications where there's just tons of images or text that needs to be annotated. So that's how you scale it. Ground truth. Awesome service. I mean, you have to try it. Really, really cool. And it's kind of funny because you actually have to try it and annotate whatever you like to annotate. It's pretty funny. I really enjoy it. It makes for good demos as well. So let's try it. Definitely check that one out. It'll make you go faster with building your data sets. Absolutely. So we're continuing on down to the SageMaker set of launches. What did we launch for notebooks? And talk a little bit about how you can launch a Jupyter Notebook on an instance in SageMaker. Tell us a little bit more about that feature and what we launched to help make that easier. So the Notebook instance is just a very convenient way to have your dev environment or your experimentation environment ready in minutes. Because of course, we could take your laptop and we could install Jupyter and we could install TensorFlow and MXNet and scikit-label. and if it had a GPU then you know we would go and install GPU drivers and that's kind of an experience every time you try and do that I actually recommend that you try for yourself see how enjoyable it is especially when you need to do it every now and then so again we want to help people get started quickly with ML and we want them to focus 100% of their time on ML so just Click in the console or call an API, start a notebook instance in just a couple of minutes, open it, and you jump straight into Jupyter. Okay, so that's the notebook instance. Maybe I can just show what I mean by that. Yeah, why don't we just show you guys really fast, see what it looks like. Okay, so here's the SageMaker console I was talking about. So we see the ground truth module here. And notebook instances are here. So creating one, I'm not going to do it live, I just want to show you quickly, is as difficult as entering a name, choosing an instance size, and pretty much clicking on that yellow-orange button here. Okay, that's about it. And a few minutes later, you have an instance, and you open it, and it will jump straight to Jupyter. Yep. Okay. And so you could start, you know, just, okay, I want to create a notebook for, let's say TensorFlow Python 3. And you can see all those different environments are already set up. It's based on Conda. Conda is a package manager for Python. And so you can easily jump from one environment to the next or everything pre-installed, no chance to mess things up like I do on my laptop. It's like, oh, I want to try the new TensorFlow version. I need to uninstall the previous one. Oh, wait, it needs a different NumPy version than MXNet. Three hours later. Three hours later, I'm on Stack Overflow trying to figure out what I've done wrong. Yep, happens to the best of us, I guess. All of us, actually. And so that's a problem. So I can quickly start one of those. Or I can, if I'm completely new to this, I can Can you also just really quickly show us what a model looks like if you were to open up an actual one of those? Sure. You have a nice collection of notebooks here. These are pre-installed on the notebook instances. These are all examples and great starting points if you're just trying to write your first model. They're also available on GitHub. If you look for this repo on GitHub, that's the exact same thing. Cool. So really quickly, let's say I'm new to this. Well, you can pull the image classification one and do some pizza. Let's look at, okay, let's have a, oh yeah, we could do image classification, why not? We were talking about this pizza example, right? So let's find, so these are, yeah, these are the image classification example. quite a few and we can just let's say okay let's quickly look at that one yeah okay so just use that example okay and here's the notebook and it's fair to say you know the the people who wrote this did a really good job because there's it's not just code I mean there's a lot of explanations and that's why I keep pointing people to this because Because of course you should take a look at the SageMaker documentation, which is pretty good as well. But you will really understand how the service works and you will really learn what machine learning is and what you can build with it by running those examples. And all of those notebooks have a lot of explanations. I mean, they're really, really meant for people who don't have a lot of experience right now. If you're an expert, you'll fly through that stuff. You will learn the SDK in a couple of hours. If you're new, you need to understand, okay, what's a training set? What's a validation set? What are the hyperparameters? Exactly. And I really recommend this. It's really, really helpful. So that's a notebook instance. So sure, you could absolutely install everything on your local laptop or your local dev server. But I find this is really convenient. Plus, if you have to manage environments for a larger team, right? Let's say you're a DevOps engineer and you need to provide environments for 100 data scientists, that's a very convenient way as well. Yeah, totally. Okay. So we got a question about, is this like Google CoLab? So it's based on Jupyter. So, you know, it's quite, of course, it's quite comparable. I would say the stage maker, so from a Jupyter perspective, yes. I mean, you can take your Python code and you can move it across different Jupyter environments. You know, it's vanilla Jupyter. We haven't tweaked anything here. So it's really standard. I think the benefit of notebook instances is that, like I said, you can launch them in your own VPC. You can secure them. So it's for basic experimentation, you know, sure, you know, when I'm playing around with demos, I'm not paying attention to security because I'm not using any kind of data that could be sensitive. And the same goes for other environments. But when you start working with real life problems, real life data, real life scale, you need to work in an environment that will let you secure and scale all the way to the moon. And I think SageMaker does that pretty well. Totally. I would agree. So besides just like the initial, I want to get a Jupiter notebook running, what else what other features do we launch around notebook instances so um so we added let me I think there was one like get integration yes so let me go back to the console so there's a good integration where we can actually we can actually add repositories so you can reference repos in your you can clone Yeah, you can add repos, your own standard or internal repos and make sure they're easily accessible. And when you create a notebook instance, you can pick from those repos as well. So again, if you want to provide even faster and easier access to your own repos, then you can then you can do that. So Git integration works with CodeCommit, which is our service for Git repositories, or it works for GitHub or GitLab, public services like that. Got it. And what about that lifecycle configuration? This is pretty cool because there was a lot of feedback. So people generally like the because it's such a huge time saver. But of course, everybody likes to set up their environment in their own specific ways. It's like, oh, I need this specific Python library, or I have actually a company code that I want to use. So when I'm working with a notebook, I don't want to have to go and pip install stuff in my notebook. Like 10 different things just to get my environment set up. Yeah, people will forget and, you know, And then it's not so convenient. So lifecycle configurations are pretty similar to user data on EC2, if you're familiar with that. So user data on EC2 is a small script that you can pass to EC2 instances that they run when they're started or resumed. And you can do the same for, no, started only. Started or create? you can do it for creation and resume. So you could say, well, every time I'm starting my notebook instance, I want to install those Python libraries. And maybe every time I resume my instance, because you can stop and resume notebook instances. Every time I resume, maybe I want to refresh my repo. So I want to git pull, whatever. Got it. Again, automation, saving you time, keeping things simple, and letting you focus on working on the data and not setting up environments. Super, super cool feature. I really, really like that one. I didn't know you could do that. Also, when you're building a model in Jupyter Notebook, a lot of the stuff that you're doing, like if you run the different cells, that information will get cached on that instance and it'll be there. And if you stop the instance, it goes away. And so those are helpful scripts to maybe help you put some of the pieces back. Yeah, sure. If you just want to make sure you always have the freshest and most up-to-date environment, you can just clean stuff. Every time I resume my instance, I want to maybe stop my notebooks and clean stuff generally and make it look nice. You can do that. Looks like we got a question from Yusuf AWS. Can I create a notebook AMI? I could harden it and circulate it amongst data scientists. Okay, so these are managed instances running AMIs, obviously. We maintain that AMI. So you cannot customize the AMI used on the notebook instances. If you have that kind of requirement, then I would recommend using the deep learning AMI which is accessible on AWS marketplace. There's an Ubuntu version, there's an Amazon Linux version and they also come with Conda, environment, so pretty much the same thing. And this is a proper AMI that you can either use as is or of course you can fire up and customize and save as you want. own AMI and distribute. So if you have that kind of requirement, I would recommend starting from the deep running AMI and customizing it. But when it comes to SageMaker, we do that work for you. Totally. So moving on. Yes. So now I'm in my notebook and I need to use an algorithm to train my model. What are we offering around our built-in algorithms or other algorithms or libraries that I could be using? How did we help support developers there? Okay, so early on, you know, I said, you know, stick around if you're new to this because we have built-in algo's. Yeah. And that's what I mean. So we realize, again, you know, we need a lot of organization need to use machine learning, but they don't have a lot of skills and they don't want to. They cannot hire those unicorn data scientists. It's difficult, and so on. So how about we build built-in algos that are easy to use, that solve the typical, not so typical machine learning problems out there. And of course, AWS, being AWS, this needs to scale. Okay, totally because of course a lot of people at first will small will solve smaller problems But as you move on you will start solving bigger and bigger problems and scaling machine learning is is not that easy and It's one of those roadblocks you hit if you use other, you know, I would say the legacy ML Way of doing things like you scale up and up and up and up and up and until you know you cannot and writing distributed algos is not for everybody right it's very very hard I mean generally distributed programming is difficult distributed ML is crazy talking about writing and I've done this work and parallelize the training on so we've done this for you so today we have 17 built-in all goes I think we started with nine when we launched what are the seven So the initial ones were the classical ones like linear regression. I think KNN was there. PCA was there. Image classification was there, et cetera, et cetera. So classical problems. But again, we made sure you could scale those things. So if you need to train linear regression on a petabyte of data, not everybody's problem, I agree. A petabyte of data. And so someone will say, come on, I've got 10 petabytes, right? I'm waiting for that question. If you need to train on a lot of data, then you need to be able to do distributed training, So training at scale with multiple instances. And of course, you need to be able to stream the data to the training instance because you're not going to copy even a terabyte of data. You're not going to copy one terabyte of data on the network to the training instance because you would need one terabyte of storage. And no one wants to pay for that. And it takes a while to copy one terabyte anyway. So we have this feature called pipeline. which lets you stream the training data directly from S3 all the way to training instances. That is so cool. And that's why we say you can train on infinitely large data sets because as you stream, as you are now streaming, you don't have a requirement to have a crazy amount of RAM or a crazy amount of storage. You could be training on medium-sized instances and that's okay because because they will get the data batch by batch or chunk by chunk, and they will train on that and get the next. Pipe mode is available for several of those algos, and it's pretty cool. And like I said, we added new algos as well. How do you find out which one pipe mode is available for? So, yeah, so the documentation will tell you that. So if you look at the SageMaker documentation, which is online, just like all the There's a specific section for built-in algos and a subsection for each algorithm that tells you what input format the algo expects for training and prediction, what's the recommended instance type for it, and also does it support distributed training, does it support pipeline. So the doc will tell you that. And so we added algos. of course, because some customers said, yeah, that's pretty cool, but we have other problems. Right. And so now we're starting to get into the more exotic stuff. And I think this is part of the unique appeal of SageMaker. You have algos like DeepAR. So DeepAR is actually an Amazon invented algo. It was published. So Amazon research. Woohoo. Well done, guys. And what is this algorithm for? DeepAR is for time series. So predicting forecasting basically. So a lot of companies, a lot of organizations have time series data and it's all around us. And predicting that with the right level of accuracy is not easy, especially at large scale. So DPR will actually be able to take many time series in parallel. So imagine you're trying to forecast demand for a thousand different products. You can actually be able to take build a single model based on those thousand time series. So if you have common patterns between the time series, like if you're selling skiing equipment, there's probably some kind of relation between skis and ski shoes. And also time of year. Exactly, time of year, etc. So DPR will do that. It's a great algo. It's part of SageMaker. Another one is called Blazing Text. I like the name of this one. It's got a cool name, Blazing Text. It's a natural language processing algo. It improves on an algo called Fast Text, which was designed by Facebook, which is a really cool algo. Super cool. Blazing is faster than fast. We enhanced it. We enhanced it, so you can now train NLP models on GPS. and fast text could only do the same on CPU. And these stay compatible. So fast text, if you work on NLP applications, of course you know fast text. And it's fully compatible. And we keep adding more, like anomaly detection algos and deep learning algos for object detection and semantic segmentation, etc. Yeah, the name are, yeah. Random Cut Forest. Random Cut Forest, yeah. I could talk about that one. What is that one? You don't want to know. Okay, I don't want to know, apparently. Who gets to decide the names of the algorithm? Whoever invents them, right? Whoever invents them. But I think half the fun in doing algo research is getting to pick the name. Got it. I really want to meet the creator of Random Cut Forest. Random cut parts, exactly. So it's an anomaly detection algo. And again, this is a pretty hard problem. And we think we make it much simpler by providing you those build changes. So the way you work with those is they're off the shelf. So you literally, with the SageMaker SDK, you select your algo. And it's all based on containers, by the way. So that really means giving the name for that container. set some parameters and a few lines of code to set up training and you're done. So I'll show you when we talk about training, I'll show you how to do this. But no, you don't need to write a line of machine learning code to do this. So if you're new to ML again, look at those sample notebooks I pointed you at and start running those examples. Yeah, I did this on Twitch two weeks ago. You will get it. Yeah, you'll get it in no time. So you said you launched with nine initial algorithms and then we launched eight more for use cases. So we have 17 algorithms. And they cover a nice section of the problems you would be facing when working on ML and deep learning as well. Super awesome. Okay, so let's move on to frameworks. Everybody wants to know about the frameworks. TensorFlow, MXNet, maybe You've heard these framework names before, but you don't know what they are. So how are we supporting built-in frameworks? So this is important because, of course, built-in algos will help you solve quite a few problems. But what about customers who are already doing machine learning, right? I'm sure a lot of you out there, you're already running TensorFlow code, Keras code, MXNet code, PyTorch code, and so on and so on. Totally. So what we built is we have that list of built-in frameworks, so built-in containers for those environments, and you just bring your code. And we have this cool feature called script mode that really, really makes it easy to take existing, let's say, and run it on SageMaker, just integrating it on SageMaker with just a few lines of code. And I mean just a few, I mean, I'm talking like three, four lines of code to basically plug your code inside the SageMaker environment, receiving location of a training set and validation set and passing hyperparameters. So script mode, it will get you all the time. running in no time. So again, bring your own code, run it on SageMaker. Distributed training is available out of the box. So no need to set up distributed training. Again, that's one of the hardest things. And which frameworks? TensorFlow? Okay, so TensorFlow, Apache MXNet. Keras is supported in both of these. As you probably know, Keras is a high-level API that supports multiple backends, so like TensorFlow and MXNet. So you can run Keras in either the TensorFlow container or the MXNet container. And Keras is a, if you don't know where to start with deep learning, I think Keras is a good way to start. It's very easy and very well documented. PyTorch is available. Chainer is available. Scikit-learn is a recent addition. We had I did it last in advance. So Cyglaarn is a popular one for ML. And we have a few more, but these are really the main ones, the ones that everyone wants. And so you could say, well, yeah, sure, OK, nice, but I can just grab a TensorFlow container or an MXNet container from the Docker Hub. And what's the big difference? Again, the big difference is this stuff is integrated with SageMaker. So distributed training and all kinds of cool features we're going to talk about are fully integrated. Infrastructure is transparent because you just train with a couple of lines of code. Deployment. Yeah, deployment is integrated. And these are not the vanilla versions. Okay, so now we're getting to the really good stuff. So we have And they're probably pizza teams, right? You know, we're big on pizza. But they don't eat pineapple pizza. But no, I don't guess. No, they don't. Just had to ask. Yeah, they're too reasonable to do that. Yeah, probably. And so we have specific teams who focus on optimizing those frameworks. So, for example, we have a team working exclusively on TensorFlow and making sure TensorFlow runs as fast as possible on AWS infrastructure. I didn't know that. So we've done some benchmarks last year showing that we could train 11 times faster. I said 11, okay? He specifically said 11. 11 times faster on CPU instances compared to the stock version. We also scaled TensorFlow linearly all the way to 256 GPUs because we found the vanilla version would not scale linearly. So if you double the number of GPUs on a training job, you would not get twice the speed up. And so that means basically you're wasting money and your training times are longer. So we tweak that. And now you can scale linearly up to 256. So that's the kind of work we do. It's not just packaging. It's in some cases, tearing those things apart. diving very, very deep on how they work and optimizing the core technology in there to make sure we make the most of our CPU instances, GPU instances, to give customers basically the most bang for the buck, right? All right, so let's get to the good part. Tell me about training. This is the fun part. This is the really, really fun part of building a model, training, and then you get to inference. This is the exciting part. for me at least. So what are our features around training? So training, and if we could please switch to my screen for a second, because I said, you know, it's zero infrastructure work, and I want to make this very clear. He really wants to stand by his word. Absolutely. So this is how you do it. Okay. So, all right, that part. So here we selected we selected the built-in algo for image classification, so that's going to be the first parameter here, and that estimator object is how you configure training jobs. An IAM role. Yeah, an IAM role and this. Okay, oops. That's it. So if you want to train your image classification algo on that GPU instance, P2XL is one of our GPU instances. It's going to spin it up and then shut it down once it's done. Just say, I need one. If you need 10, just say 10. If you need 50, just say 50. But then it's going to spin all of them up, train them and shut them down. And when you run that cell, actually when we run the next cell, which is Fit, this one here. The Fit API is the one that gets the training going, receiving the location of the data, et cetera. SageMaker spins it up, pulls the image classification container in this case, injects your parameters, copies the data set because here it's a tiny one, we can copy it, and the training process starts, the training log is available in CloudWatch and CloudWatch logs. And then when the training is over, SageMaker shuts down everything and you just get billed for the amount of seconds that you train on. So there's no chance ever that you will leave stuff running for no good reason. And that's again one of the benefits of SageMaker. Not only don't you manage anything when it comes to training infrastructure, you're guaranteed by design that you only pay for the seconds, minutes or hours that you train on. And then nothing stays on and some of us are guilty of that, right? Leaving EC2, EMR, Guilty as charged. Yeah, except we don't pay our AWS bills and all those good people out there, they do. So we want to make sure they don't spend their money inefficiently. So that's what I mean when I mean no... And I guess we can take my screen away, please. And we specifically, like, we're talking about training. We launched civic features around reinforcement learning, for example. So the first one I want to mention, it's not... Not quite there yet, but it was announced at the New York Summit and a lot of you are waiting for this, is Spot Instances. Oh, yes. So we will soon have Spot Instances available for SageMaker training. And you know generally Spot gives you about a 70% discount on EC2 and we expect to see the same on SageMaker. So stay tuned, it's coming soon. If you're not familiar with EC2 Spot also, Shameless Plus I'm doing a stream about it next Wednesday so tune in for that. So check that out if you don't know about Spot please yeah listen to this you're gonna save a crazy amount of money. Yeah, really. So that's generally the big announcement the latest big announcement and at last reInvent we also added capabilities for reinforcement learning so reinforcement learning is a it's a pretty new type of I don't want to go into details because that would be another two-hour session right there. But basically, reinforcement learning is good for problems where it's too difficult or even impossible to build a dataset. Imagine you're trying to build data for the stock market. It's too chaotic, right? Our oil exploration, our autonomous driving is very difficult. So instead of trying to build a dataset that would not summarize the problem well, we use a exploration technique and and by alternating cycles of exploration learning exploration learning literally interacting with the environment which is usually in a simulator interacting with that environment getting rewards like a Zumba like hitting the walls and trying to find what works what doesn't you gradually learn how to do more of the positive stuff and less of the negative stuff and you learn how to do things right. And the best example of this is DeepRacer. I'm sure you've seen DeepRacer. I'm sure we have tons of DeepRacer videos on Twitch. We have a lot of DeepRacer content coming for you guys. DeepRacer is based on reinforcement learning and you can train it on AWS, meaning really training it on SageMaker and then deploy the model to the car. So RL is that new that's pretty cool. And it's fully integrated with SageMaker. Okay, so what's up next? What other feature did we launch for training? Another important feature, or I guess another challenge, is when you train your models, when you go into that cycle of tweaking the data, training, looking at results, tweaking the model, training again, blah, blah, blah. You want to maximize accuracy. And if you do this manually, honestly, it's going to take you forever. It's going to cost a bit of money. And it's not guaranteed that you're going to get a good result. But let's be honest, when it comes to picking hyperparameters and so on, or even model architecture, we don't really know what you're doing. I mean, I don't know what I'm doing. I'm faking it. He totally knows what he's doing. I'm totally faking it. So we have this feature called Automate. model tuning and again it's a good name because that's exactly what it is it will it will automatically explore ranges of hyper parameters that you saw that's really cool so yeah and and it will train a certain number of models and using machine learning techniques not random although although we do support random search as well that's one of the new features because customers were using and they wanted to have that on SageMaker as well. But it's not going to search randomly within those hyperparameters. It's a range. So you define ranges and you say explore, let's say the learning rate and the batch size and whatever else you want to explore between those two values. And I guess we can't find the optimal parameters ourselves, but at least we can come up with reasonable ranges. Totally. And say, okay, SageMaker, please run 20 or 30 jobs and find me, be smart about it, and find me the most accurate model using parameters in those ranges. And they will tell you what they are at the end. I mean, the analogy I take is, okay, we're in this room, so we can define the ranges from the left wall to the right wall and the back wall to this wall. And I'm looking for my car keys. It's somewhere in there. SageMaker, find me the spot in this space where I drop my car keys or my smartphone or my glasses. Okay, so if you were to manually explore that space, it would take you a while to find that specific place. If you let SageMaker do it, then it's going to be smart about it and it's going to quickly find that spot in the space with the optimal result. So I'm still waiting for SageMaker to tell me where my glasses are. Make sure a request for the team, please. Model tuning is this. You can do what we call Bayesian optimization, which is the clever way. Or you can do random search, which people use as a baseline most of the time because they want to do this first and then they run the clever way and they want to see that it's better, and it is. All right, so we only have a couple of minutes left. Yeah, sure. Let's get to, I have my model trained. I'd like to make How do I get that in the world? So deployment is just about the same. I mean, if you were, I showed you the code for training, right, where we just say, hey, please deploy on X instances of that type. It's exactly the same for deployment. Just say, hey, please deploy on one M4 Excel instance, et cetera, et cetera. So again, zero infrastructure. So you can deploy in two ways. You can deploy to HTTPS endpoints, which become APIs that you can invoke using your favorite language or the SageMaker SDK if you're experimenting. Or you can do batch prediction, batch transform we call it. So that's useful if, let's say, you need to predict 10 gigabytes of data every month. You don't want to deploy. And you don't want to push 10 gigabytes or TTPs, right? No, you just want it to run through. It's a one-off, so just create this transform operation and be done with it, literally. Again, fully managed. And now we have this brand new feature where you can actually select the features that you want to predict on. So you don't have to pre-process the data set excluding columns in an OS. I want to remove that stuff. You could actually give your extract, the data extracted from your Redshift or RDS backend or wherever it is and say, okay, please use column 1, 9, 3, and 20, and these go to the model. So you save on ETL as well. Wow. So yeah, that's pretty nice. Batch transform is very good. We have a question really fast. Sure. JM Castagnetto says, Is there provisioning for model versioning? That's a good question. All the training jobs are fully traceable. If you look at the console, you'll see historical information on all your training jobs. What algo you used, what hyperparameters you used, what training set you used, what validation set you used. There's a search feature available in StageMaker. It's still a beta, but you can use it. That lets you quickly find models. Okay. So I want to find that XJBOOTH model that I trained over a month ago and that gave me 0.95 plus accuracy. So yeah, you can figure that out. So another cool feature that we launched is it's called inference pipelines. Okay. So in a nutshell, most of the time you more than one model. Of course, when you learn about ML, one model is more than enough. But pretty quickly you realize, oh, I need a model to pre-process my data, and I need a model to predict, and I need another model to post-process. So initially, let's say you have those three models, you had to deploy three endpoints and you had to manually orchestrate. The pipeline essentially. Call the first one, call the You're just creating a pipeline. So that was boring, right? That was boring. Yeah, sorry. So now we have inference pipelines. So we got rid of it. So we got rid of it. Well, you could still do that if you love orchestration code, but I would recommend building the pipeline. So you train those three models, you chain them as a pipeline, and you deploy the pipeline as a single unit to a single endpoint. All within the end point. And now when you push data to the endpoint, it's going to flow through the pipeline. That's really cool. do this on endpoints and you can do this on batch transform so if you have you can do up to five models so if you have complex workflows you can chain them up and deploy them just like that let's keep going right yeah I told you yeah so I guess the the next one I want to talk about it's a major one is called elastic inference and again this was released at last reinvent yeah so elastic inference lets you to fractional GPUs. Prior to that, you had to pick between a GPU instance or a CPU instance. For some customers, they were stuck between a rock and a hard place because if they deployed on CPU, it was probably more cost-effective, but maybe a little too slow, especially if you have image models and complex models. If they deployed to GPU instances, Everything was super fast, but maybe a little bit on the expensive side, especially if the model was not large enough to fully utilize the GPU instance. If you have crazy large models, fine. If you have mid-sized models, they're not going to keep the GPU instance fully busy. That's kind of a waste. So Elastic Inference lets you take a CPU instance, any CPU instance, and it works on EC2 as well, not just SageMaker, and you can attach... an accelerator, GPU accelerator to it. And those come in three sizes, yeah, medium, large, XL. And each size gives you a specific acceleration level, number of teraflops. So you can find the right compromise. And you can get a huge discount compared to a full-fledged GPU. So now people are choosing that CPU with the accelerator. Yeah. You choose the CPU that works best for your business app, let's say, and you attach an accelerator just for the prediction part, and you can tune it so that you get the right level of performance and the right level of pricing. So it's a super, super feature. If you're deploying to GPU instances by default, please try Elastic Inference and you're going to save money and you can thank me on Twitter. And I guess the last one... Make sure you thank him on Twitter. Or yell at me on Twitter if you didn't save money, but I'd be surprised. Yelling at us is part of our job desk, right? Yeah, that's true. That's all right. And I think the last one I want to share with you guys is a service called Neo. So SageMaker Neo. So Neo is basically a model compiler and a model runtime. So the idea is you train a model on SageMaker or maybe elsewhere. It could be a pre-trained model because again, it's a modular service. So you take a model, compile it with Neo for a certain architecture. You could say, I want to compile it for Intel platforms or Nvidia platforms or ARM platforms. So one API call, super simple. And then you get an optimized version of the model with native code for that platform. To run on those devices. And then you take the Neo runtime, which is a really tiny runtime, much smaller than the TensorFlows and everything else. of the world. And you take that runtime, you load the model, and now you predict with native code. And there's a nice speedup. So again, it's fully integrated with SageMaker. It's just like Elastic Inference. Elastic Inference is just one parameter away. Neo is one API call away, and deployment is super easy. You could also use this on EC2 if you wanted. But you can now optimize machine learning for a specific hardware target, right? Especially if you're ARM platforms, if you use IoT devices where you want to run ARM-based IoT devices where you want to run prediction, NEO is a nice tool to have. All right, so we have one minute left. Yes. What do you want to leave them with? Should we throw some links at them? Should we do a giveaway? Yeah, so I guess we said we'd give some You guys want AWS credits? Who wants AWS credits? All right, so let's start this giveaway. All right, while you do this, okay. I think the best way to get started, like I said, go through the SageMaker documentation. Give it a quick run through. So, you know, you can see the same thing, you know, you can see the same thing, you know, you can see the same thing, you know, you can see the same thing, you know, you can see the same thing, you know, you can see the same thing, you know, you can see the same thing, you know, you can see the same thing, you know, you can see the same thing, you know, you can see the same thing, you know, you can see the same thing, you know, you can see the same thing, you know, you can see the same thing, you know, you can see the same thing, you know, you can see the same thing, you know, you can see the same thing, you know, you can see the same thing, you know, you can see the same thing, you know, you can see the same thing, you know, you can see the same thing, you know, you can see the same thing, you know, you can see the same thing, you know, you can see the same thing, you know, you can see the same thing, you know, you can see the same thing, you know, you can see the same thing, you know, you can see the same thing, you know, you can see the same thing, you know, you can see the same thing, you know, you can see the same thing, you know, you can see the same thing, you know, you can see the same thing, you know, you can see the same thing, you can see the same thing, you can see the same thing, you can see the same thing, you can see the same thing, you can see the same thing, you can see the same thing, you can see the same thing, you can see the same thing, you can see the same thing, you can see the same Be nice to her, come on. So just start running those examples and don't worry, they're not expensive. Okay, so these are really designed to be running in just a few minutes. So you're not going to spend your budget on those notebooks. Most of them just train for a few minutes, so it's going to be a few pennies, right? So just go through that. and work your way through the built-in algos and the frameworks, etc. So I think this is how you join. We also have a 10-minute tutorial. You can easily Google that, SageMaker 10-minute tutorial, if you just want the quick experience of building and training and deploying a model. course we have lots of reinvent videos out there there were lots of oh yeah no just give away guys without the exclamation mark works putting you in the viewer list so just type giveaway and you'll enter into the giveaway and then we'll pick a winner and then we will wrap yep so yeah YouTube videos are reinvent videos the machine learning the AWS machine learning blog we have a forum and it was form for SageMaker and I'm trying to keep an eye on SageMaker questions on Stack Overflow as well. If you tag them with Amazon-SageMaker. All right, two seconds left to enter the giveaway. Ending in two seconds and I'm going to click the button. We have plenty of people in this giveaway right now. All right, ready? Do you want to click? Do you want to be honest? No, no, no. They'll be yelling at me and I get it. It's rigged? Yes. Yeah, I'm going to get it rigged. I'm not innocent. anyway they roll Richard you win the pizza guy Richard H Boyd wins he asked the question you make sure pineapple pizza wins all the time you whisper me on twitch and I will send you a no less credits it is not rigged it was not rigged I promise all right we got a wrap thank you guys so much for joining us if you'd like to see more content from Julian make sure you send him a tweet or or me. We can get him back on Twitch to teach us more about machine learning. J-U-L Simon on Twitter. I'll put his Twitter in the chat. I'm kind of easy to find. My messages are open, so shoot me questions or again, ask questions on Stack Overflow and I'll get back to you. I want to thank you very, very much. You have a really cool studio. It was a pleasure to talk to all of you. It was my pleasure to have you here. I hope you learned a few things. things and of course, you know, let's continue the conversation online and maybe I'll see some of you on the road as well, right? Absolutely. Thank you so much. Thank you guys so much for tuning in and we'll see you later. Bye bye. you

A whirlwind tour of Amazon SageMaker July 2019

Transcript

Tags

About the Author