by Julien Simon
DEVit 2017 Conference, Talk 8 "AI on a Pi"
http://devitconf.org/2017/
Julien Simon
https://twitter.com/julsimonhttps://github.com/juliensimonhttps://aws.amazon.com/evangelists/julien-simon/
Talk Description:
In recent months, Artificial Intelligence has become the hottest topic in the IT industry. In this session, we’ll explain how Deep Learning — a subset of AI — differs from traditional Machine Learning and how it can help you solve complex problems such as computer vision or natural language processing. We’ll introduce you to MXNet, an Open Source Deep Learning library and we’ll show you to run it on a Raspberry Pi. Then, using a camera and a pre-trained object detection model, we’ll show random objects to the Pi…and listen to what it thinks the objects are, thanks to the text-to-speech capabilities of Amazon Polly.
Transcript
So we have two more talks. The next one, our speaker is a tech evangelist for Amazon. He's going to talk about artificial intelligence and deep learning and he brought some robots for us. So please help me in welcoming Julien Simon. Enjoy.
Well, good evening. It's almost evening, isn't it? I can see that being on time is not really a Greek thing, but hey, that's fine. You had technical difficulties, and I can fully understand that. And again, congratulations to the tech team for fixing this in such little time. It's a challenge. My name is Julien and I work for AWS. The light is extremely bright, so I cannot see you. I can kind of see that you have people over there. Today I'm going to talk about a pretty cool subject, at least I hope it's cool: deep learning. My challenge for this 43 minutes and 54 seconds is to talk about this without ever talking about math or theory—boring stuff. What I want to show you today is that you can get started on deep learning applications even if you're a normal software engineer and you don't have a lot of theoretical background. And hopefully, I will succeed.
So here's the agenda. We're going to talk pretty quickly about AI and where it started and why it's been a pretty disappointing thing for decades and why it's changing. I will give you some examples of applications of deep learning, what we can do with it, and some of them are in real life as you will see. Then we'll talk about an open source library called Apache MXNet. This open source library allows you to build and train deep learning models in your applications using a number of languages. As you will see, it's quite high level, it's not a lot of code, and normal people, normal developers like me and most of you, I guess, can do this. And obviously, it's pretty dreadful to look at APIs on slides, so I will switch to code and show you some actual MXNet code, run some examples on a GPU instance running in AWS, and hopefully on my little friend here. Looks like he survived the trip and hopefully is willing to work today. We'll see.
So, AI. Well, I don't want to give theoretical definitions. That's it. For me, that's it, right? You've all seen this movie, I hope. If not, go and watch it. That's the dream that a lot of us have: build this thing, build this system that is super smart, smarter than us, accomplish very complicated tasks for us that we can talk to. Oh, it's super dark. Oh, the projector is dying again? No, it's not from my end. I can do this without slides and I believe I will have to do a little dance if this fails again. You have been warned. Don't fail. So, a complex system that can fulfill very, very sophisticated tasks that we can talk to, that can talk back at us, and hopefully that's not trying to throw us into space and to kill us. That's the nasty part. But this is what has been obsessing computer scientists and AI researchers, researchers, and science fiction fans for a long time. And fortunately, this movie was released in 1968, I believe, and decades later, AI, that computer system, is still nowhere to be seen. And actually, in 2001, the actual year 2001, Marvin Minsky, who is one of the fathers of artificial intelligence, and who contributed to the 2001 movie—he was an advisor to Kubrick, I don't know if you knew that, but it's a fact. He wrote a paper saying, well, it's 2001, where is hell? And, well, it's not there, is it? And he listed a number of reasons why, and they're all very frustrating. And at about the same time, machine learning started to pick up, right? I mean, in the first half of the year 2000s, machine learning exploded, and all of a sudden all the big web companies and everybody was working on machine learning, and I'm guessing all of you in the room do machine learning or at least you have machine learning on your CV, right? Might be a different thing, but hey, keep it on the resume, you know, that's the way to do it. You can always learn it later. And so machine learning exploded and it's super popular and it's solving a lot of problems in lots of different applications. But the thing is, machine learning works when you can pretty well define your data and define what the features are.
So let's take an example that should be familiar to you. You're trying to predict user behavior based on web logs. So you're trying to predict who is going to click on what ad and who is going to click on what product and who should be recommended what product, etc. And you use your navigation logs to do that. And it works well because in those logs, features are defined. You know the user, you know the product, you know the date, you know the time, you know the user agent, you know maybe the screen resolution, you know blah blah blah. And you can then ask your data scientist to go and figure out which one of those features should be used to build a prediction model, and it's a lot of work. But at least the features are available, and they need to find the right subset. Now what about using more complicated data like pictures? If I give you a 1000x1000 pixel picture and I'm asking you if this is a dog or a cat or a tiger or something else. How do you handle that problem? You have one million pixels. Is every single pixel a feature? Do all pixels mean something? Can you keep a subset of those pixels? How do you do that? It's complicated. If I give you a recording of my voice and I ask you to predict if I'm a man or a woman, what's my age? What's my nationality? Did I drink too much last night? Will I drink too much tonight? All those kind of very important questions. Then again, how do you do it? What parts, what bits, what samples in my voice are you going to use to build your model? And of course, that's not going to work. Traditional machine learning is not going to work. And this is why we need something else. We need a way to teach computers informal knowledge. Something that's obvious to us, something that might be obvious to a five-year-old, but something that is completely alien to a computer system. And to do this, we need to use neural networks. And neural networks are great because if the network is large enough, and if you give it enough data, it can absolutely learn to approximate anything. So show it enough data, it will learn how to predict correctly the output for a given input. But networks have been around for ages. I mean some of you could be older than me, maybe. And if that's the case, you remember it at the university that when we started this, we were told something like, yeah, artificial intelligence has been around for 40 years, it's super cool, but it doesn't really work outside of the lab, right? It doesn't scale, it doesn't really work for real-life applications. So it's cool, but hey, go and find a different job. Okay, fair enough. And again, these pieces have been around for forever, right? Forever, since the 50s. Why did they fail for decades? They failed because they didn't have enough data to be trained, right? You need lots of data to train neural networks. And in the 50s and 60s, what was the size of the largest hard drive again? Remember that? Well, you can look it up on Wiki. Some of them didn't have hard drives. They just didn't have the data. In the rare event that they actually had data, they didn't have the computing power to train the networks. Training a network is very CPU intensive. You need to push that data multiple times through the network. We're going to do that in a few minutes. Computing power. In those days and even until recently, it was pretty difficult to do. Now it's different because data is available, everything we do is digital. As I speak, I am and you are generating tons of data. That gentleman here is filming me, all you guys are sending tweets saying, oh my god this is really a dreadful presentation, but that's okay, because it's data that I can use for my next demos, right? And some of you might be watching stuff on YouTube because you're bored, or chatting with your boyfriend or girlfriend, keep doing that. Anyway, there are just petabytes and petabytes and exabytes of data available. And some of it ends up in public data sets, like the ImageNet data set which I'm going to show you and some other data sets that are available to researchers. And as you know, computing power is now massive thanks to GPUs. Researchers are starting to use GPUs for deep learning in the mid-2000s, and this really changed the game because those chips have thousands of cores, and so now you can run in parallel massive training operations instead of running them on just a couple of CPUs. And the last thing is, of course, you can get all that stuff, you know storage, GPUs, CPUs, etc. All those infrastructure resources, you don't even have to buy them, you can just rent them, right? You can get what you need in the cloud for a few hours, a few days, build your deep learning model, and release everything when you're done and just pay for this. You don't have to invest in a ton of hardware to build those models. So that's pretty cool, especially given the fact that running a deep learning model is actually not very CPU intensive. So it makes a lot of sense not to buy tons of deep learning hardware because you will only need it for training. You will not need it for actual usage, right? Like I will show you.
Okay, so now it's moving along pretty fast, and not a day goes by without a pretty cool article on AI and deep learning showing up on Hacker News or your favorite place for tech news. So let's look at a few examples. There's a challenge every year called the ImageNet challenge where research teams all over the world use the same database of millions of images sorted in a thousand categories and they try to predict the correct category for that image, for each image. And actually, they have to predict five categories, and if the correct category is in the five, then they consider that they successfully identified the image. Here's an example. I'm not sure you have much use for those dogs in Greece, although you do get some snow here. These are real images from the data set. Who thinks these both dogs here are from the same breed? Come on, raise your hand. Who thinks they are a different breed? And all the other ones are, what? Snow? What is snow? I've never seen snow here. Come on. So for the record, the answer is these are different dogs, different breeds, but you don't feel bad because the guys in Stockholm got it wrong too, and they have no excuse not to know this. So they're different. So this is the kind of challenge that the deep learning models have to face. So here are the results from the last five or six years. So the blue line is the error rate. And it's a little smaller on that screen, so let me read it. It goes from 28% in 2010, right, on the left, to 3% in 2016. So you can see it's improving very quickly over the years. And that red bar is the number of layers in the model. So we went from just one layer in the early years to 8, to 19, to 22, to 152, and to 269 layers. Can you imagine that? Crazy number. For the winner in 2016. This is why it's called deep learning because the networks have now hundreds of layers. We get to 3% error. What about humans? Is this good? Is this bad? Who thinks humans can do better than 3%? No one. Okay. Who thinks humans are actually around, let's say, above 10%, more than 10%? Alright. Between 5 and 10%. Alright. And everybody else is still wondering about that snow thing. Okay. Alright. Well, the result is 5%. Okay. So, humans, I mean normal humans, who are in a normal state, whatever that means, if they haven't had too much Ouzo the night before, they can do 5%. So that means here, deep learning can now significantly do better than humans at classifying very complex images. So computer vision, I'm not saying it's a solved problem, but it's making a lot of progress, as you can see. I'm curious to see how they will do this year.
Here's another example. You may have seen this one. It's the family of Echo devices. And these devices are home devices that literally you can talk to, to order a taxi, get a pizza delivered, get the latest news, get weather information, you know, all the little things that you like to do in life. And the device itself, I love to describe it as a loudspeaker on a microphone, and I get angry emails from the Alexa team every time I do that, so hi Alexa team. But it is really just a hardware device, although it's a very nice device, a friendly device, and a beautiful device, but the smart part lives in the cloud. The natural language processing and the text to speech and everything lives in the cloud. And they are now available as AWS services, although I will not really mention them today. So, deep learning in the house. And of course, you can do fun stuff, right? Yeah, come on, show me that picture. Yeah, here it is. So you've seen this one hopefully in the news lately, right? It's a Basquiat painting that was sold two days ago for $110 million. Well, Basquiat has been dead for a while, so he's not going to enjoy that money, but at least he can enjoy the fame wherever he is. So that's Basquiat, right? Very nice. Well, that's just me. And I figured, I don't have 110 million dollars to spend on a Basquiat painting. I would love to. So I figured, hey, can I use deep learning to build my own Basquiat, right? And there's this fantastic research paper that came out a couple of years ago called Neural Art. And in MXNet, which I'm going to show you in a minute, there's an example that allows you to run Neural Art. This is the result. I'm not going to ask the question, but I think most of you think that I look better on the bottom picture. If anybody wants to buy that one for... I'll give you a very cheap price. Can we do 110,000 euros or something? It's a good deal. We can do silly stuff, fun stuff, or art with deep learning, which is nice.
Alright, let's move on to MXNet. So MXNet, why am I even talking about MXNet today? Because it's the deep learning library that AWS is committing to for the future. So we're officially supporting the project. We have active contributors to MXNet in the AWS teams. We hired actually some of the MXNet designers, and we want to push that project, share it with the community, and of course use it internally. So, like I said, it's an open source library. It's now part of the Apache project, so it belongs to the community, and I think it's a good sign for the future. It supports a number of languages; you can do Python, you can do C++, you can do MATLAB, you can do R, you can do some other ones that I don't even know about. And so it's quite accessible. And it's quite high level, as you will see in the examples, it's not a lot of code to write. You can design your networks in 10, 20 lines, train them in just a couple of lines. So it's very developer friendly, at least I think it is. And it scales very well. So this is a graph of multiple models being trained on MXNet from one GPU to eight, no 16 sorry, living in the same server. And the red line is the ideal scaling, so the red line is linear scaling, and as you can see, we're really close to linear scaling for some of the more advanced models like ResNet or Inception. So it's very nice. So what happens if you push it? If you push it to 256 GPUs, that's a lot of GPUs, if you push it that far, well, you can see that MXNet keeps scaling almost linearly. And that's really really good because if you have huge deep learning jobs to run, if you're doing speech recognition for example, and you need to run that job for days, well, if you throw more GPUs at the problem, then you can expect MXNet to reduce the computing time linearly. So that's one of the really strong advantages of MXNet compared to other libraries is the scalability.
So let's look at the API. And as promised, I will really not show you APIs on the slide. I will show you some code. So the first demo I'm going to show you is a popular one. It's based on the MNIST dataset. The MNIST dataset is a set of 60,000 handwritten digits from 0 to 9, obviously written by many different people. And the goal, of course, is to show digits to the neural network and have it predict if that digit is a 0, a 1, a 2, a 3, etc., up to a 9. So our input data will be those images. They are 28 by 28 pixels, grayscale, so it's pretty easy to convert them into that 2D matrix with pixel values and train the network on them. So let's switch to the console. And let's take a look. Start yelling in the back when you can actually read. Thank you.
Okay, so here's my first example. It's called train model. The first thing we're going to do is to load our data. So the data comes in binary files, and we need to split it into two parts. We need to split it into one main part for training, the training set, and we keep one part for validation, to check the accuracy of the trained network. So in MXNet, you do this using an object called an iterator, which will load the files and feed the samples batch by batch into the network. You don't give one sample at a time; that would be really too slow. So in this case, I'm probably using 128 batch size, something like that, it's a typical value. So 128 samples go into the network at a time, and we use the iterator to do this. The next part is to build my network. Here I'm using that symbol object. I need an input layer for my data. As you can see here, I really don't give a lot of detail on what my data looks like. I'm not saying anything about how these are 28 by 28 pixel grayscale images. And that's another good thing in MXNet, is that you have a very clear separation on how the data looks, right, and how the network looks. So you load that data and you can apply it to your network, and if the data changes, if all of a sudden I wanted, let's say, color images, which would be, you know, red, green, blue 28 by 28 pixels, so it's a very different shape, then I could use the same network. The network could adapt the input layer to this data. So clear separation between data and network. That's one of the strengths of MXNet. And then I'm defining my layers, so I have a fully connected layer, so input data, right, where I'm going to show the pixel values for the images. And then a fully connected layer with 128 neurons. And as you can see, I'm doing this in one call, right? So I don't have to define every single neuron and how it's connected to its friends, etc. This high-level API lets me build a layer in one call. And then I have a second layer with 64 neurons. And then I have my last output layer, which has 10 neurons because what I want is a probability that this image is a digit between 0 and 9. So I need to have 10 different output values. And that's it, right? That's my network. Six or seven lines of code. Now, what I'm going to do is put the data on the network back together using the module object, and as you can see here, I'm gonna run it on one of my GPUs, and that's all it takes, I don't have to mess with the GPU itself, I just say give me GPU 0 and please run on that guy. Very high level, no worries. Here I'm binding the network to the data, saying okay, here's the training data, here's the validation data, and apply it to that model. I have to set some parameters, and then I go and train it. I train it for a number of epochs. An epoch is running the full data set through the network. Here I'm going to do it 50 times. So the training set is going to go 50 times batch by batch to the network. I'm saving the model and I'm using the validation set to measure accuracy. How many lines of code is this? 44 lines including comments. It's really very compact code. So how about we train it?
So it's gonna load the data, 60,000 samples plus 10,000 validation images. And it's pretty fast as you can see, thanks to the GPU. And you can see the training accuracy going up, which means the network is correctly learning the data set. And it will get to 1 normally, so it can perfectly learn it. And then after the 50th epoch, it's going to run the 10,000 validation samples and give me the accuracy. And here I can see I get to 97.7% accuracy. It's a little far, right? You can see that one right in the middle here. So is that good or is it bad? Well, let's find out. So what I did is I took my favorite paintbrush application, like in the good old days, and I drew some digits. And it's pretty ugly handwriting, especially my 9 down there. It's an ugly 9. So let's load those digits, these images, let's put them in a matrix, let's run them through the trained network, which I saved, and let's see how the network does here. And let me just show you the code for a second. So I'm loading the model that I saved. I'm loading the images into a proper array. And then I'm basically sending each image through the trained network to get outputs, and I'm printing the output, and I'm doing 10 images. So for each number here, I get 10 probabilities. So for this one, that's for the digit 0, my network says that it is 99.46% sure this is a 0, which is good. It's pretty okay about 1, about 2, 3, etc. Actually, you know, my digits are not so bad, except the last one, right? Which is my ugly 9. And this network says, nope, that's not a 9, that's an 8. Well, at least, you know, 65% say it's a 9 and 9% say it's a 9. So it gets it wrong. Alright? So let's build a better model and see if we can improve on this. So here I'm using the same dataset, I'm just building what we call a convolutional neural network. You may have heard that term quite a lot, CNN, and they're really really good at recognizing images. So it's the same principle, I'm using the high-level convolution API to build different layers. And as you can see, I'm building here my network again. And here's the output layer with 10 values again. So again, it's 10 lines of code. And actually, you could take a research paper and look at the network structure that is defined in the research paper and code it into MXNet. That's what people do; they don't go and invent networks. You can reuse networks that have been published and worked on by much smarter people than myself. And then it's the same story, right? Put the data, create the module. So here I'm going to run it on three GPUs. This machine has eight, and it's as easy as saying give me GPUs 0, 1, and 2, and I'm running for 30 epochs, and the rest is completely similar. Ok, so let's go and do this.
Okay, and it's gonna take a little while, so let me check if my buddy here is willing to say hi. Too many wires in here. Be nice, okay? Or I'm throwing you in the sea. Okay, so I should switch this thing on if I can do that. Or do I need a deep learning engineer to do it? I think I need help with this. Sorry. He has a mic PhD, thanks. Okay, so say something. Hello friends. Thank you for inviting us to Thessaloniki. Tonight, I'm going to recharge my batteries with Souvlaki and Ouzo. Now, Julien, could you please stop clowning around and get on with the demo? Talk to me like that again and you die. All right. Okay. Silly robot. So, let me get back to my training. Can you see the accuracy? It's now 99%. So that's quite better. Let's try to predict my numbers again. I just need to use my new network. It's a network called LeNet because it was designed by a French scientist called Yann LeCun. So, you know, hey, we did something cool. For once. Congratulations. Alright, so let's predict. And so as you can see for the first few digits, it's very, very sure, you know, these are the right numbers. And if you look at the last one, way on the bottom here. Well, this network is now 99% sure that my ugly 9 is a 9. So definitely, it's a much better network, much more accurate. So that's an example of building different networks, training them, and as you can see with GPUs, it goes pretty fast, and you can get pretty successful. Of course, you could do this with letters and you could do this with the Greek alphabet. That would be a very nice demo. You should try that next time.
Here's the first demo. Let's get to... I've got 11 minutes. We can be silly a little bit. Let's be silly. I said I wouldn't talk about the AWS services, but I do have a few minutes. So we have a technical conference every year called re:Invent. I'm sure you've heard about that. We have plenty of videos on YouTube about re:Invent. And we announced three services. The first one is called Polly, and you just heard it. It's text to speech. It can speak 24 languages with four voices. The other one is called Rekognition, it can recognize faces and objects, and we're going to try it. And the third one is called Lex, and it sounds very much like Alexa, and that's not an accident, it's because it's the chatbot technology that is used by Alexa and the Echo devices. Integrate now in your applications. Polly for text to speech, Rekognition for images, and Lex for chatbots. So let's try Rekognition, but can I have my assistant? I have an assistant. Thank you. Thank you, my assistant. Come on, you can encourage her. Okay, so what we're gonna do, we're gonna take some pictures with the robot. And so we're gonna try to get some faces, yeah? So there's not a lot of light, so the camera's here. So just point the camera to a friendly face. Don't drop it, alright? Alright. Okay, okay. Yeah. Okay, oh yeah, thanks for the light, that's gonna work. Alright. See? Okay, let's try this. And you should have sound too. Yeah. Okay, could you give the mic? Oh, thanks. Okay, so we're going to do a few of those, right? Alright, find someone else. Okay. Okay. No face has been detected. What? Okay, okay, I'm trying guys. You're not human, I'm sorry. Okay, and so it's tweeting. All right, no, we can't let that failure impact us. Come on, let's do another one. Okay. I'm trying. Is it too dark? No, face has been... Yeah, just let me try. No, but the camera is a little low, so... All right. Oh, yeah, now, okay, now it's... Well, that should work, okay. Okay. You're a little distant, but you can see that it does... Okay, it does pick the fact that this is a crowd and we're in a room and with a theater, etc. Okay, yeah, so that gentleman was correctly recognized? Oh yeah, sure. That one was a little dark. That's not a problem. And here, you're a little too far away. But even though it doesn't detect faces, it understands the context. So that's a mix of Polly and Rekognition. But this is cloud-based, right? I'm calling the AWS API to send that picture, get it processed in the cloud, and I'm calling an AWS API with a text message to get the sound file to play it. And we cannot trust robots to have cloud connectivity 100% of the time. So the next step, and thank you, my assistant, you were great, is to have local AI, right? To be able to do AI on that little Raspberry Pi robot with no cloud connection. Is that possible? Well, it is. Because like I said earlier, it takes a lot of resources to train models, but it does not take a lot of resources to run models. So here what I have is a Raspberry Pi on wheels running Raspbian OS, embedded Linux version. And I install MXNet, the MXNet library, the same one that I showed you on that powerful GPU instance, and I just copied a model trained for image recognition that I trained in the cloud. So I used the cloud to train it for hours and hours and hours, and then I just copied that file to my Raspberry Pi, and hopefully, it's gonna work. To make things even harder for me and hopefully this is still working, I have an Arduino here, I'm sure you guys are super familiar with this, right? And I have a joystick to drive the robot, and these things are talking to one another using AWS IoT, which is our IoT platform, so whenever I move that joystick or click the button, that's gonna send MQTT messages, which is the IoT protocol that we use, it's a standard, it's gonna send IoT messages to Dublin, I believe the Ireland region, and obviously the robot is listening for those messages, and it should do something. Alright, so we have a video, let's see how we can do that. Okay, we'll agree this is a baseball, right? It's my lucky baseball. Yeah, let's try to turn it. Alright. Well, we had too many technical difficulties, right? So this is going to work now. Will it? Okay, I'm connected, alright. Okay, let me start that thing again. Oh yeah, I think now we're connected. Okay. All right, taking a picture now. What do you see? I'm 38% sure that this is a stingray. Right, it's not, but that's not the point, right? Oh, come on. Go that way. All right. Nice. Okay, look. Looks like I have a little latency here. So, okay. Let's move on. I'm 27% sure that this is an umbrella. Yeah, sure. Will you shut up and move? All right, okay. Sorry about that. Just reboot the damn thing. Let it reconnect, hopefully, that's going to work. Come on. Yeah, it works the other way, right? Some people think humanity will be slaughtered by robots, I think we are safe. Alright, now let's try and take a picture. Could I have that mic, please? Alright, thanks a lot. Alright, make me proud. So I'm pushing that button here. I'm 99% sure that this is a baseball. All right. All right, so you get to come back to France. Okay, so again, how does this work? Okay, because it's it's fun, money, and everything, but I'm just again pushing that not super precise joystick here to send IoT messages going through our platform in Dublin. These guys subscribe to those messages, so it receives them, and I have that Python server running on the robot getting the messages and trying to do what I asked it to do. And so when I push that button, it sends a C command, and so it takes a picture with a tiny camera here and of the object in front of it and it runs that image through a deep learning model trained in the cloud, right? But all the actual deep learning process happens here, right? And the voice comes from the cloud. We could maybe do local voice as well using libraries. This is pretty cool. You may have seen on the slides or not, but I will put the slides on Twitter tonight. I wrote a number of articles on Medium about exactly this, MXNet and the robot and everything. Absolutely redo all of this. You have all the code, if not, tweet me, send me a tweet or send me an email, and I'll help you out. But if I can do it, you can do it, and I don't have a deep learning PhD, thank God. So yeah, I wasted my life in different ways.
We have other tools out there, so the Deep Learning AMI is an Amazon machine image that we provide for you to run on GPU instances. It's the one that I used in the previous demo. It has all the deep learning frameworks installed, so you don't have to install them, just boot that AMI and get to work. We have plenty more resources for you to learn, so just go and try it and build cool stuff, write blog articles, send me tweets, I'll be happy to share them and show everyone one of the cool things that you built. I want to thank you massively for having me today. It's a beautiful city. Costas, man, you picked me up at the airport last night. It was very late. That was so kind of you. Everybody has been super kind to me. That robot was almost perfect today. So thank you, robot. And you've been great. So thank you very much. Thank you, Julien.