Infoshare 2017 Julien Simon Amazon Web Services Deep Learning on AWS with MXNet
June 20, 2017
Deep learning continues to push the state of the art in domains such as computer vision, natural language understanding and recommendation engines. One of the key reasons for this progress is the availability of highly flexible and developer friendly deep learning frameworks. Apache MXNet is a fully-featured, flexibly-programmable and ultra-scalable deep learning framework supporting innovative deep models including convolutional neural networks (CNNs), and long short-term memory networks (LSTMs). After a quick overview of what Deep Leaning is, we will introduce you to GPU instances and to the Deep Learning AMI and of course, we’ll run several MXNet examples based on Python notebooks.
The speech was given on 18th May 2017 on Tech Stage during infoShare 2017.
Follow infoShare:
https://facebook.com/infoshareplhttps://twitter.com/infoshareplhttps://instagram.com/infoshare/https://linkedin.com/company-beta/1272775/
Transcript
Good afternoon. My name is Julien. It's my first time in Poland. What took me so long, right? So, super happy to be here today. This is a pretty cool subject. Let's get started right away. I'm going to talk about deep learning for a bit. Then we're going to try some demos. If the demo gods are with me, they're going to work. If not, you can throw stuff at me. I'm used to it. Let's go.
Here's the agenda. We're going to talk very quickly about AI and the story so far. Then I'm going to show you a few applications of deep learning technology. Just a few things you can do. Then we'll dive into the topic of today, which is Apache MXNet, a deep learning library to build neural networks, train them, and use them in your applications. I'll talk about MXNet a little bit, show you some key parts of the API, and as fast as possible, I want to show you some code and run some demos, because this is how you really understand how to work with MXNet.
AI, the story so far. Do you remember when that movie came out? Most of you were not born, I'm afraid. It came out in 1969: "2001: A Space Odyssey." A lot of us have been obsessed with this movie. I've been obsessed with building HAL, that artificial intelligence you can talk to, that can talk back and do some complicated stuff, and hopefully fail to kill you. That's the part we would like to forget. For decades, people have been trying to build systems like this. Artificial intelligence is maybe something you studied at university. When I was there a century ago, they would tell us, "AI has been around since the 50s, but it's very cool, and you can't really do anything useful with it. So it stays in the lab, and that's it." And that's a sad story.
In 2001, the actual year, Marvin Minsky, one of the founders and fathers of AI, and an advisor to Kubrick on the movie, wrote an article saying, "Hey, it's 2001, where is HAL?" Fifty years after artificial intelligence was invented, we're still very far from having something remotely as smart as HAL. He listed a number of reasons why this failed to happen. That was 2001. Today, it's 2017, and we still have no HAL. I'm still wondering why. On the other hand, everyone is doing machine learning today. Most of you are doing machine learning either as a service or building stuff with Spark and open-source libraries. Machine learning has never been so powerful and successful as today. But how could machine learning be so successful and AI be so terrible, to be honest?
Traditional machine learning doesn't work when you're trying to solve problems where features cannot be explicitly defined. Let me take an example. If you take web logs and try to predict what page or ad a user is going to click, you look at what's in the web log—time, date, user ID, URL, user agent, and tons of things. You have to find the meaningful ones to build your prediction model, but at least you get to pick your features from a well-defined set inside your log. Now, imagine you're trying to recognize images. I give you an image that's 1,000 pixels by 1,000 pixels. I want to know if it's a dog, a cat, a tiger, or a bottle of wine. That's one million pixels. Are every single pixel a feature? Which pixels really make sense? Compared to the 10 or 20 features you typically use with machine learning, it's a very different problem to solve. That's why traditional machine learning has not worked and still does not work to solve problems where features are so many and complex that they can't be expressed.
The problem we have to solve is how to get that knowledge into a computer without explicitly expressing features. To do this, we have to go back to something very old in computer science: neural networks. Neural networks have been around since almost the 1940s, with the first major breakthroughs in the 50s. It's literally 60 years old. It's been around forever, but until very recently, it did not deliver the goods. Why? Mostly for two reasons. Back in the 50s, 60s, 70s, 80s, and even the 90s, we did not have very large data sets. To train a neural network properly, you need tons of data. The more data, the better. Without lots of data, training is not going to be very good. The second thing is, even if you had a bit of data, you needed a lot of computing power to train the network. These are very CPU-intensive operations. On CPUs, especially older ones, that wasn't possible at all.
Now, it's quite different because we have huge data sets. All of us are generating tons of data on mobile phones, the web, video, mobile gaming, and everything. Everything we do now ends up in logs somewhere. There are large public data sets like ImageNet or the YouTube data set. Even on AWS, we host some public data sets for everyone to consume. Data sets are large and available. Thanks to GPUs, we have a ton of processing power. Recent GPUs have thousands of cores on one chip. When you have multiple GPUs in one server, that's a crazy amount of processing power in just one server. Computing is not really a problem anymore. Data sets are not really a problem anymore. Now we have all we need to train those neural networks and build cool stuff. We can do it cost-effectively thanks to the cloud, where we can grab all the resources we need, compute everything, and generate results. When we're done, we can release all those resources and stop paying for them. Scalability and elasticity are very important.
For the last few years, this is why you've been hearing about deep learning and AI every single day. It's not a buzzword. These multiple factors now make deep learning a reality, and people can use it cost-effectively to build very smart applications. Let's take a few examples. Every year, there's a challenge for research teams called the ImageNet Challenge. It's based on the ImageNet data set. They have a lot of images that need to be sorted and identified according to 1,000 categories. For example, this is an actual image in the data set. Who thinks this is the same breed of dog? Who thinks this is a different breed? Who has no idea even if I gave you 10 minutes to figure it out? The challenge is to predict five categories for each image, and if the right one is in the five, it's considered a success. For the record, this is not the same breed. I was in Stockholm not long ago, and they couldn't explain it either. So I don't feel too bad.
Every year from 2010 to this year, we have this competition. The blue line is the error rate, going from 28% mistakes down to 3% last year. The red bars are the number of layers in the network, from one layer to 8, 19, 152, 200, and 269. That's why it's called deep learning—269 layers in the network. The error rate for humans is 5.1%. A normal human in a decent state does 5%. So, computers and models are now better than us at recognizing complex images. They keep improving and can accurately classify millions of images in a thousand categories within seconds.
This one here, I'm sure you've heard about it. I met some colleagues here who are contributing technology to this, and some of it is actually built in Poland. Congratulations. That's the Echo device. I love to describe it as a mic and a loudspeaker, and people hate it when I do that. But that's mostly what it is. It's connected to the cloud and can do natural language understanding, speech processing, and text-to-speech. The quality of the voice and interaction is very impressive, and all this is based on cloud-based deep learning technology. We use deep learning in everyday life now.
Let's talk about MXNet. MXNet is an open-source library for deep learning. It was accepted into the Apache project, so it belongs to everyone now. It's not driven by a single company. It's fairly easy to use, supports multiple programming languages, and is quite scalable. This is one of the reasons we picked it as our preferred library in AWS. It scales on multiple GPUs in the same machine, from 1 to 16. The red line is the ideal line for linear scaling. As you can tell, for some models, we get pretty close to linear scaling. If you go beyond 16 up to 256 GPUs running on many machines, this trend continues. MXNet scales almost linearly up to 256 GPUs. If someone was willing to spend the money to go to 10,000 GPUs, we would see that trend continue. This is important when you want to train very large models that could last for hours or days. If you want to get things done faster, you can add more GPUs and reduce the time linearly.
Training models is always about three things: defining data, defining the network, and combining them. MXNet has ND arrays for data, symbols for defining the network, and modules for combining the two. ND arrays are n-dimensional arrays where you put your input data. Symbols are how we define the networks, building a graph connecting nodes and layers. The cool thing is we can build graphs without specifying the data, allowing us to build networks that apply to any kind of data. The module combines the data and the network for training. We have functions to help you load data from well-known formats like images to make your life easier. If you want to know more, this URL is a blog article I wrote a few weeks ago as an introduction to the API. Feel free to look at it. You can go as slow as you need and understand every detail.
On top of this API, you have higher-level APIs to build full networks in just a few lines of code. If you want to define a fully connected layer, you use the fully connected API. If you want to define a convolution layer for a convolutional neural network, you use the convolution API. These helper functions allow you to define networks in just a few lines. Using these, you can throw images, video, sounds, and text into the right model and do examples like image detection, image segmentation, and machine translation. MXNet provides all the APIs to do this and build the networks to do this.
Before I go into the demo, you will get the slides. We have a few more resources to make your life easier. We have a deep learning AMI, an Amazon Machine Image, pre-installed with all the tools and frameworks. You just start the AMI, and you have MXNet, TensorFlow, and others installed. You can start working right away with the CUDA drivers to use GPUs on GPU instances. We have blog articles and additional information on our websites. We help you get started with the AMI and documentation. My blog posts are also available.
Here's the first demo. I'm going to train an MXNet model on the MNIST dataset, a set of 60,000 handwritten digits, each a 28 by 28 pixel grayscale image. We'll load this into ND arrays. Let me show you the code. I'm loading the data from files into an iterator, which serves samples batch by batch. We define our network with an input layer, a first fully connected layer, a second fully connected layer, and an output layer. In four lines of code, I define my network. Then I bind my iterator to the network, specify the GPU, and train for a few epochs, saving a small part of the data set for validation and measuring accuracy. That's not a lot of code. Let's train it.
I'm loading my images and running the 60,000 samples through the network. The training accuracy is going up, showing the network is learning to predict the correct number from a given image. We get to 100% on the training set. When I run the validation set against the model, I get 97.7% accuracy. In one minute, I can train a model to recognize handwritten digits. Let's try it. I drew some numbers with my favorite paintbrush application. We'll run these images through the network. For each image, I see a vector of 10 floats, each the probability for the corresponding digit. The network gets all of them right with high probability, except the last one, which it thinks is an 8 with 65.9% probability.
Can we build a better network? Yes. This time, it's a convolutional neural network, known to be very good at recognizing images. I'm defining a slightly different network and running it on three GPUs to make it faster. Training takes a bit more time, but we'll get there. While it's training, I'll check if my little buddy here is still running. It looks like it's running. Let's turn it on. Whatever that means. Blame Google Translate if it makes no sense. That's Polly, one of our AI services, mostly built in Gdansk. We should congratulate your engineers for building this. Training is done. Validation accuracy is 99% this time. Let's predict. Same thing as before, my handwritten images. It gets all of them right, including the nine, even with my bad handwriting.
I have a few minutes left, so I'm going to try this with my friend here. If it doesn't work, I'm going to crush you. Can you see it? Yeah? He has a Twitter account. So that's a Raspberry Pi with a pre-computed MXNet model. I trained the model for hours in the cloud and copied it in there. I also have an Arduino with a remote control, and they're talking through AWS IoT. If you want to see this, please come here. Let's try to move it. You can tell it's moving, right? That's fantastic. Can I get that video? It can do this, right? It can do a little dance. Okay, I'm going to cheat a little bit. If I push that button here, I'm 88% sure that this is a baseball. The object is 21 centimeters away. Is it a baseball? Yeah? I'm done. Thank you.
Julien Simon is the Chief Evangelist at Arcee AI
, specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.
With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.
Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.
Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.