Deep Learning for Developers lecture by Julien Simon Code Europe Autumn 2017
March 19, 2018
In recent months, Deep Learning has become the hottest topic in the IT industry. However, its arcane jargon and its intimidating equations often discourage software developers, who wrongly think that they’re “not smart enough”. Through code-level demos based on Apache MXNet, we’ll demonstrate how to build, train and use models based on different types of networks, such as multi-layer perceptrons or convolutional neural networks. Finally, we’ll share some optimization tips which will help improve the training speed and the performance of your models.
Transcript
Good morning everyone. My name is Julien. I have a working mic and I'm the AI evangelist for EMEA at AWS. As you know, we just had a very large event in the US called reInvent, and I was there all week last week, which means I'm very, very jet lagged. Hopefully, I will still make sense. If not, just boo and throw stuff at me, okay? Today I'm going to talk about deep learning for developers. I like to call this session deep learning for humans, but whenever I have people with PhDs in the room, they get angry. So, do we have any PhDs in the room? None? Oh, now they're afraid, okay. All right. So, this is deep learning for humans then.
What are we going to cover today? Well, I'm going to give you an introduction to deep learning, and it's going to be a gentle introduction. There will be no math at all. I'm going to use your intuition and really basic concepts to explain what deep learning is and just how the hell it works. Then we're going to study an open-source library for deep learning called Apache MXNet. And more importantly, we're going to stop looking at slides as soon as possible, and I'm going to run some demos. I'm going to do this using Jupyter notebooks, which is a friendly way to do that. In the process, I will introduce you to a new service we launched literally days ago at reInvent for machine learning and deep learning called Amazon SageMaker. Pretty exciting. And of course, I'll share some links. Slides will be available on Twitter this afternoon, I guess, if I don't fall asleep.
Just a few definitions before we start. Artificial intelligence is trying to build systems that show human-like behavior—systems that can talk, systems that can understand natural language, systems that can follow complex reasoning beyond explicit programming, and systems that can do prediction. And it's been around for 60 years, so it's not new at all. Machine learning is a subset of artificial intelligence that tries to teach machines to learn from data. So, using algorithms and data sets, we're trying to learn and show predictive behavior, classification behavior, etc. Machines learn without being explicitly programmed. Deep learning is a subset of machine learning. It has the same goal—let machines understand data, learn from data—but this time, deep learning focuses on complex data, data where features, variables, columns, if you want to think in database terms, are not visible. So think images, videos, speech, text, etc. This is what deep learning is all about.
Some examples: Amazon Alexa is based on deep learning. It's a very intuitive system to use. You talk to Alexa, and Alexa talks back to you. So it's all natural language, but, of course, behind the scenes, it's all based on deep learning. We also launched more services for deep learning last year and very recently, a few days ago. Services that can understand visual context, images, videos, services to build chatbots, services that understand text, automatic translation, natural language processing, understanding the context, speech to text, text to speech. All these services are super easy to use, just one API call away, and we try to make them as simple as possible. But behind the scenes, if you lift the cover and look at how these systems are built, they're using deep learning.
We've got customers doing this as well. Here's just one example. It's a Chinese company called 2Simple. They build autonomous driving systems. Last summer, they drove a truck autonomously, no driver, for 300 kilometers in the US. This is based on MXNet as well. MXNet is part of the stack, being used to build a model and train the truck. So here are some examples of what you can do with machine learning and deep learning. Let's dive in and see what deep learning is all about.
The basic construct at the core of deep learning is called the neuron. The neuron is based on trying to imitate the neurons in our brain. A neuron is a mathematical construct that takes a number of inputs, the X values on the left. Each input is multiplied by a weight, which is a floating-point value, could be positive or negative, and we add all of that up. This gives us what we call the multiply and accumulate operation. We take the inputs, multiply them by weights, and get a value. We run that value through what we call an activation function. There are many different activation functions. For example, the rectified linear unit (ReLU) is very popular at the moment. This function decides that if the output value from the neuron is negative, we set it to zero. We consider that this neuron has not fired and don't want to propagate that value down to other layers. If the value is positive, we just output it. The role of these functions is to introduce non-linear behavior into the network, which is a complicated way to say we have to make a decision. We have to draw a line somewhere and say, on this side, I don't care what the neural output is, and on this side, I'm interested and want to keep it.
By itself, a neuron doesn't do much. You need to combine them to build layers and neural networks. Here's a very basic example. We have inputs on the left, outputs on the right, and in the middle, we have what we call a hidden layer. As you can see, it's fully connected to the inputs and outputs. This is probably the simplest neural network you could build. How do we train this kind of network with our data? We build our data set, which is made of samples. Each line in that matrix is a sample, and as you can see, it has as many columns as we have neurons on the input layer. Here, we have m samples, and each sample has l features, which is the number of neurons on the input layer. For each sample, we will have a label. Let's say we're trying to build a classification system. We're trying to classify those samples into a number of categories. For each sample, I've got a category. For the first line, I've got that y1 value, which is my label. Imagine we have image categories, a hundred different categories, category 1, 2, 3, etc. These are the values we would use here. So, I've got m samples, m labels, and on the output layer, I've got N2 output neurons, so N2 possible categories.
What I want on this output layer is the probability in each neuron that a given sample belongs to that category. For example, for the first sample, I know it belongs to category 2, so the probability that this sample belongs to category 2 is 100%, so it's 1. This technique is called one-hot encoding, and it delivers good results. We use this when we do classification. Let's put this into practice. Here, I've got my first sample and the one-hot encoded category for that sample. You have to rotate that in your mind. Put the sample on the input layer and the one-hot encoded category on the output layer. The goal is to build a network that can predict this correctly. Whenever we show the X1 sample on the input layer, this is the result we want to get. The same for X2, X3, etc. This is the goal of training. We measure training by accuracy. We run our data set through the network, measure the number of correct predictions, and calculate accuracy. Initially, the network is blank and doesn't know how to predict. When I put my sample on the input layer, it will output something that's probably completely wrong. When I apply the network to my X1 sample, I get Y', which is completely wrong. I want to predict the correct results, so I use a loss function, a mathematical function that calculates the difference between the predicted value and the real value. We compute the error between what we get and what we want to get. We extend this to a batch of samples because we train the network batch by batch to speed things up. Let's say 32 samples at a time or 64 samples at a time. We add up all the errors for those samples, and that gives us the error for the batch. We run, say, 32 samples, add up all the errors, and this is our error for the batch. The purpose of training is to minimize this error. If we minimize the error, our predicted result gets closer to the actual result, which is what we want.
How do we minimize the error? We tweak all the weights. If you remember analog TV, when you had to tweak the knobs to get a clear picture, it's exactly what we're doing here, except it's not just a few knobs; it's hundreds of thousands, possibly millions. The training process is all about this. You take a batch, push it through the network, compute the error, and run an algorithm called backpropagation, which goes back through the network and tweaks the weights to reduce the error just a little bit. We won't get it right after one batch, but at least we go one step in the right direction. We do the second batch and backpropagate again. We do this batch by batch until we've gone through the whole data set, which is called an epoch. We run this again and again for 20 epochs, 50 epochs, 200 epochs, maybe more, until we get to a very tiny error. Once we're done, we get a trained network. The weights have been tweaked to minimize error for that data set. The important parameters, called hyperparameters, are the batch size, the learning rate, and the number of epochs.
The training part is easy because the network will learn the data set very well. The problem is how the network behaves when we show it samples it has never seen before, which is called the validation data set. We split the data set between the training part and the validation part. We take the validation data set, push it through the network, and measure performance. We like to do this at the end of each epoch. We save the model after each epoch, so after each epoch, we have a measure of how well the network predicts samples it has not seen before. The question is when to stop. If you plot these values, training accuracy will go up, move up and down a bit, but eventually, you get to 100%. If you plot the loss function, it will go down to zero. If you plot validation accuracy, it will follow training accuracy and at some point, it will go down. This is bad because it means the network is not working well with new samples. You want to stop at this point. When training, you don't know when to stop, so it's important to save the model after each epoch. When you look at the graphs later, you can say, "The best epoch is 1.5.6." After that, it just goes down. Once it goes down, we call this overfitting, which means the network learns the training set too well, hurting prediction for new samples. That's why you need to save your model and keep all the checkpoints to go back once training is complete and say, "This is the best one." Make sense?
Let's talk about MXNet now. MXNet is an open-source library. The project started in 2015 and is a collaboration between industry and universities. It's the preferred framework at AWS because it supports more languages than most others. It works equally well in cloud environments and IoT environments. I run some demos on Raspberry Pis, and it works pretty well. It's part of the Apache project now, so it doesn't belong to anybody; it's just driven by Apache, which I like personally. It's high performance, fast, and scales very well. You can scale it to hundreds of GPUs almost linearly, which is very good if you have a huge data set. MXNet 1.0 was just released yesterday, so it's stable now. Here are some projects built with MXNet. All these are GitHub projects you can grab. Object detection, finding out what's in the picture and where it is. Object segmentation, which is one step further, finds the silhouette of objects, crucial for autonomous driving. You need to be precise here. You can do text recognition, even with very weird fonts and bizarre images. Face detection finds faces in a picture and attributes like gender, age, and whether the person has glasses. We have a service called Amazon Rekognition, which is very high-level, but if you want to build your own version, there you go. Pose estimation, my personal favorite, detects the position of the human body in real-time, useful for video games and movies. Machine translation is another open-source project by AWS called Sockeye. You train a model to translate, say, English to French or German to Swedish, and it works quite well.
Let's look at the key objects in the API and then jump into the demo. The four things you need to build MXNet code are: one, a way to store your data, like the ND array for N-dimensional arrays. Two, a way to build models, define layers, and how they're connected, using the Symbol API. Three, a way to serve data to the model during training, using iterators. Four, a high-level object to put the data and the model together and train, called a module. These are the four objects we'll see in the code.
Let's jump to the demos. This is the Hello World for MXNet, and it's still interesting. We'll create a synthetic dataset from random values, 10,000 samples, 8,000 for training, and 2,000 for validation. My matrix has 10,000 lines, each sample has 100 features, so my matrix has 100 columns. Each sample is labeled with a category between 0 and 9. I'm declaring my samples, initializing x and y with random values between 0 and 1 for features and random integer values between 0 and 9 for labels. Each sample has 100 random values between 0 and 1, and a label which is an integer between 0 and 9. I've got 10,000 of these, 8,000 for training, and 2,000 for validation. Once I've built my data, I split it. I have 8,000 for training and 2,000 for validation.
Let's build a very simple network: one input layer, one hidden layer with 1024 neurons, and one output layer with 10 neurons. Why 10? Because I have 10 categories. Each neuron in the output layer stores the probability that the sample belongs to a specific category. It's straightforward to define this network, just a few lines of code. Then, I build my iterator, which serves the data batch by batch. The batch size is 16, so I serve my samples 16 by 16 to the model. I bind the model to the training iterator and labels, set initial parameters, and choose my optimizer for backpropagation. Now, let's run it. I'll train the dataset for 50 epochs. It's very fast because I'm running on a GPU and it's a small dataset. As you can see, I quickly get to 100% accuracy. After 50 epochs, training accuracy is 100%. If you give enough time and data to a neural network, it will learn anything perfectly. This is called the Universal Approximation Theorem. It's very cool, but remember the validation accuracy. Training accuracy is not a surprise; it's 100%. What about validation accuracy? I build a validation iterator with the validation data set and score the validation samples. The accuracy is 10%, which is very bad. It's 10% because I have 10 classes, and if I take random picks, my score will be 10% over time. This shows that while the network can learn anything, it performs terribly on new samples if the data set makes no sense.
Let's look at a more meaningful example. Let's pretend we don't want to train and just classify images using pre-trained networks. You can download these from the MXNet website. I've got three networks trained on the ImageNet dataset, which has a thousand categories. I've downloaded them already. These models have been trained on ImageNet, and here are the first ten categories. Let's load these models and run an image through them. I load a model with one line of code, bind it, and prepare the data. The data is one image, three channels (red, green, blue), and each channel is 224 by 224. I load the categories and use OpenCV to read the image, convert it to RGB, resize it to 224x224, and build the ND array. I push the image through the network and get the output, which is a thousand probabilities. I sort these probabilities and return the top five categories. Now, I can predict. I load my image, turn it into an array, and push it through the network. The first network, VGG16, tells me there's a 96% probability there's a violin in the image. ResNet and Inception also predict a violin with high probabilities. It's quite fast, even on the CPU. On the GPU, it's about 20 times faster. GPUs are unbeatable here because they can process multiple images in parallel.
Let's train from scratch now. I'm working with the MNIST dataset, which has 60,000 handwritten digits from 0 to 9. The goal is to recognize the digits. I've downloaded the dataset, so let's run it. I build my iterators, network, bind, set parameters, and train for 50 epochs. I save the model and measure accuracy, getting 98.43%. Let's predict some real-life samples. I drew these digits myself and will run them through the network. For each digit, I get 10 probabilities. For example, 0 is a 0, 1 is clearly a 1, 2 is okay, 3 is very good, 4, 5, 6, 7, 8, and 9 is completely wrong. The network was supposed to be 98% accurate, so let's try another model, a convolutional neural network (CNN), which is better at understanding images. I train for 25 epochs, build the iterator, network, bind, and train. I save the model and predict again. Now, the 9 is correctly identified as a 9. CNNs are much better at understanding images. Accuracy is an interesting data point, but it really matters. A 98.4% or 99.1% difference can be significant.
For the last few minutes, I want to talk about SageMaker. All the notebooks you saw are running in the cloud on a GPU instance created with Amazon SageMaker, launched at reInvent a few days ago. SageMaker lets you go end-to-end. You can run your notebooks, train models, start instances for distributed training, save models, and deploy them behind a REST endpoint. You can go from experimentation to training to deployment with the same set of tools. Here's an example of training MNIST with SageMaker. It's very similar to what we did, but instead of training locally, I can start multiple instances to train large models. This is done with a single line of code. If you train the same model with different learning rates or parameters, you can test them to see which performs best.
If you want to get started, these are the right pages to look at. The MXNet website and the repo on GitHub are great resources. There's another API on GitHub called Gluon, which is even simpler than the one I showed you. The Sockeye project is useful for automated translation. I recorded a 30-minute walkthrough video of SageMaker, which goes into a lot of detail on how to use it. My blog on Medium has a lot of MXNet and deep learning content, so feel free to read that if you're interested.
Thank you very much for listening. Thanks to Code Europe for this fantastic event. It's my first time here, and I'm very impressed. If you have questions, I'll be outside and I'll do my best to answer them. Jet lag wasn't such a problem after all. Thank you very much. Have a good day.
Julien Simon is the Chief Evangelist at Arcee AI
, specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.
With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.
Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.
Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.