Deep Learning for Developers Julien Simon GOTO 2018

Transcript

My name is Julian, I'm a tech evangelist with AWS focusing on AI and machine learning, and I've been with AWS for almost three years now. My topic today is deep learning for developers. I wanted to call it deep learning for humans because it's really deep learning for normal people, not people who have 5 PhDs and 200 IQs. I have neither. My point today is to show you that if you can read and write 50 lines of Python, you can do deep learning. I really want to show you how easy it is to understand the basics, the concepts of deep learning, and then to actually write minimal code and the minimal processes needed to use deep learning in your apps. Okay, so I will start with an introduction to deep learning. This is a slight, I try to make it as easy as possible with as little math as possible. But feel free to yell and throw stuff at me if it makes no sense. It usually does. And then we'll go through demos because I want to run some code and show you that anybody in the room can do it. We've all heard about neural networks, and this is the technology that deep learning relies on. Machine learning in general relies mostly on statistics and math. We can use a combination of algorithms, and deep learning, which is a subset of machine learning, uses neural networks and only neural networks. So we need to understand what a neural network is, and we should start by understanding what a neuron is. What we're trying to do here is approximate the biological neuron, the brain neuron. Brain neurons, if they're stimulated enough, they fire an electrical current, and if they're not stimulated enough, they don't fire. A neuron is a collection of inputs, signals. Each input is associated with a weight. The first thing we do is take each input, multiply by each weight, and sum everything. We call this multiplied and accumulated. These inputs and weights are just floating points. They're just information, data samples that we feed to the neuron. So that's the multiply and accumulate. And really, that's what deep learning is all about—multiplying and adding stuff. You don't need a math PhD to do this. The problem, of course, is that this function is a linear function. If you change the inputs, the result will vary in a linear fashion. That's not what we want, because we said that the biological neuron fires or not. There's a limit, a threshold that says, below this threshold nothing happens, after this threshold something happens. That's a nonlinear behavior. So we need to add an extra function after the multiply and accumulate operation, and this is called the activation function. It decides whether the neuron will fire or not. Over time, a number of functions have been used. The popular one these days is called ReLU. If the input to the activation function is negative, the output is zero. If the input is positive, the output is that value itself. This is the behavior we want. If the neuron is stimulated enough, we get an output, and if the neuron is not stimulated enough, nothing happens. This introduces nonlinear behavior in a neuron. At the core of neural networks, this is what really happens: take inputs, multiply by weights, add everything, and run that result through the activation function. A neuron by itself doesn't do much; it's not really interesting. We need to put them together to build networks. This is probably the simplest neural network you can build. An input layer where we push our samples, an output layer where we read results, and in the middle, at least one layer, which we call the hidden layer. In this case, we call it a fully connected layer because each neuron is fully connected to all inputs. It's a full mesh. In case you wonder why this is called deep learning, it's because you tend to have many hidden layers. You set them up like pancakes, except they're horizontal. In this case, we use just one hidden layer. How does that work? We need data. For the sake of discussion, let's say we're trying to classify images. We have a collection of images, and to make things simple, we'll flatten all these images into vectors. Each individual pixel value is a value in this vector. So x1 is my first image, each x is a pixel value. X2 is the second sample, and Xn is the last sample. We take images, flatten them into vectors, and use the pixel values as features. We know what those images are. Let's say we have 10 categories. Dogs are category 2, catfish are category 0, elephants are category 4, etc. We know what those images are, so we can build labels in a vector and provide the corresponding category number for each sample. The first image is a dog, the second is a cat, and the last one is an elephant. This is called supervised learning. You have data and you know what that data is. Classification is an example of supervised learning. You can also do unsupervised learning, and clustering would be an example of that—taking a million samples and trying to group them in 10 groups without knowing what those groups are, just letting the algorithm decide. Actually, this is not really useful. In machine learning, we don't like categorical values. The first thing we do is use one-hot encoding to transform those categorical values. Each category number becomes a vector of 10 bits, and we flip to one the bit that corresponds to the actual category. This is called one-hot encoding. This is more expressive because we can look at those numbers as probabilities for each of the classes. When we look at this, we know that this sample has a 100% chance of being categorized. When we look at this one, we know it has a 100% chance of being in category 4 and 0% for all the other classes. This is more expressive because we see the number of possible classes and the probability for each class. Here we have 0's and 1's because we know for sure what's in those pictures. But what we predict will be different probabilities. Hopefully, one of them will be really close to one, showing the right class. So how do we actually do this? If we have a trained network, we could take the first sample. We'll take that sample, put each feature into a neuron in the input layer, which means I need as many neurons as I've got features. If I've got a thousand pixels, I will need a thousand neurons. Then I will run the multiply and accumulate using the weights in the network, run the activation functions, and this is called forward propagation. Ideally, I will get zeros everywhere except in the neuron for the right class. This requires that I have as many output neurons as I've got classes. If I have 10 classes, I need 10 output neurons. In a perfect world, this is how things would work. But there are two questions we haven't answered: How do you train the network? How do you find the right weights? And will we really get zeros and ones? What we care about is the accuracy—the number of correct predictions for the dataset divided by the total number of predictions. The highest accuracy means we've learned successfully how to predict the status. In a nutshell, we're trying to forward propagate samples and read results on the output layer, hopefully correct. Initially, it won't work like that because the weights are random values. It's not going to be all zeros and ones; it's just going to be random values. There are many ways to initialize those weights, but to keep it simple, let's say these are random values. If I take a sample and forward propagate, I will get something completely wrong on the output. I won't even get a 1 in the wrong position and 0s elsewhere; I will just get a random probability. Initially, we will get fully random probabilities. If you see this network as a math function, the input and all the weights are the parameters. Let's call this function f. When I take a data sample and compute its category with f, I get the wrong result. I need to measure the difference between the real value and the predicted value. We use loss functions to do this. The loss function will compute the error, a floating point value between those two vectors. Once we know the error for a sample, we could tweak the weights to reduce the error and then run another sample and do it again and again. Most of the time, we work with batches of samples, not individual samples. We take a bunch of samples, run each through the model, and add up all the individual errors. This gives us a batch error. With that batch error, we can make decisions on updating the weights. This is called mini-batch training. The purpose of the training process is to minimize loss, minimize error, gradually by batch by adjusting the weights. Forward the batch, adjust the weights, and hopefully, we made the right decision. Then to another batch and adjust the weights again. If we take the right decisions, we reduce the error gradually. This is what the training process looks like. We have a dataset, a training dataset. We slice that dataset into batches, and the libraries will usually do this for us. We propagate each sample of that batch, add up all the errors, come up with a batch error, and then backpropagation comes into play. Backpropagation goes back from the output layer to the input layer, adjusting the weights layer by layer in the direction that reduces error. For each weight, we need to decide whether to increase or decrease it to reduce the output. Backpropagation will do this for each layer, moving back all the way to the input layer. We do this again with another batch and backpropagate again. If we update the weights correctly, we take the error down a little bit. If we do this for many batches, step by step, we get to a lower area. Eventually, we get to the end of the training set, which is called one epoch. We train for many epochs, maybe 100, 200, 300 epochs. It's a lot of iterative processing—sending a batch, backpropagating, and doing it all over again until we reach the end of the dataset and then starting all over again. Gradually, if we take the right decisions, the weights converge to a set that gives us a very low error. If you have very low error, you have very high accuracy. The magic part is how we know if a weight should be increased or decreased. This takes us back to high school derivatives. Derivatives give you the slope. When you know the slope, you know which way is up and which way is down. Imagine a simple parabolic function. Take any point, and the slope tells you if you're on the right-hand side of the axis, the slope is positive, so increasing x increases y. If you decrease x, y goes down, which is where you want to go. This is high school stuff. The difference here is that these functions have multiple parameters. In this layer, the output depends on three parameters, so the error function has three parameters, not just one. You need to compute partial derivatives. Backpropagation computes the partial derivative with respect to each weight and decides whether to increase or decrease that parameter. The deep learning library does this automatically for you. The learning rate influences the size of the steps you take, the size of the updates to the weights. If you have a very tiny learning rate, you make very small updates, so you might be moving in the right direction, but very slowly. If you have a large learning rate, you make very large updates, which might be too large. The number of epochs is how many times it will go through the full dataset. These are called hyperparameters, and they are critical in getting good results. The actual optimizer is the algorithm that updates the weights. Backpropagation is the overall process. The granddaddy of optimizers, SGD, was invented in 1951, even before AI and machine learning. Imagine the error function has just two parameters, x and y, and z is the error. If you plot that, you start anywhere on that surface with random values for x and y. You compute partial derivatives to get the slope with respect to x and y to know which way is down. You make a tiny update to x and y, so you go down a bit. You do it again and again, and by walking down the mountain step by step, you get to a very low error value. This is what SGD does. We need to do this many times because we tend to take small steps down the mountain. The step size is the learning rate. If you have a very tiny step size, it would take forever to get there. If you have a large learning rate, you might bounce and take giant steps that prevent you from going down. SGD is just one way of doing it. Other optimizers, like the ADA family, have nice properties. They avoid getting stuck in local minima and can detect the slope and speed up, getting you faster to the minimum point. Life is never easy. What if the error function looks like this, and you start here on the red mountain and end up in a local minimum? Intuitively, it would be more desirable to be here, but you get stuck. There's a debate in the deep learning community about whether these local minima really exist and if they hurt the training process. Research by Ian Goodfellow suggests that while these local minima exist, the training process avoids them, and we don't get stuck in very shallow local minima. The training process should work, and we need to measure how well it does. We keep part of the data for validation and at the end of each epoch, predict those samples the network hasn't seen yet to measure accuracy. This gives us a sense of how well the trained model generalizes to new samples. Once we're done training, we use another dataset called the test dataset to get a benchmark accuracy for this dataset. The training set is for training, the validation set is for checking accuracy after each epoch, and the test set is for a final benchmark. If you plot everything, the training accuracy will go up quickly and then plateau. If you train long enough, it will get to 100%. The loss function will almost go down to zero. The validation accuracy will go up sharply and plateau. If you train too long, it might drop, which is called overfitting. Overfitting means you train so hard on your training set that you cannot predict anything else. The model is too specialized and doesn't generalize to new samples. You need to train long enough to get to the top accuracy, but it's difficult to forecast which epoch will be the best. You train for a long time, save the model at the end of each epoch, and plot the accuracies to find the best one. Deep learning is about finding a local minimum, not the global minimum, because finding the global minimum is NP-hard. You just need to find a minimum that's good enough for your business problem. If your business problem requires 95% accuracy and you get 95.1%, that's fine. Why spend more time and resources to get to 97% if there's no business value? You need to find a minimum that works for you, and if you can find it fast, that's even better. You will train many times and retrain with new data, so finding that result again and again is important. Alright, we're done with the theory, so let's look at an example. We'll work with the MNIST dataset, a set of handwritten digits from 0 to 9, with 70,000 images—60,000 for training and 10,000 for validation. We'll try to learn these images using a fully connected network with Apache MXNet, the preferred library for AWS. I've downloaded the dataset and imported MXNet. I'll train for 50 epochs and use an iterator to slice the dataset into batches. I'll build a simple fully connected network with an input layer, a first fully connected layer with 1024 neurons activated by ReLU, a second layer with 512 neurons activated by ReLU, and an output layer with 10 neurons for the 10 categories. I'll train on a GPU, bind the model to the dataset, and pick an optimizer—AdaDelta with a learning rate of 0.1. We can do this live. The batches and epochs will go by quickly because it's a GPU and a small dataset. After 50 epochs, I can save the model, which gives 98.14% accuracy. I can save the model, and now I can try it with some new images. I drew some digits with a paintbrush, resized them to 28x28, and will try to classify them. I load the trained model and forward these new images into the model. For each digit, you see the 10 probabilities for each class. The first value is for class zero. The first digit has a 99.56% probability of being class zero, which is great. The second digit is classified nicely, and the third digit is also good. The fourth digit is classified as class 4, which is wrong. The model is confused by this ugly 9 and classifies it as a 4. This shows that even with 98% accuracy, one of the samples can be completely wrong. The moral is never to trust validation sets; trust your own samples. What did we do wrong? We flattened the images, which is probably not a good idea because it loses the spatial relationship between the pixels. For example, an 8 has a cross in the middle, and if we flatten, we lose that spatial relationship. When we work with images, we tend to use convolutional neural networks (CNNs). CNNs are different because they work on 2D or 3D data and don't flatten stuff. The basic idea is that in a convolution network, you have multiple stages, each consisting of a convolution operation followed by a pooling or subsampling operation. Convolution is about taking a small matrix called a filter or kernel and sliding it across the image, multiplying the pixels by the filter values, and adding everything into a new image. If you write the right values, you can detect edges. Convolution filters extract features, and the training process learns the filter values. You start with random filters and use backpropagation to learn the values that extract the features. Deep learning extracts features automatically. This process of extracting features and shrinking the image is repeated, resulting in very tiny images that still contain the right information. These can then be flattened and used with a fully connected network. There's much more to say about convolution, but the key is to build intuition. The math and theory can get complex, but understanding what convolution does is crucial. State-of-the-art models have hundreds of layers, but you need to grasp the basics. Convolution learns filters that help extract features, and once you've extracted the features, you throw away the black pixels that mean nothing and repeat the process. This shrinks large images to tiny ones while keeping the right information. You can do a lot with this, and I want to point out an open-source library called Gluon, which is an API on top of MXNet. Gluon CV, for computer vision, has a collection of pre-trained models for classification, detection, and segmentation. Training these models yourself can be a lot of work, so using pre-trained models is a great option. With Gluon, you can load a model, predict, and print results in just a few lines of code. You can also fine-tune these models on your own data for the best of both worlds. Let's try a simple CNN, like LeNet, which was invented by Yann LeCun, a top AI researcher at Facebook. The network structure is straightforward: input data, a first convolution layer, a pooling layer, a second convolution layer, another pooling layer, flatten, and then a fully connected layer to classify the data into 10 categories. Training on a GPU is efficient, and with a different optimizer, we can achieve around 99.15% accuracy, which is higher than a fully connected network. When we test the model, it performs well on digits like 0, 1, and 2, and even improves on digit 9, though it's still not perfect. The model might have learned a feature that detects the cross in 8, which is a distinct feature for that digit. To improve the model, you could add more layers, deeper layers, or more filters. Inspecting the filters and layers for low-performing samples can provide insights into why the model makes certain predictions. For color images, the process is similar, but you use a 3D convolution filter that slides across the red, green, and blue channels. This increases the computational load, which is why GPUs are essential for convolution operations. If you can achieve good results with grayscale images, it's more efficient to use them. LSTM (Long Short-Term Memory) networks are another architecture that adds memory to the model, making them suitable for sequence prediction tasks like time series analysis and machine translation. An example of a project using LSTM is Stokai, an open-source project that builds machine translation models using MXNet and LSTM architecture. Generative Adversarial Networks (GANs) are another exciting area. GANs can generate realistic images, such as faces that don't exist, by learning from a dataset. Another GAN project can generate realistic images from semantic maps, which are color-coded representations of different objects. These projects are based on TensorFlow and PyTorch, and the generated images can be highly detailed. Amazon's DeepLens is a deep learning camera that can run image models locally, making it possible to perform tasks like object detection in real-time without cloud dependency. For further learning, you can explore AWS resources, including the MXNet Gluon pages and the Gluon CV library. If you need to train large models, AWS SageMaker provides fully managed infrastructure. I also have a blog on Medium and a YouTube channel with machine learning and deep learning content. The notebooks used today are available in a OneDrive repo. Thank you for your attention, and I'm happy to answer any questions.

Deep Learning for Developers Julien Simon GOTO 2018

Transcript

Tags

About the Author