Deep Learning for Developers by Julien Simon

Transcript

My name is Julien. I'm an AI evangelist with AWS. I'm based in France, but as you can see, I travel a lot, and it's a pleasure to be in India today. I'm going to introduce you to deep learning. Deep learning is a complicated topic, and it seems every technical article on deep learning starts with math, equations, and linear algebra, which tends to scare a lot of people away. I think that's not the proper way to teach deep learning. In this talk, I'm going to explain the basic concepts of deep learning. There is a bit of theory, but I'll try to keep it minimal, and then we'll try some actual examples using Apache MXNet and a new AWS service called Amazon SageMaker. We'll start with a quick introduction to deep learning. Then, to give you more familiarity with deep learning, we'll look at some network architectures that are commonly used and the kinds of problems they can solve. I'll share some open-source projects that use MXNet and deep learning networks that could be of interest, and then we'll do some demos. Of course, I'll share some resources, and I will share my slides on Twitter after the talk, so you can just go and grab that. Everything will be in there, including the link to the code that I'll use. What is deep learning? Deep learning is a subset of machine learning that uses a specific technology called neural networks to learn behaviors and patterns from data sets. In traditional machine learning, data scientists have to build features. They have to extract variables from a data set, and these features will be used to train a model. With deep learning, things are a bit different. Most of the time, we're not going to do feature engineering. We're going to use raw data, throw it at a neural network, and let it learn. So, what is a neural network? Let's start by explaining what a neuron is. A neuron is a mathematical construct trying to mimic the biological neuron. A neuron is basically a set of inputs, which are just numbers, floating-point numbers. Each input has a weight associated with it. The first thing we do is run a multiply and accumulate operation, which is just taking each input, multiplying it by its weight, and summing everything. This gives us the U value, the raw output from the neuron, which is a floating-point number. As you can see, this function is a linear function. If you have fixed weights and change the inputs, the result will vary linearly with respect to the inputs. However, this is not what we want. Biological neurons have a threshold. If they are activated enough by the inputs, they fire; if not, they don't. This is a nonlinear behavior. To model this, we need to add an activation function, which introduces nonlinearity and a threshold that will let a neuron fire or not. There are several activation functions. The one commonly used today is called ReLU. It is a nonlinear function. If the input to ReLU is a negative value, the output of the neuron will be zero. If the input is a positive number, the output will be that exact positive number. This is what we want: a clear threshold where nothing happens below the threshold, and above it, the neuron activates. If the U value is a very large positive number, the output of the neuron will be that same large positive number. One single neuron is not very useful. They become useful when we put them together and build layers. This is probably the simplest neural network you could build. We have an input layer, an output layer where we read results, and in the middle, a hidden layer. In this case, it's fully connected, meaning all the inputs are connected to all the hidden layer neurons, which are connected to all the output layer neurons. This is the simplest you could build. How do we use it? We need data. We need to train the network so that when I present a given input, I get the expected result on the output layer. Let's suppose we want to train a classification problem. We have a data set with samples, and each line in the X matrix is a sample. Each sample has a label, a numerical value between 0 and 9, for example, representing 10 categories. We want to predict the correct label for each sample. The labels are integers, and the output layer has many neurons. To predict the correct label, we use a technique called one-hot encoding. We transform the labels into a vector of bits. If we have 10 categories, the vector will be 10 bits, and we set the bit corresponding to the integer value to 1. For example, if the label is 2, we set bit number 2 to 1. This technique is useful because it allows us to see each number as the probability that the sample belongs to a certain category. For example, the X1 sample has a 100% chance of being category 2. When we present the samples on the input layer, we should get the correct one-hot encoded vector on the output layer. We will have as many neurons on the output layer as we have categories. The first two neurons should have a zero, meaning 0% for those classes, the next one should be 1, meaning 100%, and all the other ones should be 0. This is the ideal result once the network has been trained. We measure accuracy, which is correct predictions divided by the number of predictions. Initially, the network is random, and the weights associated with the connections are random values. Predictions will not work at all, and we'll get completely random values. We need to measure the difference between the real value and the predicted value using a loss function. Deep learning frameworks provide this, so you don't have to worry about it. We compute the error between the prediction and the actual value. We don't do this sample by sample because it would be too slow. Instead, we work with batches of samples, typically 32 or 64 samples. We push the batch through the network, add all the individual prediction errors, and that becomes the batch error. We push a batch, look at the error, and then make adjustments. The purpose of the training process is to minimize the error for the data set. We use batches, push them through the model, measure the loss, and then run backpropagation to adjust the weights layer by layer in a direction that minimizes error. We do this batch by batch until we get to the end of the data set, which is called an epoch. We continue doing this for multiple epochs, typically 10, 50, or 100 epochs. The batch size, number of epochs, and learning rate are important parameters. To adjust weights, we compute partial derivatives, which give us the slope. Once we know the slope, we know whether to increase or decrease the weights. This is done using an algorithm called stochastic gradient descent (SGD). We start somewhere on the error surface and take small steps in the direction that minimizes the error. In practice, we use more advanced optimizers like Adagrad and Adam, which are better at finding the minimum faster. We also keep a part of the data as a validation set to measure how well the network generalizes to new samples. At the end of each epoch, we run the validation set through the network and measure accuracy. We want the validation accuracy to go up, indicating that the network can predict correctly on new samples. If the validation accuracy drops, it indicates overfitting, meaning the network has learned the training set too well and won't generalize to new samples. To avoid this, we save the model after each epoch and pick the one with the highest validation accuracy. This is the model that will generalize best. Deep learning is about finding a minimum value for the loss function that generalizes well. Using a better optimizer can help find this minimum faster and more consistently. Let's look at some examples. Fully connected networks are still popular and work well for many problems, but they don't work well for image processing. Convolutional neural networks (CNNs) are excellent for image classification and processing. They use a mathematical operation called convolution to extract important features from images. By applying filters, the image is repeatedly shrunk while keeping important information. At the end, a simple classifier is used to predict the image. CNNs are the go-to architecture for image processing. For sequence data, such as predicting Bitcoin prices or translating languages, we use recurrent neural networks (RNNs) like LSTM (Long Short-Term Memory). LSTMs have short-term memory, remembering the last few predictions, which is useful for tasks like machine translation. We have an open-source project called Sockeye, an LSTM architecture for training machine translation models. You can bring your data set, train for a few hours on a GPU instance, and have a working model. Deep learning can also generate data. Generative Adversarial Networks (GANs) can generate realistic images, text, and even 3D models from 2D images. These applications are constantly evolving, and new uses are discovered every week. However, training these models at scale requires significant resources. AWS provides Amazon SageMaker, a machine learning service that lets you go from experimentation to training to deployment. SageMaker includes managed Jupyter notebook instances, pre-built environments for popular libraries, and the ability to train on multiple GPU instances with a single API call. Once you're happy with your model, you can deploy it to serve predictions with an HTTP endpoint. Let's look at a SageMaker example. I'm on a notebook instance, which is a managed Jupyter instance. We're working on a sentiment analysis example using MXNet and the Gluon API. I download a movie review data set, classified as positive or negative, and upload it to S3. We use an MXNet script to train the model. The script is brought into SageMaker, and we use the SageMaker SDK to create an MXNet object, specify the number of instances, and set training parameters. With one API call, we start the training job, and SageMaker handles the rest. All the activity in SageMaker is based on Docker. It's quite invisible, so you don't have to know the first thing about Docker to use it, unless you really want to customize things. But in most cases, you really can ignore Docker. It's going to configure the training job with the parameters that we just set, point it at your data set in S3, and then you just wait for a bit. You go and have coffee, or tea, or beer. After a bit, we get to the end of the training job. The model is saved into S3, and I want to deploy it. And the only thing I have to do to deploy it is this. Please take the model and deploy it to one C4 Xcel instance. That's it. Again, this will fire up one instance, deploy the prediction container for MXNet, load your model in there, create the HTTP endpoint, etc., automatically, and you can start serving predictions right away once this is complete. Compare this to how you would do it manually. This is much easier. And of course, now we can run some predictions. So here I'm using the SDK to do the predictions. Once again, this is an HTTP endpoint. So you could literally call it from anywhere. You could use curl, or whatever language you use. Just an HTTP call. And I see some examples, some reviews, and well, it seems to work. One is positive, zero is negative. Job done. So this is just one way of using SageMaker. If you want to know more, again, we have the workshop tomorrow, but I'm afraid it's already sold out. But I'm going to share some resources. Really quick, remember we have built-in algos. So if you don't want to write MXNet or TensorFlow code, if you want to do, let's say, linear regression at scale or if you want to do a time series at scale, you can go and grab one of the existing algos to do that. And it works quite the same. You don't use that MXNet object. You use something else, similar ID. And at the other end of the spectrum, if you're super advanced and you want to use a fully custom training code and a fully custom prediction code, you wrote your own prediction code in super optimized C++, and that's what you want to use, well, you can build your own Docker container, push it to SageMaker, and use the exact same APIs to create the training infrastructure and the deployment infrastructure. As a conclusion, you will find those demos on GitHub. There are also plenty of sample notebooks inside of SageMaker, so just go and look for the documentation and the sample notebooks on GitHub, you'll find them. Some resources, so the high-level pages for the machine learning services at AWS. The high-level page for SageMaker, you'll find some customer use cases if you're curious what people do with SageMaker. This was really a technical talk today, but we have customers for this of course. The AI blog, if you're curious about the more technical side of things, code examples, customer stories, new product announcements, etc. It's all in there. This is probably the better way to get into MXNet, if you're interested, with the Gluon API. One of my colleagues wrote a full book. It's online on deep learning. And it takes you through the deep learning concepts in general. And then it shows you how to implement them with Gluon. And I really recommend it to everybody. If you're a beginner, this is a great way to get started. If you're more advanced, I'm sure you will still learn some stuff. That guy has done a really brilliant job. And this is a video that goes through the different ways that you can use SageMaker. So built-in algo, bring your own code, bring your own model, bring your own container. Seems to be quite popular, so happy to help you with this. And once again, I will post this stuff on Twitter and if for whatever reason you don't get those, just ping me later on. All right, I'm done. I want to thank you very much. I want to thank AI Dev Days and CodeOps for inviting me. It's great. It's my first time in India and I love it.

Deep Learning for Developers by Julien Simon

Transcript

Tags

About the Author