Talk @ ODSC West, November 2nd 2018, San Francisco
An introduction to Deep Learning theory
* Neurons & Neural Networks
* The Training Process
* Backpropagation
* Optimizers
Network architectures and use cases
* Convolutional Neural Networks
* Long Short Term Memory Networks
Generative Adversarial Networks
*,Getting started
Transcript
So, welcome everybody. Thank you very much to ODSC for inviting me again. It's always a pleasure to meet the community, and I will be at the US ODSC conference at the end of October. Thank you in advance for that as well. Indeed, I'm the evangelist for AI and machine learning at AWS. My talk today is about deep learning in the most simple, bullshit-free way possible. My crusade, so to speak, is to show everybody out there that if you can read and write 50 lines of Python, you can do deep learning. Who can read and write 50 lines of Python? Come on, everybody. It's not even 50 in most cases. Let's get started.
This is the single thing that gets me aggravated in no time: when people make it sound like it's dark magic. They make it sound complicated, and what they really mean is, "I'm smart, you're not, and pay me a lot of money because I know this." Right? So, okay, fair enough. Everybody's got to run their own business. But that doesn't work, right? Companies and organizations need a lot of machine learning and deep learning engineers, and there will never be enough people with PhDs. Nothing wrong with PhDs, obviously. I don't have one. Never needed one. But if you have one, congratulations. The fact is that there will never be enough experts out there to serve the machine learning and deep learning needs of companies.
Obviously, if you're a developer, you learn a lot of complicated things already. I'm 100% convinced you're smart enough to learn quite a bit about this. The reality is that machine learning and deep learning is a bit of science. Obviously, if you're looking for the Turing Award in AI, yeah, you need to be a genius, you need to have training, you need to have a PhD, etc. But if you're like me and you're just trying to bring deep learning to apps and real-life projects and organizations, if you're just trying to solve real problems, then you don't need to worry about the science at all. You just need to understand what the technology can do and how to use it. In that respect, deep learning is just a bit of code. And like I said, you can see all those clever deep-running notebooks and examples out there, and it's never 500 lines of code. It's always a tiny bit of code because the code isn't really what it's all about. It's about the data, how you process it, the parameters you set, but it's not a lot of code.
You need a few chips. It always sounds funny when I say this in the UK because immediately I think about fish and chips, but no. But I don't have any other word for it. Chips, right? And, of course, you need your Intel CPUs and your NVIDIA GPUs, and you might need your Xilinx FPGAs. All three are available on AWS, and you can run deep learning workloads on all three. The FPGA world is pretty exotic. I won't talk too much about that, but keep an eye on it. Ask me questions. There are some new developments that are pretty interesting.
So, let's get started. When we talk about deep learning, we usually talk about neurons and neural networks. We're trying to mimic the human brain, right? And I have to say we're failing a lot at this because the brain is complicated, and anything we build is very crude. But we're trying to mimic the biological neuron. The one thing we know about the biological neuron is that if it has inputs, it's connected to other neurons. If the inputs are stimulated enough, the neuron fires an electrical current. If it's not stimulated enough, it doesn't fire. That's what we need to mimic in a way.
Here's the computer neuron in its simple form: a bunch of inputs. Again, those will be floating-point numbers, nothing complicated. Each input is associated with a weight, more floating-point numbers. The first thing a neuron does is a multiply and accumulate operation where we take each input, multiply it by the corresponding weight, and add everything together. Multiply and accumulate. That's about as much math as you will see in this talk. So if you can add and multiply, you're on your way to deep learning greatness.
Unfortunately, this doesn't exhibit the behavior I mentioned, the fact that sometimes it fires and sometimes it doesn't. This will always output some kind of value. So we need to introduce non-linearity, which is a complicated way to say that below a certain threshold, the neuron should not output anything, and beyond that threshold, it should output some kind of value. But that function here is not enough, which is why we need to introduce activation functions. These are pretty simple math functions that will modify the raw output of the neuron.
Over time, a number of activation functions have been designed. I'm sure a PhD thesis has been written on this and probably books as well, so I'll keep it short. The one that is heavily used today is the last one called ReLU. As you can see, it's a very simple math function. If the input value to ReLU is negative, the output is zero. If the input is positive, then the output is whatever the input was. So there is that nonlinear behavior. Beyond that threshold, nothing happens. After that threshold, something happens, and it could be a very high value. There's no limit to the activation value for ReLU.
So, a neuron is a combination of multiplying weights and inputs, adding all those things together, and then filtering, processing that raw value with an activation function to introduce a nonlinear behavior. Fire, no fire. That's the neuron. Done. A neuron by itself does nothing. All networks is the thing we need to work with. This is a simple one, probably the simplest you can build. This is called a fully connected network. It's easy to see why, because each input is connected to all neurons in the next layer, which themselves are connected to all outputs in the next layer. Fully connected. Pretty obvious.
So, there's an input layer where we will put our data to be predicted. There's an output layer where we will read results. And in the middle, we have at least one extra layer. It's called a hidden layer. If you wondered why this thing is called deep learning, it's because usually you have a lot of hidden layers. State-of-the-art networks can have 100, 200 layers. So, they get really, really deep. The hidden layer will compute the multiply and accumulate on the inputs, run the activation functions, propagate to the output layer, and we will read results. That's the smallest you can build.
Of course, we need data. For the sake of the argument, let's say we're trying to predict, we're trying to classify images. We have a bunch of images, let's say in ten categories, and we have dogs, cats, elephants, tigers, snakes, and anything you like. Let's keep it really simple. We'll take the images, flatten them into vectors, and each individual value is a pixel value. We put those in a matrix that we call X. So each line of X is actually all the pixel values for that first image. The second line is another image, and so on. And we have a bunch of images.
Here, we're going to take the simple road into deep learning, which is called supervised learning. Supervised learning really means you know what the data set looks like. You know the first image is a dog, the second image is a cat, and so on. You know what your data is, and you want the neural network to learn this representation and be able to predict with additional samples. So X are the images, and Y are the labels, the category numbers. We said 10 categories. So maybe category two is a dog. The first sample is a dog. The second sample is category zero. Maybe that's a snake, and so on. And obviously, we have as many labels as we have samples.
If you've done machine learning before, you know that we don't really like to work with category numbers because they don't tell the truth. When I see zero and two and four, I don't know if 2.5 is a legit value. And it's not. Those are category numbers. They need to be integers. So instead of using those integers, we're going to use a technique called one-hot encoding. Again, complicated word for a simple thing. So if we have 10 categories, we're going to replace each label with a vector of 10 bits, and we flip to one the bit that corresponds to the actual category. So the first sample is category two, so we flip bit two to one. And remember, we start counting at zero, so that's why it's bit three, in case we have less technical people in the room. The second sample is category zero, so we flip bit zero to one, and so on. One-hot encoding. This is how you work with categories when you do machine learning.
This is actually much more expressive because now I know how many categories I have. I just need to look at the size of that vector. And I could also see those zeros and ones as probabilities. So let's look at the first one here. I know for a fact with 100% chance, 100% probability that this first sample is category two. And it has 0% probability of being any other class. In the same way, I know that the last sample has 100% chance of being category four and 0% chance of being any other category. So this is much more useful because, actually, when we're going to predict our new samples, we'll never get zeros and ones. Zeros and ones are a lie anywhere in computers in the universe. There is no such thing as zeros and ones. It's only about probabilities. So we'll never get zeros and ones when we predict. We'll get probabilities between zeros and ones.
So, how does that work in practice? Let's say we take the first sample in the X matrix and put each feature, each value in an input neuron on the input layer. And by the way, yes, this means the input layer has to have as many neurons as you have features. So if I have 10,000 pixels in my image, then yes, I do need 10,000 input neurons. So sizing the input layer is easy. Just pick the number of features that you have. Let's assume the network has been already trained. And I run those multiply and accumulates, and I run those activation functions all the way until the output layer. In a perfect world, I should get this. I should get zeros in all output neurons except in neuron two, where I should read a perfect 1.00. But again, this never happens, as we'll see.
The consequence of that is the output layer, when you're trying to solve that kind of classification problem, needs to have as many neurons as you have classes. So sizing the input is easy, sizing the output is easy, the big mess is right in the middle. So in theory, that's how it works. And of course, we're interested in accuracy, which is the number of correct predictions divided by the total number of predictions. The closer to 100%, the better. So the whole purpose of training is to get to this result, to get an accurate network that predicts any sample correctly. But correctly doesn't mean zeros and ones, as we'll see. It means with a very high probability, maybe we'll get 0.01 for the first neuron and 0.02 for the second, but as long as we get 0.9 something for neuron two, we'll be happy. It's high enough to tell us it's the right category.
Of course, it doesn't work like that. It's all a lie initially because the network has not been trained. So initially, when you predict any sample, it gives completely wrong results. And it won't even give you that one in the wrong place. It will give you 10 random probabilities because the weights are all random. And so the multiply and accumulates and the activations yield random results. Random in, random out. That's a fact of life. So when you predict that X1 sample, you don't get the right label, you don't get the right probability vector, you get something else, something that's horribly wrong initially. So you need to measure how wrong it is, and we use a loss function to do this. It's a math function that we compute the difference between the real label, the real 10-bit vector, or a 10 float vector, those are probabilities, and the one that you got. It's not as easy as subtracting them because those are vectors. But you should not worry too much about those because all the deep learning libraries provide those functions. They have a collection of loss functions that help you compute the prediction error when you train the network.
So we do this. And then you could say, okay, so we're predicting, measuring the error, and then we're going to do some kind of magic tweaking in the weights to get to a lower error. So, yeah, you could do that. You could do the tweaking after each sample, but in most cases, you don't do that because it's too slow and it has a number of problems. Usually, you work with batches of samples. So we're going to predict, let's say, 32 or 64 samples, one at a time, measure the prediction error for each sample, and then add all the errors together. And then we're going to do the tweaking. It's called mini-batch training, and that's what most people do these days. So, in a nutshell, the purpose of the training process is to minimize the prediction error for the data set by gradually adjusting weights. Simple. See, you didn't need a PhD for that.
So we have all those knobs, right? Remember, all these arrows have a weight. So that's a knob. And you need to tweak those knobs to get to the lowest prediction error. Yes? Is it something similar to reinforcement learning? So, reinforcement learning is another technique. You have supervised learning, which I explained. Unsupervised learning, which is, for example, clustering. I want to cluster all of you into five groups. So, you have a number of features, and I run some kind of algo that automagically builds five clusters. But I didn't know what the answer was. I want to find the answer, so it's unsupervised. Reinforcement is basically learning from scratch how to play Angry Birds or any kind of video game by letting the software run and get some kind of reward for the action, positive or negative. Imagine you're trying to learn how to drive a car. If you steer off the cliff, to me, that's negative reward. So you start again and learn that no, don't go left because you're off the cliff. So stay straight for a while and then that's a positive reward until the first turn that you can see. And so you drive off the cliff again and say, ah, no, I should. So it's straight, straight, straight, and then right. See? So it's a bit different. Supervised learning is really the simplest one to understand.
So we need to train. We're going to take the data set, slice it into batches, and again, your favorite deep learning library will do that. We're going to predict one batch. So sample by sample, predict, find what the error is for that sample, do that for the whole batch, and add up all the errors together. Mini-batch training. And then the tweaking happens. The tweaking has a complicated name called backpropagation, which means I'm going to go back from the output layer. Backpropagation, the contrary of forward propagation. I'm going to go back from the output layer all the way to the input layer and do the tweaking. I'm going to change the weights so I'm going to start here and change those three weights in the direction that I know minimizes error just a tiny bit. And then I'm going to do it again for this neuron and then this neuron and this neuron. And if I had more layers, I would just go back through those layers and adjust the weights in a direction that I know will minimize error just a little bit. And obviously, that's the secret sauce. How do you know? You have three weights here. How do you know if this one should be increased or decreased, and this one and this one? Remember, there are floating-point values, so you can only increase or decrease them. But you need to know in which direction. So I'm delaying the answer for now. But that's what backpropagation does. It will go back through the network for that batch and just do the tweaking way by way. And then you run the second batch. Here we go. And you do the same thing. Do more tweaking. And you do it again and again and again until you get to the end of the data set. And this is called an epoch. Why didn't they call that a round or... I don't know. I hate that mumbo-jumbo, right? And you do it again. You run another epoch and then another one and then another one and then another one. And when you train from scratch on large data sets, it's quite common to train for 100, 200 epochs. So now you see why you need those fast GPUs and CPUs and so on. It's math and compute-intensive.
So, of course, there are some parameters here. The batch size is important. You need to choose that one wisely. If it's too big, you don't get too many shots at backpropagation. If it's too small, you might get a lot of shots at backpropagation, and then training is fairly slow, and you have other problems. The learning rate dictates the amount of the size of the adjustments that you make to the weights. Again, if it's too small, you make very tiny adjustments, and it takes forever to train. If you make very large adjustments, you could be squinting and hanging wildly between large values and never reaching the correct value for a given weight. That's the intuition. And the number of epochs is, like I said, how many times will you go through the full dataset? And these are called hyperparameters, and picking those is difficult. You can start with reasonable values, but finding the optimal values is a whole new thing. And there are more automatic techniques called hyperparameter optimization, which is basically using machine learning to figure out what those things are. Machine learning within machine learning. Welcome to the matrix.
So, how do you know that this works? The training process lasts for a while, and you'd like to know if it's working or not. So we use a different part of the data set called the validation data set. At the end of each epoch, we're going to predict this data set. Since we're doing supervised learning, we can compare the predicted values with the real values, and hopefully, the validation accuracy is going up. Which tells me, yeah, just a sec. Which tells me I can, this network can predict samples that it hasn't seen before. This is called generalization. It's really important because the samples you're going to send to that model during production are samples that it hasn't seen before. So it's not good enough to learn the training samples. You need to learn other samples that haven't been used for training.
The way it says is something that relates to the loss of the bias. But along the way, the bias also has... Yes. So there is an extra parameter or set of parameters in the network called the bias, which is a fixed value that you plug into each layer. But in the interest of simplicity, I'm leaving this one under the carpet. But yeah, you need to also learn the bias. But it doesn't change much. The basic idea is still the same. So at the end of each epoch, I'm going to predict, and I hope to see my validation accuracy going up. Okay?
Then, how do I know once I'm completely done training and I saw my training accuracy going up, fine, and I saw my validation accuracy going up as well, very happy, how do I measure, I would say, the accuracy of this model versus the one that I trained last month or two months before? What's the benchmark for this model? The good practice here, the best practice is to use another data set or it could be a fraction of the initial data set called the test set that you predict once you're completely done tweaking. So you make no more changes to the network and you run the test set to get some kind of benchmark. And the reason why you need that is because you will use, obviously, the training set to train. You will use the validation set to measure accuracy and generalization. And then you're going to go and tweak the network again. So as you experiment, you will actually use the validation data set to make decisions maybe on the network itself or on the parameters. So there's a bias here that you're introducing. The validation accuracy is not an honest benchmark of your model. You need to have something else that is run at the very end, and you don't make any decisions on the network using that accuracy. So you need those three datasets. The training process is pretty simple. It's an iterative thing, and backpropagation runs, and then the validation dataset gives me a hint on how well I'm generalizing.
The one question I haven't answered is how the hell do I know in which direction which should be updated? Let's take a simple example. Ignore all the text. Just look at the picture here. What are we really trying to do? Let me go back to this one here. I have a function here with three parameters, the weights, and they yield a result, the loss. And what I'm trying to do is to find the three parameters that give me the smallest possible loss for that neuron. And then the same thing for all other neurons. So I can't plot things in four dimensions, not on the slides and not in my head. So let's stick to three dimensions. So here I've got a function with two parameters, and you could see those as weights. And the output value, you could see that as the prediction error, the loss. And let's say I plot function f and I find something like this, this nice 3D graph. The intuition here is I want to get to the X and Y that give me the lowest Z. So I want to walk down that slope and get to the lowest point. So, of course, initially, X and Y are random. So I'm going to start maybe here. And I'd like to get here. So that's maybe x and that's maybe y, and these are the weights that I want because they give me the lowest error. So you could say, well, it's a simple function, I can compute x and y. Yeah, maybe. But imagine you have hundreds of weights, hundreds of parameters, maybe thousands of parameters. It cannot be computed; they have to be estimated. They have to be discovered. That's the thing. That's why we have this iterative process, because we can't compute those weights. So we're going to start somewhere, and then if we take a step in the right direction, and I'm putting quotes around that, I will get a little lower. And then if I take another step in the right direction, I still get a little lower. And again, and again, and again. And step by step, I'm getting to that lowest point, and I get X and Y that give me the lowest error. That's the intuition. I say intuition a lot because the math behind some of those things is so complicated that unless you are a proper expert, you'll never understand it. But you can easily build an intuition for what it does. And that's very, very important. It helps me a lot because I'm not smart enough to go through the real equations. And I don't think I want to anyway.
So then the last question I need to answer is, okay, where the hell is the right direction? So I'm going to take you back to high school for a second. Okay, for some of us, it's a long time ago. Some of you, it could be just a few years away, right? Lucky you. So remember derivatives? Ah, man, yeah, sure, of course. What a nightmare, okay? Okay, remember derivatives. You have a curve, any kind of curve, and you compute the derivative in any point on that curve, and what do you get? You get the slope, right? And once you know what the slope is, you know which way is up and you know which way is down. Right? There you go. Isn't this what we want? The only thing here, obviously, is we have multiple parameters. So we need to compute partial derivatives for each of those parameters. So here, if you compute the partial derivative with respect to x, we'll know which way is down with respect to x. Same thing for y. So if we take a step in the right direction for x and the right direction for y, there we go. We get to a lower z. And we do that again and again and again. And now some of you probably think, you said no math, you bastard. So yeah, that's what I said. But the good thing is you'll never have to worry about this because again, the deep learning libraries do this for you. But you need to understand how this works. So it's mostly high school math or maybe freshman math, but it's not much more than that. So this is how we know where to go, partial derivatives. And the learning rate that I mentioned is the step size. So if you have a large learning rate, you will take large steps down the mountain. If you have a small learning rate, you will take small steps. So now you see why it shouldn't be too big, it shouldn't be too small. Obviously, in real life, it could be ugly. It could look something like this. And that's the nightmare of many deep learning practitioners. Imagine you start here and you walk down the mountain and you end up here. And it's called a local minimum. And anywhere you look is up. So you're staying there. And the problem here, of course, is the error is higher than here. So this is probably where you want to go. So how do we avoid those things? Saddle points are another nightmare. This is a saddle point, it's like a horse saddle. So visualize that one in your head. Imagine you end up right at the saddle. Here's a better view. We'll go back to the previous one. That red ball here stops. Why? Because in this direction, this place is a minimum. And as you know from high school, we get zero derivative, and so we don't update weights anymore. There is no more slope. In this dimension, it is a maximum. And again, we get a zero derivative in this direction, so we don't update weights. This is why that red ball stops moving. This is called a saddle point. So that's probably even worse than a local minimum.
To make a long story short, a few years ago, Ian Goodfellow, one of the leading guys in AI, wrote an article saying, well, basically looking at lots of different trainings with typical neural network architectures, he figured out that, yes, you see those things, those local minima seem to exist, so they definitely are present, but for reasons not quite clear, we tend to avoid them. Okay? And most of the reason is actually that walking down the mountain process is always a little bit noisy. So you never exactly end up in a place where you can't escape. Right. That's a very crude way of explaining it. If you want the details, go to that article. The algo, I forgot to mention it, walking down the mountain step by step by computing derivatives is called SGD. It's even older than me. It had nothing to do with machine learning initially, but it became popular. It works very well. It is very robust. But over time, more optimizers have been implemented. This is what you see here. They all have weird names, and the modern ones, all the Ada star family, meaning adaptive, can actually, as you can see, speed up or slow down the slope. They can accelerate down the slope because they know where the slope is and they can modify the learning rate to take bigger steps. And that's why you see that yellow guy, you know, zooming by because that algo knows where the slope is, and it will literally run down the slope. So many, many more algos out there.
So to summarize everything, if you plot all those things, you'll see something like this. Training accuracy going to 100%, given enough time. Loss accordingly going to zero, almost zero. But you might see validation accuracy climbing and then dropping after a while. And the problem here is called overfitting, and you need to avoid that at all times, at all costs. You basically spend too much time training, and you can't generalize anymore to additional samples. So the best way to do this is to save the weights after each epoch. It's called checkpointing, and all the libraries support that. And then after training is complete, you just plot that thing and you see what the best epoch is.
So let's look at a few other architectures. We talked about fully connected networks. But I'm sure you've heard about CNNs, convolutional neural networks. They're the kings of anything that is image-related. I can't go into the full explanation, but you will get the slides obviously, and I've got other longer talks when I go into all the details. The idea here is starting from images to use filters, to use the convolution operation, which is a pretty simple math operation, to extract features. So we're applying filters to images, different filters, and we're extracting information from that, and then we shrink them using another operation called pooling, subsampling, and we do it again and again and again until we have a collection of very weird-looking tiny images that are nothing like the initial image but that still contain interesting information. And then I can flatten all that stuff, build a big vector, and use a fully connected network. This is the simplest CNN you could build. So just rounds of extracting features with filters, just like you would do with your favorite image app. Here's an example here, running a small filter three by three over that image will yield that. Okay, so it's an edge detector. And if we knew what that animal was here, we'd still know here, it's enough information to know what that animal is. And we threw away all the unwanted information. So, of course, you know which values to use here? Well, now you know the answer. You learn them. When you train on the CNN, the kernels, the filters, is what you actually learn. And then you can shrink the images by keeping the highest value in the pixels and you do it again and again and again. So, very simple in a way. And of course, you could do very complex things in the end. You can do image classification, you can do image detection, figuring out where objects are, you could do segmentation, finding the boundaries of images. So you could use a data set, you could label that data set, and you could train with a complex model again and again and again to get pretty good results. Or maybe you don't want to do it because you're lazy like me, which is a virtue as we know, and you just need to do classification or detection. And you could go and grab a pre-trained model, a model that has already been trained on a very large data set, and basically write four lines of code to load your image, run it through the pre-trained model, and get results. This is an example with an open-source library called Gluon. It's on GitHub. It's part of the MXNet project that we support. Those examples are literally four lines of code. The networks involved are crazy complicated, but you don't need to worry about that too much. Just load them, predict, get your results. Using pre-trained models is a powerful technique.
And you can do other things like face detection, face recognition, and this recent project actually gets extremely high accuracy on reference data sets. This is available on GitHub. You can go and grab it. Both are MXNet projects. But sometimes you need to do more than images. What about text? What about translation, what about predicting Bitcoin prices? Don't bother. So we need a new type of neuron because the neuron we saw so far has no memory, right? That basic neuron, you could predict 10 images in any order, you would always get the same results for an individual image. It doesn't remember the past prediction. So when you're trying to translate maybe English to French, of course, you don't translate word for word. Translation for a given word kind of depends on the translation, the past few translations. You need context. So you need a neuron that remembers the past few predictions. And this is the LSTM neuron. And that's a weird name, long, short term. How could something be long and short? Again, not a really smart name if you ask me. They have short-term memory because they remember the past few predictions and they are used in long networks. So how about long networks with short-term memory, something like that? That would make more sense. And they're great at predicting sequences of data, like translation, and we have an open-source project called AWS Sockeye. It's on GitHub and it implements a pretty complex LSTM architecture, and you can go and grab it and train your own translation models. So if you have a data set for English to Russian, let's say, you could learn and build a model with that. Okay, and this is based on LSTM.
Now, here's a really recent example. Anybody knows Tesseract? All right, yeah, it's a really great open-source project to do OCR. And they use traditional techniques so far, and in the last version called 4.0, it's still beta, it's very new, it came out in June, I think. They're actually using an LSTM architecture to improve the accuracy. And this is an example I reused from that great blog down there. And as you can see, this is a receipt, right? And those are notoriously difficult to understand and extract information from because they're all a tiny bit different. So you can't really generalize. And the result is pretty good, I think, right? So all hail the age of LSTM for expense reports, hopefully. That will save me a lot of time.
But that's great, but that's a little bit boring. So can we be silly with deep learning? Of course, we can be silly. So let's be silly for a second. So anybody who's seen this talk before cannot play the game, right? Especially you. But does anybody here recognize those people? I know you're all working very hard, but you must be watching TV, right? Come on. No? No idea? All right. Okay, I can't fool people anymore with this. Now they know the trick. These faces do not exist. These people do not exist. They're fake. They're completely fake. So those faces have been generated by a network architecture called GAN, Generative Adversarial Network. It's definitely out of scope to explain how they work today. You can look it up. It's not that weird. But the idea is really to start from a data set of samples and to generate samples that look like, that are similar to the ones that are present in the dataset. And it's not as easy as cut and paste or computing random statistics. Those are really generated pixel by pixel. And now everybody's familiar with that because of deep fakes and all the fun and not so fun applications of deep fakes. This is a TensorFlow project.
So here's another example. So if you remember when you were five years old or if you have kids, right, you have those on your fridge, and this is called a semantic map, okay? And the semantic map just contains outlines of objects with different colors. So blue is for cars and green is for trees, I suppose, and whatever that color is, is the road, and so on. And so you build a data set like that with real images and semantic maps saying, okay, here's the car and here's the road, and that uses the same color for everything. And now, if you have a trained model, you can take that semantic map, that's your new sample, you can draw one yourself, and you can ask the neural network to generate. So you go from semantic map to high-resolution picture. And this one is very impressive. This is a PyTorch project, and there are more. So GANs, generating new samples based on an existing data set. It's going to make fake news very, very interesting.
The last thing I want to touch upon is an emerging trend, it's more than a trend, I think, is as you can imagine, those things are crazy computation heavy, right? Let me show you a basic example. A few more minutes. Here we go. Okay, so I'm running on my Mac here, okay? And this is a simple MXNet script to learn a toy data set called MNIST. And I sure hope that in 2018 everybody has heard about MNIST, right? If not, okay, that's fine. You can catch up. It's an image data set with digits from zero. It's good for nothing, it's just a toy data set, but it's a good one to play with. And I'm building a simple, fully connected layer. This is MXNet code in Python. As you can see, it is not a lot of code. 32 lines. And I'm just piling up layers. The fully connected layer and the activation layer, etc. It doesn't really matter what we do here. Simple, fully connected layer. Connected network. Okay? So let's run this thing just to get a sense of how fast it goes. That's the thing I'm interested in. Okay? So simple, fully connected network on a toy data set on my old Mac. Okay. I can run an epoch in 2.2 seconds. All right. That's fast. Right? I could train for 200 epochs in a few minutes. Okay, fair enough. So hold that thought. 13,000 samples per second that I can learn. Now let's look at the same data set with a simple CNN. And this is really pretty much the CNN that you saw on my slide. Two convolution/pooling blocks, which is probably the simplest you could do. Okay? Let's run this on my machine. Again. 13,000 for the previous one. Ah, no. That was way too fast. Okay, here's the CNN. I can drink a little bit. Right? How's your day going? Hmm, okay. About 30 to 40 times slower. So now each epoch is not going to be two seconds, it's going to be, let's say, 70 seconds, right? Or 80 seconds. And this is a ridiculously simple data set. Now, going back to things like this, now you see what the problem is going to be, right? We need massive, massive processing power. So you can train on CPUs and, yeah, on AWS. I'm not here to talk about AWS specifically, but you'll find CPU instances and GPU instances and FPGA instances, and you get all the processing power that you need there. But a lot of companies are now moving into building hardware dedicated to machine learning and deep learning. So Google built their own ASIC, application-specific IC, the TPU. Intel is working on a generation of chips called Nirvana. Xilinx just announced a generation of new FPGAs for deep learning called Everest, and I'm sure there are a million startups out there who are working on that. Okay, so that's cool, and I'm sure eventually those will be used and they will be available in clouds and so on. Fine. But the specific interest here is that you can't just take a deep learning model and run it efficiently on one of those chips. You need to optimize it for that hardware. You need to make some trade-offs. And these are more advanced topics, but just to get you thinking, you can use techniques like quantization.
So quantization will actually not use floating points for activations and weights, okay? Because floating-point arithmetic is slow and it needs a lot of power, okay? And if you want to deploy deep learning models at the edge, right, glasses or tiny cameras, et cetera, the power budget that you get is way too small for that. So quantization really means training models with smaller precision than floats. So people have tried, they've moved from 32-bit floats to 32-bit, to integers, and then they went to 16-bit integers and then they tried eight. And now you even have projects where they train with single-bit weights. So that's easy, right? I'm sure it trains faster because a weight can either be zero or one, which tends to break everything I've explained so far. But as it turns out, you can actually do this if you're clever and at the cost of slightly lower precision. You lose accuracy when you use tiny weights like that, but you save a lot of time predicting because you're not predicting with floating-point arithmetic, you're predicting with integer or even binary arithmetic, which is way faster, and again, the power budget is much lower. So quantization. If you can't sleep tonight, just look up quantization, deep learning, welcome to my world. The next technique that is really applied is called, and I forgot to mention, some of the popular deep learning libraries already support that. TensorFlow, MXNet, they use that reduced precision training. So it's not science fiction. It's being used today. Pruning is going to remove, to shrink, networks. Because typically, one of the problems is we tend to build networks that are too complicated, too big. They have too many connections. We go big too quickly. So we end up having large neural networks that are maybe 150 megabytes or 200 megabytes. And again, it's not a problem on the server. It's a problem at the edge. We need smaller, faster, more efficient models. Pruning is a technique that removes useless connections. You want to know how to figure out which one is useless? Again, out of scope for today. It's an interesting technique. Again, you shrink the model, it becomes smaller, faster for embedded applications, it's really important. And the last one is basically compression, which is a simple technique. Instead of encoding weights, instead of using, let's say, 32-bit or 16-bit weights, you can just use traditional compression algorithms to make them smaller. And again, it shrinks the models and it helps save memory. And all these techniques are interesting per se, but in the context of machine learning hardware and deep learning hardware and embedded deep learning, they are critical. There is a lot of research going on there. And you'll find NVIDIA as well and Xilinx and everybody is pushing a lot in this direction to figure out how to shrink those models and make them efficient everywhere, not just on big fat servers. Okay. So here the main concern is really prediction, right? Because training will always happen in the cloud, right? You're not going to... There are some applications where you can, you know, your mobile phone, your connected glasses can learn from scratch, doing unsupervised learning, et cetera, et cetera. But most apps will be trained in the cloud. So power is not really an issue. Space is not an issue. But then you take that model. Imagine you need to deploy to one million water pumps or 10 million connected glasses or whatever. 10 million video cameras. The smaller it is, the easier it is to deploy, and the less hardware you need to actually run it. So you apply those techniques during training, of course, but they will yield results during prediction. Does that answer your question? Thanks.
Okay, I'm almost out of time. A question I get a lot is, okay, well, now thanks for the headache. I knew I should have had that extra coffee. This is fascinating. Yeah? Yeah. Thanks for the vote of confidence. But it is overwhelming if it's the first time you hear about it. And I fully realize it. And it's one of my goals is to overwhelm you a little bit but at the same time show you that you can get started. Like I did pretty much. So how did I do it is the question I get a lot. So I wrote a blog post on Medium. It's in two parts, and as you can see, the title is the 10 steps to deep learning. I'm not claiming it's the absolute best way. It's all from my experience and talking to a few people. And this is what I think are the reasonable steps in the right order to get started with this. So if you're a developer, or even if you're not a developer, because I guess the first step is learn Python, if I remember correctly. But if you already know Python, okay, you're already past step one. So you're not starting from scratch. And I'm trying to push math and theory as far down the list as possible. It is there at some point, because if you want a finer understanding of how this thing works, yeah. You need to understand a bit of math. But you can do a lot without a line of math. Using pre-trained models, using high-level libraries like TensorFlow, MXNet, PyTorch, et cetera. It's not math, it's code. But it's good if you understand what's happening under the hood. It helps you debug and tune your networks. Hopefully that helps. If you have feedback, some people wrote to me later on and say, I tried it and I might step eight and yeah, now it's getting hard, but I learned a lot. So it seems to work for a lot of people. So my 10 steps. I hope you like them. If you want additional resources, these are the ones that I would recommend you look at. So if you're interested in machine learning on AWS, SageMaker and all the cool services that we have to help you build and scale your machine learning workflows, just go to ml.aws, simple URL. We have an AI blog, machine learning blog, with code and customer examples and all kinds of good things, which you may like. I mentioned Gluon, one of those high-level libraries to get started. Why not? It's not a bad choice. You could use TensorFlow or anything else, but hey, Gluon is there as well. It has excellent documentation, especially for beginners, so that's why I'm recommending it. And here's the URL to my blog where you'll find lots of stuff, some AWS stuff, but not only. So take a look and send me some feedback. By now I've got a pretty big collection of talks and videos on YouTube as well from AWS events and third-party conferences. So if you want more of this, there is more. I have longer versions of this talk, or maybe versions of this talk where I insisted on something else. It might be interesting if you want to share it with colleagues or watch it later, etc. And of course, I've got some code on GitLab, so two different repos with MXNet code and a bunch of notebooks for all kinds of libraries that you can go and grab. Most of them correspond to blog articles. So, there you go. Thanks very much for listening to me today. Thanks again to ODSC for the kind invite. If you want to stay in touch, Twitter or obviously LinkedIn, but Twitter is the easiest way. If you have questions, if you're looking for resources, if you have built something very cool that you want to share, like Julien from Arcee. This guy is a crazy, crazy data science blogger. So I want to see more of that. And please send me your articles, and I'm more than happy to share them with my followers as well. Thanks again, and enjoy the rest of your day. And I'll hang around for questions, of course, if you have questions. Thank you so much.