AWS DevDays 2020 An Introduction to Deep Learning Theory and Use Cases
March 26, 2020
In this session, we’ll explain how Deep Learning differs from traditional Machine Learning and how it can help you solve complex problems such as computer vision or natural language processing. Then, using Python code, we’ll run some examples using popular libraries like TensorFlow, Keras, Apache MXNet and PyTorch. Finally, we’ll demonstrate how you can achieve excellent results quickly by using state-of-the-art pre-trained models.
Code: https://github.com/juliensimon/dlnotebooks
For more content, follow me on :
* Medium: https://medium.com/@julsimon
* Twitter: https://twitter.com/juliensimon
Transcript
Hi everybody! Welcome to this webinar on deep learning. If you have any questions, please submit them in the questions pane in the control panel, and I will answer them at the end. A copy of today's slides can be found in the handout tab on the control panel, and you will receive a copy of the recording in a follow-up email after the event. Let's get started with deep learning.
So, what are we going to cover? We're going to look at deep learning theory. Don't be scared; it's not going to be too extreme. There will be just a little bit of math, the minimum amount to understand what's really happening. This is a light introduction to deep learning theory. We'll talk about neurons and neural networks, the training process, and backpropagation, that mysterious algorithm. We'll discuss optimizers and how they help the learning process. Then, we'll look at common neural network architectures and use cases, and I will run a bunch of demos. We'll talk about convolutional neural networks for computer vision and recurrent neural networks, which are heavily used in natural language processing. LSTMs are a common architecture for recurrent neural networks. We'll end with GANs, which are pretty fun. And, of course, I'll share some resources.
Okay, let's get started. And once again, if you have questions, please ask them. So here we go with the theory. First, let's explain what a neuron is. We're trying to mimic the biological neuron. What we know about biological neurons is that they have inputs assigned a weight. If the inputs are strong enough, the neuron fires; if not, it doesn't. We're trying to mimic that using code. We have a number of input signals, \( x_1, x_2, \) etc., each with a weight, \( w_1, w_2, \) and so on. We multiply each input by its weight and add everything together. This operation is called multiply and accumulate, so \( x_1 \times w_1 + x_2 \times w_2 + x_3 \times w_3, \) etc. This gives us a value we call \( u \). In most networks, we also use a bias value, but for simplicity, I'll ignore the bias. The bias is a fixed value added to the multiply and accumulate operation. So \( u \) is a linear function of the inputs. If the inputs vary linearly, \( u \) will vary linearly. This is a problem because we want a threshold. We want to mimic biological neurons, which sometimes fire and sometimes don't. There seems to be a threshold beyond which something happens. To reproduce this, we use activation functions. Over time, many activation functions have been designed, and the popular one these days is the ReLU (Rectified Linear Unit), which is very simple and fast to implement. If \( u \) is negative, the activation of \( u \) is zero; if \( u \) is positive, the activation of \( u \) is \( u \). This gives us a threshold and potentially very large activations, unlike functions like tanh or arctan, which are limited between fixed values. This is a desirable property for learning. So, a neuron is a combination of multiply and accumulate and an activation function, giving us the output value of the neuron, which we call the activation value.
A single neuron by itself won't do much, so we want to put them together in layers and networks. This is possibly the simplest neural network you could build, called a fully connected network. Each input is fully connected to all intermediate neurons, and each intermediate neuron is fully connected to all outputs. The input layer is the data we send to the network for prediction, and the outputs are the results. In the middle, we have the hidden layers, which learn the right parameters to correctly predict inputs to outputs. You need at least one hidden layer. This is the simplest network you can build.
How do we go from input to output? First, we need data. Imagine we have a dataset and put it in a matrix, \( X \). Each row in the matrix represents a sample, and each column represents the features. \( X_{1,1} \) is the first feature of the first sample, \( X_{1,2} \) is the second feature of the first sample, and so on. \( X_{2,1} \) is the first feature of the second sample, and so on. We need as many neurons on the input layer as we have features. So, sizing the input layer is easy; you need the same number of neurons as features. This is \( L \).
Most deep learning is used for supervised learning problems, which means we need labels. We store our labels in another matrix, \( Y \). For example, \( X_1 \) is labeled with a 2. If this is a classification problem and we're trying to classify samples into ten different categories, \( X_1 \) is class 2, \( X_2 \) is class 0, and the last sample is class 4. If you've done machine learning before, you know that when working with categories, we prefer one-hot encoded vectors. If you represent categories as integers, the model might learn by mistake that category 4 is twice category 2, which doesn't make sense. To avoid this, we replace category integers with one-hot encoded vectors. If we have 10 different categories, we replace each category with a vector of 10 bits and set the bit corresponding to the right category to 1. This is called one-hot encoding and helps us represent categorical variables without scaling issues. It also helps us size the output layer because if we have 10 classes, we need 10 output neurons. This helps us see zeros and ones as probabilities. For example, if \( X_1 \) is category 2, the probability that it is category 2 is 100%, and 0% for all other classes.
The way this works is that we take a sample, and if everything works fine, when you put the features of the first sample on each neuron of the input layer, run the multiply and accumulate operations, and run the activation functions, at the end, in the output layer, you will have zeros in all output neurons except for the neuron corresponding to the right category. This is the ideal scenario. We measure accuracy, which is the total number of correct predictions divided by the total number of predictions. Ideally, we have 100% accuracy, but it takes work to get there.
Initially, the network won't predict correctly because all the weights are initialized at random. The purpose of the training process is to figure out the right weights for maximum accuracy. If you try to predict with a random network, you'll get garbage on the output layer. But that's okay; as long as we can measure the difference between the real label and the predicted one, we can improve. To measure the distance between the two vectors, we need a loss function, which computes the numerical error between the two vectors. Most deep learning frameworks, like TensorFlow, include these loss functions. We compute the error for each individual sample, but we want to compute the accumulated error for a batch of samples. We run a few samples through the network, add up all the individual losses, and get the batch error. Then, we make optimization decisions.
The purpose of the training process is to minimize prediction error, also called loss, by gradually adjusting weights. If we tweak the weights in the right direction, we should be able to minimize the prediction error and increase accuracy. This is not an easy problem because we need to adjust each weight in the right direction. We use an algorithm for this, which we'll discuss in a moment. This is what mini-batch training is. We start from the training set, slice it into batches, and start with the first batch. We take each sample in the batch and forward propagate it. We put each sample in turn on the input layer, run multiply and accumulate, get to the output layer, compute the loss for that sample, and add up all the individual losses to get the batch loss. Then, we run the backpropagation algorithm. Backpropagation starts from the output layer and goes back through the network layer by layer, adjusting weights individually in the direction that reduces error. Backpropagation figures out that if you increase this weight a bit, decrease that weight a bit, and so on, the error will be reduced. This is done for each weight, layer by layer. The next time we propagate the next batch, we should have a lower loss because we adjusted the weights in tiny steps in the right direction. We backpropagate again, making the right decisions, batch after batch, until we get a trained network.
The batch size, learning rate, and number of epochs are important hyperparameters. The batch size should be chosen carefully; if it's too small or too large, the training process might not run well. The learning rate guides how large the updates to the weights should be. The number of epochs is how many times we go through the dataset, batch by batch. For large datasets and training from scratch, it's typical to train for hundreds of epochs.
During the training process, at the end of each epoch, we run a validation dataset through the network. This dataset is a fraction of the original dataset that we kept aside and did not use for training. We measure accuracy on this validation set to ensure the network is generalizing to data it hasn't seen. It's also good practice to keep a test set, which you only use at the end of the experimentation to benchmark your network against previous versions.
How do we know how to adjust the weights? This is done using an algorithm called Stochastic Gradient Descent (SGD). The intuition is that if you're lost in a mountain and can only see a little bit ahead, you would identify the steepest slope and take tiny steps down to get to the lowest point. In deep learning, we start with random values for the weights and try to find the values that minimize the loss. The loss is a function of the network parameters, and we want to find the values of the parameters that yield the lowest loss. SGD uses math to find the direction of the steepest descent. The step size is based on the learning rate. If the learning rate is high, you take big steps; if it's low, you take tiny steps. We need a balance.
To find the direction of the steepest descent, we use derivatives. Derivatives give the slope of the tangent on a curve. For a one-dimensional problem, the derivative at a point gives the slope. For multi-dimensional problems, we compute partial derivatives in each dimension. By computing partial derivatives, we know which direction to move in each dimension to decrease the loss. We put all these partial derivatives in a vector called the gradient. The gradient is the vector of all partial derivatives in all dimensions. This is why the algorithm is called Stochastic Gradient Descent. The gradient tells us which way is down, and we adjust the weights accordingly.
The loss function can have multiple valleys, including global and local minima. Deep learning researchers aim to avoid getting stuck in local minima because the error is higher there. Saddle points, where the loss is a maximum in one dimension and a minimum in another, can also be problematic because all derivatives are zero, and the algorithm might stop moving. However, these issues tend not to be significant in practice.
SGD is the granddaddy of optimizers, but there are others like AdaGrad, AdaDelta, and Adam. These adaptive optimizers can take larger steps in steep areas and smaller steps in shallow areas, and they can use different learning rates for different dimensions. If you're new to this, start with SGD and explore others later.
If you're still with me, congratulations! You now know a lot about deep learning. Over time, training accuracy should go up, and loss should go down. Validation accuracy should also go up but will eventually plateau. If you train too long, validation accuracy might drop, indicating overfitting. Overfitting means the model is too specialized to the training set and doesn't generalize well to new data. Techniques like checkpointing and early stopping can help avoid overfitting.
Let's do a quick demo using the MNIST dataset, a collection of 60,000 handwritten digits. Each image is 28x28 pixels, and we can represent them as a matrix of pixel values between 0 and 255. I'm using Keras, a beginner-friendly deep learning library. We download the dataset, normalize the pixel values to the 0-1 range, and one-hot encode the labels. We define a simple network with a flatten layer, two fully connected layers with ReLU activation, and a softmax output layer. We compile the model using the Adam optimizer and a classification loss function. After training, we evaluate the model on the validation set and achieve 98% accuracy.
Next, I'll try the model on some hand-drawn digits. The model misclassifies one digit but performs well overall. To improve performance, we can use a convolutional neural network (CNN). CNNs account for the multidimensional nature of the data and use convolution and pooling operations to extract and shrink information. We build a CNN with convolution and pooling layers, followed by fully connected layers for classification. After training, the CNN achieves 99.16% accuracy on the validation set and performs well on the hand-drawn digits, even the ugly ones.
CNNs are powerful for computer vision tasks. We can use pre-trained models from libraries like GluonCV for tasks like object detection and segmentation. For example, using a pre-trained YOLO V3 model, we can detect objects in an image with high accuracy. Similarly, a pre-trained segmentation model can segment objects in an image. These models are trained on large datasets and can be used out of the box.
Recurrent neural networks (RNNs) are important when the order of the data matters, such as in text translation or sentiment analysis. LSTMs are a type of RNN that remembers past predictions, making them suitable for long sequences. GRUs are a simpler variant of LSTMs that are easier to train and equally effective.
Finally, Generative Adversarial Networks (GANs) are fascinating. GANs can generate realistic images, such as faces or photorealistic scenes. They consist of a generator and a discriminator that compete to improve the quality of generated images. GANs have applications in image generation, style transfer, and video processing.
To get started with deep learning, here are some resources:
- Machine Learning Academy on AWS
- Deeplearning.ai by Andrew Ng
- Fast AI by Jeremy Howard
- Books: "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville (math-heavy) and "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow" by Aurélien Géron (code-heavy)
- Gluon and Keras for deep learning
- My GitLab repo with examples and more
Thank you for attending this session. It was very dense, and I'm sure you have plenty of questions. I look forward to answering them. Thank you very much.
Tags
Deep LearningNeural NetworksBackpropagationOptimizersConvolutional Neural Networks
Julien Simon is the Chief Evangelist at Arcee AI
, specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.
With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.
Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.
Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.