SageMaker Fridays Season 2 Episode 6 Computer vision large scale training November 2020

November 14, 2020
Broadcasted live on 13/11/2020. Join us for more episodes at https://amazonsagemakerfridays.splashthat.com/ ⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos ⭐️⭐️⭐️ This project explain how to train computer vision models on large scale datasets. Starting from the ImageNet dataset, we use SageMaker to train a model with the built-in algorithm for image classification and 64 GPUs! We also discuss SageMaker features that help you scale such as RecordIO files, pipe mode, distributed training and GPU instances.

Transcript

Hey, good morning, everybody. Good afternoon, wherever you are. Welcome to episode six of SageMaker Fridays. It is the last episode of the season already. Time flies. My name is Julien, and I'm a principal developer advocate focusing on AI and machine learning. Please meet my brilliant co-presenter, Segolen Boudhaus. Hi, everyone. My name is Segolen, and I'm a senior data scientist working with the AWS Machine Learning Solutions Lab. Thanks again for joining us, Segolen. We're going to need your expertise. Everybody, let me remind you that episodes are 100% live, and you can ask all your questions in the chat. We have friendly and expert moderators waiting for your questions. I keep telling you there are no silly questions. Ask anything you like about machine learning at AWS. We're trying to help you learn. So don't be shy. Just go for it. Okay. So this is the last episode of the season, as I said, and we're going to try and close the season in style, right? No cliffhangers. We get the big events. This week, we're going to focus on computer vision, which is obviously a very popular topic for machine learning. Specifically, we're going to focus on large-scale training. Segolen, please tell us more. Yes, indeed, Julien. Today, we are going to talk about how you can scale your training job on large and complex data sets, which are pretty common in the computer vision domain. Computer vision requires a lot of training to reach good accuracy. Scaling is really a fundamental question for both experimentation and production, as you want to keep your training time and cost under control. Today, we are going to learn several techniques leveraging SageMaker capabilities to do so. Bringing your data science project to the next level with a scalable deep learning infrastructure is the main takeaway of this final episode. We're going to go big. SageMaker has a lot of built-in features that make it really easy to scale. The great thing is we're going to keep using the same familiar training and deployment workflow that we used all season. So when I say big, what do I mean exactly? Segolen, give us some feedback on that. So this is really a big episode because we're going to train the ResNet 50 network from scratch on the ImageNet data set, which is a reference data set for many computer vision applications. Today, we're ready to train an image classification model on about 150 gigabytes of data with a large P3 GPU instance. We're going to use Pipe Mode for distributed training. This is really the cherry on the cake of the SageMaker Fridays season two. Oh, yeah, definitely. So it looks like we're going to dive into machine learning engineering. Exactly. It doesn't require any infrastructure skills. If you think, "Oh, I'm a data scientist, this is going to be about VPCs and subnets and SSH keys," no, no, no. It's going to be about you scaling very easily with very large data sets. So get some coffee, get your energy drink, get anything you need because we're going to start now. As usual, the material is online. Let me share my screen and show you the repository we're going to use today. It's one of the examples from my SageMaker book, and you can find it in this GitHub repo. Of course, you will find the URL on the final slide. Okay. And so it's building an image classification model on ImageNet. We're going to dive into the code, data preparation, and lots of different things. But as always, we want to focus on the problem first. The problem itself is pretty simple to understand, right? We want to classify images. This is a common use case for many customers. There are plenty of good models for image classification. The real problem is how do you scale your training jobs when you have to deal with very large data sets? Segolen, what can we say about this? So, yes, Julien, exactly. This question of scalability is a very important topic for our customers. As a data scientist at the ML Solutions Lab, I have two recurring questions from deep learning practitioners. The first one is, "Can you check if we are making good use of the infrastructure we are paying for?" The second question is, "Can we train faster without spending more money?" This response depends on the customer's business requirements. Some companies run training jobs that last days, even weeks, whereas some other customers need to get the freshest models possible, retraining every hour. But in both cases, our customers want to avoid potential bottlenecks in their training infrastructure, which may lie in their CPU, GPU, memory, disk utilization, or training throughput. In a nutshell, deep learning practitioners want to get the most optimized workload to be able to scale. Today, we are going to see some techniques that can help here. Indeed, we're going to take a very good example of scaling a training job by training an image classification algorithm from scratch on the ImageNet data sets. Plus, we will explain why and how transfer learning techniques can benefit your deep learning projects. So, the problem is easy to understand. We have more data, and training times get longer, or we have lots of training jobs and want to run all of them at scale, or we want to retrain as often as possible. Some customers retrain every hour, sometimes less. And, of course, training time costs need to be under control. So we understand the problem. Let's take a look at the data set. If you go to the ImageNet website, you can see how big this data is. It has millions of images and thousands of categories. It mostly includes plants, animals, and objects. You can see all those different categories, and there are some really strange results if you're curious. You'll find lots of different things. You can explore the hierarchy of classes and so on. It's a huge data set. Segolen, why is ImageNet such an important data set? Can you tell us more about it? Yes, ImageNet is really a building block of modern computer vision. It has revolutionized the field of large-scale visual recognition, and ImageNet serves as a benchmark for many computer vision models. The project was launched more than 10 years ago by Professor Fei-Fei Li from Stanford to provide researchers with high-quality labeled image datasets. The images were collected from the web and labeled by human labelers using a crowdsourcing tool. As a result, ImageNet is a dataset of over 15 million labeled high-resolution images belonging to roughly 22,000 categories. These categories are organized according to the WordNet hierarchy, with each node depicted by hundreds and thousands of images. Today, we are going to use a lighter version of ImageNet. We are going to use a dataset containing one million images with 1,000 classes, which is about 150 gigabytes. ImageNet revolutionized the field of computer vision because starting in 2010, as part of the Pascal Visual Object Challenge, an annual competition called the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) was held and lasted until 2017. During this ILSVRC, several key milestones in model architecture for image classification emerged, such as AlexNet, VGG, GoogleNet, and others. You can see the results from that competition from 2010 to 2016. Each year, you see the error margin that the model achieved on image classification. Initially, it was 28.2% of errors, down to 25%, 16%, 6.7%, 3.6%, and 3.0%. The competition stopped in 2017. The blue line is the error rate, and the red bars show how many layers the winning model had. In 2010 and 2011, there was no deep learning, just shallow networks. Starting in 2012, AlexNet won with eight layers. The number of layers grew very large, helping to reduce errors significantly. The human error for image classification is about 5%. In 2015, we dropped below human error. The winning network is called ResNet, and you'll hear more about it. So, I mentioned ResNet. I guess you know which algorithm we're going to use. Segolen, tell us a little more about ResNet. Yes. Today, as we did in previous sessions, we are going to use a built-in algorithm on SageMaker. We are going to use the Amazon SageMaker image classification algorithm, which is a supervised learning algorithm that supports multi-level classification. It takes an image as input and outputs one or more labels assigned to that image. This algorithm uses a convolutional neural network (CNN) called ResNet, which can be trained from scratch or using transfer learning when a large number of training images are not available. Let me explain these concepts quickly. ResNet, which means residual networks, was the winner of the ILSVRC. It delivered an outstanding top-five error rate of under 3.6% using an extremely deep CNN with 152 layers. It is one of the most commonly used neural nets for CV tasks. In the research paper, the authors address the degradation problem observed while training deep neural nets by using skip connections and deep residual layers. The deep residual layer plus the skip connection creates an identity mapping between layers. By adding this key connection, the signal can flow easily across the whole network, making progress even if some layers have not started learning yet. Inputs can propagate faster through the residual connection across layers. On my screen, you can see the skip connection, which we discussed when we talked about LSTM. The idea is to avoid information being forgotten as we go through layers, addressing the vanishing gradient problem. Skip connections re-inject the input signal to downstream layers, ensuring information is propagated all the way to the output layer without losing too much. Here, you can see a 34-layer ResNet, with all the skip connections feeding the input to deeper layers. This is a very elegant and powerful idea. ResNet was published in 2016, and it's still a favorite. It's one of those reference models that machine learning teams rely on for many tasks. Today, we are going to train from scratch. Many teams also use pre-trained versions and fine-tune them. Can you tell us a little bit about that? Yes, the image classification algorithm from SageMaker can be run in two modes: full training and transfer learning. In full training mode, the network is initialized with random weights and trained on user data from scratch. In transfer learning mode, the network is initialized with pre-trained weights, and only the top three connected layers are initialized with random weights. The idea behind transfer learning is to leverage pre-trained networks and fine-tune them with new data to generalize to other datasets. If you have a million images describing your own objects for a business case, you might train from scratch once to create a baseline model. Then, you could specialize the model to detect a subset of those object classes by fine-tuning. For example, you might fine-tune to detect only dogs and increase accuracy. Today, we're going to train from scratch because it's exciting. So, we have a data set and a big one. We have a built-in algorithm, so let's put everything together and start looking at the code. The first step is to understand how to get those one million images ready on our own machine. This is a problem in itself. Let me close this. These steps are not super well-documented, and I had to do some exploration. So, we're going to go slowly here. First, you need to download the data set. Register on the ImageNet website, and it's not just a matter of downloading 150 gigs. What you end up downloading is a bunch of huge compressed tar files. The TensorFlow repository has a script called `download_imagenet` that simplifies this process. It needs your ImageNet username and access key, and it fetches, extracts, and saves the data set, saving you a lot of trouble. It took me five days to download this using a nice EC2 instance, so make sure you launch it in a way that won't be interrupted, using `nohup` or something similar. Five days later, you have a file hierarchy with one folder for each of the thousand classes and all the validation images in one folder. I used another script to organize the validation images into their own specific directories. This way, we have neatly organized training and validation images. We could upload these images to S3 and train using image mode, but managing a million individual files for multiple epochs would introduce a lot of overhead. Instead, we use a file format called RecordIO, part of Apache MXNet. The idea is to pack all your images into a smaller number of files. I packed my validation images into six files and my training images into about 140 chunks. Each chunk is about 100 to 200 megabytes, which is a good size for performance. We use a tool in MXNet called `IM2Rec` to create these RecordIO files. After syncing them to S3, we have a more manageable and performance-optimized setup. Now, let's open the notebook. We start by installing the SageMaker SDK and grabbing a default bucket. We use a cool feature in SageMaker called Pipe Mode for high-performance training. Segolen, can you explain what Pipe Mode is? Yes, with Pipe Mode, your data set is streamed directly to your training instance instead of being downloaded first. This means shorter startup times, higher I/O throughput, and virtually unlimited data set sizes. Pipe Mode is recommended for large data sets, but File Mode is useful for small files that fit in memory. Together, both input modes cover a wide range of use cases from small experimental jobs to petabyte-scale distributed training jobs. We use the same training input object to define the properties for our training and validation channels. We specify the S3 location, full replication, and Pipe Mode. We shuffle the files for each epoch to avoid bias. For the validation data set, we don't need to shuffle because the order doesn't matter. So, we have our inputs figured out. We see the data location in S3, grab the container name for the image classification algorithm in our region, and create the estimator. Segolen, what do we have here? Yes, we have the container, role, and instance type. I used three p3.8xlarge instances, but for this training job, I used eight p3dn.24xlarge instances. These are the largest and most powerful instances available, with eight V100 GPUs each, twice the GPU memory, and 100 gigabit networking. This gives us a total of 64 GPUs in the training job. We also use Spot instances to save costs, which can be very interesting when scaling training jobs. We set the output path and other common parameters. For hyperparameters, we have a huge number to play with to optimize our deep learning model. We choose the number of layers, and in our case, we use ResNet 50. We set `use_pretrained_model` to 0 for training from scratch. The number of classes is 1,000, and the number of training samples is 1.2 million. The mini-batch size is 2,800, which I determined through trial and error to maximize GPU memory usage. We also set the learning rate and learning rate scheduler factor to reduce the learning rate over time. We use synchronous distributed training to ensure all instances have the same model and gradients are synchronized. We call `fit`, and the usual process starts: firing up instances, downloading the container, and starting training. With Pipe Mode, data download takes zero seconds, and training starts immediately. The training job ran for five hours, and we hit early stopping at epoch 150, achieving a validation accuracy of about 65%, which is pretty satisfying for a first attempt. Looking at the training job, we saved 70% of the cost thanks to Spot instances. We ran for five hours, and the training was very resource-intensive. The GPU utilization was consistently high, around 600-700% (with 800% being the maximum). The GPU memory utilization was also very high, around 754%. We achieved almost 91% training accuracy and about 65% validation accuracy. We can see the same metrics in CloudWatch, where we can visualize CPU, memory, GPU utilization, and accuracy over time. This shows that the training was efficient and the GPUs were kept busy throughout the process. So, that's it for this episode. We trained a state-of-the-art model using just a few lines of code and a scalable infrastructure. Thanks for joining us, and we hope you learned a lot. Stay tuned for more episodes and keep experimenting with SageMaker.

Tags

SageMakerComputerVisionLargeScaleTrainingResNetImageNet

About the Author

Julien Simon is the Chief Evangelist at Arcee AI , specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.

With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.

Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.

Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.