Vision Transformer on SageMaker part 1 dataset preparation
November 25, 2021
This video is the first in a series of three, where I focus on training a Vision Transformer model on Amazon SageMaker.
In this video, I start from the « Dogs vs Cats » dataset on Kaggle, and I extract a subset of images that I upload to S3. Then, using SageMaker Processing, I run a script that loads the images directly from S3 into memory, extracts their features using the Vision Transformer feature extractor, and stores them in S3 as Hugging Face datasets for image classification.
In the next two videos, I’ll use these datasets to train models using the Trainer API in the Transformers library (https://youtu.be/iiw9dNG7JcU), and then PyTorch Lightning (https://youtu.be/rjYV0kKHjBA).
⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos ⭐️⭐️⭐️
Vision Transformer paper: https://arxiv.org/abs/2010.11929
Dataset: https://www.kaggle.com/c/dogs-vs-cats/
Code: https://github.com/juliensimon/huggingface-demos/tree/main/vision-transformer
New to Transformers? Check out the Hugging Face course at https://huggingface.co/course
Transcript
Hi everybody, this is Julien from Hugging Face. This is the first of a series of three videos where I'll focus on training a vision transformer model on Amazon SageMaker. You're certainly familiar with using transformers for natural language processing tasks, but in fact, thanks to the vision transformer, we know that we can also use transformer models for computer vision tasks, such as image classification. And this is exactly what we're going to do here. In the first video, we're going to build a dataset using images stored in S3. In the second video, we'll use this dataset to train a Vision Transformer model using the Hugging Face container on Amazon SageMaker, and we'll use the Trainer API from the Transformers library. In the third video, we'll train again using the Hugging Face container and SageMaker, but this time we'll use PyTorch Lightning, which is a very popular PyTorch edition that makes it pretty easy to train models. So just another way to do it. OK, let's get started.
First of all, we need a dataset. To keep it simple, I'm going to grab the dogs versus cats dataset and we're going to download it and extract it locally. We're going to keep just 10% of the dataset to keep the demo and training time pretty short, but you could certainly use more images. You can see the script here; nothing really difficult. Extract the dataset, move all the dog images to one folder, and all the cat images to another folder, and then basically keep only 1,250 images from each class and sync that to an S3 bucket. I'll include the script in the repo. Again, nothing really complicated here.
So once we've done that, what we have is an S3 bucket where we have at the root one prefix per class. So obviously, the cat prefix includes all the cat images and the dog prefix includes all the dog images. This is the assumption that I make on my data. This is what my dataset script expects. Okay, so pretty typical: one prefix with all the class images under it.
Let's take a look at the high-level process, and then we can look at the individual functions. This is my SageMaker processing script, and here's the entry point, the main function. The script will receive arguments: the bucket, the prefix, the list of classes that are present in the dataset. In this case, it's going to be an array with two classes, dog and cat. I could also pass the name of the model that I want to use, which Vision Transformer model do I want to use? So you receive those arguments, extract them. My first step is going to be to read all those images in S3 directly and figure out what the label is and store the images in a Python dictionary. I'll have two keys in the dictionary: the NumPy array storing the image bytes, and the label, so the class index 0 or 1. We'll look at how this works.
So once I've got this dictionary, then I can use the convenient `from_dict` API in the datasets library, applying the features, so the labels and the image. This is going to be the Hugging Face dataset object from my in-memory dictionary. The next step is, of course, to extract features from the images. I'm using the Vision Transformer Feature Extractor for that. What this really does is add a new feature called Pixel Values to the dataset. Since I'm working with the Hugging Face dataset, I can use the `map` API to apply a preprocessing function that will extract the features using that extractor. Once we've done that, so we have this extra feature in the dataset, we split the dataset into three parts: train, validation, and test, and we save them as Hugging Face datasets to well-known locations inside the SageMaker processing container, and that container will automatically move them to S3. So that's the high-level process: load images into a dictionary, use the feature extractor to add pixel values, and then split and save.
Let's zoom in on the individual functions here. Building the image dictionary starts from an empty dictionary, and we paginate the objects that live in that bucket because you can only list a thousand objects at a time. Since we have more than that, we need to iterate over the pages. For each object in each page, we load it from S3 as a NumPy array, figure out its label, and then add it to the dictionary. How do we load the object? We read it as a byte stream, which we open with the Python image library. We resize the image to 224x224, which is what the Vision Transformer expects. If your images already have the right size, of course, you can ignore that step. I convert the image to a NumPy array and then move the channels to the first position because I'm going to use PyTorch for training, and that's what PyTorch expects. These are color images with three channels: red, green, and blue. We want to make sure that the channels, so the three dimensions (red, green, blue), are the first dimension of the array. Unfortunately, the Python image library does it in the opposite direction, so channels last, which is why I need to move the axis first. That's how you load an image from S3.
Figuring out the label is super simple because, of course, the key for the image contains "dog" or "cat." So we can split that key, find the "dog" or "cat" position, and then, since we have the list of classes, we can find the label, which is just the index of "dog" or "cat" in the class list. So 0 or 1, pretty much. And that's how we do it. So nothing really complicated here.
The preprocessing function is just what you would expect. Starting from a batch of images stored in the dataset, we apply the feature extractor and add that new feature called pixel values. I guess that's the canonical way to work and process a Hugging Face dataset, using that `map` API with a function.
Now let's see how we run this using SageMaker Processing. SageMaker Processing is really super simple. We use a built-in container. In this case, I'll use the sklearn processor. You can use PySpark if you want to distribute computations, but here, I guess I don't need that. So I'll just stick to my good friend sklearn processor, passing my infrastructure requirements. I just need a decent amount of memory on the instance because I'm going to load all those images. So I'm going to go with an M5 for Excel. That should be more than enough. But you can go bigger. You can use larger M5s or very large R5 instances. So if you really need to load tons of data, you can just change the instance type, and you should be fine.
Then I'm just going to run this processor object, passing my image processing script, which we just looked at. It doesn't have an input per se. The input is really defined by the bucket, the prefix, and the list of classes. It has three outputs because, as we saw, we have three splits for the dataset: the training data, the validation data, and the test data. That's it. Super nice. We just run this cell, and we can see a bit of logging here. So we're installing the transformers library and certainly the dataset library. We can see we load images and then build a dataset. We extract features. Here we see we're downloading the feature extractor for the Vision Transformer model we selected. Then we split the dataset, and we see we'll have 2,000 images for training, 250 for validation, and 250 for test. And then we just save them. All of that took nine minutes. And I can visualize the actual location of those outputs. So I see all three are living in that same S3 prefix. If I list that, I obviously see my training dataset, and we see the familiar Arrow file, and the same for the validation dataset and the test dataset.
Okay, so that's the end of this first video. Now we have a dataset in S3, a Hugging Face dataset, and in the next video, I'll show you how to train a model using this dataset and the Trainer API in the Transformers library. Okay, so keep watching.