Generate synthetic data with Stable Diffusion to augment computer vision datasets

Transcript

Hi everybody, this is Julien from Arcee. As you can see, I'm not in my usual settings. I'm in California for a couple of conferences. But that won't stop me from doing another YouTube video. So what is this one about? Well, as we know, building datasets is a very time-intensive and rather painful task, especially for computer vision datasets. We need to collect images, label them, etc. And I thought, can we use stable diffusion, which is a generative AI technique, to create synthetic images that we can add to existing datasets instead of scraping the web and cleaning images and resizing them and labeling them? Well, that's exactly what we're going to do today. We're going to create new images using stable diffusion, add them to an existing dataset, and retrain a model on this new dataset, and we'll see how that goes. So this is going to be a fun one, stick around. In a previous video, I showed you how to use AutoTrain, our AutoML service, to fine-tune an image classification model on the Food101 dataset. As the name implies, that dataset includes 101 classes with different types of food, and it's a public dataset on the hub. Starting from this and using AutoTrain, I created this model, which is public as well, and this one lets us classify food images. Let's grab maybe this one. And we can score food images against those 101 categories. This model was pretty accurate, 91.5%. If you want to know how this model was created, go and check out that AutoTrain video. I will put the link in the video description. Of course, the model only knows about the 101 classes that are part of the dataset. We can see them here. What happens if we try to predict an image that shows something else? For example, let's say I want to predict an image showing Boeuf Bourguignon, one of the most popular meals in France and one of my favorites—beef, mushrooms, carrots, red wine sauce, amazing. If we predict this image, it tells us steak, which is not totally wrong because it is a beef-based meal, but it's not steak. The rest is awfully wrong too. That's the problem I want to solve here: I want to teach that model how to predict additional classes. As an example, we'll use Boeuf Bourguignon. So, we need to collect images that show plates and meals with Boeuf Bourguignon, add them to the dataset, and train again with 102 classes this time. The problem is, where do you find those images? You could go to Google and try searching for Boeuf Bourguignon and scrape those images, but then you would have to write the code for that, resize them, remove some that aren't exactly what you're looking for, etc. That's a valid way of doing it, but I thought I'd do it differently today. I'm going to use stable diffusion, the latest version at the time of recording (1.5), to generate images that I can add to the dataset showing Boeuf Bourguignon. That should be easier and quicker than scraping the web for whatever images I need. Okay, so let's try this. We can't use the inference API for this model, but we have a bunch of spaces here. Let's see if that model can actually generate the samples we need. It takes a few seconds. Let's see if it's good enough, precise enough, and realistic enough. Well, it's not so bad; it's actually very good. Let's generate a thousand images because that's how many images we have in each class in the original Food101 dataset, and we'll add them to the dataset and train again. Let me switch to the stable diffusion notebook and I'll show you how to do this with very little code. In this first notebook, we start with installing some dependencies. The main one is the diffusers library, which lets us work with stable diffusion models. We need to log in to the hub because that's how the stable diffusion model is configured; it needs logged-in users. Thanks to the Diffusers library, we can create a pipeline just like we would with the Transformers library. We're going to use the FP16 version, which will let us predict much quicker than if we were using full precision, with no difference in the quality of images. That's a really good trick. Of course, we want to make sure that pipeline is running on a GPU. I'm using a GPU instance here with a V100 GPU. Download the model, and then we can generate our images. The generation itself is super simple. It's just the pipeline, the text prompt, how many images we want to generate for each prompt, and other technical parameters like guide and scale. The number of images you can generate in one go depends on the GPU memory you have. Here, I can do four in a round. If I want to generate a thousand images, I need to invoke the pipeline 250 times. That's why I have this double nested loop here. First, I iterate on how many times I need to invoke the pipeline, and then I iterate on the four generated images and save them. Very simple code. What about this guidance scale parameter? This parameter gives more or less freedom to the model to generate images. A value of 8, which seems to be a good default, will generate images strictly compliant with the text prompt. Lower values give more freedom to the model to explore. Let's try this and see how it goes. Let's generate maybe eight images with the default value for the guide and scale and display them. This is very fast; it takes about 10 seconds for four images. You can do a thousand images in just under an hour, which is pretty fast. I can see my generated images, and they're good. This one is just a little bland, so let's try images with a little more freedom. We might start seeing other objects appearing in the pictures, which would be nice because it creates more diverse images and makes the model work harder at figuring out the context in the image. You can see we have multiple plates, silverware, which we didn't see before. Sometimes you get a glass of wine. I have a few pictures like that. We see bread, additional objects, more plates. Generally, this is more interesting because if you have a thousand images centered right on the meal with nothing else visible, it's not so realistic. You want a little more chaos in the picture. Lowering that guidance scale parameter will give you that. I could just do this, generate a thousand images, which I've already done. It takes about an hour, and these are saved to a local directory. Let me switch to the directory and show you the pictures, and then we'll see how we can add them to the dataset. About an hour later, I get my images. Here they are: a thousand images. Generally, they're very good quality. Now, how do we add them to the dataset? The Food101 dataset, which you can get on Kaggle, looks something like this. I renamed it Food102 because it has an extra class. It's a simple structure: one folder for each type of food, which will be the label for the class. What I did is simply create a new folder and copy my images in there. That's all it takes. If you have this image folder structure, it's super straightforward to add data to it. If you have a CSV or JSON lines folder structure, it's not much more difficult, but the image folder is very convenient for adding more data. Now we have Food102, and it's not split yet; it has only a training set. We have the data, and now we can move on to building a Hugging Face dataset for this and then train. Let's build a dataset here. We'll start from that image folder dataset and turn it into a dataset that we can push to the Hugging Face hub. For this, we need to have Git LFS installed. These are the instructions for this machine. Shouldn't be too different on yours; maybe you need to replace `yum` with `apt`. We need to be logged into the hub because we're going to push the dataset to the hub. We can use the super convenient image folder format to load from the folder into a Hugging Face dataset. Just point the loader at the top of the file tree, and it will load all those files. We can check that we have 102 classes, all our class names, including the new one. Then we can split for training and validation. By default, it's 25% for testing. We can just push this to the hub, and voilà! If I go to the Hugging Face hub, I can see my dataset and browse it. Now that we have a Hugging Face dataset ready, we can move on to the last part, which is fine-tuning that original model I created with Auto-Train on the Food102 dataset and see how that goes. If you're familiar with the Trainer API in the Transformers library, you already know what I'm going to do here. It's very similar and easier to understand. Install transformers and datasets, import some classes, load our dataset from the hub, and verify the 102 classes and class labels. We can easily build the mapping between the class IDs and the class labels, which will be useful for human-readable predictions. Define the base model that I want to fine-tune. This is the one we saw earlier, trained on Food101 for a few epochs. We're going to train for a few more on Food102 this time. I've gone for a really small learning rate because this is very fine-tuning. The batch size, eval size, and FP16 are huge time savers. We'll evaluate after each epoch. Next, we download the base model, passing the labels and mappings. This is important because the classifier layer of the model is going to change. The original model has 101 output classes; the new one has 102. We need to resize it, and if you don't set the mismatched size parameter to true, you'll get an error. Next, I need a transform function that converts the images to RGB format and passes them to the feature extractor for the model, which applies resizing and normalization. We apply that to the full dataset. Define the collator function to ensure the batches have the proper format with the proper features and pixel values. Define the metrics function; here, I'll just output accuracy. I left the code for F1, precision, and recall, but it's per class, so it gets a bit messy. Feel free to use it if you need it. The training arguments put everything together, and then the trainer, which includes the model, arguments, feature extractor, collator, metrics function, training set, and eval dataset. I call train and run for three epochs. That's about an hour, 20 minutes for one epoch with this dataset. As you can see, I improve pretty rapidly on the accuracy. The base model was 91.5%, and after three epochs, I'm at 93.4%, almost. Maybe we could push it a little more, but an hour of training is good enough. So, that's a very accurate model. Very cool. We push it to the hub, and we're done. If I go back to the hub, I can find this new model, which is a Swin model by the way. If you want to try this with a vision transformer, it should work the same; the feature extractor is the same as well. 93.38% accuracy on those 102 classes. Now, if I try my image, it's properly scored. Very cool. So there you go. A bit of a weird technique, but I think it's actually quite simple. It took very little effort to generate those images, and this is a very scalable process. If I needed 10,000 images, I would just let it run for 10 hours or use a multi-GPU instance and scale things a bit. Once you have this, you can generate any number of images and any prompt. If you wanted to try chicken teriyaki or something else, you can add as many classes as you want. This is much simpler than going and trying to find the appropriate number of images on the web, scraping them, and worrying about whether you're allowed to use those images and how much work you need to process them. Here, they're going to be the right size, the right quality, and you can get as many as you need. This is actually a rather fast process, and thanks to the datasets and transformers library, it's quite easy to augment your datasets and retrain your models. One hour of data generation and one hour of training, and we get extremely good 93.38% accuracy. I was surprised to get such good results. I guess I found a real-life business use case for stable diffusion. Sure, you can generate crazy pictures of dragons and unicorns, but if you want to get real business done with stable diffusion, this is one way to do it. That's it for me. I hope you liked it, and I hope this was useful. I'll see you when I'm back on the other side of the pond. Until then, keep rocking.

Generate synthetic data with Stable Diffusion to augment computer vision datasets

Transcript

Tags

About the Author