Hi everybody, this is Julien from Arcee. A lot of folks still think that fine-tuning large models is difficult and expensive. Well, in this video, I'm going to show you how you can fine-tune a stable diffusion model for literally one dollar. And we won't even write a line of code. We'll just use an off-the-shelf script from the diffusion library and advanced training techniques implemented in the PEFT library, which, as we know by now, is parameter-efficient fine-tuning. So this is pretty cool stuff. It's very efficient. Let's get to work.
This video is based on a blog post that my colleagues wrote a little while ago, and I'll strongly encourage you to read it, of course. I will put all the links in the video description. So, in a nutshell, what we're going to do here is start from a stable diffusion model. We're going to fine-tune it on an image dataset of Pokemons, which is pretty funny. And we can do this very efficiently using a technique called LoRa, which stands for low-rank adaptation. I will explain this in a minute. In a nutshell, this technique helps us reduce the number of parameters that we actually have to fine-tune, and that definitely lowers the bar on how much infrastructure is required. Okay?
So, going with this, this is the actual script I'm going to use. Very simple. And, of course, we'll generate some Pokemons with the trained model. The model is also on the hub, so you'll be able to replicate this with your own images. That's pretty fun.
First things first, let's take a look at LoRa, what it means in hopefully plain English, and why it's an amazing technique to reduce the amount of infrastructure that's required. As you'll know, traditional fine-tuning updates all the model parameters. So when we're working with a billion or multi-billion parameter models, obviously, this is a time-consuming process and it becomes expensive quickly because we need lots of infrastructure and powerful GPUs with lots of RAM to even fit the model.
A while ago, this new technique came out. It's called LoRa, which means low-rank adaptation. I will explain what the low-rank bit means. Please go and check out the research paper if you'd like. And what LoRa says is we can fine-tune models and update models by simply training two small matrices, multiplying those two matrices, and adding the result to the original model. And obviously, the point here is that instead of fine-tuning the full original model, we're just going to learn those update matrices, which are much smaller.
So this is from the research paper and looks scary, but hopefully, I can translate this to plain English. We start with an original model, so a weight matrix which has two dimensions, d and k. Instead of learning the total number of parameters, which means the product of d and k, we're only going to learn the two matrices, A and B, which we can multiply. And A and B are much smaller, so they have a smaller rank, which is the math term for size. The product of those two matrices will be added to the original model. So as you can see, we're still freezing the original weights from the W0 original matrix, and we're only learning A and B. Only A and B contain trainable parameters, which means that instead of tweaking all of W0's parameters, which would be D multiplied by K, we only learn the parameters in A and B, which are basically Rd plus Rk, or R multiplied by D plus K.
So as you can see, we're turning what's a quadratic problem into a linear problem. Because if we scale the size of W0, we don't have to learn D multiplied by K; we have to learn d plus k multiplied by a small integer. So the scaling is now linear instead of being quadratic. And that's the core interest of LoRa. We can work with bigger models, but we don't have to scale the amount of infrastructure as much. So what this really means in practice is we can reduce the number of parameters by at least a thousand. So that means training only maybe 0.1% of the original parameters with negligible loss of accuracy. That's a huge bonus because now we can train those large models on potentially a mid-range GPU because we don't need as much GPU memory to fit the model.
And at inference time, we just collapse everything. So we load the original model unchanged, we load the LoRa weights, and we add them. There's no latency. And in fact, you'll see in my model repository that we only store the LoRa weights. When we load, that process is automatically changed, implemented by the library. So no difficulty, no latency. This technique is implemented in the PEFT library. And again, PEFT means parameter-efficient fine-tuning. And this is what the blog post here is using. Hopefully, that gives you a little bit of background. Again, if you want the hardcore math, please go check out the paper. But the intuition is we don't touch the original model. We just learn a couple of matrices that are much smaller. They have lower rank. And we can just add that update to the original model and get amazing results.
So let's take a look at the actual process. I'm starting from the diffusers library, which I cloned to this machine. And if we go to examples, text to image, we'll see different scripts to train diffusion models. So there's the vanilla one and there's the LoRa one. Feel free to go and read this. It's not strictly necessary, but if you're curious about all the details, you can certainly learn a lot of stuff here. To keep it simple, I'm just reusing the script from the blog post. I don't think I tweaked anything. So we're going to fine-tune this model, Stable Diffusion 1.5, on the Pokemon dataset, which you can see here. It has 833 Pokemons with descriptions. That's a fun one. But again, it would be reasonably easy to build your own dataset with just images and descriptions. So that's what we're doing here. We're going to save the model locally and once we're done, we're going to push the model on the hub with this name.
Okay, so we're launching the training script with Accelerate. The rest is really just all standard parameters. Again, feel free to tweak. You can change the validation prompt if you like. So validation images are generated regularly if you want to keep an eye on the training process. So here we are validating with total images. Why not? Okay, so that's the script. Now all I need to do is really launch this, which is simple.
What kind of instance is this? Well, this is a very small instance. This is a G4DNXLarge AWS instance, which is probably the smallest GPU instance you can get on AWS. If you don't know about G4 instances, go and check out the product page. Let's go look at those things here. So you can see this is the smallest one. It's got one GPU, a T4 GPU, with just under 16 gigs of RAM available. It's got 16 gigs of memory, and the on-demand price is 52 cents an hour. Definitely not expensive, especially when you compare that stuff to the bigger GPU families like P4, let alone P5. That's the whole point here: to train in a cost-effective way, but also to be able to train at all because availability of P4s and P5s is pretty challenging to say the least. Thanks to the LoRa technique, you can actually train your models on much smaller GPU instances, which are very easy to grab, whether you want to use G4 or G5, which I'll show you in another video later. This is just available everywhere, in all AWS regions. I keep meeting with customers who complain they can't get P4s, let alone P5s, in their regions. Well, they certainly can do G4 and G5. So this solves a lot of problems from availability to cost.
I'm making a point to use the smallest one here, and obviously, you could scale a little bit. You could try G4DN12XL, which has four GPUs. You would probably get some speed up there, but it's a bit more expensive. Or you could go and try G5. But again, I wanted to show you that you can run this on the smallest GPU instance on the device. So we just need to launch this. Why don't we do that? So that's my script. And we just have to launch it. There we go. So it will take a while, so we're not going to run to completion. I just want to show you that first of all, the script works and how long it should take. So we can see at the bottom of the screen, this will take something like six hours. And maybe you're thinking, oh wow, that's way too long. Again, the fact that this is running at all is just amazing on this tiny instance. You can scale up if you want, but you can run this on a tiny instance for very little cost. So about six hours. Let's interrupt it because I've already done this, and some of you are thinking, wait, you said I could do this for $1. So, 6 hours multiplied by $0.52, that's probably $3. How do I do this for $1? Well, you do this for $1 by using spot instances. And if it's the first time you hear about spot instances, you have been missing out. Spot instances are an amazing way to optimize cost. Go and read about that. If we're looking at the price for g4dnxl in the US East one region, we see that the on-demand price is 52 cents, and the spot price is, let's say, 15 cents. And this is very consistent. That's a week, that's a month, and that's three months. So super stable. No worries, no problem. You will get G4DNXLarge at 16 or 15 cents an hour. Multiply this by six. It only costs you a dollar. So I wasn't lying. I never lie. Stable diffusion, $1.
So I pushed this to the hub, obviously after six hours, and you'll find the model here. There's everything. There are the checkpoints, the tensor board logs if you're interested, a few validation images, which are pretty nice. I included the script just so you can run exactly the same thing I run. You just need to clone the Diffusers library, put this in the right place, and run it. I also included the full training log, so the actual output, just to show you that this is how it happened. It's not fascinating, but I know some of you want the full training log, and here it is.
So why don't we try the model now? Okay, so I added a bit of information. Let's wait for a few seconds for the model to load, and let's see what kind of Pokemon we get. All right. Well, that's a pretty nice flying unicorn. So there you go. Now you can generate Pokemons all day long. Fine-tuning doesn't have to be complicated because we provide a ton of scripts. And you saw a stable diffusion here, but we have fine-tuning scripts for everything. So please don't go and spend time writing fine-tuning code; there's a good chance we have something that you can start from and tweak if you need to. And when it comes to cost, again, techniques like LoRa implemented in the PEFT library are amazing. You can actually run this demo on even smaller GPUs, but this is the smallest available on AWS, and the cost is negligible. So you could fine-tune tens, hundreds of models in parallel for negligible cost. You could also, of course, do this on SageMaker, just run that same code on SageMaker. It is no problem at all. And so you can fine-tune tons of models and experiment at very low cost and scale on the cloud and build amazing stuff for very, very low cost.
So go and experiment, go and run this model, maybe add your own images to it, see how easy it is to do it. Well, that's really what I wanted to show you today. There's more coming. I have a llama to fine-tune in a video, which I think is pretty cool. So working on this one in the next few days. Keep your eyes open for this. And until then, thank you for watching and keep rocking.