A fistful of dollars fine tune LLaMA 2 7B with QLoRA

October 10, 2023
Fine-tuning large models doesn't have to be complicated and expensive. In this tutorial, I provide a step-by-step demonstration of the fine-tuning process for a LLaMA 2 7B model. Utilizing a pre-existing script sourced from the TRL library, the configuration is set to leverage the QLoRA algorithm from the Hugging Face PEFT library. The training procedure is executed on a modest AWS GPU instance (g5.xlarge), optimizing cost-effectiveness through the utilization of EC2 Spot Instances, resulting in a total cost of just a few dollars. ⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos ⭐️⭐️⭐️ - Blog: https://huggingface.co/blog/4bit-transformers-bitsandbytes - Model + code + training log: https://huggingface.co/juliensimon/llama2-7b-qlora-openassistant-guanaco - Dataset: https://huggingface.co/datasets/timdettmers/openassistant-guanaco - Amazon EC2 G5 instances: https://aws.amazon.com/ec2/instance-types/g5/ Thumbnail generated with Stable Diffusion XL : "Clint Eastwood western style riding a lama" Follow me on Medium at https://julsimon.medium.com or Substack at https://julsimon.substack.com.

Transcript

Hi everybody, this is Julien from Arcee. A lot of folks still think fine-tuning large models is difficult, time-consuming, expensive, and not for them. In a previous video, I showed you how to fine-tune stable diffusion for literally one dollar, and in this video, I'm going to show you how you can fine-tune the famous Llama 2 model, 7 billion parameters, for just a few dollars. And I really mean a few, less than five. Okay? Let's get to work. My starting point is this really nice blog post by my colleagues where they introduce quantization and combining quantization with the famous LoRa technique, which I demonstrated in that stable diffusion video. The combination of quantization and LoRa is called Q LoRa. I'll explain, hopefully in plain English, what Q LoRa means. But just so you know, in a nutshell, this makes it possible to start from really large models and dramatically reduce the amount of infrastructure and particularly the amount of GPU memory required to fine-tune the model. As we will see in the demo, we're able to start from a multi-billion parameter model and fine-tune it on a really small GPU for almost negligible cost. As usual, I will put all the links in the description. As a starting point, you should definitely read this and learn about Q LoRa and the benefits. There's a lot of technical information here, or if that's a bit too much, watch the video and then go read the nice blog post afterwards, okay? So let's first start by explaining what the Q LoRa technique means. In the stable diffusion fine-tuning video, I explain LoRa in detail, so please go and watch that for the full explanation. I'll just give you the quick summary here. Fine-tuning requires tweaking all parameters, but LoRa says instead of doing that, we can just learn a couple of much smaller updates to the original model. The end result is we end up learning only a fraction of the original parameters, typically 0.1%, sometimes even less. As you can imagine, we require much less GPU memory. We can work on mid-range GPUs, which are much less expensive and obviously much easier to find. So that's LoRa. Again, watch the previous video for the full explanation. Q LoRa is adding quantization. We quantize the model before training, and quantization means resizing all model parameters from, say, 16-bit to maybe 8-bit or even 4-bit. By doing that, we save a lot of memory once again. We quantize the model, and that's a fast operation, and then we fine-tune using the LoRa technique. We get even better performance because we need even less GPU memory. So really large models can now be fine-tuned in a fraction of the original memory. This is super easy to do on Hugging Face. We've integrated the bits and bytes library, which is a quantization library, into transformers. It's literally just one parameter to quantize a model when loading it from the hub. As usual, we provide sample scripts to do all of this. This is exactly what I'll be running today. This is a supervised fine-tuning script from the TRL library, Transformer Reinforcement Learning. This lets us fine-tune Llama 2, 7 billion parameters, on a nice PROM data set using PEFT, so parameter-efficient fine-tuning, plus 4-bit quantization. We can run this on a GPU as small as a T4. I've actually tried a T4, and it does work. I can confirm this. It took a little too long for my own taste, so I decided to upgrade to a slightly bigger GPU, but you will see the cost is still negligible. For this demo, I decided to use a G5 xlarge instance on AWS. This is the smallest size for G5 instances. G5 instances come equipped with one to multiple GPUs. I'm using this one here, which has a single GPU with just 16 gigs of memory, and that's more than enough. If we look at that GPU, we see we've got an NVIDIA A10G with just under 24 gigs. That's really much smaller than what you would get on P4 or P5 instances, and quite more cost-effective. So, what have I done here? I've simply cloned the TRL library from GitHub, and this one has the nice script I want to run. It's a really simple script for supervised fine-tuning the Llama 2.7 billion model on this cool dataset. It's got about 10k prompts, which is really good and probably more than we even need to fine-tune. We've got questions and answers, etc. That's a good example of a PROM data set you could build for your own efforts. It's a good size, 10K, and probably more than we need. We decide to use PEFT, the PEFT library, which implements LoRa and Q LoRa. We will automatically quantize the model in 4-bit when we load it. That's about it. Feel free to go and read the script. As usual, I think it's really nice to read. But if that's a bit too much, you can just run this for now. In another video, I'll show you how you can adapt your existing, let's say, vanilla Transformers code for PEFT and Q LoRa. Wait for a couple of days, and this should be ready. For now, we'll just run the script. If we're going to run this, I'm just going to start it to show you how long it takes. It takes a few hours. Normally, we would use nohub so that if we lose the connection to the console, it keeps running in the background. Long-running scripts with nohub is how I do it. Go read about nohub if it's the first time you hear about this. Let's take a quick look at the requirements for running this. Nothing really strange, as you would expect: TRL, PEFT, bits and bytes, and the diffusers library is required, although at the time of recording, I had to pin this to an older version. There's a bug using the latest diffusers, and that's the one. This is pretty fresh, so I'm quite sure it'll be fixed. They moved an import from one file to another, and this breaks the script in TRL. So that's certainly going to be fixed at some point, but for now, I would advise you to stick to slightly older diffusers. Keep in mind that this model is a gated model. You will need to get access to it. Super simple. Just click here, go to the Meta website, and within a day or two, they give you access. Now that we've talked about all the tiny pitfalls, we can actually run this. It's important because when people ask me questions, they say, "Oh, it doesn't work," and I tried it, and why is it more complicated than your video? It's not. It's just tiny things. Let's run this. I already downloaded the model, so we're loading directly from the cache. If you run this for the first time, it will download a little bit. Let's pause the video. It takes a minute or two to load the different shards, and then we'll be right back. Fine-tuning is actually started. We see we happily fit in GPU memory, with some room to grow, maybe not enough to increase the batch size, but we definitely fit that large model in those 24 gigs. It looks like our training job is going to run for about 10 hours, which is interesting because we see we have about 10k prompts. If there's some logic to this, it means you could actually fine-tune on a thousand prompts in an hour, which is quite fast. Generally, I've seen customers getting great results with 500 to a thousand prompts. 10K is probably too much for initial experiments. We're going to run this for 10 hours. Obviously, we're not going to wait for 10 hours. I've already run this, so why don't I stop this job? I've got my full training log here, which I will include, and we can see the runtime is just about 10 hours. That's three epochs. If you wanted to do less, you can definitely do less. 10 hours is already a pretty significant job for 10K prompts and three epochs. Most of you will be happy just fine-tuning for maybe one or two hours. You can definitely do this with a tiny GPU instance compared to the P4 and P5 monsters. I actually shared the model on the hub again. I'll put the link with some details, and you can absolutely experiment with it. I think I have a short inference script too, so why don't we try that? We can see here we're actually loading from the hub. Let's fix this for a second. The cool stuff here is it's a Q LoRa model. If we look at the weights, we're just storing the adapter model, which is just the delta weights. This is fast to download as well. We'll just grab the vanilla Llama 2 model, add the weights to this, and that's it. Let's give that a try. We don't need to do anything. The library is smart enough to know, "Hey, this is a Q LoRa model. I need to grab the baseline and add the weight." Then I'm just using a simple pipeline and printing out some stuff. Let's try this. See how that goes. The model was loaded, and I can see some answers here. Feel free to try. This proves that the model is working, and of course, you probably would want to deploy it nicely in a Hugging Face space or something similar. But this is just a quick test. What about cost? Let's go back to that page. We can see G5XL is $1 per hour. So the total cost of that training job would be, let's say, $10. If we use spot instances, and if you never heard about Spot, please learn about Spot as quickly as you can. Let's look at historical prices for G5XL. Yes, and let's take a look maybe in three months. On-demand is $1, and spot price is $0.30, which means that this fine-tuning job would now be $3. And assuming you had just a few thousand prompts or assuming you didn't fine-tune for three epochs, you can see how you can take this to $2, maybe $1. So fine-tuning Llama 2, 7 billion parameters, just for $1 or $2 is something all of you can do out there. Fine-tuning is not difficult. We have ready-made scripts for almost everything. We have sample data sets. And thanks to parameter-efficient fine-tuning and quantization, combined to Q LoRa, you can run this on a tiny GPU instance for literally a few dollars. So there's no reason not to try this. You can fine-tune tens, hundreds of models on different data sets, different combinations, and for a ridiculously low amount of money. So go and try that. It looks like folks are picking up because if we look at the last week or so, we can see some increased G5 spot price here, which tells me some folks out there know how to do this and figure it out. They don't need to wait for or pay for P4 and P5 when they can do the same thing for silly money on G5. So get them cheap while you can. That's pretty much what I wanted to tell you. Please read the blog post. I'll share all the links in the video description. I have a few more demos coming on LoRa and Q LoRa. So watch this space. And I hope this was useful. Until next time, keep rocking.

Tags

Fine-tuningQ LoRaLlama 2Cost-effective MLGPU Optimization