Deep Dive Parameter Efficient Model Adaptation with LoRA and Spectrum

Transcript

Hi everybody, this is Julien from Arcee. In this video, we're going to talk about training and fine-tuning. A lot of you are probably familiar with parameter-efficient fine-tuning techniques like LoRa and Q LoRa, and those were a really important step forward in making training and fine-tuning more efficient. However, those techniques have some shortcomings that we'll discuss. A few months ago, Arcee contributed to a new parameter-efficient technique called Spectrum. So in this video, we're going to start with a theoretical look at LoRa and Spectrum. And I will actually take you all the way back to singular value decomposition. Then I will show you what Spectrum is all about and why it is a better solution than LoRa. We'll look at some of the results from the research paper and also some of my own results, which pretty much reproduce what the paper says. Okay, so no SageMaker this time, but there will be math, so there's still time to run away. If not, let's get started. To set the scene, this is really the problem we're trying to optimize. A typical model adaptation workflow involves different steps. Most of the time, you will start from a pre-trained model, a good model that you found maybe on Hugging Face, or maybe a model you pre-trained yourself. You need to push it through a number of different steps to specify it for the particular problem you want to solve. So usually, you want to add some domain knowledge to the model, right? Maybe you work in financial services, healthcare, or retail, and obviously, your company has good quality data on that particular problem. Running some additional pre-training on the model is usually a first step, called CPT (Continuous Pre-Training), and of course, you need a dataset for that. Then, generally, you want to teach the model how to answer your own questions, how to follow instructions, and this is called instruction fine-tuning. Starting from a question and answer dataset (Q&A pairs) coming from your domain experts, FAQs, customer support, etc., you're going to teach the model that this particular set of questions should be answered in this particular way. It's about facts, brevity or length, bullet points or long sentences, etc. Finally, you may want to run alignment, which is a process where you start from a preference dataset containing a question, an accepted answer, and a rejected answer. The purpose is to teach the model that this is the preferred answer for this particular question. A lot of times, the two answers are factually correct, so it's not so much about getting the facts right. This is more about tone of voice, using particular words or a particular way of answering that your company expects and your user expects. So CPT-IFT alignment generally needs to be run on your models. There are lots of good tools and libraries to do this, but obviously, you need datasets and a fair amount of compute. The larger the model and the more data you're working with, the more compute, time, and budget you're going to need. That's really the problem we're trying to optimize. Building a great model, and the definition of great should be your own, not mine, involves many steps and they are compute-intensive and time-consuming. I cannot overstate how important this is, especially if you're going to do continuous pre-training. You need a lot of data, a lot of good quality data. For instruction fine-tuning and alignment, you need Q&A pairs. You don't need a million, but they need to be high quality, diverse, and represent the breadth and depth of questions the model is likely to receive. That's a lot of work. There are some ways around that. Synthetic data can help, maybe I'll do a video on that at some point. But generally, this is really where you should pay a lot of attention. And this is nothing those training techniques can really fix for you, so you have to do the hard work. Now, when it comes to training, and I use training in the general sense, it could be pre-training, continuous training, or fine-tuning. Most of the time, you have to pick between accuracy or cost efficiency. The traditional way to train models and fine-tune models is to do full training or full fine-tuning, meaning you're going to apply your dataset to the model repeatedly and update all the model parameters. So if you work with an 8 billion parameter model, 8 billion parameters will be modified or may be modified by the training job. That means they need to be loaded into your GPU memory, meaning you need enough GPU memory to load the model. And most of the time, you want to work with the original precision, right? These days, most models are 16-bit, so BF16 or FP16. Each parameter is going to take two bytes. If you have billions of parameters, that's a fair amount of GPU memory you're going to need. In fact, if you work with really large models, they may not even fit in a single GPU. Then you need to do distributed training on multiple GPUs, and it's all kinds of fun. It is compute-heavy, it can be expensive, long-running jobs on expensive GPU instances, and assuming that you can even get them, it's no big secret that all the hyperscale clouds these days are pretty challenging in terms of GPU capacity. You may not even be able to get the amount of compute you need to get the job done, and maybe not at the price point you were expecting. That's a really big blocker here. To solve that, as I mentioned before, some more efficient techniques have been introduced, and they share the idea: can we not update all the model parameters, can we just update a subset or a fraction of those model parameters? The idea is to save memory, be able to work with smaller GPUs that are less expensive, maybe easier to procure, and generally give you close to the quality of full fine-tuning without the price tag. Techniques like LoRa and Q LoRa, which we'll dive into, are the most well-known. The idea is we can learn only a small number of model parameters; we don't learn the full model again, we don't update the full model again. We can apply quantization that will further shrink the amount of accelerated memory required. If you've never heard of quantization, I would suggest you pause now. I've got a good video on that, or maybe go and watch it later. We're not going to dive too deep on quantization, but basically, shrink, let's say, the 16-bit parameter to four bits. That's the 10-second explanation. There's more to it. So this is more memory-efficient. You can work with smaller GPUs, fit bigger models in the same amount of GPU memory, and hopefully train faster because you're training on fewer parameters. That's the intuition. So this is quite effective for instruction fine-tuning and alignment, the later stages in that model adaptation pipeline that I showed you. That's mostly what folks use LoRa and Q LoRa for. Unfortunately, for pre-training, LoRa and Q LoRa don't work very well, and we'll talk a little bit about that. The big question is, can we get both cost efficiency and accuracy, and the spoiler is yes, and this is why we built Spectrum. But now, how do I convince you of that? I could run benchmarks all day and all night, and yeah, we'll talk about benchmarks and I'll show you some numbers, but I think we need to dive deeper into those techniques and understand the theory behind them. Why can we train with fewer parameters and still get good results? To do this, we need to talk about linear algebra and matrices. Before you go and watch, you know, dogs and cats videos instead, please stick with me. This is really important. I've tried to keep it as straightforward as possible, and hopefully, this will give you a deeper understanding of LoRa, Q LoRa, and Spectrum and why Spectrum is actually a better solution. Before we actually start talking about LoRa and Spectrum, I want to talk about singular value decomposition (SVD). A lot of the ideas behind LoRa actually come from SVD, and I think explaining SVD first will help you understand LoRa better. Singular value decomposition is a matrix factorization technique. If you took linear algebra in college, you may recognize eigenvalues, eigenvectors, and all that good stuff. It's the same idea here, except it's generalized to matrices of any size; they don't have to be square matrices. There's a geometrical explanation to this, which I don't think is quite relevant to our discussion because matrices are linear transformations, and SVD breaks down that linear transformation into two rotations and one scaling operation. You can go and watch some really cool videos on YouTube with good SVD visualizations, but this is not what I'm getting at here. What we need to focus on is understanding that we're breaking down our matrix into a product of three matrices: U, Σ, and V transpose. Two of those matrices represent vectors, so U represents the basis vectors for column space, and V represents the basis vectors for row space. In the middle, we have singular values, which are scaling factors, and the Σ matrix is a diagonal matrix with zeros everywhere except on the diagonal, where we have those singular values, sorted in descending order. That's super important as we will see. So that's a decomposition. If you want the deeper math explanation, I cannot recommend enough the MIT linear algebra course from Professor Strang, specifically the video on SVD. So, we have this matrix broken down into a product of three matrices: vectors, singular values on a diagonal, and vectors. If you see the shape of those three matrices, you see they're identical to the original matrix. In terms of reducing the number of parameters, we did a horrible job here because now, instead of having 1024 by 1024 parameters, we have three times 1024 by 1024 with a bunch of zeros. So we didn't reduce anything. But that's what the next slide is all about. The cool thing about SVD is we can use it to approximate the original matrix. Remember that Σ matrix in the middle was 1024 by 1024. We can just keep the top K singular values, and K would usually be a small number, say 8. So we keep the 8 largest singular values on that diagonal, the 8 first columns because they're sorted. We throw away everything else, put zeros on top and below the diagonal. Now we have a K by K or 8 by 8 matrix. For the product to still work, U needs to become 1024 by K, and V transpose needs to become K by 1024. So now the product of those three matrices is still a 1024 by 1024 matrix, but the number of parameters is much smaller. If we sum the number of parameters for U, Σ, and V transpose, we get 2 to the power of 10, so 1024, multiplied by K. That's for U and V transpose, and K squared for Σ. If we take K equal to 8, we have a total of 16,448 parameters versus the original number of parameters in C, which was 1024 by 1024, or 2 to the power of 20, over a million. So take a deep breath now. This is awesome because thanks to low-rank approximation, we can approximate that million-plus parameter C matrix with only about 16,000 parameters, or about 1.5%. Obviously, this is an approximation, so CK is not strictly identical to C, but if we do a good job, we can get pretty close. We can measure the difference between those two matrices with a norm called the Frobenius norm. How about we run a bit of Python code to show you this in action, and then we'll talk about LoRa, and you will see why you had to suffer through SVD first. Here's a simple example, and all we need is NumPy. Let's start with a square matrix 1024 by 1024. We generate a random matrix, compute the singular value decomposition of C, and get U, Σ, and Vt. We print the shapes. Σ here comes as a vector, which is optimized because we don't need all those zeros. For the sake of working with matrices, I reconstruct that diagonal matrix with singular values in descending order on the diagonal and zeros everywhere else. We keep only the top K values, and here top K is 8. We do the same for U and V transpose. We print the shapes again, print the total number of parameters, reconstruct the C matrix by multiplying the low-rank U, Σ, and V transpose, and compute the Frobenius norm with NumPy. How difficult is that? Not a lot. Let's run this. The original matrix has a little more than a million parameters, 1024 by 1024. The decomposed form has U at 1024 by 1024, Σ as a vector of 1024, and V transpose 1024 by 1024. Now I apply the low-rank approximation. I keep the first eight columns of all three matrices. U, Σ, and V become low-rank. The total number of parameters is what we saw before. When we reconstruct the matrix and compute the Frobenius norm, we get a value of 291. Just to show you that this is really what you saw on the slide, there's no difference. Let's try increasing the rank. If we put the rank at 64, all the shapes change, but look at the error. The error is now 263, which is a bit lower than 291. This makes sense because we kept more of those singular values. We threw away less data, so the reconstruction is a little more accurate. Obviously, if we put 1024, you can guess what's going to happen. Well, 1024 means we throw away nothing. We keep the original matrices as they are, and the norm here is zero, or it should be zero, but it's a rounding error. Now that's SVD. See, it's not complicated. You get the intuition that we can approximate any matrix by a product of low-rank matrices. That rank, the common dimension, is a critical parameter. If you get a value that's awfully low, you throw away so much information that it becomes impossible to reconstruct. If you have a high value, you may be able to reconstruct accurately, but you didn't save a ton of memory. You still have a ton of parameters compared to the original matrix. That's the trade-off. Now let's look at LoRa. LoRa follows the same intuition as SVD. When we fine-tune a model, we fine-tune all the parameters. LoRa says, "Well, I'm going to reduce the number of parameters by not training the full layer. In fact, I'm not going to train the layer at all. I'm going to train updates only." So I'm going to keep the base model unchanged and just learn the weight updates. This update is parameter-efficient because we learn it as the product of two low-rank matrices. We are not fine-tuning the full layer at all; we are just learning the product of two low-rank matrices, which is a smaller number of parameters. Why this works is exactly why SVD works. It's an approximation, and r is the critical parameter here. For a pre-trained weight matrix W0, size n by m, we constrain its update by representing the latter with a low-rank decomposition W0 + ΔW, where ΔW is B times A. B is an n by r matrix, and A is an r by m matrix. The rank, the common dimension to A and B, is a really small value, typically 4 to 32. Once we're done training, we still have the base model and what's called the adapter model, which is the weight updates for all the layers. When we want to predict with the fine-tuned model, we load the base model, add the adapter, apply the updates, and voilà. We still get one model in the end, same size, same inference latency, same memory consumption, etc. We save a ton because we are not training on n by m parameters. We are training on r times (n + m), which is really tiny. Typically, LoRa will train 1% or sometimes less of an original layer. We don't usually apply LoRa to all the model layers; this is actually configurable. Most of the time, LoRa is applied to the attention layers. The intuition is that attention layers pick up the relationships between tokens, so these are the ones that need to be fine-tuned. You could apply LoRa to all the model, but generally, it's mostly attention layers. This is really easy to use. There's a good library from Hugging Face called PEFT (Parameter-Efficient Fine-Tuning), and you can run this stuff in a couple of lines of code. You may even run some of their existing scripts, just point at the model, point at the dataset, and you're good to go. LoRa has become extremely popular. There's a variant of LoRa called Q LoRa, where the Q stands for quantization. We do exactly this, except first, we quantize the model to a smaller data type. If you have a 16-bit model, you will load it in 8-bit or 4-bit and then apply LoRa. The exact same process, but as you can guess, if we shrink from 16 bits to 4 bits, we save even more GPU memory. That's why Q LoRa has become a super-effective and inexpensive way to train large models on small GPUs. However, LoRa has a number of challenges. First, it might not be obvious to decide which layers to apply LoRa to. Mostly, we do attention layers, but sometimes that's not enough, and if you do all layers, you might be wasting time or hurting the model. There's a choice here that's not really obvious. Another challenge is that LoRa comes with a couple of hyperparameters, the most important one being the rank. Picking that value might not be super obvious. You'll see a lot of scripts that use 16, but should you use 16? What do you gain if you use 8? What do you gain if you use 32? There's a trade-off here between accuracy, training time, etc. The rank is really telling us how many dimensions or directions we're going to keep when we train the model. If you look at those matrices as linear transformations, you have 1024 columns, meaning you have a 1024-dimensional space. If you use r equal to 32, you're only keeping 32 dimensions of the original space. If you take a value that's too small, you're throwing away too much information, and you may underfit the data or cause catastrophic forgetting. If you have a high rank, you won't save as much GPU memory, and you may be fine-tuning the model in dimensions that are not so relevant or noisy, potentially overfitting. Picking the value is all about compromise. One thing that always nagged me is why the same rank would work well for all layers. I don't have any data to back that up, but especially for large models with 80 layers or more, different attention layers are going to learn different patterns. Why would the same rank work well all over? A bigger problem detected more recently is that although LoRa does a good job at fine-tuning, it does not do a good job at pre-training. If you want to inject new knowledge into the model, teach the model a new domain that wasn't really present in the original base model, LoRa and Q LoRa are not going to work well. Our team actually wrote a blog post on this. Another recent problem highlighted in a paper called "LoRa versus Full Fine-Tuning: An Illusion of Equivalence" is really interesting. It looks at the dimensions present in the fine-tuned models. Comparing a fully fine-tuned model with a LoRa fine-tuned model, the main finding is that LoRa fine-tuned models have what they call intruder dimensions. Remember those singular vectors, which are the basis for row space or column space? Well, LoRa models have net new dimensions that did not exist in the original model. Does this explain why LoRa can fail sometimes? Does this explain why LoRa models tend to forget? Now let's talk about Spectrum. Spectrum makes a very different hypothesis. It's still trying to reduce the number of parameters that we train, but Spectrum says that not all model layers have an equal contribution to the output and the quality of the output. It defines a signal-to-noise ratio for layers. If you have a very high signal-to-noise ratio, it means the original signal comes out amplified with very little distortion. If you have a low signal-to-noise ratio, it means your output signal is dirty. The same intuition applies here. If that assumption holds, let's identify the layers that contribute very highly to the quality of the output and run full fine-tuning on these. No parameter reduction on those, no quantization on those, full fine-tuning on these high-quality layers. The other layers we don't even touch; we leave them as is, no updates whatsoever. Let's look at that signal-to-noise ratio. This takes us back to our singular values. For each layer, we run SVD, which gives us this decomposition. We'll have all those singular values on the diagonal in descending order. Once we have the singular values, we want to look at their distribution. The intuition is that if, for a given layer, some values are much higher than usual, it means the particular dimensions those values correspond to have a disproportionate influence on the output. We want to keep those because if those dimensions are particularly important, they are probably dimensions we want to fine-tune and adapt to our new domain. On the other hand, very small singular values have little influence. This particular dimension in the embedding space has a very tiny, negligible influence, so it's probably not worth fine-tuning. This tells us which layers to pick. If a layer has a lot of unexpectedly high singular values, it means it has a very strong contribution to the output. What do I mean by high? We need a threshold. The threshold is given by a statistical distribution called the Marchenko-Pastur distribution. This distribution gives us the bounds for typical singular values in very large random matrices. They are called λ- and λ+. They only depend on the size of the matrix (number of columns, number of rows) and the standard deviation of the singular values for that matrix. We're looking for the very high singular values, so we set the threshold to λ+. Any value larger than λ+ is outside the distribution and is likely to be significant. Anything between λ- and λ+ is considered random or usual. Anything far to the right, higher than λ+, is a signal and contributes heavily to the output. The signal-to-noise ratio is very simple to compute for a given layer. We split the singular values between those lower than the threshold and sum them, putting them in the denominator, and those higher than the threshold, summing them and putting them in the numerator. We compute the ratio. If a layer has a lot of very high singular values, it will have a higher sum on the numerator and a high signal-to-noise ratio. This is the definition of the SNR. How does Spectrum work after this mathy explanation? It's that simple. The code is actually very simple. For each layer, we compute the singular value decomposition, which gives us all the singular values. We compute the threshold based on the layer size and the standard deviation of its singular values. We split the singular values between the high ones and the low ones, sum them, compute the ratio, and do this for all layers. Once we've done this for all layers, we decide to keep only the layers with the highest signal-to-noise ratio, typically 25%. Spectrum will output a config file (YAML) with the list of those layers, and that list of layers is what we're going to use in our training script. When we run the fine-tuning job, we start by freezing all the model layers and unfreeze the top signal-to-noise ratio layers. We run full fine-tuning as usual, but only on those top signal-to-noise ratio layers. Pretty complex math explanation for a very simple process, actually. I encourage you to take a look at the code, and you'll see exactly this happening. It's one or two pages of code, actually very simple code. But you have to understand why this works and what the intuition is. If you see high singular values for a layer, it means those dimensions have a disproportionate contribution to the output. These are the ones you want to fine-tune, not the others, which have a negligible impact. And then, of course, because we run full fine-tuning, we make no compromise on low rank or quantization. We give those important layers the full fine-tuning treatment. Hopefully, they were important before and will still be important in the fine-tuned model. Using Spectrum is very simple. You can go and read the research paper. Please look at the code. It is pretty short, actually. Using it is super nice. Just run Spectrum.py with the name of the model, which can be a model on Hugging Face or a model you've already downloaded. What's the percentage of layers you want to keep? On the first run, if you run this on a model you haven't worked with before, Spectrum will compute the signal-to-noise ratio for each model layer. You'll see a file that looks something like this on the left, the list of layers and their signal-to-noise ratio. If you said, "Hey, I want to keep 25%," you get a YAML file with the list of model layers that need to be unfrozen. What do you do with this? The easiest way to work with this is to run fine-tuning with Axolotl, which integrates Spectrum. You literally copy-paste that list of layers into your configuration file. If you want to run this with PyTorch or Hugging Face, there's just a function you'll find in the Spectrum repo to unfreeze the appropriate layers. That's all there is to it. What are the results? Let's look at benchmarks. These are benchmarks from the research paper, Mistral 7B. We compare Q LoRa (light orange), Spectrum 25% (dark orange), Spectrum 50% (red), and full fine-tuning (pink). We benchmark the fine-tuned models on mostly reasoning datasets: ARC, Hellaswag, MMLU, etc., and the math dataset, GSM8K. If we look at them one by one, we can see on ARC that full fine-tuning is still the best by a small margin, and Spectrum 50% is very close and definitely higher than Q LoRa. If we look at GSM8K, Spectrum is very good. Make a note of that because we'll see that again. Q LoRa does actually pretty poorly. If we look at Hellaswag, we can see Spectrum 25% and Spectrum 50% are head-to-head with full fine-tuning, about the same on MMLU. Although we are not really running full fine-tuning, we're running 25% to 50% of fine-tuning, we get equal or even better results sometimes, and definitely much better results than Q LoRa. Same benchmarks for LAMA 3.8B, and I would say very comparable results. We can see that Spectrum 25% is extremely competitive with FFT on most benchmarks. Again, on GSM8K, we see Q LoRa doing a terrible job, and Spectrum doing better than FFT. There's a good reason for that. Maybe it's a math benchmark, and math is more sensitive than language. Language is a bit subjective, but math is right or wrong. Maybe the approximation in Q LoRa is actually hurting accuracy. These are from the paper. Go and read them. They also published some RAM usage and training time on LLaMA. If you look at Table 1, this is distributed training on LLaMA 3B and 8B. You can see that memory usage is pretty competitive between Spectrum and Q LoRa. If you look at single GPU usage, Q LoRa is doing better, but Spectrum still enables you to train models that wouldn't fit in the GPU if you were doing full fine-tuning. If you look at training time, Spectrum 25% is actually faster than Q LoRa. Spectrum 50% is close and definitely faster than full fine-tuning. So you get training speed, reasonably good memory efficiency, and better accuracy. I tried to reproduce those numbers and put a huge screenshot to prove it. Running those fine-tuning jobs on different configurations, I used Supernova Lite, which is a LLaMA 318B. I used a single GPU for each job, one epoch, batch size one, and fine-tuned the model on the Alpaca dataset. I used Axolotl and the Eleuther.ai evaluation framework. Those two commands you see here are really all you need. If you want to reproduce this, no big deal. I'll probably put all my config files on GitHub and in the video description if you want to play with this. What do we see? Full fine-tuning, no luck. LLaMA 318B doesn't fit on the GPU at all. I can't even load it for fine-tuning, so out of memory error. LoRa, I used a rank of 32. Training took 48 minutes, and you can see the MMLU, GSM8K, and Hellaswag benchmarks. Q LoRa, quantizing the 16-bit model to 4-bit, still using r equal to 32, training a little faster. Obviously, significant reduction in GPU RAM usage because 4 bits take less space than 16-bit. Benchmarks are, interestingly, a little higher than LoRa, which I didn't completely expect. Well done, Q LoRa. Then I ran Spectrum 25, keeping only the top 25% layers with the highest SNR. Training time is 32 minutes, which is 31% faster than Q LoRa. GPU RAM is a little bit higher, but only 22%. I could still fit models in there and even bigger models. The MMLU score is higher, the GSM8K score is much higher. Again, there's something here, especially on math. It appears all the time. It looks like Q LoRa on math datasets is not a really good idea, and full fine-tuning probably captures the finer-grained relationships needed for math outputs to be exact. That's a huge difference, 12 points. I ran it twice to double-check and got the same result. Hellaswag is a little higher too. Just for completeness, I also ran Spectrum 50, which didn't improve on Spectrum 25. Close, but no cigar. As you can see, Spectrum 25 is much faster than Q LoRa. Although I ran only one epoch, imagine saving 31% on long-lasting jobs. That's a ton. 24 hours become 16 hours, three days become two days. Those are huge savings. And you get better accuracy. So that's pretty cool. Spectrum is a really interesting technique. You literally get the best of both worlds: cost efficiency and accuracy. If you wondered why Arcee models are that good and why we keep scoring very high on the Hugging Face leaderboard, well, Spectrum is one of the reasons. Go and read the research papers, review the slides. You'll find everything in the video description. Maybe run some of your own tests, but hopefully, this will convince you to use Spectrum. As you can see, Spectrum is a really interesting technique. You get the best of both worlds. Cost efficiency and, if you wondered why Arcee models are that good and why we keep scoring very high on the Hugging Face leaderboard, well, Spectrum is one of the reasons. So go and read the research papers, review the slides. You'll find everything in the video description. Maybe run some of your own tests, but hopefully, this will convince you to use Spectrum. Understanding the theory sometimes is actually really useful, and you have to dive a little deeper, and we certainly did that today. Thank you very much for watching, and until next time, keep rocking.

Deep Dive Parameter Efficient Model Adaptation with LoRA and Spectrum

Transcript

Tags

About the Author