Deep Dive Model Merging with Arcee Fusion

Transcript

Hi everybody, this is Julien from Arcee. In this video, we're going to dive deep into Arcee Fusion, a model merging recipe implemented in our Merge Kit library. First, I'll give you a high-level explanation of Fusion, and then, using the code and animations I built, we'll look at every single step of the Fusion process, explaining everything in great detail. Finally, I'll run an example of Arcee Fusion with Laumont models and use a small tool I wrote to inspect model layers, showing the differences between fine-tuning and merging with Arcee Fusion. This should be fun. Let's get started. Before we dive into model merging, let's talk about a common problem when fine-tuning models with SuperValue: catastrophic forgetting. Catastrophic forgetting is a common reason why fine-tuning projects fail. In a nutshell, the issue is that the fine-tuning process destroys the knowledge of the original model. When we run supervised fine-tuning, we start from a base model and work with one of our good datasets to train the model. However, instead of getting a fine-tuned model that maintains all the knowledge from the original model and adds specialized knowledge, we often end up damaging or destroying the general knowledge present in the model. This happens because of what I call "spurious updates," where fine-tuning updates every single parameter in the base model, even those that are not important or unrelated to the task. Model merging is a better solution to model adaptation. We start from maybe two models, merge them into a single one, preserving original knowledge and combining the special abilities that the two models had. Model merging is awesome. You don't need datasets, you don't need GPUs. It's a fast, very cost-efficient process, and you can even run it on your laptop. However, there are plenty of model merging recipes, and it's not always easy to get the performance you want. Let me talk about that for a second and introduce Arcee Fusion. Model merging is a simple process, and writing the configuration files for the merging recipes is equally simple. But sometimes things don't work out. Let's look at an example where we start from a base model with very good general-purpose performance and want to improve specialized task performance. Usually, you do this through fine-tuning, which will improve specialized task performance but may hurt general-purpose performance, making the model unstable. A common technique is to merge the two models to try to get the best of both worlds—strong general performance and strong specialized performance. But sometimes you get neither and end up with a worse model than the original two. That's where Arcee Fusion comes in. It's a more efficient merging recipe that will only merge important parameters. We'll explain what important parameters are and how to find them in the rest of this video. Now, let's dive deep into Arcee Fusion. I encourage you to read the code. It's fairly short, though not simple, as there's a bit of math involved. I'll break that down for you in the rest of the video. You should definitely take a look at the code, which you'll find in the MergeKit repo. Go to MergeKit, Merge Methods, and Arcee Fusion. It's about 100 lines of code, so not a ton, but some lines are a bit difficult to understand. Feel free to read the code before watching the rest of the video or read it along as I explain the different steps. Your choice. In the introduction, I mentioned that Arcee Fusion only focuses on merging important parameters. This is the key. Fine-tuning often fails because it tweaks every single parameter in the model. Fusion doesn't try to do that. Instead, it looks at the difference between the base model and the fine-tuned model, analyzes the updates, and decides if they're really important. If they're not, they're discarded, and only the important updates are merged. In the end, we update fewer parameters in the base model, maintaining more of its general knowledge and adding only updates to parameters deemed important. This should add the specialized knowledge and business value we want. The question now is, what is an important update? How do we identify them and apply them? First, we need to understand what is an important update and what isn't. When comparing weights for the base model and the fine-tuned model, we need to determine if an update matters. The first step is simple: compute the difference between the base parameters and the fine-tuned parameters. Subtract one from the other and take the absolute value to get the magnitude of the changes for all parameters, which are positive values. We keep these magnitudes and run a different process. We analyze the distribution of parameters for the base model and the fine-tuned model by applying the softmax function to each row of each tensor, turning the parameters into a probability distribution. For each row, we get a base distribution and a fine-tuned distribution. Once we've computed the distributions for each row in both models, we use KL divergence to understand how similar or different these distributions are. KL divergence is a value between 0 and 1, where 0 means the distributions are extremely close, and 1 means they're extremely different. For each row of the models, we have these KL values. Now we can apply the fusion formula, which factors in the magnitude differences and the KL divergence values. By factoring in both, we capture local changes (tiny changes at the parameter level) and global changes (differences in the distributions). In different scenarios, if updates are just random noise, we'll have very noisy magnitudes but low KL divergence, indicating low importance. On the other hand, if we see significant differences in magnitude and distribution, these parameters are probably high importance on both the local and global levels. If we see low values for both magnitude and KL divergence, this is probably just noise or tiny changes we don't need to account for. This is important scoring, and it's crucial to combine local and global differences. Local only would miss the big picture, and global only would miss tiny nuances in parameter updates. By combining these two metrics, we get a better understanding of what really changed when we fine-tuned the model. Thanks to important scoring, we now have a better understanding of what really changed when we fine-tuned the models. The next question is, what do we keep? Where is the threshold for updates we want to keep and merge, and updates we want to discard? We need to find the threshold for this particular job, for this base model and this fine-tuned model, so we use dynamic thresholding. We take all the important scores for all model parameters and compute a distribution. We see that a lot of updates have low importance, while only a fraction have high scores. We apply quantiles to break down this distribution. For example, in a sample distribution, we might have q25 equal to 0.12, meaning 75% of parameters have an importance score lower than 0.12, and q90 equal to 0.78, meaning 10% of parameters have an importance score higher than 0.78. These are probably the ones we want to keep. Instead of just relying on these quantiles, we run interquartile range (IQR) analysis, where we look at Q75 and apply an outlier factor to compute a new threshold. The final threshold we apply is the minimum of that value and Q95. For example, it might be 0.89. Any parameter with an importance score of 0.89 or higher is considered an important update. The next step is to look at outliers. In this case, outliers are the ones we want. These very high importance parameters are the ones we apply. Generally, they are a fraction of the model with crazy high importance scores. The goal is to focus on the most important changes and ignore anything else. Thanks to this dynamic threshold technique, we can adapt to very different fine-tuning jobs. For example, if you do very dense fine-tuning, you'll have a lot of changes, and the threshold will likely be a bit lower. If you did just a tiny bit of fine-tuning, you probably updated only a few parameters, and the threshold might be higher. Obviously, you could have middle-of-the-road scenarios. This technique guarantees that we find the right threshold no matter how the fine-tuned model was trained. We have a robust approach here where, by computing distributions, quantiles, and thresholds, we can adapt to the specific scenario you're working on. The worst part is over. It's easy from now on. Now, let's see how we apply the updates and then do a demo. The math pays off because we have the threshold to use and can decide what to apply and what not to apply. We define a mask to select the parameter updates we want to apply. We look at all the important scores; anything higher than the threshold goes in, and anything lower stays out. We define the parameter mask or the parameter update mask based on that. Next, we select which parameters in the base model need to be replaced by their fine-tuned equivalents. We cherry-pick the fine-tuned parameters that go into the merged model and keep the base parameters if they're not selected. This is how Arcee Fusion works. We select what we want from the fine-tuned model based on the mask, itself based on the important scores, and only keep what matters. We preserve the fine-tuned capabilities and the base knowledge, avoiding averaging just for the sake of averaging. Let's quickly recap everything and take a quick look at the code. Starting at the execute method, we load the tensors, compute the important scores, then compute the dynamic threshold. Once we have the threshold, we compute the mask. Once we have the mask, we merge the two models. It's not that difficult, and if you want to double-click on each function, you'll see it works exactly as described. Let's do a quick demo now. Here's my Fusion example. I'm going to fuse a Lama3 SQL coder 8B, a variant of LAMA 3.1.8B. The config file is simple. Here's the base model, and this is the fine-tuned model we want to fuse. That's all there is to it. So let's go and fuse it. We use MergeKit, the config file, and an output directory. It will run for a minute or two. I'll pause the video and be back in a sec. Fusion is done. It took about four minutes on my Mac, so it's still fairly fast. Now we've fused the two models. As a final step, let's compare the fused model with the fine-tuned model. I'll use a small tool I implemented in MergeKit. The pull request is outstanding, but I'll include it in the video description if you want to grab it. We'll compare the weights between the base model and the fine-tuned model, and between the base model and the fused model, to see how many are different. We expect to see much fewer updates in the fused model. Let me run those tools and show you the results. This is how the tool works: MergeKitDiff, BaseModel, FineTunedModel. Let's launch this. You can also do BaseModel, FusedModel. The tool will load each tensor in the two models, compare how many weights are different, and compute KL divergence on those tensors. It will run for a minute. I'll pause and show you the results. Looking at the analysis between the base model and the fine-tuned model, we see that on average, 98%, almost 99%, of the weights in each layer are different, with a median of 99%. The minimum difference was 81%. This shows that fine-tuning touches almost everything in the base model. The KL divergence values confirm this. Comparing the Lama model and the fused model, we see that the average difference is only 10%. Fusion is discarding a ton of updates from the fine-tuned model and only keeping the most significant ones, about 10% in this case. Some layers are unmodified, which is what fusion is all about. It discards irrelevant noise and tiny changes that might damage the model. When we look at KL divergence, it's an order of magnitude lower than in the previous example. Here, the median was around 0.01, while in the fused model, it's around 0.001. This indicates that the distributions of the base model and the merged model are much closer when using fusion than with traditional fine-tuning, preserving more of the original model's performance. The last thing we should run, but I'll stop here for time, is to evaluate the two models. Run some general-purpose benchmarks to see if the fused model is closer to the base model's performance in terms of MMLU and other benchmarks. Also, run a code or SQL benchmark on the fused model to see if it performs better than the fine-tuned model. As an exercise, it'd be interesting to see the results. If someone does it, I'd love to hear them. I may do it as well and put it on my to-do list. If you made it this far, I salute you. Congratulations. This was a lot of work with the animations and everything. I hope you liked it. Leave a comment if you did, and if you liked the animations too. Thank you so much for watching, and more coming as always. Until next time, keep rocking.

Deep Dive Model Merging with Arcee Fusion

Transcript

Tags

About the Author