Deep dive model merging part 2

Transcript

Hi, everybody. This is Julien from Arcee. A little while ago, I did a deep dive video on model merging, and we looked at the different techniques implemented in the MergeKit open-source library. Since then, a bunch of new methods have been added. So in this video, we're going to look at those new techniques that are now available. Specifically, we're going to study model breadcrumbs, model stock, and Della. As a refresher, we'll take a quick look at how you can do managed model merging in Arcee Cloud. Okay, let's get started. The techniques we're going to discuss today are implemented in MergeKit, which you can find on GitHub. The creator of MergeKit is now working for Arcee. So we covered MergeKit in the first part of this deep dive, and I won't go through that again. If you keep going down, you can see the list of methods. In the first part of the deep dive, we covered model soups, slurp, task arithmetic, ties, and pass-through. Today, we're going to discuss the latest ones: model breadcrumbs, model stock, and Della. The first technique we're going to discuss today is called model breadcrumbs. It was released at the end of last year and is an improvement on the task arithmetic technique. In task arithmetic, we start from a base model, fine-tune it on different datasets and tasks, and get a collection of fine-tuned models. Then, we create task vectors by subtracting the fine-tuned weights from the base weights, getting those differences, and combining them. Breadcrumbs is another iteration on task arithmetic. Breadcrumbs starts the same way by computing a task direction, which is the same as a task vector, by taking the fine-tuned weights for each fine-tuned model and the base weights. We do this for all the fine-tuned models, and we get those differences, as shown with the green, yellow, and red weights. Where breadcrumbs differs from task arithmetic is that we ignore or mask a percentage of the weights. We see the green, yellow, and red distributions for those weights, and we do this for each layer. We drop the tiny and large weights according to hyperparameters called beta and gamma. Setting the difference to zero means ignoring the effect of fine-tuning for that parameter. If the difference between the fine-tuned weights and the base weights is zero, the fine-tuned weight is equal to the base weight. We zero out a fraction of the fine-tuning and focus on the outliers. Then, we merge the surviving weights by adding them to the base model. The base model gets updated with the different fine-tuned variants in different ways, as different layers and weights may have survived. This method generally outperforms task vectors with the same number of models. The paper shows benchmarks, but more interestingly, it scales better. If you have 100 tasks, you would fine-tune 100 models and then need to find the relative weights when merging. This means optimizing 100 hyperparameters, which is challenging. Breadcrumbs solves this by showing that the hyperparameters for merging masked variants are very stable. They ran hyperparameter optimization on 10 tasks to find beta and gamma, which optimized accuracy for an evaluation set. They then froze those hyperparameters and merged more tasks, up to 200, and saw performance continue to improve. This means you can optimize accuracy on a limited set of tasks, freeze the hyperparameters, and extend them to a much larger number of tasks. This is particularly useful for computer vision. Another benefit is that merging generally improves the performance of fine-tuned models. For example, T5 base fine-tuned on four GLUE tasks (MRPC, RTE, etc.) shows zero-shot performance. The base T5 model scores 74.8 on MRPC, and fine-tuning specifically for this task increases it to 87.9. Merging six other T5 models trained on different datasets like IMDB further increases the performance on MRPC. This often happens when merging a fine-tuned model with additional off-the-shelf models trained on other datasets, improving generalization. We see this in the benchmarks for MRPC, RTE, COLA, and SST2, where the performance of the fine-tuned model increases after merging additional T5 models. So, breadcrumbs is task arithmetic with the addition of dropping tiny and large weights, leading to better models and a more scalable way to merge a large number of models without extensive hyperparameter optimization. Let's move on to the next method, model stock. Model stock is a completely different approach, purely mathematical and clever but complex to understand. The goal is the same: to merge a collection of fine-tuned models optimally to maximize accuracy. They started by analyzing fine-tuned weights and comparing them to the base model. They found that fine-tuned weights from different random seeds reside on a thin shell layer. In 3D terms, imagine a base model vector in space, and fine-tuning it multiple times results in vectors pointing to a thin surface, like a sphere. This surface has a center, and the vector pointing to the center is likely the optimal vector. Averaging fine-tuned models, as in model soups or task arithmetic, moves the vector closer to the center. Model stock leverages this by finding the center using fewer models. Instead of averaging 100 fine-tuned models, you can use just two or three to find a good approximation of the center. This is more compute-efficient because you don't need a large collection of fine-tuned models, and you can merge just a few to find the optimal center. Model stock can be run during training, where periodic merging during fine-tuning is even more effective. You can also do post-training merging by grabbing a couple of models and merging them. Benchmarks show that model stock achieves the same accuracy as averaging and results in a model more resistant to out-of-distribution prediction, generalizing better. The key intuition is that fine-tuned vectors point to a common surface, and the center of that surface is the optimal point. You can find this center with just a few models, making the process more efficient. Let's move on to the last technique, Della. Della stands for drop and rescale via sampling with magnitude. It's similar to the TIES method but with a major difference. It starts by computing the delta parameters for each fine-tuned model (fine-tuned weights minus base weights). For each layer, a drop probability is assigned to each parameter, and parameters are dropped probabilistically. The drop probability is inversely proportional to the magnitude, so tiny parameters have a high probability of being dropped, and large parameters have a low probability. After dropping parameters, the remaining ones are selected for merging using the same sign election technique as in TIES. Parameters that do not align with the dominant direction are dropped. Finally, the survivors are fused by averaging, resulting in the merged model. Della performs well, as shown in benchmarks. It outperforms previous techniques like task arithmetic and TIES, especially when merging multiple models. Merging improves the accuracy of the base model on individual tasks. For example, the language model scores 80.8 on AlpacaEval, but merging it with the math model increases the score to 81.8. Similarly, the math model scores 63.5 on GSM8K, and merging improves the results. This shows that merging is not just a quick hack but a way to build high-quality models that outperform on individual tasks while handling multiple tasks in the same model. Let's take a quick look at model merging in Arcee Cloud. Arcee Cloud has a free tier, allowing you to run merges unlimited for free. You can create an account, go to the merging tab, create a merge, give it a name, pass the YAML file for MergeKit, and launch it. Arcee Cloud also has a Python SDK for this process. All the links are in the video description. So, that's what I wanted to show you today. Hopefully, your brain didn't explode. If you're interested in the math, read the papers. They're pretty interesting. I'll see you soon with more content. Until then, my friends, you know what to do, keep rocking!

Deep dive model merging part 2

Transcript

Tags

About the Author