Julien Simon @ AIDevTLV 24 Unleashing the Power of Small Open Source Language Models
January 01, 2025
AIDevTLV 2024
aidevtlv.com
Israel's largest conference for LLM App developers
Powered by EventHandler and AT&T
Unleashing the Power of Small Open-Source Language Models
As AI developers, we are constantly exploring better ways to build intelligent systems that cater to real-world needs. A significant transition is taking place in Generative AI from large-scale, closed-source models to smaller, open-source alternatives. In this keynote, we will not only delve into the benefits of small open-source language models but also provide practical insights on how to utilize them for domain adaptation, cost-efficiency, and improved privacy.
Firstly, small open-source language models can be easily customized to specific domains with minimal effort and resources, unlocking their full potential for real-world applications. By utilizing community-driven models, datasets, and libraries, developers can tailor their models to address unique business or organizational needs. Secondly, open-source language models provide significant cost savings without compromising performance. Unlike closed models and their rigid, one-size-fits-all approach, open-source models offer flexibility. They allow you to select the best small model for the task at hand, adapt it, and deploy it on cost-effective hardware, making AI development more adaptable and agile. Lastly, by keeping model adaptation in-house, organizations can ensure complete control over their intellectual property and sensitive data. This approach contrasts with using closed models, which may require sending sensitive company and user data to 3rd party APIs, potentially weakening the security and compliance posture.
Join us as we guide you toward a more effective path to AI innovation!
Julien Simon
Chief Evangelist @ Arcee.ai
Julien Simon, the Chief Evangelist at Arcee.ai, is dedicated to helping enterprise clients develop top-notch and cost-efficient AI solutions using open-source small language models. With over 30 years of tech experience, including more than a decade in cloud computing and machine learning, Julien is committed to daily learning and is passionate about sharing his expertise through code demos, blogs, and YouTube videos. Before joining Arcee.ai, he was Chief Evangelist at Hugging Face and Global Technical Evangelist at Amazon Web Services. He also served as a CTO at prominent startup
Transcript
All right. Good morning, my friends. It was quite an adventure getting here. I'll write the blog post, but I'm super happy I made it. It's an honor to be here and it means a lot to me. And apologies for not speaking Hebrew. I will learn at some point, but if I stick to English, I think I'll get my message across, right? Thank you so much for putting together this amazing event. I'm really, really glad I'm here.
So my name is Julien. I am the chief evangelist for a startup called Arcee. In this session, we're going to talk about one thing: small open-source language models and why they're the number one thing you should be looking at instead of those large, closed models. I spend a lot of time talking to enterprise customers. People who want to build real-life products and applications with AI. This is the one-slide summary of what they want. Number one is maximum privacy, security, and compliance, whether it's the model, the data, or their customers' data or personal information. They don't want to share it with anyone else. No, no, no, we don't train on your data. Well, maybe you don't, maybe you do, but I'm not taking the chance.
The second thing they want is the best model for each use case. The Swiss Army knife model that is supposed to work everywhere, for every use case in every industry with every language, etc., is a lie. People have realized this by now. They want the best model for the job every time, with different options. Some use cases are small scale, some are large scale, some have high ROI, some not so much. The one-size-fits-all model and the one-size-fits-all infrastructure that supports the model is a terrible option.
Because these folks have real-life problems in their industry and business, they need to design for ROI. They're not trying to build stuff just because it's funny or cool or to bump their resume. They want to make money or save money with that AI solution. Closed models don't deliver. If you don't agree, wait for me after the session, and we can discuss for two hours. But trust me, they don't. Or you can say, "Oh, yeah, you're absolutely right," and we can have a friendly coffee together. No yelling necessary.
I'm sure you've seen this. The open-source community has been busy over the last couple of years, building better and better models that catch up to the best closed models out there. Maxime, if you're listening, could you please update this because it stops at July, and we've had quite a few great models since then. Now it's fair to say you can match or even outperform the largest models out there with open-source models that are just a fraction of the size. These are the baseline models, the off-the-shelf general-purpose models, which are nice. They're a great starting point. But how can we do better than that?
We do better by adapting the models to our particular use case or problem. This is probably how you do it today. You start from a pre-trained model, maybe from Hugging Face, apply some additional pre-training to inject domain knowledge in financial services, healthcare, or any industry you're in. Then you'll probably run instruction fine-tuning to tweak the Q&A behavior of the model, helping it understand how you want your questions answered. Lastly, you run alignment for tone of voice and safety, teaching the model about human preferences. This is a good answer, this is a better answer, so please answer this way instead. There's nothing particularly wrong with that. A lot of good models and customer models are built like that, and we've built a lot of models that way. The papers, tools, and Hugging Face libraries are all out there to let you do that. But that's kind of two years ago, so there are a few problems still.
Building datasets is super difficult. It's a lot of hard work. You need a ton of data if you want to do pre-training. I'm not even talking about initial training. I'm talking about continuous pre-training. We advise customers to have at least 1 billion tokens. So you could say, "Yeah, it's Israel. We have 100 billion. No big deal." I know you guys. But for a lot of organizations, 1 billion tokens is quite a bit of data to prepare. So not so easy. For instruction fine-tuning and alignment, you need very good question and answer pairs. This is the last round of training you're going to do to the model. They need to be high quality and diverse because different people will ask the same question differently. It's a fair amount of work.
I'm not going to focus on datasets today. We have a library to help you with that. I'll mention it along the way. Today, we'll talk more about training. But keep in mind that datasets should not be underestimated. The bigger problem is that training or fine-tuning models puts you in a difficult situation where you have to decide whether you want to favor accuracy or cost. This is a horrible decision to make because we want both. The reason for this is that you can't really fine-tune those closed models. They are fine-tuning APIs, and I could talk about them for 10 seconds, but this paper tells you everything you need to know. How well do commercial fine-tuning APIs infuse knowledge in LLMs? My conclusion is not well at all, and that's the polite way to put it. So let's ignore closed models for now and focus on small, open-source models. You can run full fine-tuning, tweaking every parameter in the model on your data in the original precision of the model, say 16-bit. That's fine and not very difficult to do. But because you're tweaking every parameter and using a ton of data, it takes quite a bit of computing power. You need to find those GPU instances and pay for them if you can. We work with AWS and have a lot of interesting discussions on finding more GPUs. Most of the time, they do.
For better techniques, like parameter-efficient fine-tuning, LoRa and QLoRa help you train only a subset of the memory required. You can save on GPU memory, time, and money. This is a good technique for the later stages of the training pipeline, such as instruction fine-tuning and alignment. However, for pre-training and continuous pre-training, there is significant accuracy degradation. If you're trying to inject domain knowledge using LoRa or QLoRa, you will be disappointed. We have a blog post on that, and there's a very good paper, "LoRa versus full fine-tuning and the illusion of equivalence," which I highly recommend.
How do we fix these challenges? We fix them by implementing new techniques, evolving that training pipeline. That's what Arcee has been doing for the last 18 months. We introduced a collection of new libraries to help you do the same. The first technique we promoted is model distillation, implemented in a library called DistillKit. The intuition is that we have a lot of really good open-source models. We have LLaMA 400 billion and some amazing models that someone spent a ton of money training. We're not going to do the same. We'll take those models and teach smaller models to mimic the output. We'll take the 400 billion model and train an 8 billion or 14 billion model to predict as closely as possible. We have a teacher model. We run inference on a dataset, store the logits (the raw values coming from the model before activation), and that becomes our training set. We train the smaller model on predicting like the bigger model. The loss function is minimizing the difference between the two token distributions using the Kullback-Leibler divergence. The intuition is to predict as closely as possible to the teacher. This is a simple process, and the code is actually very simple.
Distillation is nice because we can inherit knowledge from a much bigger model without training a new one from scratch. The second thing is the training process. We're stuck between full fine-tuning, which is accurate but expensive, and LoRa, which is not so expensive but inaccurate. We solve this with another technique called Spectrum, which we introduced a few weeks ago. The intuition is that we don't have to train all the layers. We need to look for the layers that have the highest contribution to the output and train only those in full precision, not in LoRa or QLoRa precision. There's a bit of math involved, but the key concept is the signal-to-noise ratio for each model layer. We run Spectrum, identify the layers with the highest signal-to-noise ratio, and keep only the top 25 percent. We train those with full fine-tuning and full precision. Spectrum outperforms LoRa and QLoRa on every benchmark I've seen in terms of accuracy, gets very close in terms of memory saving, and is pretty much a match for full fine-tuning at a fraction of the cost. If you haven't looked at Spectrum and are still doing full fine-tuning or are frustrated with LoRa, I highly recommend looking at Spectrum. It's very easy to implement, and you will save a lot of time and money without degrading accuracy.
How do we keep improving this? Model merging is another interesting technique. We recognize that we have a ton of good models out there. Hugging Face has over 1 million models. Do you really need to train another one or fine-tune another one from scratch? Maybe not. If we have all those great models, can we build the one we need by merging existing models? Merging is a mathematical operation where we average out the models. Think of model A + model B + model C divided by three. It gets a little more clever than that, but that's the basic idea. We combine multiple task-specific models into a single model that can do all those things, hopefully better. It's not an assembling technique; there's only one model in the end. You can run this on a CPU. I'll run a demo on my laptop right now. At the end of the day, you get one model with no inference penalty, and it's super nice.
Merging is exactly this: we take model A, model B, and model C, ideally with the same architecture, but you can do heterogeneous merging if you want. We run MergeKit for a few minutes and get a merged model. The reason it works is that if you imagine the embedding space for different models, there's some overlap, but if you pick the models well, they won't overlap much. When you merge them, you add new embeddings to the same space, building a model with the combined knowledge of the original models. If you want to read the code, you can. Let me show you how this works. Clone the library, install it, job done. Write a config file. Here, I'm merging three LLaMA 8 billion parameter models with different weights. The math coder is 60%, the SQL coder is 20%, and the code LLaMA model is 20%. It's pretty much averaging those out. That's my config file.
Now, let's run this. It's local on my Mac. I'm not using a $100 an hour AWS instance. I've already downloaded the models. Start thinking about how long it's going to take. I have 17 minutes left. Can we do under a minute? While we're waiting, imagine how much time and effort you would need to build a fancy math and code model using datasets you may not even have. Wow, looks like we're going to do under a minute. Keep going. Usually, it's 107. Let's see how deterministic this is. Okay, all right, close. 108. So one minute, 68 seconds, we built a merged model on this Mac. I scored the three original models on the GSM 8k benchmark, a math benchmark. You can see the scores for the individual models. The merged model is 73.62, better than all three combined. This improvement of almost 1% in accuracy is huge. If you build models for a living, you know getting an extra percent of accuracy on this kind of benchmark is a lot of effort. Here, it took 20 seconds to write the config file and one minute and eight seconds to run. That's near-infinite ROI.
If you want to know more about the different techniques for model merging, there are two hours of videos on my YouTube channel. Now, let's revisit our training pipeline. We can be super efficient. We have the same steps: pre-training, continuous instruction fine-tuning, and a lot of training. If we look at the top of all the training steps, we could use Spectrum and save maybe 50% on training time. We could still do it like that, but if we want to get fancy, we can use MergeKit at every step of the way. Some models are strictly basic with MergeKit. We sometimes build customer models only with MergeKit. You can merge for domain knowledge, Q&A behavior, alignment, or combine. For example, if you have super confidential company data, you could still run pre-training with Spectrum to inject the data and knowledge into the model. Then for Q&A and alignment, you could use some open-source datasets and fine-tune that way, and run MergeKit instead of fine-tuning.
For datasets, to help you build better, richer, more diverse datasets, we have another library called EvolKit. It starts from your existing dataset and uses clever prompts and a large model on the side to generate better prompts and answers. If you have dataset problems, go and look at EvolKit. This is the foundational technology we've been building for a while. To prove our points, we use our stack to build models and put them on Hugging Face. We built a 1.5 billion parameter model called Arcee Light, which was the number one in its class when it was released. We built an 8 billion model, which was the number one when it was released. We built a 14 billion model, which was the best when it was released. We built the best Arabic model, which is not on Hugging Face but for customers. If everyone wants to build the best Hebrew model, come and talk to me. We built the best 70 billion parameter model.
Why are they the best on those Hugging Face leaderboards? Because you can't just train and build models the way you did two years ago. You need to be more clever and use the fancier techniques. To show you how we built SupraNova, the 70 billion model, we started from the big LLaMA, 405 billion, and distilled it. It took 32 H100s for five days, which is not cheap, but nowhere near the cost of training a new model completely from scratch. Then we took LLaMA 3.1, 7 TB, built a really nice dataset with EvolKit, and applied Spectrum training to it, giving us another 7 TB model. We took the original LLaMA 3.1, 7 TB, ran alignment on our own datasets, and got a third model. We merged them in a clever way, and that gave us SupraNova. SupraNova outperforms LLaMA 3405B, Cloud 3.5, and GPT 4.0 on the IFE Val benchmark. That's not even the best model we have; we have some better ones right now.
How does a company like us, about 40 people, build a model at a reasonable cost that outperforms the monsters from Silicon Valley? Because of the training pipeline. It never stops. Just a week ago, we released Virtuoso Small, which is part of the model engine we just built. Virtuoso Small is a 14 billion model, just released weeks ago. You can see some of our other open-source models, the 8B, 14B, and 72B. Look at the gap between SupraNova Media, released two months ago, and Virtuoso Small, just released. Amazing results on all the benchmarks. Now look at how close Virtuoso Small, 14 billion, is getting to our original 72B model, released in July. So 14B is not quite the new 72B model, but give us a few more weeks. If you're working with large models from a year ago or six months ago, you can get the same performance today with a model a fraction of the size and a fraction of the cost. The pace of innovation by Arcee and the open-source community is absolutely insane.
We keep building models, and just a couple of weeks ago at AWS re:Invent, we announced our new inference platform called Arcee Model Engine. At the moment, we have six models. We have the Virtuosals, which are general-purpose SLMs, in three sizes. We have two coding models and a function calling model. You can use these with paper token APIs and all that good stuff. Let me show you. You can use the sandbox to try the models. Hopefully, the Wi-Fi is not down. Hummus, see that's all you needed to know. Yes, that's the best model in the world. It's the proof, although shawarma and falafel are good too.
You have APIs you can use, the OpenAI API to call the models, curl, Python, the usual stuff. Let me show you a better example. I'm going to use the function calling model. I'll create an OpenAI client and call the Yahoo Finance API to get stock prices, CEO names, company summaries, etc. I'll define those tools according to the OpenAI function calling spec. Function names, parameters, etc. Then I'll define a function that invokes my model with the OpenAI API, passing the tools and checking if there's a function call to be made and calling the function.
Let's try this one. Let's answer this super important question. What's the last price of Shipwape? We'll call the model, and the model decides to call the get stock price function. It calls that function via Yahoo and returns the stock price. Delisted, no, has she partly gone bankrupt? Okay, how is that possible? McDonald's still exists. Let's try McDonald's. There we go. One call to the model, call the function, return the function, invoke the function locally, call the Yahoo API, get the result. But you may want to do something more complicated. Maybe let's do this one. This time, I'll call the model, try to invoke the tools, but maybe there's no tool to answer that question. I'll take the result and pass it back. No response, and we'll pass the result. No response from the tool, something's broken. We can pass the answer back to the original model and say, "If you don't get an answer from this, give me an answer with your own data." Not sure what's going wrong here, but no worries.
You can see here, this is a 30K context model. It's a 70 billion because it's a large one. You can use them just like that. Let's go back to this. In conclusion, I hope by now, almost 2025, people realize there is no model to rule them all. You tried, you figured it out. There is no model that works across all industries, all use cases, all languages, etc. You need to find the right model for each project and use case. Our customers tell us they get better results, more accuracy, and certainly more ROI from small language models. They just make more sense. If you need to build a model for your use case, if you need to adapt a model for your company, your data, please look at all these new techniques like EvolKit and Spectrum, maybe distillation and merging, etc., because they really change the game on cost, time to production, and model accuracy. They're not difficult to use.
For inference, that's what we're building at the moment. We're testing it with customers right now and seeing extreme performance by combining small language models and agents, which could integrate with Slack, Jira, Salesforce, etc., or call your internal workflows. SLM-based agent workflows are amazing. In a few weeks, we'll be launching our platform for this. It's going to be called Arcee Orchestra. It will be based on some of the models you just saw in the model engine and some new ones. It's drag-and-drop, no code, and all the good stuff. Keep your eye out for this; it's going to be interesting.
Feel free to visit Arcee to learn how to build all of this. This QR code is a link to our newsletter if you want to stay in touch. Please go and do that. Again, I want to thank you very much for having me today. It means a lot to me to be here. Thank you for the amazing conference, and enjoy the rest of your day. Thank you very much.
Tags
Open-Source Language ModelsModel Fine-Tuning TechniquesAI Model Optimization
Julien Simon is the Chief Evangelist at Arcee AI
, specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.
With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.
Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.
Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.