Arcee.ai Tailoring Small Language Models for Enterprise Use Cases 09 2024

October 13, 2024
Talk @ AWS Telco hackathon, Dallas, TX - September 2024. Slides: https://fr.slideshare.net/slideshow/tailoring-small-language-models-for-enterprrise-use-cases/272382540 ⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos. Follow me on Medium at https://julsimon.medium.com or Substack at https://julsimon.substack.com. ⭐️⭐️⭐️

Transcript

Good afternoon, everybody. It's a pleasure to be here. My name is Julien. You may remember me from my AWS days or my Hugging Face days. Still the same guy, just a little more tired, I guess. And now I'm working with Arcee. And I actually have colleagues here. Can you guys wave? All right. So if you have questions, don't ask me because I'm jet lagged. Ask them. You'll get much better answers. But you can still come and say hi and ask simple questions. Okay? So, yes, I am going to talk about small language models. But before I do that, I need to set the scene a little bit. A couple of years ago, all we had was closed models, and they were cool, useful, and helped educate a ton of people on what AI really was, particularly for businesses and enterprise use cases. The open source community started running, and fast forward a couple of years, we've come to the point where the quality of open source models is just indistinguishable from closed models, as you can see on this very nice graph. The latest models from Meta and others are just on par with the largest and best closed models from OpenAI and Anthropic, the most important ones. So that's a given. I'm not even going to argue with this. There is no arguing. I think it's a fact. And because I like to look a little forward, it's very clear to me that open source models are state of the art, and they're winning this, clearly. The pace of innovation in the open source community is much higher than in the closed source community. So there's no doubt in my mind that even now, not to say maybe six months or a year from now, the best models period will be open source models. So that's a given. My talk today is not about this. My talk is about how we take those amazing models and make them even better so that they completely outperform the closed models on your enterprise use cases. Before I do that, I just want to call out why customers prefer small language models. It's not just me saying it. I've met probably 200, 300 of them in the last 18 months, and I've consistently heard this. The number one thing they like is accessibility, thanks to Hugging Face and model builders in the open source communities. We have great models to pick from. We can download them in seconds. We don't need to pay anyone anything. We just download them and get to work. So that's very cool and a good way to speed up innovation. The models are open source. I prefer to say open weights. What an open source model really is, at the very least, is that you have access to the model architecture and the model weights, so you know what you're working with and can test it, inspect it, and find out when things work and when they don't, which is pretty much impossible with closed models. Privacy and intellectual property protection go hand in hand. When you work with small language models and open source models, you deploy them on your own infrastructure. So if you work on AWS, you deploy those models in your VPC. The data you send to the model and the answers the model sends back stay within your VPC. There's no one else but you looking at that data, and that's the way it should be. Plus, if you train those models on your own data, the models become yours, and they stay within your VPC. You don't have to deploy them somewhere else with the risk of a knowledge leak or a security breach. Freedom of choice is very important. With closed models, you can pick from a handful of models. With open source models, you can pick from almost a million models now. There's almost a million models on Hugging Face. So you're not locked in, which means you can find the right model for every single project. You can probably upgrade to a better model when a new one becomes available. IT flexibility is also very important. Not everybody wants to deploy or train their models the same way. Some folks insist on doing on-prem, maybe because they have to. A lot of folks will use the cloud, etc. So even in the cloud, you could use EC2, EKS, SageMaker, or maybe Bedrock. Again, you have the freedom to pick not only the infrastructure but also the service and technology you want to run your models on. Cost optimization is a central concern these days. We're not living in the sandbox anymore; projects are moving to production, and ROI and cost performance are everything. Because you control model selection, model size selection, and infrastructure selection, you can find the sweet spot for each particular project, something you cannot do with closed models, which price per token regardless of the use case. And last but not least, model quality. I've seen enough evidence here to confidently say that a small open source model tailored on quality data will always outperform a generic large model. I used to say almost always, but it's 2023. So now I can say always. I've seen it enough, and we, with our colleagues here, do that every single day. So what does a typical workflow look like? Typically, you'll start from a pre-trained model. Let's say a Llama 3 model that you grab from Hugging Face. Sometimes it's just good to go, and you don't need to do anything else. It will answer your problems in a good way. But most of the time, and particularly in an enterprise setting, you need to tailor the model to the domain. Whether that's telco, energy, financial services, retail, manufacturing, insurance, etc., your domain and company-specific data and vocabulary need to be baked into the model to increase the quality of the generated answers. So you need to go through that workflow. You'll start from a good model, like a Llama 3 model, and need to go through different steps. The main three steps are: inject new domain knowledge, teach the model how to answer your questions the way you want, and give feedback on what a good answer is and what a not-so-good answer is. These three steps are continuous pre-training, instruction fine-tuning, and alignment. Typically, customers will go through all these steps, bringing datasets for those different purposes and, hopefully, end up with a model that knows about their domain, knows how to answer questions properly, and is aligned for tone of voice and safety. Not so easy to do, lots of work, lots of compute, lots of datasets. What Arcee is doing is trying to simplify this, and I'll show you some examples. Let's zoom in on those building blocks. The first one is continuous pre-training. Continuous pre-training means, for example, building a telco model by training the model on all 3GPP standards from the last 10 years or a million pages of Cisco product documentation. It's a ton of data, billions of tokens, and as you can imagine, it's pretty compute-heavy. The only choice for a while was full fine-tuning, training the full model in original precision, probably FP16 or BF16, on that corpus of data. The problem is, we're talking about billions of parameters, billions of tokens, and full fine-tuning is compute-heavy and expensive, which has stopped a lot of folks from doing this. About a year and a half ago, a new way to train models was introduced called parameter-efficient fine-tuning, techniques like LoRA or QLoRA, which enable large memory savings by only training a fraction of the model, not the full model. Introducing quantization with QLoRA can further reduce memory usage. This allows you to train models in a more cost-effective way. Unfortunately, although it works very well for instruction fine-tuning, it doesn't work well for initial training. The degradation, the price you pay for just training a fraction of the model, is pretty high. Recently, Arcee has been contributing to a new technique called Spectrum. In a nutshell, Spectrum trains the layers in the model that contribute the most to the prediction, identified through statistical analysis, but in the original precision. This is still parameter-efficient fine-tuning because we're not training the whole model, just the most important layers. Spectrum typically trains the top 25% to 50% of layers, and depending on where you set that threshold, you get significant speedups and memory usage reduction without compromising accuracy. It's almost as good, sometimes even a little better than full fine-tuning, and it's almost as efficient and sometimes a little more efficient than QLoRA. This is an open-source project available on GitHub. Spectrum will train the top 25% to 50% of layers, and depending on the threshold, you get very significant speedups and memory usage reduction without compromising accuracy. Fine-tuning has been improved and simplified with LoRA and QLoRA, which are available in Hugging Face libraries and are almost household usage. They're still very good, cost-effective techniques for fine-tuning. One thing people tend to forget is that it's not just about the training algorithm but also about the quality of the data you train or fine-tune the model on. People are always obsessed with GPU memory usage and might neglect the quality of the data a little bit. We recently released a new project called EvolKit, a toolkit that lets you improve your Q&A dataset by enhancing diversity and complexity. By using a high-quality LLM, we take your existing Q&A dataset and make it better, more diverse, and more complex, which automatically improves the quality of your fine-tuning. This is another one of our open-source projects, and we shared a resulting dataset on Hugging Face as well. We did a study on taking mostly 7 billion parameter open source LLMs, fine-tuning them on various tasks, and comparing their performance to GPT-4. In a nutshell, we see that fine-tuned models are much better than the base models, and more importantly, all the fine-tuned models outperform the OpenAI models, including GPT-3.5 and, in many cases, GPT-4. Google's Gemma doesn't fine-tune very well, so I wouldn't use it, but Mistral or Zephyr 7B models are very easy to fine-tune and outperform GPT-4. These models are 7 billion parameters and can run on very little infrastructure. The smallest GPU instance on AWS today is G5 2XL, which costs $1.2 an hour on demand and runs these models like a dream. From a cost-performance perspective, this is way better than anything else out there. Let's talk about alignment. RNHF has been around for a while, and hopefully, it's not the first time you've heard the term. RNHF involves human feedback on generated answers from a model, where a human rates the answer and provides a golden answer. Once you build enough of these, you can train the model again using reinforcement learning so that it learns to improve its answers based on the feedback. This made ChatGPT impressive, but there's a dark side. RLHF is difficult to scale, requiring thousands of people to write answers and score prompts, which is not something a typical company can do. Ethics are also a concern, as these companies have been using outsourced workers in different countries, leading to issues like the Kenya story. This is not the way we should be building AI. There are other problems like bias and quality. If the people providing feedback think the same way, biases can occur. Complexity and costs are also problematic because reinforcement learning is a very heavy technique. One of the most popular techniques these days is DPO, which does away with the human workforce and the reinforcement learning element. It can start from an existing dataset, such as a prompt, a chosen answer, and a rejected answer, and use statistical analysis to learn how to generate answers that are closest to the chosen one and not the rejected one. There are a lot of good DPO datasets on Hugging Face, and you could start with these to align your model using preference datasets, possibly without having to build your own. This is a much faster and more cost-effective way to align models based on human preference. We could stop here and say we've improved a few things: CPT with Spectrum, better fine-tuning through LoRA and improved datasets through EvolKit, and DPO for alignment. But can we do away with training and fine-tuning completely? Can we get rid of it? I think we can, and that's the point of model merging. Model merging is based on an Arcee library called MergeKit. Building a great model is difficult, and the first difficulty is defining what "great" means. It means different things to different companies in different industries. Going through the training and fine-tuning pipeline is not simple, so can we simplify that? Instead of retraining a single model on all the data, can we find models that have the qualities we need and merge them? For example, if we want a model that can do code, math, and Cisco log analysis all at once, we can find a good code model, a good math model, and a good Cisco log analysis model and merge them. We take the weights from these models and average them out, combining task-specific models into a single model without any training. This is not an ensembling technique; we build one model and work with that. Because there is no training, we don't need any CPU or GPU compute, and you can run this on your laptop or a CPU instance. There's no cost for training, no extra cost for inference, and no extra inference latency. All you have to do is look at MergeKit, select a merging technique, and write a config file. If we look at our workflow again, adding merging and some of the other things I discussed, we can modernize and accelerate the model adaptation workflow while getting a much better model in the end. Merging can happen at every step, replacing training or fine-tuning or alignment with merging. For example, if you have very specific company data, you might still want to do pre-training because it's unique data you can't find in any other model. You could run CPT with Spectrum for efficiency, but for instruction fine-tuning and alignment, you could just do merging, inheriting the work already done by others. You would use LoRA here and DPO here and EvolKit to improve your datasets. This is what Arcee brings to the table, helping you go through that pipeline faster, cheaper, and with higher quality in the end. A lot of what we do is open source. You could run all of this open source, including Spectrum, EvolKit, MergeKit, and the Hugging Face libraries for LoRA and DPO. For some companies, they don't want to reinvent the wheel, so we have built a cloud platform called Arcee Maestro that does all of this in just a few clicks. Everything you saw on the previous slide can be done here in a few clicks. Upload your data, go through the steps, and we have a Python SDK if you prefer writing code. This is available as a SaaS platform and as a VPC deployment for maximum privacy. We also build models. We have a bunch of models on Hugging Face, open source, and you can check them out. We're also building commercial models. One that came out not two weeks ago is called Supernova, a 70 billion parameter model based on the Llama 3 architecture. This was built using all the techniques I've described, including merging and model distillation. Starting from the larger LAMA 3.1, 405 billion, and distilling into a smaller model, it's the best 70 billion model available today. On the Google IFEVOL benchmark, Supernova not only outperforms LAMA 3.170B and LAMA 3.1405B but also outperforms CLaude 3.5 SONET and GPT-4.0. Even if they are trillion parameter models, it's impressive that we can outperform them with a 70 billion model. We have a demo. You'll get the slides. The demo is at supernova.arcee.ai. It's fast, accurate, and the price point is much lower. Go try it. If you want to deploy it in your company, it's available on the AWS Marketplace. If you look for Arcee on AWS Marketplace, you'll see Supernova, and you can deploy it to SageMaker in a few clicks. We've also built a smaller sibling of Supernova, an 8 billion parameter model called Supernova Lite, based on LAMA 3.1. It's the best 8 billion model available today, number one on the Hugging Face leaderboard. You can try this very easily, deploy it from Hugging Face, and because it's so small, you can run it on very cost-effective infrastructure. In fact, I can run it locally on my machine. I have a quantized version running on my Mac. Let's use this prompt. You can run this 8 billion parameter model, the best available today, just like that on your local machine. Zero cloud cost. For production usage, you might want to run this on AWS. Let me show you how to deploy this on Inferentia, the AWS accelerator. All the notebooks are on GitHub, so you'll get the link. Here, all it takes is deploying this on Inf2Excel, which costs 99 cents an hour. I'm using the AWS inference container, setting some basic settings, and calling deploy. Wait a few minutes, and you have your SageMaker endpoint ready. You can start prompting it, and it's as fast as you need it to be. You can run this inside your AWS account for 99 cents, which is the on-demand price, but as an AWS customer, you can optimize that at least 30% very easily. You get full privacy and control, and you can use the OpenAI prompting format, so if you have OpenAI prompts today, you can reuse them as is. Can we get smaller? Of course. I can run this on a Graviton instance, a CPU instance with ARM cores. No GPU whatsoever. Who thinks we can run this fast on a CPU instance, an 8 billion parameter model? Let's take a look. Supernova Lite quantized to four bits on CPU. How many tokens per second do you think? 57 tokens per second. This instance is an R8 instance, which costs about $2.5 an hour. When we get the C8G instances, they will be even more cost-effective. A state-of-the-art model running at this speed on a CPU instance is where we are today. Summing things up, the one takeaway is that there is no model that rules them all. Each project is different, with different requirements, domain knowledge, and ROI scenarios. You need to study each project separately and find the right model and infrastructure for each use case. Small, tailored open source models are the way to go. You take the best open source models available today, which are already extremely close to state-of-the-art performance, and through clever techniques, make them even more amazing on your domain knowledge, which is the only thing you care about. Training and fine-tuning techniques are moving very fast. All the latest advances, like MergeKit, Spectrum, etc., are changing the game in terms of speeding up the pipeline, reducing costs, and increasing quality. Again, you can probably do all of this open source and reinvent the wheel a little bit, or you can try our platform in the cloud or in your VPC. Of course, you can try our open source models. A few resources to close: go check our blog, our model and dataset collection on Hugging Face, and the GitHub repository where I put all my AWS notebooks. If you're interested in deploying models on SageMaker or through the marketplace, that's where you'll find all the notebooks. I have a busy YouTube channel where I keep posting AI and AWS stuff, and you may like that. This QR code is how to subscribe to our newsletter and stay in touch. We have a bunch of new launches coming, so you don't want to miss out. That's really what I wanted to tell you. Thanks so much for listening. My colleagues are here for questions, and I guess I can take a few too. Thank you very much.

Tags

OpenSourceModelsSmallLanguageModelsModelMergingEnterpriseAICostOptimizationAI

About the Author

Julien Simon is the Chief Evangelist at Arcee AI , specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.

With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.

Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.

Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.