SLM in Action Arcee Lite a powerful 1.5B distilled model
August 20, 2024
In this video, you will learn about Arcee-Lite, a small yet powerful 1.5B model created with Distilkit, an open-source project for model distillation.
⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos. Follow me on Medium at https://julsimon.medium.com or Substack at https://julsimon.substack.com. ⭐️⭐️⭐️
Arcee-Lite outperforms Qwen2 1.5B, and is currently the best 1.5B model.
First, I run an 8-bit version on my M3 MacBook with ollama and OpenWeb UI. Then, I deploy the model on AWS with Amazon SageMaker. I run both synchronous and streaming inference. I also show you how to use the OpenAI Messages API, allowing you to invoke the model with the OpenAI prompting format.
* Model page (full precision model): https://huggingface.co/arcee-ai/arcee-lite
* Model page (quantized models): https://huggingface.co/arcee-ai/arcee-lite-GGUF
* Notebook: https://github.com/juliensimon/arcee-demos/blob/main/sagemaker/deploy_lite_gpu.ipynb00:00 Introduction
00:55 Introducing Arcee-Lite
04:40 Running Arcee-Lite locally with ollama and OpenWeb UI
06:20 Deploying Arcee-Lite on AWS with Amazon SageMaker
08:10 100+ tokens per second on g5.xlarge !
08:45 Streaming inference
09:40 Use cases for this model
Configuration file for ollama:
FROM ./arcee-lite-Q8_0.gguf
Sign up for Arcee Cloud at https://hubs.li/Q02Kh_YQ0 and please follow Arcee.ai on LinkedIn to stay on top of the latest Small Language Model action!
#ai #aws #slm #llm #openai #chatgpt #opensource #huggingface
Transcript
Hi everybody, this is Julien from Arcee. I hope you like my summertime office. In this video, I'd like to talk about a new model built by Arcee. It's called Arcee-Lite. It's a 1.5 billion parameter model, so definitely a small one. This model is a little bit special because it was built with a new open source library from Arcee called DistillKit. As you can guess from the name, DistillKit is a model distillation library. We'll talk a little bit about that, but I'll cover DistillKit in more detail in a future video. For now, I just want to show you Arcee-Lite, this really, really sweet 1.5 billion parameter model, first running locally on my machine and then deployed on AWS with Amazon SageMaker. Of course, you can find the Arcee Lite model on the Hugging Face Hub, and as usual, I'll put all the links in the video description.
Arcee Lite is a compact yet powerful 1.5 billion parameter language model developed as part of the DistillKit open source project. Feel free to go take a look at the DistillKit project on GitHub. We'll cover that in more detail later. Distillation is a process where we start from a larger model, and in this case, the team started from Phi-3 Medium, which is a 14 billion parameter model. You run it through the distillation process, and you distill it into a 1.5 billion parameter model, which is based on the QAN2 1.5b architecture. This represents almost a 10x size reduction from 14 billion to 1.5 billion. The magic of distillation is that, despite slashing the parameter count by almost 10x, we try to keep as much of the good stuff as possible. You can see this in the benchmarks.
Let me zoom in a bit. What's particularly interesting here is the comparison between the yellow bar, which is Qwantu 1.5b Instruct, and Arcee Lite. Both have the same architecture, except the weights come from distilling Phi-3 Medium, a 14 billion parameter model. We can see that across the board, in Big Bench Hard, Multi-Step Reasoning, and MMLU Pro, the distilled model performs much better. At the time of recording, Arcee Lite is actually the best 1.5b model available, according to the LLM leaderboard. This shows how powerful distillation is. You can start from a much larger model, shrink it by almost 10x, and outperform the best in class for that size.
Let's start by running this model locally. We have GDUF versions available, so I just grabbed the 8-bit version here. I could go lower, but because it's already a small model, I thought I'd stick with the 8-bit version. Feel free to try the 6, 5, or 4-bit versions if you have a constrained environment, like running it on your phone. I imported it to Olama and I'm using the OpenWeb UI. Let's ask a question. I'm not so interested in the result; I just want to know how fast this is. Wow, look at that. This thing is flying. The first benefit of a small model is its speed. At 1.5 billion parameters in 8-bit, it's 1.5 gigabytes, which can run on even a small machine. If you have a nice machine, it runs even faster. I haven't looked at tokens per second, but it's certainly the fastest I've seen on my local machine. If you want to run this model locally, grab a GG web version, download it, import it to Olama, and you can query it all day.
For production, you may want to deploy this in the cloud. Let's look at how we would do this on Amazon SageMaker. You'll get the sample notebook. Basic dependencies as usual, and pointing at the Hugging Face repo. I'm using the smallest possible GPU instance on AWS, a G5 x large, which is the smallest and least expensive one you can get because it's a tiny model. I should really try CPU inference, maybe that's another video. The environment for our deployment inference server is, of course, HuggingFaceTGI. Model ID, one GPU on this instance, and I'm enabling the OpenAI Messages API for comfortable prediction using OpenAI syntax. Create a model object with the latest TGI container, 2.2, then call deploy and wait for a few minutes. We have our endpoint. Now we have a proper production GPU. Let's see how fast this is. Let's send this prompt and run it. I'm timing it, so let's see how fast we are. Synchronous inference first. We did 461 tokens in 4.20 seconds, which is about 110 tokens per second on the tiniest GPU instance. That's a very good number. The instance costs about a dollar an hour on demand, so you can run that model at scale for very cheap.
Now let's try streaming inference. All we have to do is set streaming to true in the TGI container. That was a short answer, but it's very fast. No one can read that fast. That's my benchmark. You see how fast you can run this. Keep in mind, that's a single GPU on a tiny machine. If you need to scale up, you can. What would you use this model for? It's only 1.5 billion parameters, so it has a limited amount of knowledge. I wouldn't use it for general-purpose Q&A as is, but if you start plugging in external context, it's certainly going to do very well. As a RAG model, this is very tempting. I have a small example here, which is not really RAG but involves adding extra context. I'm asking if Cybertron is the ancestor of deep learning, and we get a good answer. The machine learning Wikipedia page talks about Cybertron, a machine developed by Raytheon in the 50s, etc. So as a RAG model, this is probably good. Instead of a 7B model, try this one. The tokens per second you get here are impressive, and your ROI will shoot up. In RAG architectures, the model is just a writing assistant; the knowledge comes from the external source of truth.
The second use case would be embedded applications, especially if you look at the tinier versions. For example, the Q4 KS version is going to be really small, under a gig for sure. You could fit this on a phone or any constrained device. It would still do a very good job at language modeling. I love these tiny models. They pack a ton of punch, run very cheaply, and still do a very good job, thanks to distillation.
That's really what I wanted to tell you today. Before I jump into the swimming pool, my friends, you know what to do. Keep rocking.