Hi everybody, this is Julien from Arcee. Last week we released Lama Spark, a new Lama 3.1 variant that outperforms the original model. In this video, I'm going to talk a little bit about the model and how we built it, and then we'll test it locally on my machine with O-Lama, and then we'll deploy it on Amazon SageMaker on a small GPU instance, and we'll run some testing. You will see why this model, just like small language models in general, are a great alternative to OpenAI models. Let's get started.
Lama Spark is a variant of Lama 3.1, 8 billion. We started from the base model, fine-tuned it on the Tome dataset, which Arcee released a little while ago. Then we merged that fine-tuned model with the Instruct version of Lama 3.1.8b. The results are pretty good. You can look at how Lama Spark outperforms Lama 3.1.8b Instruct on a lot of different tasks. The model is available on Hugging Face, and you can find more information in the blog post. We also have GGUF and quantized versions available for local inference. Last but not least, we have the dataset that was used to fine-tune the LAMA 3.1 base model. It's called the TOME, a fairly large dataset with 1.75 million rows, and you can read all about it.
Now let's grab a quantized version of Lama Spark and run it on my Mac. Lama Spark is an 8 billion parameter model, so there's no need to go to extreme quantization. I went for Q5AS, which is a very good compromise between model size and accuracy. You just need to go here, and you see the model parameters, and just click on the download link. I've done that already. The model file is just a little over 5 gigs, which should be very easy to run on this machine. Using a simple config file like this, I could just go and run `olama create`. Now we have our model ready to go. We can just run it. Let's ask a question. How about that? Is there a relationship between transformer robots and transformer models? There's actually a connection. Interesting. Okay. And you can see how fast this is. Running smaller models, particularly quantized versions on a local machine, is just super sweet. The speed is very good. If you have a good laptop or machine, you can do this all day.
How can I optimize inference latency in the KV cache? Let's see. Quantization, pruning, distillation, and so on. Very detailed answers, very good speed. What's not to like? You can run local inference all day. But chances are, for production, you want to run in a different environment. So now let me show you how we can run this on AWS, and we'll see how you could actually migrate from OpenAI to this with minimal impact.
In this notebook, we're going to deploy Lama Spark on AWS using Amazon SageMaker. I'll use a really small GPU instance, which is cost-efficient, and we'll run some prompts. Model deployment is very easy. First, define the name of the model and then define some parameters, such as the maximum input for the model. Arcee Spark has a large context window. You can go all the way to 128, which is much more than the original LAMA. However, this is limited by GPU RAM. On this small instance, I can't go higher than 8K, which is more than enough for a lot of use cases. But if you need to work with a bigger context, you will need an accelerator with more RAM. So max input will be 8K, total tokens, so input plus output, I've set this to 10K, but you could tweak it. You could have a higher input context and a lower output if you just want short answers. 8K will be more than enough here.
I'll enable the messages API, which sets OpenAI compatibility for inference. SageMaker uses the Hugging Face text generation inference server, allowing you to send inference requests and receive answers using the OpenAI format. This really simplifies the whole thing. Then we create a model object using the SageMaker SDK, get the name of the Docker image for the TGI server, and call `deploy` using a small G5 instance, which costs about $1.2 an hour on-demand. It takes a few minutes to deploy, and now we can run inference. We can use the OpenAI prompting format, which hides the complexity of the LAMA prompting format. Just define the model as TGI, the system prompt, the user prompt, and the max tokens. Let's run this and do synchronous inference. Generating the full answer in one go and then sending it back. And then we should be able to print it. So we have the answer. It's in the OpenAI format, a long answer with the number of tokens, etc. And we can print it nicely because it's a markdown answer. Pretty cool stuff.
But probably what you want is streaming inference, a nice chatbot experience. I took some support code from an AWS blog post to display the streaming response and tweaked it a little for the OpenAI format. Now I can run a streaming inference request. Same question, except I'm setting `stream` to true. Let's run this and invoke the SageMaker endpoint with a streaming response. Now we are streaming, a much nicer experience. The speed is just fine, and keep in mind, this is not a quantized model. It's generating fast enough on that small GPU instance. We get the same answer. It's very easy to work with these models on AWS. Just deploy them, run inference, and you're on your way.
One question you may have is how this model performs compared to ChatGPT. I ran a small query on ChatGPT just an hour ago and asked a simple question: Is Cybertron the ancestor of deep learning? The answer was that Cybertron is most commonly associated with the fictional planet from the Transformers franchise, not with deep learning or artificial intelligence. If there is any connection, it might be through a misinterpretation. It's not a great answer. ChatGPT is dodging the question, spitting out a full page of deep learning history, which is not what I wanted. This is a media-trained version of ChatGPT, a technique called block and breach.
Let's ask the same question to our LamaSpark model. The answer is not really better. It tells me a little more about what Cybertron is and a little bit about deep learning, but it's still not great. So what do we do next? We use external context. I'm not going to do a full Arcee demo, just simulate something. I took the history section of the Wikipedia page for machine learning and loaded it into a file. Now I'm passing this as context. Same question, but using the provided context. Cybertron is not the ancestor of deep learning. The term Cybertron refers to a learning machine developed by the Ritheon company in the early 60s. Cybertron was used to analyze Sornos signals, electrocardiograms, and speech patterns, etc. That's the answer I was looking for.
Some of you might say that if you pass the same context to ChatGPT, it would have figured it out. I agree. My point is, why work with a really large model that you have little control over, that's next to impossible to fine-tune, and that can get expensive when scaling? If these models don't know everything, and no model knows everything, why not use a smaller, cost-efficient model that you can fine-tune, align, and host on your AWS infrastructure completely privately? You can add context to make it super relevant to your business domain. Those models do not know everything, and the only way to get high relevance is by adding external context. If you have a great open-source model to write a good story based on that context, even better. I can only think of the benefits.
That's pretty much what I wanted to show you today. Lama Spark, a new state-of-the-art Lama 3.1, is very easy to test on your local machine and deploy on your AWS infrastructure. You can use the OpenAI prompting format, so you don't have to rewrite your application code. It gives you really good answers out of the box, and if you add your own domain-specific context, it gets even better. So SLMs for the win. Count on me to keep diving into that area. Much more content coming. You know what to do, my friends. Keep rocking.
Tags
Lama SparkModel DeploymentAWS SageMakerLocal InferenceQuantization