Hi everybody, this is Julien from Arcee. A couple of hours ago, the Hugging Face LLM leaderboard was updated, and we're really happy to announce that our latest 8 billion parameter model called Supernova Lite is now officially the best open-source 8 billion parameter model available. It sits at the top of the leaderboard. That's great. In this video, I'm going to show you how you can deploy Supernova Lite, which is a Lama 3.1 model, on Amazon SageMaker. I'm going to use the Inferentia 2 accelerator to do this because it's super cost-effective. In fact, I'm going to use the smallest Inferentia 2 instance available, which costs about 75 cents an hour. Yes, you heard that right. Not 75 cents, 75 dollars, 75 cents. Once you've watched this, you'll be able to run what we think and what the leaderboard thinks is the best 8B model available today for under a dollar an hour. Sounds good? Let's get started.
As a reminder, about a week ago, we released Arcee Supernova, our new 70B model, which is competitive with the largest open-source and closed models, including Cloud 3.5 and GPT-4, etc. I did some videos on that, and you can find them on the channel. There's also an 8 billion version of this model, and this is the one we're going to work with today. This one is called Supernova Lite. I actually long Lama 3-1 Supernova Lite because this model is based on the Lama 3-1 architecture. This is the model page, and you can read a little bit about the model. In a nutshell, this is a distilled version of Lama 3-1 405B, a much larger model, but we're able to keep a lot of the original goodness in the much smaller 8B model. You can go and look at the evaluations, and of course, I have to show you the leaderboard, right? As usual, all the links will be in the video description. Here, I'm showing you all the 8B models. You can see them here, right? Size 8B. So all of them, the Bayes models, the fine-tuned variants, and we are number one, right? Which is quite nice. We are even outperforming 9B models. Pretty cool. So well done team, Supernova Lite, the best 8B model available today.
Now let's see how we can deploy this. Recently, I did another video on the developer resources that I built to help you deploy our models and, in fact, Hugging Face models in general on AWS, so notebooks, CloudFormation templates, etc. I'm going to use one of those notebooks to deploy our Supernova Lite model, right? If you go into that repo, you'll see two notebooks. There's one for GPU deployment and one for Inferentia 2, which is the one I'm using today. Let's jump into that notebook and see how things work. Here, I'm running this notebook in SageMaker Studio, but you can run this in any Jupyter environment with your AWS credentials.
Let's go and look at the different steps. So dependencies, importing some packages, and you can see here I'm using the DGL inference server. This is the first time I'm showing you this because so far I've been deploying on SageMaker using the Hugging Face TGI container. However, for devices like Inferentia 2 and Tranium, called neuron devices because that's the name of the accelerator on those chips, the DGL container is actually a better choice. It supports more models and is officially supported by AWS. I recommend this one. You can read all about it in the DGL documentation, which I'll link in the description. They have this large model inference container which supports different backends like VLLM, TensorRT, and, as we can see here, the NeuronX devices. The documentation is pretty good, and you should be able to figure it out.
This is what I'm using today. Let's grab the SageMaker Client and SageMaker Buckets. In this notebook, and in general with Inferentia and training devices, you can deploy models in two different ways. The first one is to download the model either from the Hugging Face Hub or load it from an S3 bucket, but load the vanilla PyTorch model and compile it on the fly. That container can do that, and there's nothing to set up particularly. That's option number one, and we'll look into that. That's the one I'm going to use today because it's the simplest way. Compilation takes a few minutes. Option two would be to load a pre-compiled model. The notebook has some instructions and pointers on how to do that. This is a slightly more involved process, but the benefit is that you save the compilation time. If you're interested in the second option, I found some pretty good instructions on the DGL website on how to compile and package. If you follow that, you should figure it out. I did, and I guess I'll do a specific video on that compilation and packaging process. It's not as bad as you think; you just need to follow the steps carefully. Today, we're not going to do that. We're going to load the model and compile it on the fly, and it just takes a few minutes as you will see. Because we want compatibility with the OpenAI API, we enable that, and it's available out of the box in that container. We can just use the OpenAI prompting format.
Now let's look at the configuration for the endpoint. As mentioned, I'm going to use Inf2.xl, which is the smallest Inferentia 2 instance available. It costs 76 cents an hour on EC2 and 99 cents on SageMaker. That's the on-demand price, of course. If you're an AWS customer, you have larger instances available, but for a model in the 7 or 8B range, my tests show that you really don't need to do more than Inf2.xl. If you need more throughput, you can just scale out on Inf2.xl instead of trying to scale up to Inf2.24xl or 48xl, etc. The model is really too small for those larger instances, and you don't get nice linear scaling. My option would be to start on Inf2.xl and scale out your endpoints on that instance type.
Now that we have the instance type and container, we just need to define how to load the model and how to compile it. The environment is obviously DGL specific. If you watch some of my TGI videos, these are different environment variables, but they tend to define the same things like the tensor parallelism degree. On this Inf2.xl instance, we have two neuron cores, the accelerators. We set tensor parallelism to 2, batch size to 2, and sequence length to 8k. That's pretty much it. Remember, because we are combining the model, we need to work with a fixed batch size and a fixed sequence length. But of course, we'll use padding and everything else. That's also why I think compiling on the fly is a really good option when you're experimenting and want to try out throughput and latency with different batch sizes and sequence lengths. If you compile on the fly, you can try different things quickly. You don't have to pre-compile models only to realize the batch size is too high or too low and recompile and repackage again. Long story short, I recommend on-the-fly compilation for dev and test and experimentation. Once you've locked in your configurations, pre-compile and run deployment from the pre-compiled artifacts to speed things up, particularly if you have auto-scaling configurations and don't want to recompile the model on each auto-scaled instance.
Here, we're just going with those parameters. Then I'm defining the DGL model object, so the model ID, which is our Supernova Lite model from the hub, the name of the container, the environment, and the SageMaker role. That's all there is to it. Really, really not difficult. As the notebook says, once we've done that, we can create the endpoint. We'll skip the pre-compiled section and come back to that in a future video, maybe once the new SDK is fully integrated. It's really just a matter of calling deploy on that model with the instance count, instance type, endpoint name, etc. Small gotcha: on Inf2 instances, you need to attach an EBS volume because, unlike P4, P5, and TRN1 instances, INF2 doesn't have instance storage. They have a little bit of storage, but even with an 8B model, there's just not enough to download everything and copy everything. Your endpoint deployment fails, and you'll see "no space left on device" in your CloudWatch log, and you have to start all over again. So here, 64 gigs is more than enough. We're on deployment, and it goes for 17 minutes. Let's take a look at the log to see how long the different steps took.
I'm looking at my endpoint in the SageMaker console. If I click on this and then on logs, I'll go to CloudWatch and see the log for this. My timestamp tells me when this was created, so it was created at 12:12. The instance starts logging at 12:19. We have two neuron cores, and we see DGL starting. A whole bunch of messages we're not really interested in. Now we see the model being loaded. Loading the model at 12:19:14. It's downloading it from the Hugging Face Hub, so that took about three minutes. Then we see compilation starting at 12:22:37 and done at 12:28:39. That's six minutes. Not too bad because the Inf2.xl instance is really tiny. It has a handful of vCPUs. If we deploy on Inf2.8xl, which has the same number of Neuron cores but more vCPUs and more RAM, it would certainly compile a little faster. But again, you don't need more; you need to go small and cost-effective. If you're running the pre-compiled deployment option, you should be able to save six or seven minutes. That's what it looks like.
Now let's go back to the notebook. Total time, 17 minutes, right? With a pre-compiled option, probably 11 or 12, which doesn't sound like a very nice improvement, but if you're autoscaling, six minutes is very significant. Now we have a running endpoint, so we can start running inference. Let's first start with this prompt. Let's get names for pet food stores. Here, I'm using the OpenAI format, so the system prompt, the user prompt, and I can pass all the usual parameters like max tokens, etc. I'm not using streaming. Invoking the endpoint is super simple; we use the boto3 invoke endpoint API, passing the endpoint name and a JSON string with the payload. We deserialize and print the output. We get the OpenAI output format with the content, the number of generated tokens, and we generated 339 tokens in 15 seconds, so that's about 24-25 tokens per second, which is pretty good, especially given the price point of Inf2. We could display it as markdown for a nice answer.
Let's try another one. Let's write a marketing page for a SaaS platform called Arcee Cloud. More of a real-life scenario than the pet food store. 1024 tokens, no streaming, exact same code, and we get our answer. It's inserting emojis and everything. You can see how easy it is to prompt, and those are very simple prompts. We get really good quality answers because the model started from LAMA 318B, which is already very good, and then our team made it way better.
Now let's look at streaming. It's not much different. In your payload, you need to set streaming to true. Instead of calling invoke endpoint, you need to call invoke endpoint with response stream, and you'll receive a streaming object. The slightly tricky part is how to process the data coming from that streaming object, but in the repository, you'll find utility code to do that. I adapted it a little bit for the OpenAI format. Now we just have to pass the response body and call this function to have a nice streaming answer. Let's see that. There we go. Now we're having something that looks like a chatbot. The speed is more than 24-25 tokens per second, which is really nice.
Just one more example, a customer email for a motorcycle enthusiast named Julien with a couple of bikes. I try to provide a little more context to see what the model can do. Write personalized emails for your customers, injecting context that you would retrieve from your database or your RAG. It's a really good writer, and you can try it with all kinds of things, like code generation, code explanation, and code debugging. All those things work super nicely.
Once you're done, please, as usual, don't forget to delete the endpoint and avoid unnecessary charges. It's not expensive, but you don't want to pay for things you're not using. If you want to be 100% sure, just go back to the SageMaker console, and yes, the endpoint is gone. That's really what I wanted to show you today. To wrap things up, I'm really excited about this Supernova Lite model, the best 8B open-source model available today, and you saw how straightforward it is to deploy it on AWS for under a dollar an hour. That's the on-demand price, and if you're an AWS customer, you can optimize that in different ways. The best 8B model with really amazing abilities, and this will be more than enough for a lot of use cases for under a dollar an hour. I hope you like that. Well, that's it for today. Thank you very much for watching, and there's much more coming with Inferentia and everything else. More models, there's always more. Until next time, my friends, you know what to do. Keep rocking!
Julien Simon is the Chief Evangelist at Arcee AI
, specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.
With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.
Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.
Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.