Hi everybody, this is Julien from Arcee. In this video, I'm very happy to introduce two new Arcee open-source models that have been distilled from DeepSeek v3. The first one is a 10 billion parameter model based on the Falcon architecture, and we call it Virtuoso Lite. The second one is a 32B model based on the Qwen 2.5 architecture. We call this one Virtuoso Medium V2 because we've already built Virtuoso Medium and I told you about this one a few weeks ago. So we're going to look at the benchmarks. You're going to see how amazing they are. And of course, then we're going to deploy those models on AWS and run a few tests. And I'll share some notebooks with you. Sounds good? Let's get started.
The two models are available now on Hugging Face, and of course, I'll put all the links in the video description. So let's first look at Virtuoso Lite. Virtuoso Lite is a 10 billion parameter model and it was distilled from DeepSeek V3. I'll also put a link to our blog post that tells you a little bit about that process. This is based on Falcon 10B, which itself is based on LLaMA 3. So this is a really small model, 10 billion parameters, as you will see in the demo. We're able to run this very easily on a single GPU instance in a very cost-effective way. And yet, this is a pretty powerful model. So don't let the small size fool you. This is a very, very high-quality model. And this is under the Apache 2.0 license, so you can go and build cool stuff with it.
We also have Virtuoso Medium V2. You may remember Virtuoso Medium, one of the models we released when we launched our inference engine around December time, and there's a video on the Virtuoso models again; I'll put the link in the description. This is the next version of it because we love to iterate quickly. This one is still a 32B model based on Qwen 2.5, just like Virtuoso Medium V1, and again, this one is distilled from DeepSeek V3 on a much larger dataset of over 5 billion tokens. I'll let you go through the details of those models in the blog post, where some pretty cool techniques have been applied to build these models.
What I really want to show you before we dive into the demo are the benchmarks. As usual, it will take a little while for those models to actually show up on the Hugging Face leaderboard, but we're evaluating them using the same benchmarks and the same procedure. So let's first look at how these models compare to other models we've built. Here in light blue, you see Arcee Nova 72B, which is a model we released in mid-2024. That was our original 72B model, open-source, still on Hugging Face. You can go and try it. Purple is Virtuoso Medium 32B, and pink is Virtuoso Lite 10B.
As you would expect, Virtuoso Medium V2 is the best model of the three, no surprise. What is impressive is how it outperforms Arcee Nova 72B, which, when we released it, was the best open-source model in its size category. It's pretty impressive that in a short timeframe, we have models in the 30B size range that easily outperform a much larger model like 72B. In fact, if you look at the 10B model, you can see that it's also outperforming the 72B model in quite a lot of benchmarks. This goes to show that when we're talking about small language models, we should actually say smaller and smaller language models because not only do we keep building better and better models, we also keep building them smaller and smaller. The combination of increasing performance and shrinking the size is amazing news for customers and organizations who want to build not only high-performance solutions but also cost-efficient solutions that give them actual ROI instead of just a big OpenAI bill at the end of the month.
Feel free to compare these to the top models on the leaderboard. In the interest of time, I'm not going to do this, but you will see that Virtuoso Medium V2 is outperforming not only Arcee Nova, which is about a year old, but also some other 70B models that are much more recent. More performance in a smaller package means more ROI for all AI builders out there. Look at the numbers, run your tests, and make up your own mind.
Now let's run the models. Here, I'm going to deploy the two models on AWS using Amazon SageMaker, and of course, the links to those notebooks will be in the video description. First, import a whole bunch of things as usual, and we're going to deploy using the LMI container by AWS, which is DGL serving-powered. Virtuoso Lite, as mentioned, is a 10 billion parameter model. So parameters are 16 bits, so we need about 20 gigs with a little extra room for the KV cache and all that good stuff. This should easily fit on a G6E.2XL instance. These are based on the L40s GPU, and those GPUs have 48 gigs of RAM. So this should fit very, very easily. You might even fit this on a smaller GPU, but I love G6E, I think it's the best GPU instance family you can use for smaller language models on AWS. A really good cost-performance ratio.
Point to the model on the Hugging Face Hub, define the instance type, and then you can just go and create the endpoint. Create the model object and call `model.deploy`. We've done this a bunch of times, and if it's the first time you see it, no worries; you can go and read the notebook and feel free to ask questions in the video description. It took a few minutes to deploy, so now we have our endpoint and can query it. Let's suggest names for a neighborhood pet food store and run synchronous inference, generating the full answer before printing it out.
While we do that, let's take a look at the price of that G6E instance. The on-demand price is $2.20, which is actually the EC2 price. SageMaker is a little more expensive, but you could probably deploy it on a G6C.XLarge, which is even cheaper, with the same GPU but a little less RAM. It should still fit. This is very cost-effective, and if you go for reserved instances, you can see how you can probably run this for less than a dollar an hour, which is hard to beat. We generated our answer, and we can print it out. We see the OpenAI format, which is nice because if you're using OpenAI today, you can minimize the amount of application code rewriting. You'll probably need to adapt the prompts a little bit, but you won't have to change how your apps invoke the model. Just switch the URL.
Now let's try streaming. Just set `streaming` to `true`, and we have a small utility function to retrieve tokens as they are generated. Let's write a marketing email. Oh, emojis. This is nice; we could say no emojis if you want something a little more enterprise-compatible, but you can also see the speed of generation, which is more than adequate even on a small instance like that. Of course, if you scale up to a 12XL with four GPUs or even a 48XL with eight GPUs, it would go even faster. Generally, I recommend scaling out and not scaling up. You will get more scalability with a SageMaker endpoint backed by several G6E.2XL instances and scaling out and in according to traffic, rather than trying to run everything on a single large instance, which will be more expensive and harder to scale down when traffic is low. The fact that you have eight GPUs doesn't mean it will run 8x faster, so you're better off running eight of these than one of those. Trust me.
Let's try something else. A technical question. Again, you can see this is more than fast enough. And let's try the motorcycle dealership email, my favorite. Ah, more emojis. There you go. So that's super easy to try. Just open your SageMaker, grab my notebook, run it, and give it a shot. And of course, when you're done, please delete your instance to avoid unnecessary charges. Ask your questions in the video description or ping me by email.
All right, so that's Virtuoso Lite, our cool new model with amazing benchmarks. Now let's take a look at Virtuoso Medium V2. Same story; we're going to run this in the same container. This is a bigger model, so it won't fit on a single GPU. The next size up is G6C.12XL with four GPUs. AWS wishlist item: instances with two GPUs, but I'm not holding my breath. That would be awesome. Maybe I get lucky, I don't know. So we have to go to four, but that's okay because even four is pretty cost-effective.
We deploy this model in exactly the same way. Build a model object, call `model.deploy`. This is why I like SageMaker; I can copy-paste those notebooks. Just change the model name, super simple. Let's go to streaming inference directly to see how fast this is. Let's try the marketing email again. This is still plenty fast. Now we're leveraging those four GPUs. And now you see why it's so important that we raise the bar on model quality and accuracy while shrinking them to smaller and smaller sizes. Maybe a year ago, to get that kind of performance or generation quality, you would have needed a 70B model, which would have been much bigger, bulkier, and slower. We didn't have G6E a year ago, so who knows what kind of instance we would have needed, maybe a P4, which is way more expensive and even difficult to grab. This is great. Same quality, faster, more cost-effective on smaller instances that are probably easier to procure.
That's what I wanted to tell you tonight. The model just hit Hugging Face about 15 minutes ago, and I couldn't wait. I'm super excited to bring these models to all of you out there in the community. Again, look at those benchmarks; this is pretty amazing. Do your homework, don't trust me, and you will see that these models are some of the best available out there, particularly in this size range. The question you're probably asking yourself is, "Okay, that's DeepSeek V3, where's DeepSeek V1?" The only thing I'm going to say is, we're just getting started. Until next time, my friends. As always, you know what to do. Keep rocking.
Tags
ArceeVirtuoso LiteVirtuoso Medium V2DeepSeek V3AWS SageMaker
Julien Simon is the Chief Evangelist at Arcee AI
, specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.
With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.
Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.
Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.