Hi everybody, this is Julien from Arcee. With the release of our first foundation model, AFM45B, we've decided to release on Hugging Face three models that were commercial models so far. You can now download them, test them, and deploy them any way you like. That's what I'm going to show you in this video. The three models are Supernova, Virtuoso Large, and Caller. We'll discuss all three, and I'm going to run them on Together AI. Let's get started.
First, we'll discuss Supernova. Supernova was built in September last year. It's based on the LAMA 3.1.7TB architecture. The most interesting thing about Supernova is that it was distilled from LAMA 3.1.7TB 405B into the 70B flavor with additional merging on top. The model is released under the LAMA license and is a very good model for general-purpose applications. It has a lot of abilities. If you're interested, you can read the blog posts we released back then. There are benchmarks you may find interesting. At the time of release, Supernova was on par or better than the teacher model, as well as LLMs like Cloud 3.5 and GPT-4.0. So, it's a very capable model within the 70B model category. There's another blog post on the training pipeline. If you're interested in distillation and model merging, there's good information on how we built the model. As usual, links are in the video description. Supernova is on Hugging Face. We're not providing a hosted version on Together AI or elsewhere because we feel Virtuoso Large, the next model I'm going to discuss, is actually a better model. However, if you're interested in a LAMA-based model for licensing reasons or other reasons, Supernova is still a very good model to work with.
Let's talk about Virtuoso Large. Virtuoso Large is a 72 billion parameter model based on Qwen 2.5, released under the Apache 2 license. In a way, it's our flagship 70B, 72B model. Again, very capable and a good general-purpose model for creative writing or data understanding, etc. We recommend using this if you're looking for a 70B size model for general-purpose applications. This model is available in our SaaS platform, Arcee Maestro. You'll find some good demos on my channel. You can work with our SLMs directly, and we see Virtuoso Large here, or you can work with model routers for general-purpose queries, reasoning queries, or function calling queries. Depending on the query, we'll use one of our small language models, possibly Virtuoso Large, or we'll use one of those external models only if we can justify the cost. Go check out Arcee Maestro, and you'll find videos on my channel. But I showed you this before, so instead, I'm going to show you how to work with Virtuoso Large and generally with our other models on Together AI. Sign up for Together AI, and within minutes, you'll be able to start testing. Let me open my notebook for Virtuoso Large and show you what we can do with this one.
The first step is to install the Together client package and make sure we have our API key. You can get an API key when you sign up for Together AI. Here, I'm storing it in an environment variable. Please don't hard code them into your notebooks. Let me show you different ways you can work with the models. The first good thing to try is plain prompting, no prompting format. Let's call this basic text completion. It's a good way to see how the model behaves, the tone, whether it uses bullet points or a more narrative mode, etc. Let's run this and see what we get for transfer learning. We got a response with some bullet points. It got truncated because I asked for just a few tokens. That's the first thing you may want to try.
Now, let's use proper chat. Using the same API, let's give it a little more room and a bit more creativity. We'll run the prompting you're familiar with, the OpenAI style prompting: a system prompt, a user prompt, and see what this gives us. This usually works better because that's how the model was trained. We get a good answer, bullet points, very clear, and very fast too, just three seconds. That's because Together AI is fast, and we're only working with a 70B model. Whether that's still a small language model is an interesting discussion, but in this case, it certainly is smaller than the OpenAI or Anthropic models.
Let's look at streaming. We can do streaming, printing each chunk as it becomes available. Explain the concept of gradient descent in three or four paragraphs. The time to the first token is very fast. We got about 2,000 tokens. It's a very good answer. We could keep prompting, but if you're using OpenAI or Anthropic today, you'll see how fast this is in terms of time to the first token and general text generation speed. Let's do a little more prompting. Ask it to follow a certain format: headers, bullet points, code examples, and learn about k-means clustering in Python. We're not streaming here. The model closely follows the instructions, giving different steps with bullet points and step-by-step code. Virtuoso Large is doing a good job here at following instructions. We see truncation, so we need to increase the max new tokens value.
Let's do multi-turn. Here, we see what a neural network is, and let's suppose this is the previous answer. We're coming back with, "What are some common activation functions?" Simulating a chat type. We get a pretty interesting answer with LaTeX formatting, which would need to be rendered properly in our application. The model is trying to give detailed information, math equations, etc.
Finally, here's how you would work with the model if you didn't want to use the Together AI client, using the requests library or another plain HTTP or HTTPS library. As usual, put your key in the headers, use `request.post` with the URL and headers. This will be a synchronous call, so no streaming, but it's still pretty fast. That's how you would work with Virtuoso Large on Together AI. It's well-behaved, good with code, and comfortable with math and a ton of other operations. I use this model a lot, even locally as a quantized model on my machine. It's a very capable model.
Let's switch to the Caller model. The name implies Caller is optimized for function calling and tool calling. It's based on Qwen 5, 32B, released under Apache 2. It's only 32B, making it a good fit for automation, at-scale workflows, and agentic apps. 32B is small enough to run on a small GPU instance. You can use quantized versions for even more optimization. This is important because function calling and tool calling will require multiple SLM invocations, as we'll see in the demo. The more you run, the more you'd be penalized if you worked with a slow or expensive model. Working with a small language model here makes a lot of sense on both fronts. Let's open a notebook to see how this one works. Caller is also available on Arcee Maestro, although it's called Cola Large, which is the same model. You can find it on Together AI.
This is a demo you may have seen if you watched my previous video on Caller, which I did when the model came out a few months ago. Here, I'm using the Yahoo Finance API to leverage the function calling capability of Caller to invoke several APIs on Yahoo Finance, retrieve information about listed companies, and answer finance-related prompts. As before, we need the package, import it, set the key, and implement the three functions the model will work with. These match Yahoo Finance APIs: `getStockPrice`, `getCEOName`, and `getCompanySummary`. These are self-explanatory. Let's test them first. Get the Amazon price from Yahoo, information about Ford, and information about Verizon. I can call those Yahoo APIs from my notebook. Great. Now, let's see how we can connect them to the model.
We use the OpenAI style function calling API. In this big JSON object, we define the functions: `getStockPrice`, `getCEOName`, and `getCompanySummary`. For each, we have a description, sample questions, and parameters. The company name, which I provide from the prompt, and the stock symbol, which the model will have to figure out. If I have a question about Apple, I hope the model will match that to the AAPL ticker. We define these three functions and the function for actual function calling. We send our prompt and pass the list of tools in the context. Caller decides which tool to use, if any. We could have zero, one, two, or three different function calls. Once we've invoked Caller, we look at the response, extract any tool calls requested, and call the appropriate function with the parameters.
The model will only tell us to call, for example, `getCEOName` with those parameters. Let's make sure this is defined. This works well, but what we get in the end is the raw output from the API, which may or may not be enough to answer the question. Hopefully, it's useful, and the model picked the right answer, but maybe it's not enough information. Most of the time, we'll call two models. First, we call Caller for function calling, retrieve the output from the function call, and then generate a better, more comprehensive response with another model. In this case, I'll use Virtuoso Large. It's a two-step process: call Caller, figure out which function to call, call it, retrieve that data, and then pass the response from the API to the model for enrichment and extra guidelines, like showing URLs.
We need streaming. Let's try some prompts. First, who's the CEO of General Motors? Caller says to call `getCEOName` with arguments General Motors and GM. The ticker was correctly inferred by Caller. Next, we call the local Python function, which calls the Yahoo API, and that's what we get. This is the raw output from the Yahoo API. For a simple question, that might be enough, but we can pass this to Virtuoso Large for a better answer. Yes, the CEO is Mary Barra, but Virtuoso Large adds more information. If you add a knowledge base, you could connect it through a router to make it even better, using not only the model knowledge but also external sources.
Let's try something more challenging. 3M makes filtration products for automotive. Let's see what Yahoo says. Calling `getCompanySummary` gives the company summary, but it's not exactly what I want. I want a clear answer. That's why Virtuoso Large is useful again. It takes the context returned by the API and writes a better answer, confirming that 3M does make filtration products for the automotive industry.
Let's do another one to demonstrate multiple calls. I want to know on which products P&G and J&J compete the most. Two company names. Caller asks us to call the function twice, one for Procter & Gamble and one for Johnson & Johnson. Function calling is not just one function call; it could be different function calls. We get the two company summaries and ask Virtuoso Large to unpack that for us. It says the information is based on the company descriptions provided in the query. Maybe I need to fix my prompt because it used two different tools. But the context had the right information, so we get a direct answer, an analysis, and resources for the two companies. Definitely a much better answer.
This is how you use Caller on Together AI for function calling. Feel free to grab the notebook and tweak it for your own particular use case. That's what I wanted to show you. Three formerly commercial models are now available on Hugging Face for you to experiment with: Supernova, based on LAMA 3.170B, a very good general-purpose model with a LAMA license; Virtuoso Large, probably more capable than Supernova and based on Qwen 2.5, available on Together AI; and Caller, a very good 32B model for function calling and tool calling. Go grab them from Hugging Face or use them on Together AI and build cool stuff with them. I hope you like these three new models. That's it for this video. I hope you liked it. More coming as always, and until next time, keep rocking.
Julien Simon is the Chief Evangelist at Arcee AI
, specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.
With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.
Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.
Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.