Arcee Conductor advanced model routing selects the best SLM LLM automatically

March 05, 2025
In this video, we introduce Arcee Conductor (https://www.arcee.ai/product/arcee-conductor). This new Arcee inference platform intelligently routes any query to the best model, efficiently delivering precise and cost-effective results for any task. We first look at the list of available SLMs and LLMs in the Conductor user interface and compare some of them on a couple of prompts: speed, quality, and price. Then, we run more prompts programmatically with the OpenAI API. Sign up and get $20 of free inference credits! If you’d like to understand how Arcee AI can help your organization build scalable and cost-efficient AI solutions, don't hesitate to get in touch with sales@arcee.ai or book a demo at https://www.arcee.ai/book-a-demo. ⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos. You can also follow me on Medium at https://julsimon.medium.com or Substack at https://julsimon.substack.com. ⭐️⭐️⭐️ * Arcee Conductor: https://conductor.arcee.ai * Arcee Conductor product page: https://www.arcee.ai/product/arcee-conductor * Arcee Model Engine video: https://youtu.be/yVlHEjlIZVY * Notebook: https://github.com/juliensimon/arcee-demos/blob/main/model-engine/test-model-engine-auto.ipynb

Transcript

Hi everybody, this is Julien from Arcee. In this video, I'm extremely happy to introduce a new Arcee service called Arcee Conductor. Conductor is a new inference platform that automatically sends your prompt to the best and most cost-efficient model, picking that model from a list of Arcee small language models, as well as closed LLMs from other programs. The end result? You're going to get the best, fastest, and cheapest model available, delivering huge savings in terms of latency and cost. First, I'm going to show you Conductor in the user interface, displaying the list of models. Then, I'll demonstrate how you can easily compare models available in Conductor, seeing the latency, relevance, and price. Lastly, I will show you how to use Conductor programmatically with the OpenAI-compatible API. Let's get started. At the time of recording, Conductor is not out yet. We're going to launch in about a week. You can get early access by filling out this simple form. Of course, I will put the link in the description. So now, let me switch to Conductor and show you the list of available models. Here is the Conductor UI. First, let's take a look at the list of available models. By default, we'll leave the model set to automatic, which points to our router model, but all these models are available. Here, you see the models already available in the Model Engine, which is our inference platform. If you're new to Model Engine, I'll put the link to my video in the description. We have coding models, function calling models, general-purpose language models in different sizes, and so on. These are extremely cost-efficient because they are small models yet have excellent performance. I've talked about these models many times already. We also have the ability to use external models like Cloud, DeepSeek, or Gemini, or some of the OpenAI models. These have stronger abilities for advanced reasoning or really complex questions but are much more expensive. You don't want to use these fancy and expensive models for 100% of your prompts. They would do a good job, but they would be too expensive. That's where Conductor comes into play. It will solve the problem of picking the right model for the job every single time. Unfortunately, some customers work with those closed LLMs because they have complex prompts, but maybe that's just a fraction of the prompts they have to deal with. They waste a lot of money on everything else. Simpler prompts and questions should go to more cost-effective models, and only the most complex and advanced prompts should go to one of the fancier LLMs. Now, you don't have to compromise. Some people use smaller models and don't get the best answers, while others use larger models all the time, paying way too much overall. Before we run some live examples, let's compare the output for some of those models. My question is, "Explain logits-based distillation versus hidden-state distillation." It's a technical question with not a lot of detail or context. Model A is Virtuoso Small, and Model B is ChatGPT-40. First, let's look at the answers. I'll leave them on screen so you have time to read them. The answers are pretty similar. Each model explains what logits-based distillation is, and ChatGPT is a bit more structured, but the explanation is more narrative. For hidden-state distillation, we have a bit of a comparison at the end. Virtuoso Small generated the answer in 5.2 seconds, while ChatGPT took about 13 seconds. The output was a bit longer for ChatGPT. Looking at cost, there's a huge difference. ChatGPT is 188 times more expensive. The ChatGPT answer looks a little better, but would you pay 188 times more for that answer compared to the one from Virtuoso Small? That's the whole point. If you're working with a single model, you either optimize strictly on cost and work with a small model, or you use one of those fancier, more expensive closed LLMs. As you can see, you may be paying 100 to 200 times more on some queries. At the end of the month, with hundreds or thousands of users, imagine the difference in dollar amount. That's precisely what Conductor is going to solve for you, as you don't want to figure that out on a prompt-by-prompt basis. Let's look at a second example. This time, I'm asking, "Write a short welcome message for a new user of Model Engine, Arcee's inference platform." On the left, I'm sending this to Conductor, and on the right, to ChatGPT-40. The answers are equivalent. Conductor automatically selects Virtuoso Medium, a 32-billion-parameter model, with a response time of 184 seconds and a very low cost. The reasoning for the selection is provided: given the task's moderate complexity, general text generation requirements, and the computing domain, Virtuoso Medium is an appropriate choice. ChatGPT was a bit faster, but it gave a shorter answer, and the cost is 18 times higher. Why pay 18 times more for an equivalent answer? Now, let's run a couple of live examples. Let's try this prompt: "Explain the pros and cons of CPU inference and GPU inference for small language models." We get a nicely structured answer with pros, cons, and a detailed summary. We used Virtuoso Large, a 72-billion-parameter model, which is still very cost-effective compared to larger models. This is the key value proposition of Conductor: selecting the right model for the job, saving you from the compromise of optimizing for cost or maximum abilities. Let's run some programmatic examples with the OpenAI API and different prompts. As usual, when working with Model Engine, we can reuse the OpenAI client. We just have to send the endpoint to models.arcee.ai, ensure we have an API key, and set the model to auto. If I wanted to change this to Virtuoso Large, I could do it like that, but I want to use Conductor, so my model is the automatic model picked by Conductor. Let's create the client and run our prompts. This is one we ran before: "Write a short welcome message." I have a small utility function that prints the streaming response, the number of tokens, and the model that was picked. This is quite simple. Let's try a slightly more complicated one: "Explain what put and call options are and when they are used, and show a simple example for both." This time, we picked Virtuoso Large because the question is more complex, requiring deeper explanations and examples. Let's try another one: "Explain the difference between logits-based distillation and hidden-state distillation. Show an example for both with PyTorch code, using BERT-Large as the teacher model and BERT-Base as the student model." Now, we're talking. This time, we're using something else, as the model is thinking and having an internal discussion, which is interesting. This is not going to be one of the Virtuosos. It's generating the code, and this looks like good code. The answer is very detailed, and you could probably run it out of the box. We generated over 8k tokens, and this one was DeepSeek R1. The trade-off is clear: this is a complex question involving language, code, and reasoning. DeepSeek R1 does a good job, but it's a long answer and more expensive. That's fine; use the bigger, more expensive models when they're really needed and the small, cost-efficient models for everything else. Let's try a code generation example: "Write a Python function that prints a streaming response from an OpenAI API call." This time, we used ChatGPT. Let's do some writing now: "Write an SQL query." Here, I'm loading "Alice in Wonderland" and asking the model to write a psychological profile of the main characters, supported by direct relevant quotes from the text. This could take a while. After some thinking, the model is answering the question, writing profiles for the main characters and quoting from different chapters. This is DeepSeek, and it took 49 seconds. We could go on forever, but I think you get the idea. Conductor picks the right model for the job in terms of quality and cost efficiency, so you don't have to make that decision yourself and don't have to compromise on cost efficiency or quality. You can get the best of both worlds. This is a super exciting service that solves a lot of problems customers have faced for a long time. Please go to our webpage, sign up for early access, start testing Conductor, and let us know what you like and what you don't like. We're happy to take your feedback. That's it for today. Thanks for watching. Until next time, keep rocking.

Tags

Arcee ConductorModel Inference PlatformCost-Efficient AI ModelsAutomated Model SelectionOpenAI-Compatible API

About the Author

Julien Simon is the Chief Evangelist at Arcee AI , specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.

With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.

Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.

Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.