Arcee AI webinar pick the right SLM LLM for each query with Arcee Conductor

March 26, 2025
In this video, we introduce Arcee Conductor (https://www.arcee.ai/product/arcee-conductor). This new Arcee inference platform intelligently routes any query to the best SLM/LLM, efficiently delivering precise and cost-effective results for any task. If you’d like to understand how Arcee AI can help your organization build scalable and cost-efficient AI solutions, don't hesitate to contact sales@arcee.ai or book a demo at https://www.arcee.ai/book-a-demo. ⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos. You can also follow me on Medium at https://julsimon.medium.com or Substack at https://julsimon.substack.com. ⭐️⭐️⭐️ * Arcee Conductor: https://conductor.arcee.ai * Arcee Conductor product page: https://www.arcee.ai/product/arcee-conductor

Transcript

Looking forward to that for sure. Okay, I think we're live, right? Let's give it a few seconds. There's always a bit of a delay. But yeah, we see the red light. So we are live. Good afternoon, everybody. The Arcee Avengers are back with another action-packed webinar. My name is Julien. I'm the chief evangelist for Arcee, and today I have three fantastic guests. So in no particular order, we have Abhishek and his fancy hat. Abhishek recently joined us; you may remember him from Hugging Face and Kaggle, all kinds of amazing things, and he's now working on Conductor, so we'll get into that in a second. We have Lucas, who was already with me last time. Lucas, nice to have you back. Good to be back, Julien. Thank you. Lucas is co-leading Arcee Labs, our research arm. And we have Fernando. He's working on a daily basis, sometimes on an hourly basis, and sometimes on a minute basis with Lucas to build all our cool products and models. Fernando, we're very happy to have you as well. Glad to be here. And we're all over the world too, right? So I'm still based out of Paris. Abhishek, where are you? You're in Norway. Are you on mute? Yeah, I think Abhishek is in Norway. Lucas, you're in the US. And Fernando, where are you? São Paulo, Brazil. Nice. You're the lucky one. Okay, you're the lucky one. Okay, so today we are going to talk about our latest launch, and of course, I mean, Arcee Conductor. Chances are, if you are watching this, you probably saw our announcements, maybe you're already using Conductor, and today we're going to show you Conductor in detail. Lucas is going to jump into a demo in a few minutes. And of course, we'll discuss how we built it and answer all your questions. So you have a chat box, feel free to use it, and we'll try to keep an eye on it during the conversation and answer as many questions as we can. Okay, guys, you ready for this? Let's get started. So, Lucas, can you explain Conductor in 60 seconds? Yes. So Conductor is our way of trying to give customers the best of both worlds. We have a lot of really powerful, small language models that are smaller, cheaper, and faster, for the most part, compared to what you'll get from the leading closed-source APIs. What we noticed in our day-to-day usage of these models for training and evaluation is that they solve 50 to 60% of real-world tasks that someone might want to use. This excludes the AI hype bubble, which is primarily related to engineers, coders, and people doing tasks that smaller, less frontier-oriented models are less skilled at. For the average LLM use, you could use our models and save yourself a lot of money. Instead of going to people and saying, "Hey, you should just use us instead," we provide a brand identity and comfort. You put a lot of your day-to-day workflow through these systems, and they become part of your career and livelihood. So, use us in addition to others. The easiest way to do that was to train a classifier that is tiny and, for all intents and purposes, a complexity classifier. It determines how difficult a query would be for an LLM to answer, which model to use, and then we have a ranking system. The promise is you hit Conductor with your queries, the router model analyzes that, and sends the query to the best and most cost-effective model, only going to the fancy but expensive LLMs when necessary. So why don't we show Conductor right away and make it very pragmatic for the audience here. Lucas, feel free to share your screen and run some prompts and walk us through the Conductor UI. We'll talk about the API too, but just give us a quick sense of what the service looks like. Yeah, I don't know. Are we able to verify that people can see my screen? Yes. Okay. It is good. I see it at least. Well, good. That's all I care about. So I will go here so you guys don't have to see all of my chaotic testing prompts. But I'll just take a few that I've thought of. Here's one. This is a pretty decent prompt that is not hyper-specific. There's not a single model that I would think of that is particularly strong at explaining the keto diet. Typically, what you would get from a Wikipedia page is what the model would provide. So I'm going to send this prompt, and you'll see the model classify it and route it to the best model for this task. We see at the top left, we see "auto." That's a drop-down box where we're literally in Conductor mode. We could go and hit "Blitz" or "Verge," or choose a large model if we wanted to, but here we're just hitting the router. Which model did we pick this time? Okay. This one with Blitz because it was correct. It's a low-complexity task. In the UI, we have a nice little summary generated to tell you why a certain model was chosen. Blitz is one of our recent models, a $24 billion Mistral model that we improved. It's on Hugging Face and is the most cost-effective. You can see the cost of the query; it's a lot of zeros. I'll show you this. I'll put in the same prompt but compare it against using Claude Sonnet. I apologize if you hear a leaf blower outside my house. They decided to do it. We made these classifiers really tiny so that there would be a negligible delay. We're having quite a bit of demand for this product, so we're working on auto-scaling that quite well. I'm going to send these. You'll see Claude starts first because we need to classify "auto." The cost is going to be unbelievably lower, and the answer is negligibly different. I actually did the math. Token generation with Blitz is 300 times more cost-effective. It's five cents per million output tokens. Sone, which is an amazing model, is $15. So that's not money well spent on low-complexity tasks. That's the thing about Conductor. Most businesses and individual developers go to the best models, like Sone, GPT-4.5, or whatever the best is, assuming their task requires the absolute best. But there are many AI use cases and companies founded based on the original ChatGPT model where we are beating GPT performance with a 20 billion parameter model. There's no reason to use the monster LLMs and their price tag for everyday business productivity tasks like translating, rewriting, improving emails, and summarizing. We don't need the monster LLMs for that. Okay, show us a couple more prompts. I'm going to go back to the single page. We'll try to get all the secrets from Fernando. Oh, gosh, dude. He might just give them to you. No, you won't. You won't. But we want to know. Here's another one. The way we trained the router was not based on human-annotated versions of complexity. Tasks we might consider quite complex as humans come back as rather low complexity from the classifier. I think this one's going to be Blitz too, which is funny because I came up with all these prompts, and they came back with Blitz. Fernando, do you have any prompt that you've seen that routes away from Blitz? Coding tends to go to Sone a lot. Yes, by design, for now, because we're working on how we can best integrate a coding orchestration suite within Conductor. If it's coding, it goes to Sone. So, if I go here and do a Python script to show the Fibonacci sequence, this is likely going to classify it as coding, regardless of the complexity of the task, and send it to Sone. Let's see. Okay. As you can see here, we have a nice little animation. And as you saw, since it came back as code generation, it went to Sone. I have one for you, Troy. Explain put and call options and give me an example. I wouldn't be surprised if it went to Blitz; it very well might go to Virtuoso or a larger medium model. I think there's going to be a flash. Yes, there we go. So, go and try that out. It's open for business. You can go to conductor.arcee.ai and sign up. For a limited time, you will get $200 worth of free inference. By now, it's pretty clear that if you hit Blitz or the most cost-effective models in the list, $200 will take you very far. It's going to be hundreds of millions of tokens unless you play with Sone all day and then you'll burn that cash and ask me for more. And I'll say no because you really need to use the smaller models all the time. So, Conductor.arcee.ai. Thank you, Lucas, for the demo. You can compare the different models to see that, for these prompts, the LLMs are not doing a significantly better job. Even if they are, are you willing to pay 100 times, 200 times, or 300 times more for this? Absolutely not. So that's the thing. Now, I think we'd like to understand a little more about what this router is. Fernando, being a researcher, will tell us the secrets. So, Fernando, can you tell us a little more about how you're training that? Do we use fancy techniques like model merging and distillation, or is it simpler than that? Everyone's curious about that tiny model. Well, the first thing to realize, as Lucas already mentioned, is that LLMs don't know what they don't know. If you prompt the fanciest large language model regarding the complexity of anything, it's just going to guess without any robustness. We must introduce an optimal policy to save money. It's a must-have policy to solve an optimization problem. The challenge is understanding what is complex for these models and scaling synthetic data to train these models. This is a ranking problem, not a regression or classification problem. It's very hard to predict every single time. That's why we thought of the problem as an ordering problem instead of a classification problem. We must avoid classifying something very complex as very simple, which is a huge mistake. We came up with a ranking system and developed novelties in our labs to solve this problem. We also tried to set this up a bit like AI feedback systems, using reinforcement learning to measure the gap between the quality of the answers. In short, this is how we train these models in terms of complexity, which is the core of our policies. There are other parts of the router, such as programming detection and natural language routing, which are also tricky to put together to have very small latency. For example, having a model produce very good answers in Portuguese, depending on the model, is tricky. Being able to route models according to natural language, programming language, and task type is also core. Putting everything together and having this small latency is important. We could discuss what kind of data is needed to train this, but that would probably take us too far. As far as the complexity, most of our secret sauce was in the training libraries we had to invent to fit it into this size. We took a tremendous amount of real-world prompts available online and had many models, ranging from 500 million parameters to the frontier, answer those. We used an ensemble of high-quality models to judge those responses on a level of correctness. If 19 out of 20 models got a prompt correct, you can assume it's a rather easy task. If two out of 20 got it right, that's a high-complexity task. Training a 30 billion parameter model to classify would be meaninglessly easy. Getting it into something this tiny is challenging, but it allows us to make it fast and fit it on a phone, providing this on-prem for people. A small model also means low latency and is easier and faster to retrain. If you want to add a new model or five new models tomorrow, the router has to know about them. Initially, when we first trained our router, which gave us the idea to make Conductor, it was trained on a model-specific basis. Now, because we're classifying, the router spits out numbers, and we decide which model to use based on those numbers. If a new model comes out, like GPT-5, we can integrate it with very little delay. We aim to be on the cutting edge, understanding what models are best for what tasks, so you can focus on building your cool product. We'll ensure you're always getting the best model. We also have plans for Conductor. We wanted to get it out for people to use, but we understand that reasoners, tool use, vision, and coding are important. We're adding configurations so you can specify, for example, "I want to hit a reasoner right now." We'll route between the best reasoners for that task. The same goes for vision and tool use. We didn't want to include reasoners in the default version because we didn't want someone expecting an instant response to wait three minutes. I wanted to talk about multimodality. Abhishek is taking care of the chat. Abhishek, you should be back on screen. Come on. Abhishek is shy. It's the hat. There you go. Everybody loves your hat. If you love Abhishek's hat, please tell us in the chat. Abhishek, tell us a little bit about images or speech. What's your vision? Yeah, so I think that's another big challenge for everyone in the Conductor team, and we'll need a lot of support from the labs team. How we route properly for images or for asking questions from documents is a big research question. I don't want to comment much about it right now. The easiest thing we'll probably do first is route to the right vision model based on prompt complexity. Over time, we'll specialize and pick niche use cases that don't generally work well on VLMs and send them to the best model. As people use Conductor, we'll get a better idea. Right now, Conductor has been built with a lot of assumptions in mind by our team. Over time, as we get more concrete data on how people are using it, we'll have a much better idea of how to improve it exponentially. We looked at the UI, but there's an API as well, right? I would expect most customers to work through the API. It's compatible with OpenAI, and that's a very reasonable choice. You can just use the OpenAI client. I did a YouTube video showing you that, and all my code examples will be based on that. You just have to create an OpenAI client with the OpenAI library and point to our URL, conductor.arcee.ai, and set your key just like you would set your OpenAI key. It's literally all you do. You just change the base URL for OpenAI and replace your OpenAI token with the one you get from our website, and you're off to the races. I think that's a good design decision. If you're working with GPT-4 today or any OpenAI model, there's probably no change at all. You can just set the destination to Conductor and you're off to the races. Let's talk about pricing for a second. If you sign up today at Conductor.arcee.ai, you get $200 worth of free inference. Don't wait because we could stop being generous in the near term. There will be a threshold where we stop that. For now, you are getting that. You can look at the pricing page, which goes from Blitz at five cents per million output tokens to Sone 37 at $15. We've done some simulations, and it's very easy to put together an Excel sheet. If 30% of your prompts are low complexity and go to Blitz, 40% are medium complexity and go to another model, and 30% are high complexity and go to the LLMs, you can easily get at least 60-75% cost reduction unless you're using insanely complicated prompts all day. We did some simulations with academic benchmarks. The router, in an early version, was put against MMLU, GPQA, Diamond, and HumanEval. We got better results, and I attribute a lot of that to the math scores because Gemini is very strong at math. Our models hit around 38-40% of the time, and the others were hit in a pretty equal 20% range. You get the same performance at about 65% of the cost. Virtuoso Large is 50 cents per million tokens, which is still 30x cheaper than Sonnet. If you only hit Virtuoso Large, which is unlikely because we hit Blitz all the time, you can save a significant amount of money. It's important to note that we're not charging a premium to use the router. We're making our money when it hits our models. We have an incentive to make our models better so that it can go to them more often. If you decide, "Hey, I'm saving a bunch of money hitting Virtuoso Large 40% of the time, I have to hit it 100% of the time," that's fine. You can directly hit all of our models. You can't do that for the closed-source ones, but you can do that for ours. Fernando, if we looked at Conductor as a single model, how well would it do on MMLU and the usual benchmarks? Assuming it's a black box, how does the solution as a whole perform compared to GPT-4 or Sonnet? We expect that if our policy is optimal, we should be very close to the performance or even better than the frontier models. We are not setting benchmarks as our target variable. We are not p-hacking or gaining the benchmarks. We expect that if our policy is optimal, we should be very close to the performance or even better on the frontier models. We are trying to figure out optimal routing policies on top of the topic or theme of the prompt. We are not trying to understand which model excels on a specific setting. We are optimizing for customers, not benchmarks. Abhishek, what's coming next? What are the two or three features our friends can expect in the next few weeks? In the next few weeks, the reasoning models, tool calling, and a better complexity model for code are coming to Conductor. We are also working on multimodal models. Those are all things we would want in something we were using ourselves, so we're keen to get those out. Lucas, last time we were online, we discussed Arcee Maestro, our agentic workflow platform. What's the connection between Conductor and Arcee Maestro, and how do we create more business value for customers by connecting the two? When we were building Arcee Maestro, a huge part of the initial selling point was the model routing. We found that it was such a powerful tool that a lot of people resonated with it on the sales side and in early conversations. We decided it needed to be separated from Arcee Maestro to grow and flourish. Conductor aims to find not-quite-language-explainable patterns in daily AI workflows that could be done by the best model possible, the cheapest one. Arcee Maestro is more about AI agent orchestration created by a human. Conductor has the potential to be very powerful on device for people trying to do agent orchestration on a phone where you might not have as much horsepower or RAM. You can use the router to break down complex tasks and route each part to the best model. For example, you can use different models for each step in a complex task, like explaining quantum string theory at a PhD level. Once broken down, the original query might be an eight complexity, but each part might be only three to four. This is another way to save money. Julien, let's break down complex prompts. A lot of my queries are multiple questions into one. Explain this, and if the answer is yes, then why, and if not, then why not, and do it under these constraints. Breaking things down and sending them to different models can certainly get better results. Let's go around the virtual table. Fernando, what are you most excited about, and where do you dream of taking Conductor in terms of capabilities, features, etc.? For the short term, I feel like becoming able to route accordingly to the complexity for powerful reasoning models is very important. Extending with the rise of multimodal reasoning models is also a big challenge. I'm excited about coming up with something novel to deal with that. Lucas, what's your moonshot project for Conductor? I think Conductor has an unbelievable chance to improve code. Many large-scale coding services, whether it's Cursor, Bolts, or even individual users in VS Code, can benefit from routing to the best model based on programming language. Using Conductor as an orchestration suite to simplify things for the end user who doesn't enjoy doing all that stuff manually is very exciting. Adding an API toggle to do all that for the user will be powerful. Abhishek, what do you think is unique about Conductor? Conductor is unique because of how we route using our own custom models. We've created our own datasets and put a lot of work into the labs team. We call it intelligent model routing. Even if we are able to route 50% of the queries accurately, that's a big savings for a large volume enterprise. Other non-intelligent model routing services don't do that accurately. Solving problems for customers is my obsession, and I think we're doing that with Conductor. We remove the complex equation of choosing models and build very small, cost-effective models that deliver a lot of value for the buck. We're trying to get to those models as often as we can to deliver the quality and benchmark-like quality that Fernando was talking about. When we have to, we call the big guys and you spend a little more, but we think it's money well spent. So, go sign up at Conductor.arcee.ai. It takes a minute, and you will get $200 of free inference for a limited time. If you watch this video in a month, I cannot promise you'll get $200. So do it now. If you have feedback, please reach out. You can contact all of us on LinkedIn or through the website. We're happy to see what you're building and what works. We'd love to see the prompts that don't work so we can fix them for everyone else. That's it for today. The Arcee Avengers, Fernando, Lucas, Abhishek, thank you so much for spending time with us. Now you can go back and keep building Conductor and implement all those cool ideas. Thanks, everyone, for watching this. We'll be back soon with another topic. As usual, thank you to the team. There are a lot of other folks working on Conductor, not just the faces you see here. Thanks to our marketing team for putting this event together and helping us share the good world with all of you. Thanks, everyone. Next time, go work with Conductor and let us know what you think.

Tags

ConductorArceeLLM RoutingCost-Effective AIModel Optimization

About the Author

Julien Simon is the Chief Evangelist at Arcee AI , specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.

With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.

Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.

Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.