Arcee Maestro 7B A small language model that outperforms o1 preview

February 21, 2025
In this video, we introduce Arcee Maestro 7B (preview), a new Arcee open-source model based on Qwen and distilled from DeepSeek-v3. We first look at the model page on the Hugging Face hub. Then, we discuss evaluation benchmarks. Finally, we deploy the model on Amazon SageMaker. If you’d like to understand how Arcee AI can help your organization build scalable and cost-efficient AI solutions, please get in touch with sales@arcee.ai or book a demo at https://www.arcee.ai/book-a-demo. ⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos. You can also follow me on Medium at https://julsimon.medium.com or Substack at https://julsimon.substack.com. ⭐️⭐️⭐️ * Blog post: https://www.arcee.ai/blog/arcee-maestro-7b-preview-arcee-blitz-advancing-reasoning-and-speed-in-smaller-models * Arcee Blitz: https://huggingface.co/arcee-ai/Arcee-Maestro-7B-Preview * Notebook: https://github.com/arcee-ai/aws-samples/blob/main/model_notebooks/sample-notebook-maestro-7B-preview-on-sagemaker.ipynb * Arcee Model Engine: https://youtu.be/yVlHEjlIZVY

Transcript

Hi everybody, this is Julien from Arcee. In this video, I would like to introduce yet another open source model by Arcee. This one is called Arcee Maestro 7B, and this is the preview version. We're still working on it. Maestro is based on a Qwen model that we further trained on math problems and we use the same reinforcement learning technique that DeepSeek used for their R1 model, an algorithm called GRPO. So we're going to take a quick look at the model and the benchmarks. You will see it does outperform O1 preview on math problems. Then we'll deploy it on Amazon SageMaker and run some tests. Let's go. Maestro came out yesterday. You can read about it in the blog post. In a nutshell, it is a 7B model that we further trained on math and some coding problems. Looking at the benchmarks, we can see that Maestro not only outperforms some other really good 7 billion parameter models but also outperforms a one preview from OpenAI. I think this is very impressive for such a small model. When compared with larger models like 14 billion or 32 billion distillations from R1, we can see that this 7B model is pretty close. So again, another sign that this is a really, really strong small model. Keep in mind this is just a preview version, so we're quite hopeful that we can make it even better. So now let's deploy this model on Amazon SageMaker and run some reasoning tests. As usual, I'm going to download the model from the Hugging Face Hub and use the AWS LMI container to deploy it to a SageMaker instance. We need to set up our dependencies, and because this is such a small model, it can easily fit on a small GPU instance, which is really cost-effective and my preferred GPU instance type on AWS. That's G6E2XL. This one has a single L40s GPU with 48 gigs of RAM, which is more than we need for this 7B model. So let's just go and select that instance type and use the usual code to create an endpoint. After a few minutes, that GPU instance is up, and we can start running a couple of examples. Here, I'm going to run this apparently simple question, but we're going to see that it turns out to be not that simple. Explain the impact of rising US interest rates on emerging market bonds. Give an example from recent history. Let's run this with streaming inference. Because this model is a reasoning model, it's going to be extremely chatty, which is a good thing. It's going to walk us through how it's looking at the problem, breaking down my question into bits, figuring out what each bit means, and then starting to formulate a question and an answer. This is really funny because it says, "Oh, I'm confused. Oh, but wait, no, I'm thinking about this wrong. Or hey, let me look at this again." This is really not how we're used to working with language models. Usually, we get that super nice polished answer. So this could be a little upsetting at first, but you're really looking at the model working through the problem. When it's done, it's going to output this /think tag saying, "Okay, so now I'm done thinking." Then you get your answer, which is all nice and neat and structured. There's no doubt or, "Oh, wait, no, I'm wrong here." This is really similar to sitting down next to a colleague or an expert and asking a question and watching that person think about the whole thing, maybe backtrack, or look at the question in different ways, etc. So this is really interesting. If you ask me, I think this output is super valuable, and you would probably want to keep that, not just the nice answer but the thought process as well. You can use it for content creation or extra context in many different ways. At the end of the day, we did get an answer. You may wonder, "Is this a good answer, yes or no?" So I actually asked another model, Virtuoso Large, which is a much larger model at 72 billion parameters, saying, "Do you agree with this explanation?" I just took the answer from the small model and pasted it. The conclusion from the much larger model is that the conclusion correctly summarizes that rising US interest rates can lead to significant declines in the price of emerging markets, etc. Feel free to do the same with other models. This is really interesting because we have this small model that can give a really good answer that a much larger model basically agrees with. It cannot say much more than what the small model says. But with the small model, you also get the thought process. Okay, now if you watched my previous video on RSS, you saw me use this prompt with a bunch of Lama CPP code for ARM CPUs, asking the model to analyze how NIR instructions were used for vectorizations. The larger model gave us an answer, and there was nothing wrong with it, but it was just the answer. Now let's run the same prompt with this small reasoning model and see what happens. Once again, you see the model working through the problem, looking at the first few lines, the loop, the inner loop, noting certain things, etc. So literally walking through that code and figuring things out. Again, once it's done, /think, now you get your precise answer. For a model that small, we get a very, very good explanation. It might just be better than the one we got from the larger model. So feel free to download this model again from Hugging Face, deploy it anywhere you like, and give it a spin. Share some comments in the community tab on Hugging Face or in the comments for this video and let us know how you're doing. That's it for today. I'll see you soon with more content. Until next time, keep rocking.

Tags

Arcee-Maestro-7BMath-ReasoningAWS-SageMakerModel-DeploymentCost-Effective-GPU-Usage

About the Author

Julien Simon is the Chief Evangelist at Arcee AI , specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.

With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.

Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.

Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.