EP2 ARCEE.AI small language models open source and cost efficient AI AWS for AI podcast

May 20, 2025
Join us for an enlightening conversation with Julien Simon, VP and Chief Evangelist at ARCEE.AI , as he shares deep insights on building practical and cost-efficient AI solutions. From his extensive experience at AWS, Hugging Face, and now ARCEE.AI, Julien discusses why "small is beautiful" when it comes to language models, revealing how 10B parameter models can now match the performance of much larger 72B models from just months ago. Learn about innovative techniques like model merging, the importance of proper infrastructure choices, and practical advice for organizations starting their AI journey. This episode covers critical topics including: ● Why small language models are the future of enterprise AI ● How to optimize costs while maintaining performance ● The role of CPU vs GPU inference ● Essential architecture considerations for AI workloads ● Best practices for building production-ready AI systems Whether you're a startup, enterprise, or public sector organization, this episode offers invaluable guidance on building scalable, efficient, and practical AI solutions in today's rapidly evolving landscape. Learn more: Build and scale the next wave of AI innovation on AWS: https://go.aws/ai ARCEE.AI: https://www.arcee.ai/ Julien Simon Youtube channel : https://www.youtube.com/@juliensimonfr 00:00:00 : Introduction 00:02:18 : Journey into AI 00:06:40 : Arcee.ai small language models champion 00:09:02 : Arcee.ai global presence 00:10:19 : Use cases for SLMs and AI agents 00:15:00 : Post training with model merging, model distillation 00:17:20 : Model routing 00:19:15 : Orchestra drag and drop agentic platform 00:20:42 : How to build the best SLM ? 00:23:29 : Synthetic data and data quality 00:25:26 : Open source in AI 00:28:04 : Reflecting cultural nuances 00:31:02 : Biases in synthetic data 00:34:55 : What is an SLM 00:36:33 : Obsessing on cost efficiency 00:39:37 : CPU Inference 00:41:49 : Infrastructure and model choice 00:45:38 : GPU-less and microservices architecture 00:48:02 : Training on AWS: Hyperpod & Trainium 00:55:48 : Key advice for organizations starting with AI 01:02:14 : Closing remarks and resources Subscribe to AWS: https://go.aws/subscribe Sign up for AWS: https://go.aws/signup AWS free tier: https://go.aws/free Explore more: https://go.aws/more Contact AWS: https://go.aws/contact Next steps: Explore on AWS in Analyst Research: https://go.aws/reports Discover, deploy, and manage software that runs on AWS: https://go.aws/marketplace Join the AWS Partner Network: https://go.aws/partners Learn more on how Amazon builds and operates software: https://go.aws/library Do you have technical AWS questions? Ask the community of experts on AWS re:Post: https://go.aws/3lPaoPb Why AWS? Amazon Web Services (AWS) is the world’s most comprehensive and broadly adopted cloud. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—use AWS to be more agile, lower costs, and innovate faster. #AWS #AmazonWebServices #AWSForAI #CloudComputing

Transcript

Welcome to the AWS for AI podcast. Welcome to AWS for AI, the podcast where we explore cutting-edge AI solutions and the innovators behind them. I'm your host Hamza Mimmi, AWS Solution Architect. Today, I'm thrilled to welcome a very special guest, Julien Simon, VP and Chief Evangelist at Arcee. I knew Julien from a past life back in Paris when he was working for Hugging Face, and I was personally impressed with his depth of knowledge, helping hundreds of customers solve their AI challenges, always sharing his unique perspective. Julien is a strong advocate for open-source AI, small language models, and is dedicated to helping enterprise clients develop top-notch and cost-efficient AI solutions. With over 30 years of tech experience, including more than a decade in cloud computing and machine learning, Julien is committed to daily learning and passionate about sharing his expertise through code, videos, and more. At Arcee, he's working on things like the Arcee Mirage model, an Arabic language model which consistently outperforms state-of-the-art models across most open Arabic large language benchmarks, showcasing remarkable improvements and securing the top spot on the Hugging Face leaderboard. Before joining Arcee, Julien was Chief Evangelist at Hugging Face and Global AI Evangelist at Amazon Web Services. He also serves as a CDO at prominent startups. In 2021, Julien was ranked number one in the list of top 10 AI evangelists worldwide by AI Magazine, alongside Flex Friedman, Elias Hutzkever, and others shaping the industry. Julien, it's a pleasure to have you today with us. It's a pleasure to be here. Welcome to the show. Very happy to be in Dubai and very happy to be your guest today. Thank you. Before we dive into technical discussions, I want to go back to your personal story, getting into AI and now AWS, Hugging Face, Arcee. What brought you to AI in general? And what's your story getting into Arcee? It's a long story. I'll keep it short and hopefully not boring. It all started with MySQL. Back in the early 2000s, I was working on web platforms, e-commerce websites, and data was everywhere—product catalogs. We had to manage that. So, MySQL, Postgres, Oracle, you name it. Then I joined another company working in ad tech where data was structured. Still databases, but mostly unstructured—web logs with ad displays and clicks. We started using machine learning to predict clicks, click-through rate prediction. We built a very large Hadoop cluster, a multi-petabyte cluster, which was insanely large at the time. I started into machine learning, and though I had no formal training, I found it interesting—predicting behavior from web logs. I started learning about it, and the ball kept rolling. Around 2015-2016, when I joined AWS, AWS started launching their first machine learning services and AI services. I was curious about deep learning and decided to focus on it. I convinced my boss to let me focus exclusively on AI and ML, which few people understood 10 years ago. They said, "Sure, go do your thing." That's how it went—learning every day, starting from scratch, and figuring out use cases, what's in it for customers. Not the theory, but what can I build, solve, and improve? How can I enhance user experience and company agility? Figuring out the best technology and models for each challenge, which changes daily, so you have to keep learning and experimenting. That's how I describe my job—spending hours learning or unlearning and trying to figure things out so you don't have to. If I spend 20 hours on a problem and can make a YouTube video to explain it, and it gets thousands or tens of thousands of views, I save a lot of time globally for the user community. Thousands of saved hours is my leverage. Now, Arcee. Can you talk a little bit about Arcee's mission? What are some of the things you're working on, and how are you trying to achieve that mission? I describe Arcee as the SLM champions—small language model champions. We champion SLMs because we believe customers will get more value from their services and products using small language models versus large language models. An increasing number of people agree with this. We're also SLM champions because we're good at building models. You mentioned Mirage, our Arabic language model. We start from the best open-source models available, like llamas, mistrals, or Qwen, and run them through our post-training stack, which is our own IP. This stack includes open-source libraries we've built to not only build outperforming models but also build them cost-efficiently. We share some models on Hugging Face, keep some for commercial customers, and are also a platform builder. We deploy our models on SaaS platforms for inference with features like model routing and agentic workflows. We're trying to advance AI by recognizing that no single model can do everything. Just like there's no Swiss Army knife database or programming language, you need the right tool for the job. We help customers use the right models and tools for their specific needs. What are some of your customers, industries, and geographies where you're focusing right now? Arcee is an American company, and most of the team is based in the US. A few of us are in Europe; I'm based in France. We have a US focus, but we have international conversations. One reason I'm here this week is to meet with customers, potential customers, and partners. By the time you watch this, you will have heard of our announcements in the GCC region. We're serious about the region, seeing a lot of AI activity, interesting customers, and partners. We hope to help customers here as well. I see the same—AI is huge here, but it's huge everywhere. Here, things are moving very fast. I want to talk a bit about some use cases. You work with customers on built-in small language models, but you also have products. Maybe talk about some of the use cases you're working on with your customers. Initially, the vision was one LLM to rule them all, but that didn't work well, especially for customers with strong privacy or domain-specific needs. The open-source community started building great models, and now we have millions of models on Hugging Face, which is amazing but creates complexity. Everyone started fine-tuning models, trying to embed their domain knowledge, with varying success. The world is more complicated, and building automated workflows or smart workflows requires more than one model. You need a collection of models, and the smaller the better because small means fast, less expensive, and more scalable. You also need IT tools. You have interesting data in Salesforce, GitHub, and all your apps. Trying to build a single model to replicate all that knowledge is very hard and often fails. The right approach is to use models for what they're great at—data analysis, data conversion, story writing—and combine them with data from your existing IT systems. This opens up infinite use cases because you can combine models and external tools in countless ways. Instead of teaching a single model to do customer support, fraud prevention, and document analysis, use each model for its strengths. We work with companies across various industries, from financial services to insurance and retail. Ten years ago, I would have said financial services were ahead, but now we see customers in all verticals, including EdTech and companies that are not tech-savvy but have data and need to automate and scale. You also have products beyond the ones you're working with customers directly, like Conductor, Orchestra, and Merge Kits. What kind of problems are you solving with these? Initially, Arcee started as a model-building company, specifically for custom models. Companies came to us saying they needed a model for a specific use case in a particular industry. We were successful because we didn't just use the fine-tuning notebook from Hugging Face. We realized there's a different way to build models. You mentioned MergeKit, our model merging library. Model merging is a technique where you literally merge models. You take model A, model B, and model C, and run a math operation. There's no training involved. You can merge single-task models into multi-task models. This technique is popular, and companies like Google use it for their Gemma models. Model distillation is another great technique, thanks to DeepSeek. We've been doing this for a while, and our stack is sophisticated, not just vanilla techniques. This is how we build great models. Once we have those models, we built an inference platform called Conductor. Conductor involves model routing, where every prompt is analyzed by a tiny router model. We look at complexity, domain, and other factors to make a quick decision and send it to the most appropriate model, ideally the smallest one. Small models can do tasks like writing meeting invites, summarizing documents, and translating documents extremely well. Sending these tasks to an LLM is 100-200 times more expensive. At scale, this adds up quickly. The idea behind Conductor is to find the best model for each prompt in real-time and optimize for cost efficiency. We have examples where small models do as good a job as top models from OpenAI or Anthropic at a fraction of the cost. We also have Orchestra, our agentic workflow platform, with a simple UI where you drag and drop boxes. You can connect Salesforce as an input to an SLM, add custom code, use the models, run workflows, and deploy them as APIs or through a chat interface. The philosophy is that no single model, even the best open-source or Arcee model, is enough. You need a collection of models used at the right time for the right task. Conductor helps by letting the platform decide which model to use, which is great because models change constantly, and no one has the time to evaluate everything. We do that work, letting customers focus on their business operations. Let's dive into the model merging technique and how you build small language models. You talked about distillation, quantization, merging, and training from scratch. How do you get the best small language model? We're pragmatic. Some people have spent billions building great models like Mistral, Llamas, Qwen, and Gemma. We start from there. Our research team evaluates these models, connects with the teams, and identifies which new models are better at specific tasks or languages. Depending on what we're building, we select the right model and run the post-training process. We do this cost-efficiently, without spending millions. Techniques like model merging and distillation are super cost-effective and have repeatedly delivered outperforming models. Every time we publish a model on Hugging Face, it was number one in its size category. Over time, it changes, but every model we release is the best. We work very hard to ensure customers get a lot out of it. There are many ways to improve models, including post-training and synthetic data. The quality, diversity, and complexity of your data are critical. When Falcon from TII came out, it was better because they spent a lot of time cleaning and curating the data. This is crucial for fine-tuning models and building question-and-answer datasets. Diversity and complexity are important. If you fine-tune with 100 Q&A pairs, you'll get good results on those pairs, but not on a broader range of questions. Diversity and complexity are key. Most of these techniques require the models to be open source. You're very passionate about this. Open source is not well-defined in AI and generative AI. The term is often misused. We're starting to see open weights, but very few models are truly open source. To be open source, you need the training data set, training code, and post-training code—the full recipe. Hugging Face has built a few models where they shared everything, which is interesting. The community's contribution is critical. Initially, meaningful open-source or open-weights models came from Stanford, Vicuna, and Alpaca. Meta joined with the Llama models, and many others followed. We see actors from various regions, including TII, Alibaba with Qwen, and models from India and Asian countries. We'll see models from Africa and other regions. This is great because local teams are better suited to build models for local languages and cultures. No single company or a small set of US-based companies can be in charge of that. Local initiatives, languages, and cultures are essential. For example, a US West Coast team might not be the best to build a Swahili or Tagalog model. Cultural differences should be reflected in the models used. Sovereignty is an abstract concept, but if you need a model comfortable for schools from a language, safety, cultural, and religious perspective, that's important. If I build a French model, I'll do it the French way. If I help build a model for Singapore, there are rules. This is how I look at it. You touched on synthetic data, which contrasts with using a large language model to generate that data. How is synthetic data useful in this scenario? Synthetic data can be a vicious circle. Feeding AI models too much AI-generated data can go wrong. Synthetic data is useful for multiplying and diversifying your data. You can ask domain experts for the top 100 questions customers ask, and then use models to rewrite 50 versions of each question, accounting for cultural, age, and other factors. This enriches and completes your data sets faster. Generating net new data from scratch and relying too heavily on it can introduce bias. If the large model you use to generate data has a bias, that bias will be in your dataset, making it hard to detect and fix. It's like technical debt. To tackle synthetic data, use the knowledge already available, get human-in-the-loop, and use multiple models to average out any biases. Be very critical and review the data. Let's talk about small language models. This is an undefined topic. What is your definition of a small language model? My definition is anything you can run on a single accelerator without splitting or sharding. Realistically, anything larger than 70 billion parameters cannot be called a small language model. If you're targeting devices like phones, you're looking at single-digit billion models. AWS has a wide range of accelerated instances, from various GPUs to AWS chips like Tranium and Inferential. Cost efficiency is critical. When I was at AWS, I told customers to use the smallest EC2 instance possible. Sometimes, a T2 Micro instance worked just fine, saving thousands a month. The same applies to AI. Obsessing over cost efficiency is crucial because AI is about scale and automation. If AI can help doctors spend less time on paperwork and more time with patients, that's great. The same applies to teachers and other jobs. More face time with humans and less time on low-value tasks. If AI can do that, it needs to scale. For AI to scale, the cost needs to be minimal, which means using the smallest compute possible for inference, whether GPU or even CPU. I'm a fan of CPU inference and get fascinating results because the models get so small. Not really, because 72B models today are better. But for many business use cases, you can get the same performance today from an 8 or 10 billion parameter model that you would have needed a 72 or 34 billion parameter model for six or nine months ago. This makes a world of difference in deployment because a 10 billion parameter model can run on the smallest GPU instance available on AWS. I've seen your demos on Graviton, and you can run them on Graviton for small-scale use cases. So, what's your advice for people starting to run their models, how to evaluate the right size, the right model, and the right infrastructure? Start with the best 7 or 8 billion parameter model available today. You can try smaller models like Google's Gemma or Microsoft's models, but start at 7 or 8 billion parameters. Run them on the smallest GPU instance on AWS, which costs around a dollar an hour. The cost of experimentation is very low. Build evaluation datasets that reflect what your users will do, score with your internal metrics, and do human review. If it works with a 7B or 8B model, try 5B or 3B. The 3B models today go a long way. When should you try Graviton in production? First, there are scenarios where you have no GPUs. Some AWS regions are not GPU-rich, and it's difficult for smaller customers to get quota. There are also edge scenarios where you have a server in a restaurant or supermarket, or devices that don't have GPUs. In these cases, if you have the right type of CPU and know how to optimize models, you can get good results. Graviton is based on ARM technology, which is well-supported by open-source tools for optimizing models. Even in the cloud, there are scenarios where it makes sense not to use GPUs. For small-scale workloads, you can embed optimized models on the same instances that run the application code. This avoids having a single point of failure and reduces costs. There are lots of scenarios where a GPU-less architecture makes sense. If you need 10,000 inferences per second, this won't work, but not everyone needs that. Sometimes you just need a local model to do something small-scale, and you can run it in place instead of having a GPU instance that sits idle 99% of the time. There's a GPU-less architecture to be invented. We'll still need GPUs and accelerators like Tranium and Inferentia, but we need the right tool for the job. There is no single technical solution to everything. For small language models, edge scenarios are a must. Even when training models in the cloud, at Arcee, we used Tranium and reduced costs by 98%. We've also used SageMaker HyperPod. Moving from CUDA to Tranium involves a learning curve, but it's worth it for cost efficiency and scaling. If you work with well-known architectures, the learning curve is manageable. You'll find notebooks and examples, and you'll get support from your account team. HyperPod manages failures and restarts, provides monitoring and observability, and is valuable for long-running training jobs. Tranium is an interesting option, especially if you struggle with access to the latest GPUs. AWS has more control over Tranium, making it easier to scale and cost-optimize. The cost benefits are real, and it's worth a shot if you want to be independent of GPU supply chains. For new startups, public sector organizations, and enterprises starting from scratch, don't treat AI differently from typical software engineering efforts. A model endpoint is just another microservice. Focus on cost, scaling, monitoring, observability, security, and compliance from day one. Design for production, not the sandbox. Talk to all the stakeholders who will sign off on your deployment. Use AWS tools for cost management, compliance, and security. Talk to solution architects, who are free and here to help. Automate everything, right-size your infrastructure, and avoid silly mistakes. Eat your vegetables from day one, automate, cost-optimize, and talk to solution architects. They are the best resource you have. Thank you for listening to this episode of AWS for AI. If you're interested in learning more from Julien and Arcee, check out Julien's YouTube channel and Arcee's website. You can find links in our show notes along with resources for exploring AWS services mentioned today. Don't forget to subscribe to AWS for AI on your favorite podcast platform and leave us your feedback and thoughts about today's discussion. Until next time, this is Hamza Mimi thanking you for listening. Keep exploring, keep innovating, and we'll catch you on our next episode.

Tags

AI EvangelismSmall Language ModelsCost-Efficient AI SolutionsArabic Language ModelingModel Merging Techniques

About the Author

Julien Simon is the Chief Evangelist at Arcee AI , specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.

With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.

Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.

Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.