Small Language Models with Julien Simon Let s Build a Startup S3E9

September 09, 2025
Discover how startups can benefit from adopting or creating Small Language Models and the role they have in enabling agentic AI. Learn from world-class experts Julien Simon, Chief Evangelist at Arcee, and Nicolas David, Sr. Startup Architect at AWS

Transcript

Music In the heart of the urban jungle, a unique species of homo sapiens thrives. Identified by their common adornments, the humble hoodie, and their peculiar yet endearing rituals involving ping-pong tables and an astonishing amount of free coffee. The startup homo sapiens. I'm no David Attenborough, but I know a thing or two about startups. For example, did you know they don't run on coffee alone? I'm here to guide you on an extraordinary journey. Our new Twitch series, AWS Let's Build a Startup. Together we shall observe how an idea evolves, grows, and matures into a real minimum viable product. All of this within the AWS ecosystem, whether it's the thirst for knowledge guiding you or simply your passion for free caffeine. Join me on this fascinating journey. Hey, hello folks and hello founders. Welcome back to AWS Let's Build a Startup. This is your one-hour show that weekly takes you through the secrets and mechanisms of success we see in awesome startups building with us on AWS. I'm Giuseppe Battista, your host for today. And in store for today, I think we have a great episode. We're going to dive a little bit deeper into what small language models are, how startups can actually make use of them, and their role in Agentee AI. To do that, I have two exceptional guests with me. I will be introducing them shortly. But please let me know if you have any questions about small language models, how we can use them, or if you're already using them in your startups. Before we dive deep into the world of small language models, let me just remind you of a few new episodes that are coming. A few weeks ago, we had a fantastic episode with Lovable. We will be sharing a link to that episode in a few moments. But next week, for example, we will take you through the journey from dataset to actual inference and production with two of my favorite startups and AWS partners: Super Annotate and Fireworks AI. We will take you through the journey from gathering your first piece of data, adding labels to it, automating the process of adding labels, all the way through to production-ready endpoints. Without further ado, I want to welcome to the stage Julien Simon from Arcee and ex-AWS, and my friend and colleague Nicolas David. Hey folks, how's it going? Nicolas is a startup specialist like myself. I'm based in the UK, and Nicolas is based in Bahrain, working with Sub-Saharan Africa. I'm super eager to learn more about the stories from startups there. Julien, I don't think you need any sort of introduction, but I'll do it for the sake of it because I really like the sound of my voice. I think you're a pillar of the open-source community when it comes to AI, and I'm trying to be the same voice, yes. So, folks, I know that you have been exposed to small language models in some way. Julien, I know that you have bought almost completely into the idea of our approach to large language models. But I really want to hear from you folks. So I want to spend maybe the first couple of minutes on the show helping me understand what a small language model is and how we are going to use it. What are the differences maybe with large language models? So I'll start, and you can correct me, Nico, okay? So small language models are small compared to large language models, right? How about that for an introduction? Large language models are basically the models we started working with when ChatGPT came out. Those are the models that everybody listening to us right now would know and be familiar with. Very large models, increasingly large models even, hosted behind APIs and generating text, generating images, etc. So that's how it started. And we tend to close those language models as well because we don't really know what they are. They're hosted behind APIs, and we don't really know what they've been trained on, etc. So there's a black box element. A few months after those models, the open-source community started releasing their own flavor of language models. Hugging Face played a very important role there. We started seeing models from U.S. universities, then pretty soon from startups, and then from Meta. Now, everybody's familiar with the llamas and the mistral. And I guess we'll talk about Arcee later on. These models are definitely smaller than their close equivalents. Even though if you read research papers, you will still see people saying LLMs and then they'll use LAMA or Mistral or something else. They'll be working with models that are way smaller than the bigger ones. If we assume that the largest models from OpenAI and Anthropic are hundreds of billions of parameters, maybe even trillions, or several models collaborating, obviously, those small language models are much smaller. LAMA, a lot of folks out there use LAMA 3.8 billion or Mistral 7 billion. Arcee released a few weeks ago a 4.5 billion parameter model that is actually better than the 8B model from LAMA. So I would say now we live in a world where you can work with single-digit billion parameter models and get real work done, not just toying around or experimenting, but actually deploying enterprise use cases and getting business value from them. And I think that's why we should really all try and say, you know, small language models now. Yes, I know there are larger open-source models. We have models, and OpenAI, funny enough, finally released an open-source model again, and it's 120 billion. So you could say, okay, that's a larger model. Maybe that still qualifies as an LLM. But when we say small language models, we really mean models, 10 billion, 20 billion, nothing really bigger than that. Okay. So it's about the number of parameters, right? How about your opinion, Nicolas? What do you think about them? Yeah, I 100% agree with Julien. I think Julien was one of the earlier advocates, if not the earliest, of small language models when I first met him in his role. Probably one of the loudest. And so recently, as you mentioned, Giuseppe, I cover sub-Saharan Africa with early-stage startups. One of the customers I've been working with recently, based out of Johannesburg, called Lapa, presented at the summit a couple of weeks ago about the model they use. They've actually released a model called Incuba LM, which is a small model with 1.9 billion parameters and covers five languages. This model, covering five languages, is basically translating, let's say, from Zulu to English or to another African language. The idea behind this is to remove barriers in education and several other commercial use cases as well. A fantastic use case, yet a small language model is used, not something with 120 billion parameters. So fantastic example of what can be done with these models, SLMs. It's not just about capabilities. This is a really good point. If you're listening to me from time to time, you keep hearing me say models get smaller and smaller, yet they get better and better. That's a really important point because, you know, our 4.5 billion model is better than 7B or 8B models from six months or a year ago. And that trend will continue. I don't think we need additional buzzwords, but I start calling them crazy small language models, CSLMs. Now we're seeing 3B or even smaller models that are very capable. Of course, they won't have all the knowledge and abilities that bigger models would have. But if you focus them on a single job and optimize them for that job, they will work great. The first thing I want to make clear is that there's been a lot of focus on parameter count in the last few years. A lot of folks out there want you to believe that bigger is better, that if you want more ROI from your model, you need a bigger model, whether open-source or closed-source. That's not true. The 70B models from a year ago are obsolete, and you can certainly get as much performance today with maybe an 8 or 9 billion parameter model. It really comes down to what you're getting from models today, not the size. The size with respect to abilities is a very fast-moving target. From one month to the next, it will literally change. So don't be afraid to test the smaller models, the latest smaller models. Hugging Face, as you know, has small LMs and models, and those are very good models, too. You should really try them out. You will be surprised how much value you get from them. That kind of association between the number of parameters and capabilities in general goes out the window as we progress into enhancing the way we train these models. The short answer is it's all about the quality of the data and, of course, the improved training techniques. But if I have to pick one thing, it's the quality of the data. We saw that when we trained our 4.5 billion parameter model, which is called AFM45B. AFM means Arcee Foundation Model. When we were monitoring the progress of the training job and running evaluations on the early checkpoints, we could see that the early checkpoints were outperforming some of the better models out there. The curve was skyrocketing and delivering very strong performance early on. Because, of course, our training recipes are not too bad, but it's really the quality of the data. If you're in school, you can have an amazing teacher, but if the material sucks, if your manual sucks, if you don't work on the right exercises, if you don't practice the right stuff, you won't really learn. So you need the right teacher, the right recipe, but more than anything, you need the right material. This is why those models are getting awesome very quickly. I've shared a link on LinkedIn, and I will be sharing on YouTube and Twitch soon. But if you want to have a look, there's a link to the description page of your model card. We do have a few questions coming from YouTube. Our friend Vivek. Hey, Vivek. Thank you so much for your question. Can small language models be a game changer for edge deployments where we need offline intelligence in near real-time? What do you think about that, folks? Yes, yes, yes. I saw you on LinkedIn a lot, Vivek, so that's good. Actually, if you think about what I was saying earlier, which is the models get smaller and smaller, yet better and better, it's a very positive flywheel. We love flywheels. Now we're getting to a point where the models are so small that you can optimize them for edge platforms. Let's say a CPU platform, a server running in an edge location in a restaurant or gas station, or even a smaller device. They're so small already, and through further quantization and optimization with frameworks like Intel OpenVINO or Lama CPP, you can shrink the model further. Modern CPUs, and we'll do some demos later in the session, give you really great performance because of hardware acceleration. So if I'm noisy about one thing, it's really SLMs plus CPU. I think this is an amazing combination. A lot of us are running these things on our laptops without GPUs. You can run them on the plane, anywhere. It just works. Nico, did you have some experience doing that? Yeah, actually, back when SageMaker had this other service called SageMaker Neo, you could compile a model specifically for edge devices like a Raspberry Pi. Fast forward a few years, and I think it was last year or a couple of years ago, the Raspberry Pi comes with an AI hat where you can have up to 26 teraflops of performance. If the model is small enough and you have those CPUs or dedicated chips, like Inferentia or Tranium, you can do a lot without a GPU. Even at the edge with a Raspberry Pi, which is based on ARM, you can do a lot. I'm doing a lot of work with ARM, and as we know, Graviton CPUs are ARM-based. All those chips have the instruction sets you need to accelerate matrix multiplication and other operations. If you combine that with Lama CPP, it's a great combination. If you move from an 8B model to a 4B model to a 2B model, you maintain the level of business performance and get maybe twice the speed or throughput on a tiny, cost-effective platform that can run outside the cloud and almost anywhere. This goes into answering the question that Bavesh has asked us: Would you recommend small language models for IoT applications with lower resources? I've only heard good stuff about this in terms of IoT. Is there any reason why we should maybe shy away from this and still keep some of the inference back in the cloud? So, I would say right now, anything smaller than a Raspberry Pi is probably not a good idea. If you're mentioning ESP32 or small 16-bit microcontrollers, that's not reasonable. You could probably run small computer vision models, but not generative AI models. IoT means a lot of different things to a lot of different people. Some people are running NVIDIA platforms at the edge, like Jetsons, which are small GPUs. Of course, you can look at those. But you won't be able to run a text generation transformer on a 16-bit Atmel chip anytime soon. So far, IoT and generative AI at the edge have a lot of interest. A scenario we're seeing is turning machine data, technical logs, or complex data generated or received at the edge into natural language text. Predictive maintenance, writing emails automatically, shooting emails to technicians, generating failure reports, or daily reports of what's working and what's not, can be done right there. It's so much better than going through logs. Imagine showing up at a technical location and being able to chat with the system installed there and figure out what's going on instead of opening technical logs and debugging with outdated diagnostics. So, you can think of a lot of scenarios there. Bavash was suggesting edge processing for smart meters analytics, which is fantastic. Another use case is preventative maintenance. These models are great at understanding very complex data and turning it into plain English or whatever your language is. It's also about the need for smaller corpora of data to train these models. If we're shifting our attention from quantity to quality, we can build knowledge in models that abstract ways of thinking for languages that don't have extensive literature or cultural heritage passed down by writing. I'm thinking of the type of languages that your customers are more accustomed to, Nicola. I also want to mention something that happened very recently. Many people spoke about this from AWS, but specifically what happened in Madrid, where the whole city was out of electricity. The infrastructure kept running, but there was no way to interact with things. If you think about scenarios where you need to manage traffic in a smart city environment with intelligence at the edge, you can manipulate traffic lights to make traffic as smooth as possible. There are plenty of scenarios in this domain, especially in rural areas where connectivity is not great or you need satellites. Smart City and living digital twins, which collect data from various sensors, are super interesting. Dealing with this data at the edge saves on latency and a ton of other things. Thanks to SLMs, you can deploy this on hardware that doesn't consume tons of electricity. That's a very interesting use case. As we build these systems and potentially rely more on them, it's great to think that we have a default option for local inference compute. We have a very interesting question from Bruno on LinkedIn. Hey, Bruno, thank you so much for that question. Can we execute an ephemeral environment for small language models in Lambda and use S3 for persistence? I love that kind of architecture. Any suggestions here? There's no reason not to if you understand the compute restrictions synonymous with Lambda, like no GPUs. Lambda does not have the latest CPUs, so the hardware acceleration I mentioned might not be available. You can select between AMD and Intel or ARM, but it's probably not the latest Graviton. So, technically, it's possible. You could use llama CPP with the Python API, grab a model, and go. You have to worry about cold start and transferring data from S3. Not every Lambda function should copy the model from S3, but you should know the tricks by now, like temporary storage. There are good blogs on how to do that. Technically, it's possible, but the performance might not be great. If you have a thousand parallel invocations, which is easy to do with Lambda, this could work for your use case, especially for batch processing where you need very low cost and are not sensitive to latency. Let's take this last question, and then we'll go into a demo. Any ideas about the cost of training a small language model? How do we help with that? Well, it's a complex thing. You can't just say it's three dollars. You have to define what training means to you. If you're talking about training from scratch, we're still talking millions of dollars. Preparing the data will be the worst part of the project. Training is actually not the hardest bit; the data is the killer. The quality of the model is really linked to the quality of the data. If we're talking about fine-tuning or post-training, starting from one of the good Hugging Face models, like AFM45B, and specializing it for a particular use case, the cost is much more reasonable. It could go from a few thousand dollars to a few tens of thousands, depending on the model size and how many runs you need. But it's orders of magnitude simpler. Arcee does that, by the way. If you're interested, you can ping me, and I can connect you to the right people. Fine-tuning is susceptible to catastrophic forgetting, which is absolutely right. Catastrophic forgetting means the model is trained so much on your own use case, like a customer support chatbot for banking, that it only knows how to do that, and everything it could do before is gone. It's like frying your brain on linear algebra and then trying to switch back to Greek philosophy. That's why I prefer to say post-training because there is much more than fine-tuning. Techniques like model merging and model distillation, which Arcee has shown to be pretty good at, are proven ways to enhance models, add knowledge or behaviors in a non-destructive way compared to fine-tuning. Fine-tuning is a blunt tool and can damage your models. Merging and distillation are more like scalpels and neurosurgery tools. We've been chatting for half an hour, which I believe is a record for this show because we try to grasp the attention of the viewers with nice demos. So, let's get into a demo. Folks, you're watching AWS Build a Startup. This is your weekly one-hour show dedicated to the secret sauce of a successful startup. Today, we are here with Julien Simon from Arcee and my friend Nicolas, a senior solutions architect focusing on early-stage startups in Sub-Saharan Africa. We had a few very interesting episodes of the show. A few weeks ago, we had Lovable. I'm going to share some of the links here. Next week, we're going to have Fireworks AI and Super Annotate, who will take us through the journey from annotating your first piece of dataset all the way down to enterprise-ready inference. But let's get back to small language models and your demo. We're going to show you two demos. The first one is our 4.5 billion model running on EC2. Recently, AWS released a new family of Intel instances. I'll talk about them in a minute. We're going to show you Intel EC2 inference on an Intel platform and then the same on SageMaker, the managed service for AI and ML, using a Graviton instance. If you want to dive a little deeper, Giuseppe has shared some links, but you can go to the Hugging Face Hub. This is the model I'm going to be using: AFM45B. That's the vanilla model. RCAI/AFM45B is the GG web version, optimized for Lama CPP. If you go to RCAI/blog, you'll find blog posts and everything you need to know about AFM, why we built it, why we think it's good, benchmarks, etc. These are the instances I was talking about: R8i. They're the latest Intel CPU generation, the so-called Granite Rapids. R8i is the first generation. We'll see C8i and other variants. But for now, that's what we have. You can read this good blog post and learn about this. I'm using an R8 instance here. I've built a Docker container that makes it very easy to grab models from Hugging Face and run them with llama CPP on EC2 or any Docker environment, including SageMaker. If you go to github.com/JulienSimon, you'll find it. If you like it, I'm always happy to see some stars. There are test containers on Docker Hub, so you can just run this in five seconds. It works on any Intel or ARM system. I'm using it on my Mac very successfully. Here, I'm just going to run my container and grab the 8-bit optimized model. This is an 8-bit quantization of the base model. Q8 means quantized to 8 bits, which means we are shrinking the depth of all the weights in the model. The base model, when you see Hugging Face models, is usually in the safe tensors format and trained in 16-bit precision. When we say 4.5 billion parameters, every parameter is a 16-bit value. Quantization is a process that rescales model parameters to a smaller range. When we say quantize to 8 bits, it means we're looking at those 16-bit values and rescaling them to 8 bits. The model size will be divided by 2. You might think we're losing precision, but in my experience, 16 to 8 is completely invisible for models of this size. If you go smaller to 4 bits, there might be a little bit of degradation, but the model is now 25% of the size. It's a trade-off. The smaller you go, the more memory you save, but the more degradation you could see. So, 8 bits is really safe. The model is already downloaded, so it's super fast. We can see Lama CPP detecting all those great instruction sets: AVX, AVX512, AMX. These are instruction sets to accelerate the typical math operations involved with deep learning and transformers. I'm using 32 cores. We loaded the model, and it should be ready to go. Now it's running locally on this machine, and I can use curl and the OpenAI API format to say, "Explain how the attention layer works in transformer models." Here we go. If the voice goes robotic, it's because it's running on the CPU. Sorry, it's on the wrong port. It's 8080. It should work. Let's give it a few seconds. Folks, in the meantime, have you ever used a small language model before? Please let us know in the chat box. There you go. Is it up and running now? Yeah, that's it. Here I'm using Perl, so there's no streaming. But you can see how fast this was. This means you can get good performance out of the box on this kind of machine. For a single session, you don't need 30 tokens per second, so you could run one to four sessions in parallel, depending on the latency sensitivity of the use case. You could also run multiple models, giving each 16 threads. This way, you can run multiple models on the same instance, and even the application itself. You can scale out, running inference pods and app pods on the same instances, regardless of whether it's CPU or GPU. CPU instances tend to be cheaper, so you could scale bigger and get more parallelism and resilience. If you have 50 nodes in a cluster, anything failing won't be noticeable. CPU inference is an interesting architecture. It won't work for high-scale scenarios where GPU acceleration is needed, but for small to medium scale scenarios, CPU could really work. Ramesh asked about small language models suitable for MEP-related projects. MEP stands for mechanical, electrical, and plumbing. In these scenarios, data is crucial, making up 80 to 90% of the problem. A colleague in Dubai created the Garnet Framework, an open-source framework to build dynamic knowledge graphs. This can be used to create a digital twin of a building, making decisions at the edge. Hugging Face published a blog about popular small language models like QN 1.5B, a smaller version of DeepSeek, or Gemma. The combination of this framework, the right model, and data from sensors could be an excellent solution for MEP projects. Now, I'm using SageMaker, a managed service for machine learning on AWS. It works with a Python SDK that allows you to fire up training and inference instances. I'm using the same container as on EC2, hosted in ECR. I'm giving my inference endpoint a name and using a Graviton3 instance. I'm deploying the model from the SafeTensors format, quantizing it to 8 bits on the fly. I create a model object with the container and parameters, then call deploy. The deployment log shows the model being downloaded and quantized layer by layer. We can run inference, and the speed is virtually the same as on a GPU. This is a C7 8xl, a rather cheap endpoint. You could go even lower to 4xl and still get good performance. SageMaker provides auto-scaling and other features. This is a viable option if you don't need GPUs due to cost or availability issues. When you're done, delete the endpoint to stop paying. Thank you, everyone, for sticking around. If there's anything I missed, let me know. Edward asked about SLM plus graph resources. I'll be on LinkedIn, so hit me up if you need more information. Junai will be in London on October 14th at the JNI loft. Book your ticket at Startups.aws.events. It's free, but space is limited. Next week, at 2pm BST, 6am PST, and 9am Eastern Time, we'll talk about going from data to inference with Superannotate and Fireworks AI. If you have any questions, feel free to reach out on LinkedIn. See you next week. Bye-bye. Thank you.

Tags

Small Language ModelsAWS Let's Build a StartupEdge DeploymentStartup EcosystemMachine Learning Optimization