AWS AI and Data Conference 2025 Knowledge Distillation Build Smaller Faster AI Models

April 03, 2025
Knowledge distillation transfers capabilities from large language models to smaller, faster models while maintaining performance. Organizations can achieve dramatic improvements in throughput and cost efficiency. Learn how to implement distillation using Amazon Bedrock or to build a custom solution on Amazon SageMaker. Julien Simon will showcase how Arcee AI uses distillation to develop industry-leading small language models (SLMs) based on open architectures. He will also introduce the open-source DistillKit library and demonstrate several newly distilled SLMs from Arcee AI. Speakers: Laurens van der Maas, Machine Learning Engineer, AWS Aleksandra Dokic, Senior Data Scientist, AWS Jean Launay Orlanda, Engagement Manager, AWS Learn more about AWS events: https://go.aws/events Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4 ABOUT AWS Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster. #AWSEvents

Transcript

Welcome. We're going to get started slowly. So I want to open up this session today with a question. I want to know how many of you have tried, let's say in the last half a year, to build something using Gen.ai, a tool, an application, business use case. Show of hands. There are definitely some people in the audience. I think Gen.AI today is becoming a common part of our lives, and we really do use it to transform our business and to excite our customers and end users. Even though our business is transforming, as we have already heard in the keynote, I think our customers haven't really changed that much, have they? They want what they have always wanted: a great experience, quality of service, and they want it now. Simple ask, right? Today, we will talk about how you can make them happy and hopefully you as well, using a technique called knowledge distillation. Now, don't get this mixed up with whiskey distillation, which might be popular in this region. The first one aims to make your models smarter, while the second one just makes you think you're smarter. But it will be worth the next 30 minutes of your time. My name is Alexandra, and today, together with my colleagues, Laurence, Jean, and Julien from Arcee, we will discuss some details about knowledge distillation. We will actually tell you how you can do that on AWS today with different options for implementation, and we will also bring you some insights and experiences from the field. On that note, let's dive in. For those who have tried to build something with Gen.AI, the first question you might need to answer is which foundational model you want to use. There are so many available out there. You read a bit, understand what capabilities they have, narrow down the list, run some tests, run some benchmarks, and finally, you might zero in on that one final model that is good for your business based on your priorities. What we have actually seen is that these priorities are very often a pillar of three main things. You want to see the quality of the response, the speed of the responses, and the cost factor, because we all want to get a return on that investment. If you put these on a chart, they strongly relate to the size of your model. If you go with those big 70 billion parameter models, you will get a great quality of response, but that response might come in seconds, sometimes even minutes, depending on the use case. The bigger the model, the higher the cost tag as well. On the other hand, if you take a smaller model, let's say a one billion parameter model, you will get those tens of milliseconds responses you're looking for, but the quality might not be quite there. So what we are aiming to do, or what you can actually do here, is start with a small model and try to improve its performance to reach that gap towards those bigger models. We're going to do that by leveraging those big models we have. Our big models will be called teacher models, either 30 billion or 70 billion, and we want to see if they can instruct our smaller student models to perform better. This is an expert talk, so let's try to understand a little bit how this process looks behind the stages. It is very closely related to the process of training your models, specifically supervised fine-tuning, for which you need your label data or perfect ground truth labels. You run model predictions on that data as well. Ideally, you try to compare those and minimize the difference between them, which is our loss for the model, or our teacher loss. If we take a smaller model as a student, the process is exactly the same, but what you will see is that the student might not quite capture the same complexities in the data as the teacher does, so the performance won't be so good. This is where the idea came to try to get a student to imitate the teacher. If we can get a student to imitate the teacher very well, it might perform better even without understanding every complexity that comes from the data itself. Here, we try to minimize the difference between a student and a teacher prediction, and this is our distillation loss. This last line is knowledge distillation in itself. But if you have a good golden label data set, you should make use of it. You can combine the last two steps by using your ground truth label data and additional unlabeled data with teacher predictions. This is how you would actually get the best quality value from your student model. The combination might depend on the dataset profile, so you have to look into those. The final thing I want to introduce as theory for this presentation is some categories for knowledge distillation. We have two. What we talked about in the previous slide is actually black box distillation. We compare the predictions of the student and teacher models and focus on this for preparatory models because what you often have is just the final predictions. In case you have an open-source model, rather than just getting those final answers, you can also understand the thinking process of your teacher model. In this case, you have a white box approach, and the thinking is exposed through model logits. In the black box approach, the most common method is sequence KD or sec KD, which we often see implemented and used. If you look into white box approaches, you have multiple methods. Some examples are forward KD, reverse KD, and adaptive KD. Now, forward KD is very similar to sequence KD, but you use logits rather than the outputs. For reverse KD, you get your student to do the predictions, and then the teacher corrects them. Adaptive is a mixture of the above two methods, using a symmetrical loss function to leverage both. So much on theory. Now, we actually want to show you how you can do these things in AWS. You can do that in two ways. Either you can use Amazon Bedrock with a simple set of API calls to access these functionalities, or if you want to try some of those white box approaches we discussed, you might want to go with an option that lets you run more fine-grain control and customization with Amazon SageMaker. That's it from the intro. Now, I'm going to give it over to Laurence to tell you more about how to do it on Bedrock. Thank you, Alexandra. Okay, so now that we understand the concept of knowledge distillation, let's talk about how we can efficiently apply it using Amazon Bedrock. First, let's quickly talk about why doing knowledge distillation with Bedrock. First of all, fast training and deployment. When using Bedrock, you don't need to worry about GPUs, storage, cluster management. Bedrock takes care of all the underlying infrastructure for you when fine-tuning your student model. Secondly, it provides direct access to industry-leading large language models, allowing you to easily fine-tune the largest models available without major development efforts. Now, let's look at how Amazon Bedrock does knowledge distillation in the background. You first select your teacher and student models—your teacher large model and your smaller, cost-efficient model. Then, you provide your prompts. Bedrock will use your teacher model to generate responses and feed those prompt-response pairs to your student model to fine-tune it, generating a custom distilled model that is smaller but has the same performance as your teacher model. Additionally, Bedrock adds proprietary data synthesis techniques to enrich your prompts and responses, reducing the number of iterations needed to improve the model's performance. Bedrock streamlines this process in a single workflow, saving you multiple rounds of tweaking prompts, training, and evaluating results with a single API call. You have two approaches. You can use prompts that you provide. You provide your own prompts, and Bedrock will generate responses with your teacher model, which you will use for fine-tuning. Or you can use your production data. If your teacher model is already running in production and you have invocation logs enabled, you can use those logs to fine-tune your student model. Another important point is that Bedrock focuses on black box fine-tuning, meaning it fine-tunes with run responses using sequence KD. If you want to do white box fine-tuning, where you fine-tune with logits and learn from the reasoning of the teacher model, you can do that on SageMaker, and my colleague Laurence will explain this in a second. Now, if we look at the step-by-step guide, we first select our teacher and student models. Right now, the models available for knowledge distillation on Bedrock are from Anthropic, Meta, and Amazon, and these are possible in the US region, specifically US 1 and US West 2. You then prepare your dataset. If you decide to use your own prompts, you prepare them in JSON. When using Anthropic or Meta for fine-tuning, there's only single-turn conversation. If you use Amazon Nova, you can do multi-turn conversation in your prompts. You upload this to Amazon S3 and provide Bedrock access to these prompts. If you go for the second option, you need to ensure that your teacher model had invocation logs enabled to store your prompts and responses. You can then select whether you want Bedrock to use only the prompts and generate new responses or use the pre-generated responses from production. Another important feature is that you can filter down the prompts you want to use for fine-tuning, which is very important because you only want to use the prompts relevant to your specific use case. You can use the metadata, the request metadata, to filter those prompts. Finally, you need a minimum of 100 prompts and a maximum of 15,000 prompts for fine-tuning your student model. You trigger your distillation job with the Create Model Customization Job API call. In the background, Bedrock will do the whole process in the US 1 or US West 2 region. Once you finish your training, you can copy your custom model and deploy it in your region of choice using provisioned throughput. I encourage you all to try this yourself. I will now hand it over to Laurence, who will look at how to do this using Amazon SageMaker. Thank you very much. Right. So now that we've seen Bedrock, let's go into SageMaker. SageMaker allows more flexibility, enabling you to employ methods that are not purely dependent on black boxes. Sequence KD is very useful if you only have a black box, but if you have the weights and logits available during training, you might want to dive into more complex methods with potentially better results. Before I go into knowledge distillation, let's talk first about training and fine-tuning on SageMaker. Could I get a quick show of hands? Who has trained a model on SageMaker before? For those who did not raise their hands, let me do a quick recap of what it means to train on SageMaker. SageMaker is a managed service that allows you to train, deploy, and monitor your models on AWS. For training, you start by calling the training job API. SageMaker provisions a training job in the default VPC or your VPC of choice, performs health checks before billing starts, and you start streaming or copying the data to your training job, typically using S3, but you might use EFS or FSx for Luster for size or speed requirements. When training large language models, you typically use PyTorch or TensorFlow. If you want full control of your container, you can save it on ECR and use it from there. SageMaker provides pre-built deep learning containers, so you can typically use one of those for your training job of large language models. During training, logs are streamed in real-time to CloudWatch, and you can see what's happening in near real-time. Your metadata and hyperparameters are stored on SageMaker, so you know exactly what you did and when, which is important for experiment visualization. You can also use TensorBoard or any other tool for experiment tracking. Once your training job is complete, you output your model and data, and the training cluster spins down. Importantly, for very large models, you often use multi-GPU settings, and many of the pre-built containers support distributed training out of the box, whether model-parallel or data-parallel, depending on your use case. Now, let's go into what it actually means to do all the preparatory work before starting this training job. First, you prepare your dataset. We're talking about instruction tuning here, supervised fine-tuning. Typically, you'll have some JSON with a bunch of prompts and ideally human-labeled responses. Next, you select the appropriate instance type. Unlike Bedrock, where the GPU is entirely managed for you, you need to think about which GPU will fit your memory requirements. Then, you select the optimal tuning parameters. The typical ones for deep learning are epoch and batch size, but since this is natural language processing, you also look at sequence length. For large models, you might use LoRa or some form of quantization to limit memory requirements, which is often the bottleneck in LLM training. Finally, evaluation. In natural language processing, you can have straightforward problems like classification, where you use typical metrics like accuracy, F1, and precision. For summarization, you need a different approach. You might have a human summary, but how do you know if the machine-generated summary is worse or better? This requires some time and thinking before you start. For all of this, we have a great team that has spent a lot of time making examples of how to fine-tune large language models on SageMaker. I encourage you to check it out. It's really intuitive and straightforward. You have these notebooks, you run through them, and it shows you exactly how to run these training jobs on SageMaker. We'll go from fine-tuning to sequence KD. This transition is natural because sequence KD is just supervised fine-tuning with different data than the human-labeled data you initially had. We start with the advanced model, the teacher, and let's say we want to instruction-tune that as well. We're looking at four steps here. It's a bit more work to set up, but if the gap is big enough, it's very interesting and worth it to do. You train the teacher, then you run some SageMaker inference or spin up an inference server, run prompts through it, and get the responses. Now you have your distilled data. You also have the golden data, the labels. We combine these two to create a mixed dataset, which will be the input to your fine-tuning job for the student model. Similar process, just a few more steps. In the end, you get a similar result: Sequence KD. It's offline, an asynchronous process, and you do four steps to finalize this. You can also do it online, meaning synchronously within the same job. Still, the teacher needs to be trained. If you train the teacher now, within the same job, the teacher model and the student model will be in memory. The teacher model will run batches of prompts, and you'll have the logits for these batches. You'll also run these batches of prompts to the student model and compare the logits, as Alexandra described earlier. This is forward KD. Reverse KD and adaptive KD are slightly different. For reverse KD, you run the prompts first through the student, see what it generates, and the teacher will judge the student's output, requiring some stabilization techniques. These are all described in the papers that initially published these techniques. Given that we're here to talk about this topic, we've spent quite a bit of time employing these techniques in the wild and have some insights to share that might be useful if you want to try this technique for yourself. First, performance. If the gap between your teacher and student is not big enough, and the baseline of your student without the teacher-generated information is already good enough, then don't bother because it's quite a bit of work. Next, data. When talking about Sequence KD, you have the golden dataset, which is valuable because it was labeled by humans. You might want to upsample it to get a similar mix between the golden data and your teacher-generated data. Finally, memory management is crucial when training large language models. Quantization is very useful here, and LoRa wasn't that useful for us in our online experiments, but it's always worth investigating. Now, we've shown you a lot of theory. What is knowledge distillation? How do you do it on Bedrock? How do you do it on SageMaker? We're very lucky to have the champion of small language models here, Julien Simard, who will speak about Arcee and how they employ knowledge distillation for real-life use cases. Thank you. Good morning, everyone. It's a pleasure to be back in Ireland. My name is Julien. I'm the chief evangelist for Arcee. Arcee is a U.S. startup, under two years old, and we are collectively the champions and leaders in small language models. We only do small language models. I'll explain why in a minute and show you some of the models we've been building. We also put those models to work. We're platform builders, obviously on AWS, and I'll show you a quick demo of our latest inference platform, as well as our agentic workflow platform. These platforms can run on SaaS in our own AWS infrastructure, in your VPC for more privacy and compliance, and, if you're that kind of company, on-prem, although that's not our preferred option. Over the last 18 months, we've been focusing on enterprise customers across verticals, including financial services and telcos. That's just a small group of customers, just the logos I can show. We've been delivering platforms and building models for them. When it comes to using Arcee on AWS, we're deeply integrated. We work across the board with service teams. You can find our models and SaaS offerings on the AWS Marketplace. We have models on SageMaker Jumpstart and on Amazon Bedrock. So quite a few options. Arcee started as a small language model company because two of the founders, as well as myself, spent some time at Hugging Face early on. We realized that enterprise customers would get the most business value from their AI projects if they used small language models versus closed LLMs. Starting from quality open-source models available on Hugging Face, we ran them through our stack and added a bit of our secret sauce. Across the board, all those models ended up taking the top spot on the Hugging Face leaderboard. We also started building commercial models like Mirage or Supernova, which we'll discuss in a moment. Running the same benchmarks as the Hugging Face leaderboard plus other benchmarks, at the time of release, they took the top spot. So we were doing something right. The key to the quality of those models is starting from the best architectures available at any given time, not sticking to a particular architecture, just taking whatever works best right now, and applying some of our post-training techniques, including distillation. Let's dive into Supernova, the first commercial model we built and released last September. This model is based on LAMA 3.1.70B. The first step in building it was distilling LAMA 3.1.405B, the largest LAMA model available. Because we apply advanced distillation techniques, we ended up writing our own open-source library called DistillKit, which you can find on GitHub. It implements the techniques my co-speakers introduced, plus some others. We distilled this for five days on 32 H100s. Not a tiny workload, but not a huge one either. Certainly much smaller than training a new 70B model from scratch. That's one of the benefits of distillation: it's a cost-efficient way to build very high-quality models versus training them from scratch. Because one model is never really enough, we also built another one, starting from LAMA 3.1.7TB, training it on synthetic data generated with our EvolKit library and applying parameter-efficient training techniques with Spectrum, another one of our libraries. This resulted in a much shorter training time and gave us another very high-quality 70B model with slightly different qualities. Because two is not enough, we did another one, starting from LAMA 3.1 Base, which we realigned using DPO techniques. At the end of the day, we got three 70B models based on the same architecture, and we used model merging with our MergeKit library. Who has ever heard of model merging? Okay, so all of you are heavily missing out. Take my word for it. Go learn about model merging tomorrow and take a look at MergeKit. All the best models coming out today use merging. If you're not looking at it, you're missing out. This gave us Supernova. As you can see on this benchmark on IFEval, not only does Supernova outperform its teacher model, which is unexpected, but if you do it right, you actually get a model that performs better than the original model. Supernova also outperforms Cloud 3.5 and GPT-4-0 on these particular benchmarks and others. The takeaway here is that distillation is not a toy. Distillation is not nice but not great. You get amazing models. Talking about DeepSeek, we also ended up distilling DeepSeek V3. We built two other models that are available on Hugging Face. One is called Virtual, a 10 billion parameter model based on the Falcon 3 architecture. The other is Virtuoso Medium V2, a 32 billion parameter model based on Qwen 2.5. We distilled DeepSeek V3 and trained these two models. Obviously, they are pretty small models, but look at the benchmarks. The blue bars represent a 72B model from a year ago, called Nova, which you can get on Hugging Face. When we released it, it was the best 70B model available on Hugging Face. But look at the performance of the other two. The purple one is the 32B model, and the pinkish one is the 10B model. On almost all benchmarks, you can outperform the 72B model, not only with the 32B model but also with a 10B model. This shows that distillation is a powerful technique. It also shows that small language models are getting better and better all the time. We started from Falcon 3 and Qwen 2.5, which were better than whatever Nova was based on, so base models are also getting better, and the post-training techniques are getting better. There is a flywheel here where base models get smaller and better, leading to distilled models that are smaller and better, which can be further distilled, and so on. That's the beauty of small language models. They're not toys; they're outperforming like you wouldn't believe, even against much larger models. If you're interested in these models and distillation in general, I have a bunch of YouTube videos that seem to be a little bit popular, so go and find me on YouTube. And if you have questions, I'm trying to answer all of them. Now, let's look at why we're doing all of this. Smaller models, faster models, cheaper models, etc. But now you have a million models, literally more than that. If you're an enterprise user, option A is to find one of those amazing small models and run all your prompts on it. It will be great and cost-effective, but on some prompts, it won't be awesome, especially the most complex ones. Option B is to go with Sony 3.7 or GPT-9 or whatever came out yesterday, hoping it does a good job, but it costs a ton of money. The cost difference between our most efficient model in the platform you're going to see here, a model called Blitz, which is a Mistral 3 24B model, and Sony 3.7 is 300 times more expensive. Not 300 percent, 300 times. When you're sending a simple prompt to Sony 3.7 or GPT, you're overpaying 100, 200, 300 times. So what do you do then? That's why we built this. This is called Arcee Conductor, our inference platform. Based on model routing, it automatically picks the best and most cost-efficient model for every prompt. A simple prompt will go to a simple, cost-effective model, while a more complex prompt, like a coding prompt, will go to Sony or maybe a GPT. This is done automatically and routed automatically. So you get the best model every single time: small, fast, and very cost-efficient models for simple prompts, and slower, more expensive, maybe more advanced models for the prompts that really deserve it. We are actually launching this today, so go to arcee.ai. The other way to use these small models is by building agentic workflows. My LinkedIn feed is full of influencers talking about agents, but this is what we're building, and we know what our customers are doing with it. This is Arcee Orchestra, where you can drag and drop, no code, and build your workflows. This is the content generation workflow I use to turn my YouTube videos into blog posts in different languages, saving me a ton of time. You can see we have different boxes here, including Google Docs, YouTube, and so on. But we also have a bunch of model boxes. More complex workflows will obviously have a lot of model boxes. Imagine overpaying 300 times for each of those model steps. Plus, it would take 30 seconds or 50 seconds every single time, even for simple tasks. Instead, you can use the right model for the job. Let me run this. Run test. Yes. Now we can select the best, most cost-effective, and fastest model every single time. So now your workflows run fast, scale better, cost less, and give you the best quality possible. Here, you're seeing a manual run, but you can run this from a chat interface, trigger workflows automatically, and trigger workflows through APIs. Summing up, this is where we think the future is: building smaller models that are on par or sometimes better than the closed models, giving you the best tool for the job at the best possible cost, and building SaaS platforms based on those models to give you cost-effective inference without degrading quality, and letting you build powerful workflows, apps, and SaaS tools. Thank you very much. That's what I wanted to tell you today. Thank you very much to my co-speakers. Let's give them a warm round of applause. The next session will keep discussing agentic workflows and Gen AI with Zervi. I'll announce that right. So don't miss that one. Thanks again. Have a great day.

Tags

Knowledge DistillationAWS BedrockAmazon SageMaker

About the Author

Julien Simon is the Chief Evangelist at Arcee AI , specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.

With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.

Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.

Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.