AWS AI and Data Conference 2025 Knowledge Distillation Build Smaller Faster AI Models

April 03, 2025
Knowledge distillation transfers capabilities from large language models to smaller, faster models while maintaining performance. Organizations can achieve dramatic improvements in throughput and cost efficiency. Learn how to implement distillation using Amazon Bedrock or to build a custom solution on Amazon SageMaker. Julien Simon will showcase how Arcee AI uses distillation to develop industry-leading small language models (SLMs) based on open architectures. He will also introduce the open-source DistillKit library and demonstrate several newly distilled SLMs from Arcee AI. Speakers: Laurens van der Maas, Machine Learning Engineer, AWS Aleksandra Dokic, Senior Data Scientist, AWS Jean Launay Orlanda, Engagement Manager, AWS Learn more about AWS events: https://go.aws/events Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4 ABOUT AWS Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster. #AWSEvents

Transcript

. Welcome. We're going to get started slowly. So I want to open up this session today with the question. I want to know how many of you have tried, let's say, in the last half a year, to build something using Gen. A tool, an application, business use case. Show hands? There are definitely some people in the audience. I think Gen AI today is becoming a common part of our lives. And we really do use it to transform our business and to excite our customers and our end users. And even though our business is transforming, as we have already heard in the keynote, I think our customers, well, they haven't really changed that much, have they? They want what they have really always wanted. They want a great quality of service, and they want it now. Simple ask, right? Today, we will talk how you can make them happy, you as well, using a technique which is called knowledge distillation. Now, don't get this mixed up with a whiskey distillation, which might be popular in this region, as the first one aims to make your models smarter, while the second one just make you think you're smarter. But it will be worth the next 30 minutes of your time. Now, my name is Alexandra, and today, together with my colleagues, Lawrence and Jean and Julien from RCE, we will discuss some details about knowledge distillation, we will actually tell you how you can do that on AWS today with different options for implementation. And we will also bring you some insights and experiences from the field. On that note, let's dive in. So for you who have actually tried to build something with Gen AI, the first question you might need to answer is which foundational model you actually want to use. There are so many available out there, So you read a bit, you understand what capabilities they actually have, you try to narrow down that list, you run some tests, you run some benchmarks, and finally you might zero in on that one final model that is good for your business based on your priorities. What we have actually seen is that these priorities are very often a pillar of three main things. You want to see how good quality of the response you get? You want to see what is the speed that you get those responses with and the cost factor because of course we all want to get the return of that investment. Now if you take this tree and actually put them on a chart, you will see that they very strongly lay to the size of your model. If you go with those big 70 billion parameter models, you will get the great quality of the response, but that response might come in seconds, sometimes even minutes depending on the use case. And you know, the bigger the model, usually the higher the cost tag as well. On the other hand, if you take a smaller model, let's say one billion model, you will get those tens of milliseconds responses you're looking for. But the quality might not be quite there. So what we are aiming to do, or what you can actually do here, you can either start with a big model and try to speed it up. And there are a lot of techniques out there which focus really on this. But this is not what we're going to talk about today. That's a separate presentation. As we're going to start with a small model and we're going to try to improve its performance to reach that gap towards those bigger models. And we're going to do that by trying to leverage those big models we have. Our big models are going to be called now our teacher models of 13 or 70 billion and we want to see if they can instruct our smaller student models to actually perform better. Now this is an expert talk, so let's try to understand a little bit how this process looks actually behind the stages. It is very closely related to the process of actually training your models, process of supervised fine-tuning, and for that, you need your label data or your perfect ground-root labels, and you're going to run model predictions on that data as well. Ideally, you actually try to compare those and minimize the difference between them, and that is our loss for the model, or our teacher loss. Now, if we take a smaller model with the student, the process is exactly the same, you will see that a student might not quite be able to capture the same complexities in the data as a teacher does, so performance won't be so good. This is how the idea came to be to actually rather try to get a student to imitate the teacher. Because if we can get a student to imitate the teacher very well, it might perform actually better even without understanding every complexity which comes from the data itself. And here we actually try to minimize a different between a student and a teacher prediction, and this is our distillation loss. Now this last line that is the knowledge installation already in itself. But if you do have a good golden label data set, you should make use of it. So you can actually combine the last two steps. You can use your ground-root label data, and you can also use additional unlabel data with teacher predictions, combine them, and this is how you would actually get the best quality value of your student the combination of this might depend on a data set profile so you have to look into those. The final thing which I want to introduce as a theory for you to know for this presentation is some categories for knowledge installation. We have two. What we have talked in the previous slide is actually black box knowledge installation and in this case we actually compare the predictions of the student and the teacher model and we try to focus on these for proprietary models, because for those, what you actually often do have is just the final predictions. Now, in case you have an open source model, rather than just getting those final answers, you can instead try to also understand the thinking process of your teacher model. And in this case, you actually have a white box approach, and the thinking is actually exposed through our model loggots. In the black box approach, the most common method is sequence, and this is what we very often see implemented that used out there. If you look into white box approaches, you have multiple methods out there. Some examples are forward KD, reverse KD, adaptive KD. Now forward KD, that's very similar like sequence KD, you just use logits rather than the outputs. For the reverse KD, you actually get your student to do the predictions and then the teacher corrects them. And adaptive is a mixture of the above two methods. It's symmetrical loss function actually so it can kind of leverage. The boat. So much on theory. Now we actually want to show you how you can do these things in AWS. And you can do that in two ways. Either you can use Amazon Betrock with a simple set of API calls to actually access these functionalities. Or if you want to try some of those white box approaches we discussed, then you might want to go with an option which lets you run a bit more fine-grained control and customization with Amazon SageMaker AI. That's it from the intro. Now I'm going to give it over to So I'll tell you more about how to do it on Bedrock. Thank you, Alexandra. OK, so now that we understand how the concept of knowledge distillation, let's talk about how we can efficiently apply it using Amazon Bedrock. First, let's quickly talk about why doing knowledge distillation with Bedrock. First of all, faster training and deployment. So when using Bedrock, you don't need to worry about GPUs, storage, cluster management. Bedrock takes care of all the underlying infrastructure for you when fine-tuning your student model, when doing knowledge installation. Secondly, it provides you direct access to industry-leading large language models and to easily fine-tune the largest models available without major development efforts. So now let's look at how does Amazon Bedrock do knowledge installation in the background. So you first select your teacher, and student model, your teacher, large model, and you smaller cost-efficient model. And then you provide your prompts. And Bedrock will use your teacher model to generate responses, and it will fit those prompt response pairs to your student model to fine-tune it. And it will generate a custom distilled model, smaller model with the same performance as your teacher model. Additionally, Bedrock also adds some proprietary data synthesis techniques that looks at enriching your prompts and responses reduce the number of iterations you need to do to improve the performance of a certain model. So Betrock streamlines this process in a single workflow. It saves in multiple rounds of tweaking prompts, training, evaluating the results in a single API call. Now, you have two approaches. You can use pronts that you will provide. So you provide your own pronts, and Bedrock will generate responses with your teacher model, and you will use those for with logits and we learn from the reasoning of the teacher model, you can do that on SageMaker and my colleague Lawrence will explain this in a second. Now, if we look at the state-by-state guide, right, we first select our teacher and student model. Right now, the models that are available for notice installation on bedrock are the models for Anthropic, Mehta and Amazon, and these are possible in the U.S. Region, so in U.S. 2 and U.S. East 1. You then prepare your data set. If you decide to use your own prance, you prepare them in JSON. When you want to use anthropic or meta for fine-tuning, there's only single-term conversation. If you use Amazon Nova, you can do multi-term conversation in your prunts. And then you upload it to Amazon S3, and you just have to provide bedrock access to these prongs. If you go for the second option, of course, you need to make sure that your teacher model had invocation logs enabled to store your prompts and responses. And then you can select two options. You can select whether you want Bedrock to use only the prompts and generate new responses or whether you want to use the prompt responses that were already generated in production. Another important feature is that you can filter down the prompts that you want to use for fine tuning. And this is very important because you only want to use the prompts that are relevant for your specific use case. And you can use the metadata, the request metadata, to filter those prompts. And final data point, you need a minimum 100 prongs and a maximum of 15,000 pounds for fine-tuning your student model. Finally, you just have to trigger your distillation job. So you trigger with the Create Model Customization Job API call. And in the background, Bedrock will do the whole process that we reviewed just a second ago in the US 1 or USWS2 reason. Once you finish your training, you can copy your custom model and deploy it in your region of choice using provision throughput. All to try this yourself. I will now hand it over to Lawrence who will look at how to do this using Amazon SageMaker. Thank you very much. Right. So now that we've seen Bedrock, let's go into SageMaker. And SageMaker, you can imagine, allow some more flexibility. It allows you to employ these methods that Alexandra described that are not purely dependent on and black boxes. Sequence KD is very useful if you only have black box, but let's say you do have the weights, you have the logits available during training. You might want to dive into some more complex methods with potentially even better results. Now before I go into knowledge distillation, let's talk first about training and fine training on SageMaker. Could I get a quick show of hands? Who has trained a model on StageMaker before? Their hands up let me do a quick recap of what it means to train on SageMaker. SageMaker is a managed service that allows you to train, deploy, and monitor your models on AWS. And for training, you start by calling the training job API. What SageMaker then does for you is it provisions a training job in the default VPC or your VPC of choice. It does some health checks before the billing starts, and you start streaming DAVID MALAN, or maybe copying the data to your training job. Typically, you'll use S3 for it, but you might have some size or speed requirements that cause you to use EFS or FSX for Luster. Next is we're training large language models. Typically, you'll be using Pyotorch or maybe TensorFlow. If you want full control of your container, that's entirely possible. You just save it on ECR and use it from there. StageMaker provides a bunch of pre-built deep learning containers for you, So you can just typically use one of those for your training job of large language models. During your training, you're going to be streaming the logs in real time to CloudWatch or SageMaker will be doing that for you, and you'll be able to know and see in real time, near real time, what is going on. For your training job, your metadata, high parameters are all going to be stored on SageMakers, so you know exactly what you did and when, which is very important for experiment visualization. You can also use TensorFlowBord if you really want to go in depth. Or any other tool for experiment tracking. Next is outputting your model and data once your training job is complete and you spin down the training cluster. Now importantly, since we are talking about these very large models, often it's not just one instance that we use, we can be in multi-GPU settings as well. And the great thing about these pre-built DLC is that many of them support distributed training out of the box. Parallel, whichever is necessary for your use case. So that's the training bit. Let's go into what it actually means to do all of the preparatory work before you start this training job. First is you prepare your data set, right? And we're talking about instruction tuning here, supervised fine tuning. So typically you're going to have some JSON with a bunch of prompts and I would say ideally human labeled responses as well, right? So those are the labels for your supervised training. Next is selecting the appropriate instance type. Where the GPU is entirely managed for you, you need to think about which GPU is going to fit my memory requirements. Then you select the optimal tuning parameters. The typical ones for deep learning are epoch, batch size you need to think about. But since this is natural language processing, we're also looking at the sequence length. And since we're looking at large models, you probably will be using Laura or some form of quantization to limit the memory requirements, which often is the bottleneck in your LLM training. Finally, evaluation. Because in natural language processing, you can have maybe straightforward problems such as classification, where you have the typical accuracy, F1, precision recall metrics. But when it comes to summarization, you need to approach the problem differently. You might have a human summary right there, but how do you know if the machine generated summary is worse or better than the human summary? Won't go into detail there for now, but you can imagine the evaluation metrics also require some time in thinking before you start. For all of this, we have a great team that spent a lot of time making a bunch of examples of how to fine tune large language models on SageMaker. I do encourage you really to check it out. Get your hands dirty. It's really intuitive. It's straightforward. You have these notebooks. You run through them, and it's going to show you exactly, like you saw on the previous slide, how to run these training. Jobs on StageMaker. Good. We'll go from fine-tuning to Sequence KD. This transition is natural because Sequence KD is just supervised fine-tuning with some different data than just the human label data that you initially had. We start with the advanced model, the teacher, and let's say we want to instruct you in that as well. We're going to be looking at four steps here. So it's a bit more work to set up. But yeah, if the gap is big enough, that is very interesting and worth it to do. Do. You've trained the teacher, then you run some stage maker inference, so you spin up an inference server, you run a prompts through them, and you get the responses. So now you have your distilled data. You also have the golden data, the labels I call them, and we're going to combine these to create a mix of the dataset. And this mixed dataset is going to be the input to your fine-tuning job for the student model. So similar process, just a few more steps. In the end, you get a similar result sequence kd it's offline in the sense that it's an asynchronous process you really do four steps to to finalize this now you can also do it online and online is going to mean that you do it synchronous within the same job still we're looking at two steps because the teacher needs to be trained if you want to train the teacher now you've done this what's going to happen next is within the same job and this is you can imagine for the memory requirements bit harder you're going to have the teacher model and the student model in memory and the teacher model is going to have batches of prompts run through them and you'll have the logis for these batches and you'll also run these batches of prompts to the student model and then compare the loges as alexander was describing earlier in the presentation and that's forward kd reverse kd and adaptive kd are slightly different because reverse kd we're going to run the prompts first through the students see what it rates. Well, it's probably going to generate some nonsense in the very beginning. And rather than the student mimicking the teacher, we're going to have the teacher saying to the student, this is not so good. This is pretty good. So it's kind of judging the student. And you'll need some stabilization techniques for that as well. These are all readily described in the papers that initially published these techniques, in this case, is mini-LLM. Now, given that we're here to talk about this topic, we also spent quite a bit of time employing this techniques. In the wild and some insights that we wanted to share that might be useful if you want to try this technique for yourself which we do encourage you to do first is performance if the gap is not big enough your teacher and your students at let's say the baseline of your student without the teacher generated information is not big enough then don't bother because it's quite a bit of work and well maybe your student your much smaller model is already good enough Next is data. When we're talking about sequence KD, you have the golden data set. You actually had a human who spent a lot of time labeling this data or a bunch of people. And that data is very valuable. So you might want to upsample it to get kind of a similar mix between the golden data and your teacher generated data. And finally, and I can't say this enough, whenever we're talking about training large language models, memory management is crucial. Quantization is going to be very useful here. That useful for us in our online experiments, but it's always worth investigating. So now we've shown you a lot of theoretics, I would say, right? What is knowledge installation? How do you do it on bedrock? I do it. How do you do it on stage maker? We're very lucky to have the champion of small language models here. Giulien Seymour, we will speak about RCAI and how they employ knowledge distillation and use it for real-life use cases. Thank you. Good morning, everybody. It's a pleasure to be back in Ireland. My name is Julien. I'm the chief evangelist for RCAI. RCAI is a US startup. It's under two years old. And yes, we are collectively the champions and the leaders in small language models. We only do small language models. I will explain why in a minute. And I'll show you some of the models that we've been building. Also put those models to work. We're platform builders, obviously on AWS. And I'll show you later in the presentation a quick demo of our latest inference platform, and as well as our agentic workflow platform. And of course, they're heavily based on small language models. And again, we'll explain why. Our platforms can run on SaaS in our own AWS infrastructure. They can run in your VPC for more privacy compliance. And if you're that kind of company, yes, we can do on-prem, although that's not our preferred option. Over the last 18 months, we've been focusing on enterprise customers across verticals. As you can see, financial services and telcos and others. That's just a small group of customers, but just the logos I can show, actually. And we've been delivering platforms and building models for those. When it comes to using Arcee on AWS, we're deeply integrated. We work across the board with service teams. You can find our models and SaaS offerings on the AWS Marketplace. We have models on SageMaker Jump Start, and we have models on Amazon Bedrock. So quite a few options. So Arcee started as a small language model company. The main reason is two of the first. Founders as well as myself spend some time at HockingFace early on and we realized that enterprise customers would get most business value from their AI projects if they use small language models versus closed LLMs. And so because the proof is in the pudding, we started building models ourselves. So starting from quality open source models available on Huggings, we ran them through our stack and added a bit of our secret sauce and across the board across model sizes all those models ended up taking the top spot on the Hugging Face leaderboard and we also started building commercial models like Mirage or Supernova which will double click on in a minute and again running the same benchmarks as the Hugging Face leaderboard plus other benchmarks at the time of release we saw them taking the top spot. So certainly we were doing something right. And I think the key to the quality of those models is, number one, starting from the best open source architectures available at any given time, not sticking to a particular architecture, just taking whatever works best right now, and then obviously applying some of our post-training techniques, including distillation. So let's double click on Supernova, which is the first commercial and we built we built last September. So this model is based on Llama 317B. And the first step in building it was distilling Llama 3105B, which is the largest Llama model available. So because we apply advanced distillation techniques when writing our own open source library, which is called Distill Kit, you can find it on GitHub, it does implement the techniques that my co-speakers introduced plus some other techniques. And we distilled this for five days on 32H100. So not a tiny workload, but not a huge workload either. And certainly a much smaller workload than training a net new 7TB model from scratch. So that's also one of the benefits of distillation. It is a cost-efficient way to build very very high quality models versus training them from scratch. Because one model is never really enough, we also built another one, starting from Alama 317B, training it on synthetic data generated with our Evolkit library and applying parameter efficient training technique with Spectrum, yet another one of our libraries. And that was a much shorter training time. Very high quality 7TB model with slightly different qualities. And because 2 is not enough, we did another one, which we are realigned, starting from Llama 31Base, which we realigned to our own preferences using DPO techniques. And so at the end of the day, we got those 370 models based on the same architecture, and we use model merging with our MerchKit library. Who has ever heard of model merging? So, okay, so all of you are heavily missing out, okay? Take my word for it. Go learn about model merging tomorrow and take a look at MerchKit. All the best models coming out today use merging. So if you're not looking at it, you're missing out. And that gave us supernova. And as you can see on this benchmark on IFE Val, not only does supernova outperform, teacher model, which is unexpected, right? Because a lot of people look at distillation as a quick and dirty way to kind of get nice small models, but not as nice as the bigger model. Well, if you do it right, you actually get a model that performs better than the original model. Why and how is a 600 level session, I guess, maybe next time. And Supernova also outperforms Cloud 3-5 And GPT4 on these particular benchmarks and others. So the takeaway here is distillation is not a toy. Distillation is not nice but not great. You get amazing models. And I think DeepSeek kind of showed that, right? Talking about DeepSeek, we also ended up distilling Deepseek V3. We built two other models that are available on Hugging Face. One is called Virtual Light. It's a 10 billion parameter model based on the Falcon 3 architecture. And Virtuoso Medium V2 is a 32 billion model based on Quinn 2.5. So we distilled Dipsic V3 and train those two models. Okay, and obviously pretty small models, but look at the benchmarks. So here, the blue bars are our 72B model from a year ago, a model called Nova, which you can get on Hocking Face. When we released it, it was the best 70B model available. So not a bad model at all. But look at the performance of the other two. So the purple one is the 32B and the pink-pish one is the 10-B model. So you can see that on almost all benchmarks, you can outperform the 72-B model, not only with the 32-B model, but also with the 10-B model. So this shows a lot of things. Again, this shows that, you know, is a powerful technique. This shows that Bayes small language models are getting better and better all the time. We started from Falcon 3 and Q22, which were better than whatever NOVA was based on. I think QN2, Q1.5 maybe. So base models are also getting better. And the post-training techniques are getting better. So, you know, Amazon loves flywheels. So I think there is a flywheel here where, you know, So, get smaller and better, and lead us to distilled models that are smaller and better, and those may be further distilled, et cetera, et cetera. So that's the beauty of small language models. Again, they're not toys. They're outperforming like you wouldn't believe, even against much larger models. If you're interested in those models and in distillation in general, I have a bunch of YouTube videos which seem to be a little bit popular, And of course, if you have questions, I'm trying to answer all those questions. So now let's look at, oh, you want to take a picture? OK. All right, sorry, sorry. Take a picture. All right, get your pictures? OK. All right. Yes? OK, let's keep going. You'll get the deck, no worries. So now, why, why, why are we doing all of this? OK, yes, smaller models, faster models. Cheaper models, etc. But now you have a million models, okay? Literally, more than that. So if you're an enterprise user, option A, well, I'm going to go and find one of those amazing small models that Julien told me about, and I'm going to run all my prompts on it. And well, yeah, it's going to be great. But on some prompt, and it's going to be cost effective, very, but on some prompts, it's not going to be awesome, right? Won't be handled as efficiently. So option B is I go with Sony 3-7 or GPT-9 or whatever insanity came out yesterday and hopefully does a good job, but it's costing you a ton of money. And when I say a ton of money, I do mean a ton of money. The cost difference between our most efficient model in the platform you're going to see here, a model called Blitz, which is a Mistral 324B model. The difference between this and Sony A37 is 300 times more expensive. Not 300 percent, 300 times. So when you're sending a small prompt, a simple prompt, a vanilla prompt, and most of them are vanilla. When you're sending those to Sony A37 or GPT whatever, you're overpaying 100, 200, 300 times. So you don't want to do that too much. Well, that's why we build this. So this is called Arcee conductor, and it's our inference platform. Based on model routing, it will automatically pick the best and most cost-efficient model for every prompt. So a simple prompt will go to a simple cost-effective model, a more complex prompt, like maybe a coding prompt, will go to Sonay or maybe a GPT. And this is done automatically. And routed automatically. And so that means you get the best model every single time. Small, fast, very cost-efficient model for simple prompts. Slower, more expensive, maybe more advanced models for the prompts that really deserve it. I believe we are actually launching this today. I should know. So go to rc.a.I and take a look. So that's one way to use those SLMs. Building agentic workflows. So my LinkedIn feed is full of influencers talking about agents. I don't know what those people are talking about. This is what we're building and we know what our customers are doing with it. So this is Arcee Orchestra. You can drag and drop no code and build your workflows. That's the content generation workflow I used to turn my YouTube videos into blog posts in different languages. It saves me a ton of time. And you can see we have different boxes here. We have Google Docs and we have different things. We have YouTube and so on. But we also have a bunch of model boxes. And more complex workflows. We'll obviously have a lot of model boxes. So imagine overpaying 300x for each of those model steps. Not too cool, right? Plus it's going to take 30 seconds or 50 seconds every single time, even for simple tasks. So instead, you can use, again, the right model for the job. Let me run this. And now we can select the best, most cost-effective and fastest model every single time. So now your workflows run faster, scale better, cost less. I mean, they cost, but they should cost, not 310 more. And they give you the best quality possible. Okay? So here you're seeing like a manual run. You can run this from a chat interface, triggering workflows automatically, and you can trigger workflows. So, summing up, this is where we think the future is, building smaller models that are on par or sometimes better than the closed models, giving you the best tool for the job at the best possible cost, and then building SaaS platforms based on those models to give you cost-effective inference without degrading quality, and letting you build powerful workflows. Integrating with all your IT apps and SaaS tools. Thank you very much. That's what I wanted to tell you today. Thank you very much for my co-speakers. Let's give them a warm round of applause. And I think the next session will keep discussing agentic workflows and Gen. AI with Zervi. Hopefully I pronounce all right. So don't miss that one. Thanks again. Have a great day. Thank you.

Tags

Knowledge DistillationAWS BedrockAmazon SageMaker