Good afternoon, everyone, and welcome to Under the Hood with AWS Compute, the show that takes you behind the scenes to understand how AWS services and solutions help drive great outcomes everywhere from media and entertainment. I'm your host, Lorenzo Winfrey, Senior Specialist Flexible Compute. This is going to be a good one, folks. I'm really excited about today's show. My guest today is none other than Julien Simon, Chief Evangelist at Arcee, and my peer Jeff Underhill, Principal Specialist Flexible Compute. On today's episode, we'll dive deep into the optimal way to leverage AWS Graviton for running AI ML workloads and share the exciting work that Arcee is doing with Graviton. Good afternoon, Julien, Jeff. Welcome to the show. How are you doing?
Doing great. Thank you for having me. Pleasure to be back.
That's right. Jeff, I think we had you on a few weeks ago. Glad to have you back as well. And I'm super excited because we're talking about two of my favorite things, which are Graviton and AI. So just really, really excited about that. We're coming out on those things today. So if you're tuned in, everyone in the chat, you're in for a good one. As always, drop in the chat. Let us know where you're watching us from so we can give you a shout out. I'm hanging out in the DMV, the D.C. Maryland area here today. And I think, Julien, Jeff, I think you guys are out on the West Coast.
Oh, I'm out. I'm outside Paris. So, you know, Paris, France, in case you wondered. It's horribly late, but like I said, you know, I'm jet lagged, so it doesn't really matter what time zone I'm in.
And yeah, Jeff, where are you by the way?
Yeah, I am in, I'm going to say sunny Seattle. It's not sunny here today, but so yeah, I'm on the West Coast of the US, and hey, I appreciate you hanging with us late in your evening.
Yeah, sure. Especially jet lagged. My pleasure.
For sure. Look at that commitment, folks. I always appreciate you hanging in with us. All right. So, Jeff, before we dive kind of super deep into the details, can you tell the audience exactly what AWS Graviton is and how it works?
Yeah, sure. So you should know you've got the shirt on, right? So you're sporting the Graviton t-shirt. Nice. Representing there. Yeah. So fundamentally, AWS has been developing custom silicon for a number of years, I think probably over 10 years now. Back in 2017, we introduced the Nitro system, and that was the first example of custom silicon from AWS. Now Graviton is our general-purpose processor. We introduced the first generation of Graviton in 2018. So we're about six years into this journey now. It's amazing to think I've been here for all of that time, and it's been quite the journey. And at re:Invent at the end of last year, so literally one year ago, almost to the day, it's a couple of weeks away from re:Invent 2024, we introduced Graviton 4. So we're on our fourth generation of processors right now. And over those four generations and that six-year period, we've been able to deliver four times the performance. So from generation one to generation four, the current Graviton 4 processors are powering multiple instances, C7g and X8g, with up to 192 vCPUs and 3 terabytes of memory. So it's actually quite a beast of a processor. And yeah, been around for a number of years. Let me see, what else can I say? We've built over 2 million Graviton processors. That was a stat that we shared last year. We've got over 50,000 daily active customers using it across over 150 instances in our global regions. So yeah, I've been doing this for a while. That's a mature offering, and lots of customers rely on Graviton to run their business every day.
No doubt. No doubt. It's definitely one of those things that I'm glad to see folks coming around on, right? The whole architecture, people got exposed to it on their cell phones. And now it's powering some of the most dynamic and versatile workloads that we're running at scale these days, which I think is really great. I think, Jeff, one of the biggest things is, right? Graviton's 40% price-performance benefit. Can't forget about that one. That's always something we want to note. So, you know, definitely, I say fourth generation, Graviton4, C7g. So folks, if you haven't checked out Graviton, I'm sure you're going to want to check it out after this episode. The applications for it are just getting bigger and bigger every day. So, okay, Jeff. So, I'm always a fan of getting the tough questions out of the way early on. We can kind of smooth it out after that. So when it comes to Graviton, the primary thing I think everybody watching in the audience would like to know today is if and how they can use Graviton for their AI ML workloads. Can you talk a little bit about the AI ML pipeline and how Graviton can play a role there?
Sure. So first off, I'm going to say I'm not sure that's a tough question. And the reason I say that is fundamentally, if you look at the AI ML space, you've got a ton of diverse workloads. And that means there's no one size fits all for running all AI ML workloads, right? So a lot of people gravitate to accelerators and think, you know, that's the only way you can get this done. In the AWS Cloud, we've got AMD, Intel, NVIDIA, and AWS Graviton. So it's about choice fundamentally. And so the question is not tough. Where it gets into the tough part though is figuring out which AI ML workloads belong where. And even that, if you think about it, the cloud, you have access to all of these things. So once you've got an AWS account, all of these are at your disposal. And because you only pay for what you use, it fosters the opportunity to evaluate and experiment, right? So that said, the key is having options. So if you can use Graviton, absolutely. And throughout the course of the remainder of the session here, we'll sort of get through that. I'll be talking about that at re:Invent 2024 as well. Shameless plug. And then if we look at Graviton3 and Graviton4 and the advancements there from a compute perspective. So let's say Graviton3, we doubled the SIMD bandwidth over the previous generation. So 2x the floating-point capacity, right? It added the SVE vector extension in addition to Neon, which is historically in the architecture. We increased the branch predictor. It was the first processor to have DDR5 memory. So lots of memory bandwidth to feed the beast. We've also been contributing software optimizations to the frameworks to make this performance accessible to people. And then when we look at Graviton4, right, 96 cores instead of 64, 30% better performance, but it just takes it all to a whole new level. More memory channels, faster memory channels. So yeah, absolutely. You can end. And if you're not considering running your AI ML workloads on CPUs, well, you may just be spending more and giving up performance. So I don't know. I think, yeah, people should really be taking a closer look at this.
Agreed. Agreed. And hopefully, like I say, after today's show, they will be motivated and equipped with the information they need to go and take advantage of this opportunity. Because I do think that there's a broad kind of belief that GPUs are really the only way to do some great things in this space. And I think that Julien is definitely going to have some alternative perspectives on that, which I think is good. So, you know, definitely pull up a seat and lean in with us today. So of course, today, many of our customers and partners have started to explore and use Graviton for their AI ML workloads. And that's why we really wanted to have Julien on with us today to dive deep into how Arcee is leveraging Graviton and some of the benefits that they've been able to achieve doing that. And so, Julien, I really want to, of course, welcome you to the show. Thanks for hanging out with us. Like I said, I know it's late there. And I wanted to start off by having you tell our audience a little bit about Arcee and exactly what it is that you all do.
Okay, sure. So, Arcee is a US startup. It's about a year and a half old. Not that old, but it's been making a lot of noise in those 18 months. So what we do is we're the champions of small language models. We only do small language models, which we refer to as SLMs. So you're going to hear that word a lot. Save us some time along the way. We do nothing but SLMs. I don't think we'll do anything else. And the reason we focus on SLMs is because we, not only are we convinced, but we also see that enterprise customers and users in general can get an awful lot of business value from those models. And when I say business value, I mean they can get close or even above the performance of the largest closed models like GPT-4 for a fraction of the price because small language models are small. What's the definition of that? I would say all the way to 70 billion is still considered small. Some folks could challenge me and say, no, that's too big. But if we assume that the largest closed models are, let's say, a trillion parameters, then 70 billion is still kind of small, and you can run it on a single accelerator. I guess that's my definition. A model you can load on and run on a single accelerator is considered an SLM. We're not even going to use accelerators today. So we're going to take you down the rabbit hole of shrinking and cost optimizing. So that's what Arcee does, bring SLMs to the enterprise world. And we do this in different ways. So we build our own models. We are model builders. We share some of them on Hugging Face for the community to use for free, and we keep some of them as commercial models because, of course, we want to grow the company. All our models are based on open-source architectures. So we start from, you know, let's say LLaMA or Qwen models, et cetera. And we run them through our stack. We have a model tailoring stack which is based on open-source libraries, which are getting a lot of attention, like MergeKit for model merging or DistillKit for distillation, et cetera. So a combination of open-source, in-house recipes, and the know-how of our amazing research team. And so we build models that sit at the top of the Hugging Face leaderboard. And we also build models that compete with the best of them. Like I said, our largest model, our 70 billion parameter model, which is called Supernova, can outperform GPT-4 and other closed models on several benchmarks. So we build those models and we also build a platform that customers can use where they will find our own SLMs, where they can tailor their own SLMs to the particular business problem they want to solve, and those models are orchestrated, and we let them build workflows, use agents, etc. And that's going to be a really huge launch in a few weeks, hopefully by re:Invent time. So if you've taken a look at Arcee maybe a year ago and thought, well, okay, just a couple of models, what's the big deal? You should keep an eye on us because what we're bringing to customers in the next few weeks is going to be absolutely amazing in terms of cost-performance and some features that are probably just not available anywhere else, certainly not at the cost-performance price point that we're going to deliver. So SLMs all the way, you know, small is beautiful. And obviously that took us to, you know, to different paths of, hey, okay, we have great models. We deliver outstanding performance with fewer, much fewer parameters. Now, what are the best hardware platforms for that? And well, we can run on small GPUs no problem. We don't need the huge ones. We can run on Tranium and Inferentia. We also work with that team, so hi to the other chip builders at AWS. And Graviton is of course on the map, and I guess we'll talk more about that.
Thank you, Lorenzo. Just before you chime in there, so Julien mentioned Inferentia and Tranium, so just people viewing here may not realize. So I talked about the AWS Custom Silicon innovation, starting with Nitro, our virtualization offload, Graviton, general-purpose processor. AWS Tranium is a purpose-built accelerator from our silicon team built for doing training, hence the name Tranium. And AWS Inferentia, similarly, a purpose-built accelerator for inference. So we have that sort of suite of four custom chips that we built today. And as it turns out, Tranium is also very good at inference. Well, yeah. I mean, there's no compute problem, right? In the bizarre types of events, it is also very good for inference. So all those chips are on the radar, and that's great. You know, AWS has always been a big part about freedom of choice. And I know because I worked at AWS for six years. So how many databases do you have? How many ways do you have to deploy a container? And I guess now the new joke will be how many chips do you have to run an SLM on AWS? And not all customers have the same ROI targets or expectations or cost constraints, et cetera. So some very high might justify the cost of more expensive hardware accelerators, but maybe some use cases need to be commoditized and cost-optimized to the max. And that's where, you know, you need to look at the smallest possible models on the smallest possible and most cost-effective chips. And I think that's Graviton in a nutshell.
Hey, I like the way that sounds. So chat, what we got, we got small language models, right? You know, from this perspective, I think I like Julien's definition in terms of, hey, can I run this thing on a single accelerator, right? 70 billion parameters or less. And I think that you talked a little bit about this at a high level, Julien, but I definitely want to make sure we pull the thread on this a little bit. And so, of course, earlier this year, we had your CEO, Mark McQuaid on. Shout out to Mark on that. Hey, Mark. Exactly. And I was a fan of SLM ever since. So shout out to Mark for getting me hooked on that one. And so I think the big thing here, and I think you talked about it a little bit, but I really want to bring it home, right? Is that the differences between small language models and large language models, which is what most people are probably really familiar with, and the situations and use cases where you'd want to use one versus the other. Right. So I think we really got clear up front that, okay, small language models, right, they're significantly smaller. But kind of like, what are the big differences and what do you get from it? Any work for our audience in terms of one or the other.
So, obviously a couple of years ago, chat GPT came out, and everybody, I mean, all of us, rushed to go and test it. And it was a positive breakthrough. It helped a lot of folks, including non-technical folks, understand what AI was, what generative AI was, what they could do with it in their organization, HR teams, finance teams, supply chain teams, not just the data fans and MLOps geeks like us. So that was great. And then, of course, they started building POCs. Some of those POCs became production. And when you get to production is where the rubber meets the road, that's where you see if the solution works or not. And in some cases, it does in terms of ROI, right? And in terms of does the model provide enough domain knowledge or am I starting to see some limits there? And in terms of security and compliance, big mistakes were made, sending confidential data to OpenAI without any set, without any precaution there. So, but generally, I mean, I guess any organization can find use cases that will work with closed models. So, could be OpenAI, in AWS it could be cloud models on Bedrock, for example, do a bit of prompt engineering, figure things out, it's fine. But what literally sends customers our way is a combination of three things. I would say a lot of it is cost. You know, pay per token looks fine, I keep saying it's deceivingly simple, and people don't truly understand the real cost. They think if they ask a 15-word question, it's going to be 15 tokens, but no, because they're injecting system prompts. They're injecting data through retrieval-augmented generation. They do four or five back and forth, et cetera. So that short, seemingly short conversation ends up being, you know, maybe 50,000 tokens every single time. So the costs do add up, and the price is the price, and there isn't a lot of flexibility here. So as you start scaling use cases, as you start scaling the number of users, as you add more and more data to the mix, et cetera, things can go off the rails literally. So that could be problem number one. Can I replicate that closed model experience with a smaller model? Some highly regulated customers also want to have solutions they control completely. They need full ownership of the solution. They're uncomfortable or sometimes prevented from using those third-party APIs. Bedrock is a little simpler because it stays in the VPC, but I would say outside APIs are sometimes not possible at all, and generally, they prefer to run and manage and control what's going on. Number three is they just need deeper domain knowledge. If you're doing vanilla tasks with closed models and I don't mean that in a negative way, the usual day-to-day business stuff, rewriting or translating documents, etc., nothing extremely fancy, those models do a really good job. They're really good at language. But if you need deep company knowledge, deep company concepts and terminology, etc., in the energy domain or the telco domain, etc., those models don't have that. They certainly know about telco, they certainly know about finance, they certainly know about chemical engineering, but do they really know all the finer points of how that domain is actually implemented in your company, in your business unit, in your R&D team? Of course not because that would require data that is not in public training sets. And so that's where the ability to tailor the models, fine-tune them, or further train them on your data is critical. And that's next to impossible to do with those closed models. So that's really the split. I think customers who are generally happy with the closed models don't need insane domain depth and can live up to a point with a solution they do not fully own and control. If you start pushing those sliders into the red zone, one of them or two of them or even all three, things start to break in terms of economics or quality, etc. And that's where SLMs are unbeatable, in my opinion. And that's why you see all those folks moving to SLMs and tailoring them, etc.
Yeah, so that's, and of course, there is overlap. But generally, if you want to look at things from a simple perspective, that's really the split between the closed and the smaller models. Gotcha. So what I heard, Julien, is when you think about small language models versus large language models, cost is going to be a major driver, right? Yes. Especially as you go to prod, people want to use AI. When you're in the sandbox, everything is, I mean, there is no cost issue. But if you want to deploy AI, multiple AI apps for 1,000 or 2,000 internal users, and of course, you want people to use that stuff because you see the benefits, you see the productivity, you shouldn't be stopped. You shouldn't be prevented by costs. It's such a frustrating thing having your CFO yell at you and having to literally dial down on your AI efforts strictly because of cost. And again, those closed models cannot solve that problem. They have no way of solving that problem because they were built with assumptions that make them expensive. You can't get it all, right? You can't get the best of both worlds. So there are design assumptions of what they are. I'm not saying they're bad. They are what they are. And the cost structure is going to be what it is. And yes, you need all those huge GPUs to run those models. And we know how much they cost, and there's no solution to this. No doubt. And so I think that's really like I really like to say love where the space is going because you have all these options where you can find the right one for you, right? Exactly. Use what works, you know? Yes, the SLM will be more cost-effective. Because of their nature, they are small, and it will be easier to run on a type of Sunday, which I think, as I said, when you go to production, many times you're looking for that long-term scope. You don't want to be a million miles away. You don't want the recipe of the day and the astronomy questions and the poetry, right? You want to summarize ER notes or answer deep PhD-level questions on genomics. And that's exactly where you need an inch wide and a mile deep, not the other way around.
No, definitely. So I think, as I say, for anybody starting to think about it, those are the vectors you're going to want to think about. Okay. So now that we've laid out the use cases around SLMs, let's get a little bit deeper if you don't mind, Jeff. So I think it's become a common belief that the use of accelerators and specifically GPUs are the key to successful AI ML use cases. Is that always true? Do you think there's a place for CPUs and, in particular, AWS Graviton to have a material impact in this space?
Yeah. So, I like to use an old phrase, right? If the only tool you have is a hammer, then every problem looks like a nail, right? And I think that's what we're suffering from a little bit here in the world of AI. So Julien just laid it out about SLMs versus LLMs as well, right? It's picking the right tool for the job. And just like an LLM versus an SLM, you want to do the same with the infrastructure that underpins that and runs it. So GPUs are amazing. Accelerators are amazing. I mean, we built purpose-built accelerators for training and inference, et cetera. And there are massive models and large models and places where you want to use that. But as you start to get more efficient and you narrow the scope and focus on domain-specific knowledge, that brings with it a level of efficiency as well and allows you to run things on CPUs. Julien mentioned SLMs sort of capping out at 70 billion parameters. I like that you've tried to quantify that fitting on a single instance. So we have ARM actually published a blog where they've shown a 70 billion parameter model running on a Graviton instance. So that size model can run on a CPU. And when you do that, you have some other efficiencies. So CPUs, they're general-purpose. They're easy to program. Lots of software support, lots of variety of instance families within EC2, instance sizes, global focus, the cost structure. So there's a lot of benefits that start to come to bear once you can use a CPU to run your models on, right? And, you know, Graviton for sure. So I kind of alluded to it earlier with Graviton 3, we doubled the floating-point performance, DDR5 memory, Graviton 4, you know, more cores, faster cores, more memory interfaces, faster cores, we've also got native data types like Bfloat16, which came in with Graviton 3, some int8 instructions that help with quantization and some other optimization techniques. So there are some ML-specific capabilities in these CPUs now. And so what you end up with is kind of a continuum of compute. So CPU, you can get this far, and then there's a bit in the middle where CPUs and accelerators overlap, and then you need accelerators to take you the rest of the way. And the key is figuring out where on that continuum is the best price-performance for your specific use case, because that's ultimately what you're optimizing for, right? I've got my model, I've got my input context, I've got my particular technical and business objectives I'm trying to meet. And I'm trying to scale my business and do so as efficiently as possible. So you want to make sure you're taking advantage of the cloud, the experimentation that that provides to pick the right infrastructure to run your workload as efficiently as possible. So yeah, short answer, Lorenzo, absolutely. And if you're not trying Graviton for your AI ML workloads today, you'll leave money on the table, and you should probably go and experiment with it. So yeah, absolutely.
Right. So that sounds good to me, chat. So I say the key takeaway on that one is that it is not always true that you're going to need those GPUs and those accelerators to successfully execute your ML use cases. You know, as we kind of said, you're going to hear probably over and over again here today is that it's really about finding the right tool for the job. Yes. There are lots and lots of tools in that toolbox that you can go out and leverage. So definitely find that right one. And I think that's a really key thing. And maybe if you don't remember anything else we say here today, one of the high-level things that's really important is that when you talk about CPUs in this space, there is, in the right context, more than enough space for them where you can use them to great effect. So kind of don't ignore CPUs in this space. Otherwise, like I say, you're leaving money on the table.
Yeah, if I can use an analogy, I'm sure we have software engineers in here. It's, you know, if you've been long enough in tech, you see that we keep solving the same problems again and again and again. And we keep coming up with similar solutions. It's just generally a new generation rediscovers what the older guys had already done. And like 20 years ago, what were we all doing? I mean, we were writing huge enterprise apps, maybe in Java. And we were building that huge Java monolith and deploying it on a fancy server, probably not in the cloud, on some JVM with obscure settings, et cetera. And, you know, don't get me wrong, that helped a lot, mostly. And then we hit the wall on performance, debuggability, cost, and all those usual things. And what did we do, you know? Microservices, right? So we started breaking the monolith into smaller things that would do only one thing but do it great. Okay. We learned how to orchestrate those microservices. We learned how to deploy each individual microservice on the best hardware platform, scale it out, you know. Some would be memory-bound, some would be compute-bound, some would be IO-bound, and not all servers and not all the reason why AWS has so many instance types is just to prove that not every app needs the same configuration. Okay. And I guess now it's how we build stuff, APIs, and scale out and use the right tool and the right library and the right hardware for the job. Okay. It's kind of the same with AI, right? We started with the monolith and, you know, I'll call the GPTs the monolith running on huge, huge GPU instances. I mean, God only knows what they really use, but we can figure it out, right? If you want to deploy those huge models, you need a very expensive GPU box, multi-GPU box. I'm not even talking about the training part, just the inference side. What are we doing with SLMs? We are breaking the monolith. We are, and that's what the Arcee platform is, where instead of using a monster, you know, we use a swarm of SLMs, one for text generation, one for maybe text to speech, one for maybe image modalities, et cetera, et cetera. Those collaborate just like microservices collaborate, and we can scale and optimize each one of them for the best platform. So maybe one of, maybe the image stuff needs the GPU because it's highly parallel and we need more oomph, but maybe the text generation stuff can run just fine on CPU and Graviton. And if you're an AWS customer today, go ask them for 16 GPU instances or let alone 64, okay, you'll see what they'll tell you. Ask them for 16 Graviton instances or ask them for 200 Graviton instances, you'll get a very different reaction. It's like, yeah, okay, sure, of course, you can grab them, and we'll give you a good price, and you can do reserved instances, etc. So the engineering mindset is just at work once again, and that's what SLMs are. Think of SLMs as breaking the monolith, working with a collection of tools optimized for one thing, orchestrating them, and scaling and cost-optimizing each one of them. If it worked for microservices, it's working for AI. There's no difference. So no, you don't need the fat, expensive, hard-to-procure, hard-to-scale monster GPU boxes. Sorry.
So as an extension of my hammer analogy, right? There's a variety of hammers. You don't always need a sledgehammer, right? If you're trying to put a little tack in the wall to hang a photo, you don't need a sledgehammer. So definitely picking the right tool for the job. I like the analogy to sort of breaking the monolith. So I think there's a continuum here of specialization and efficiency, right? And at one end of the spectrum, you've got LLMs, which are amazing, trained on vast data sets, capable of so many things and reasoning and stuff like that. And then at the other end, you've got the smaller, more specialized, more agile SLMs. And you bring up a good point there, Julien, that using combinations of those, you mentioned a swarm of them, right? So yes, you're putting them in concert, like microservices and bundling the right things here. So yeah, that's a good way to think of it. It's definitely, I think, a really good way to think about it and to kind of bring it home in terms of the orchestration, right? Because I think that's like the next big thing that's going to kind of become a big thing is the orchestration of the different smaller models together to kind of do these bigger jobs, which I think is just... And external tools too. People keep talking about agents, which is just a fancy word, honestly. Again, it's another buzzword, but guess what? People have been building and running workflows for 50 years without any AI, right? So if you need to do math, the last thing you want to do is use an LLM or an SLM to do that because they are not deterministic animals. They are probabilistic animals. And, you know, math is kind of deterministic if you ask me, right? I mean, two plus two is a pretty deterministic answer. You don't want to roll a dice, right? Even if it's 99.9% right, well, it's the rest that's going to kill you and kill your app and kill your workflow. So some stuff is the language stuff is what models should do. The data pattern analysis is what those models should do. But when it comes to, you know, I need to do a transaction or I need to do something deterministic, and guess what? I have a legacy IT app or a third-party tool to do it, and I know it works, go use that. So orchestrating goes, you know, it's for models, but it's also, that's what we call workflows, the ability to help models and external tools collaborate. We see a ton of value there and using, again, as Jeff said, the right tool for the job, which means using language models for language and complex data understanding and using the rest for the deterministic stuff. So it's where we are today.
For sure. For sure. So as always, folks, we're a little bit here past the top of the hour. So for anybody who wasn't with us at the start of the show, I'm Lorenzo Winfrey. I'm your host of Under the Hood with AWS. And today we're talking about running performance and cost-effective Gen AI applications with AWS Graviton and Arcee. I'm joined by my colleague, Jeff Underhill, and Julien Simon, who is the chief evangelist of Arcee. And we're just, as you can tell, folks, vibing and we've got a good little demo coming up here in just one sec. So I definitely want to make sure we get to that. But right before we do that, Jeff, so kind of everything we talked about, I feel like it makes it really about finding the right compute option for that particular kind of use case of workloads. Jeff, do we have any other customer use cases about customers who've adopted Graviton for their AI and ML? And can you shed any light on those before we kind of jump into the demo that Julien's brought with us to kind of demonstrate how to use LamaCDP to quantize and run a model on Graviton instances?
Yeah, so I'm excited to see Julien's demo. So I'm going to be pretty succinct here. But some that spring to mind, and these are all publicly out there. We've got blogs on these so people can go read more about them. One is a customer called Sprinklr. They have been using Graviton for some time now, and they switched a bunch of their AI workloads to Graviton 3. And what I like about this is there are a couple of things. First, they were able to see a 30% reduction in their latency. They saw 25% to 30% cost savings. And then they circled back a little later after they'd adopted Graviton and looked at the downstream impact to their actual business, where they recently talked about resolving customer queries about 50% faster. And they've been able to increase the productivity of their agents, not the agents Julien was talking about. These are people, agents, people. Real people. We still need them. Yeah, exactly. So they increased their productivity by up to 40%. So that's one example. And you can find a blog. Another partner, Third AI, they've been running neural net training on CPU. So training is typically very much thought of as something you need accelerators for. They're doing this with CPUs and they've got some of their own technology there, but they've seen 30, 40% performance improvements on Graviton over comparable instances and a price-performance improvement of nearly 50%. Again, there's a blog on that. Let me see, Databricks. So another partner who have been supporting their Spark analytics platform on Graviton for a while announced their ML runtime is available on Graviton earlier this year. And there's a blog on this, but this talks about two specific areas, one being AutoML where they're doing experimentation with hyperparameters. And on Graviton 3, they were able to run 63% more trials in a period of time. And what that means is you can explore the space more and ultimately end up with better combinations and then ultimately improve the accuracy and precision of the models. And then MLlib, they've got some libraries like HGBoost, Spark MLlib, et cetera. And there the engineering team saw 30% to 50% speed-ups. And then last but by no means least, so the Graviton chips are based on the ARM instruction set. They themselves have done some experimentation where they saw a 3x better performance for prompt processing and token generation using LLaMA models. And then I mentioned this earlier, they also experimented with a 70 billion parameter model on Graviton 4 and demonstrated that you could run a model of that size with human-readable performance. What that means is anything interacting with someone like you or I, we can only consume information at a certain rate. The general understanding is about 10 tokens a second. Correct. Yes. Is sort of, we can't read faster than that. So doing it faster than that is kind of, so anyway, oops, and I just went. Just for the record, Jeff. Yeah. I did reproduce all those numbers that you were. Yeah. We confirmed that. And yeah, we are running our 70 billion at about 10, 10, 12 tokens per second. So yeah, those numbers are real. Just so, you know, sometimes people are skeptical about numbers and they, you know, they worry it could be just marketing fluff. I appreciate you. You can be honest there. No, no, no. But these are real numbers. Yeah. And maybe it's a shameless plug, but Jeff and myself and the team, we are working on the blog post. So we'll have more demos and more numbers to show you in the next few weeks. But there's a lot coming.
Nice, folks. I think the key takeaway from that, of course, is you don't got to take our word for it. You check out those blogs. Check out those blogs, run your own tests. But like I said, there's just a ton of info. I've dropped a couple of links in the chat. Check it out, read it for yourself, and go make it happen. And now I think to the portion that everybody's here for, hand it over to Julien. Let's check out that demo.
Okay. So, now you can see my screen. So I'm going to do this. That's fancy camera work, all right, because otherwise you're gonna be looking at me staring at the ceiling, and that's gonna look a little stupid. So what we're gonna look at here is one of our models running on a very cost-effective instance. Okay. So for context, this is our organization page on the Hugging Face hub. So just go to huggingface.co/arcee and you can see a few things here, our models, our research papers, et cetera, et ceta. So that's a good entry point. And one of the models we've built this year is called SupernovaLight. And SupernovaLight is a LLaMA 3.1 model. So we started from LLaMA 3.1 Instruct, the meta model. And as I mentioned in the intro, we train the model further using our stack. And again, on the model page, you can read a little bit about that. The model was actually distilled from LLaMA 3.1, 405 billion, and then merged, et cetera. So we gave LLaMA the Arcee treatment, so to speak. And when the model was published, it was the best 8 billion model available, outperforming even the llamas from Meadow, and it's still very high on the leaderboard. So it's a really good llama. Okay, so that's the model, 8 billion, right? And of course, the original model is a 16-bit model, as you can see here, PF16. So it is a reasonably large model. It's probably around 17 gigs or something, so not a tiny model and a very capable one. So it's a good way to get started with CPU inference generally. We find the sweet spot is, I would say, 7 all the way maybe to 14. Makes a lot of sense, and yes, you can run bigger models, but I think the sweet spot for us is really let's say 7 to 14. That's where we see the really nice cost-performance. So what do we do with this? So first of all, okay, let me switch to a terminal here. So first thing we did, and this is for the record, this is an R8g.4xlarge instance, which is, let me look at my notes, 94 cents an hour on-demand, which is already much cheaper than the cheapest GPU instance you can get on AWS, which I believe is G5.xlarge at one point something dollar. Baseline cost is already good. So we're starting from there. You could go a little more cost-effective actually. You could run a C8g.xlarge, which is 63 cents an hour. But I've decided to use R8 because I find I'm getting a little more, a little memory speed on this, and this is noticeable in performance. But C8g is a very good pick as well. So we have the model here, okay, SupernovaLight, and I just downloaded it from the Hugging Face Hub. And we're not going to run the 16-bit model, okay? We could. It does run, and we would see some okay performance, right? But we're gonna first apply a process called quantization. I think one of you mentioned it earlier, so maybe we should explain what quantization is. So I guess you could do a PhD thesis on this, but keeping it simple, it's a process where we rescale model parameters to a smaller range. Okay, so here we're starting from 16-bit values. Okay, that's the model we see here. So we have 8 billion 16-bit parameters, and we're going to rescale each one of those to a smaller range. And here I'm going to go low, I'm going to go to 4 bits. Okay, so imagine the dynamic range of 16-bit values. We have to rescale that to the dynamic range of 4 bits, which is obviously much smaller. Why do we want to do that? Well, you guessed it, because there's an immediate 4x reduction in model size going from 16 to 4, and we can leverage CPU instructions that are optimized for, you know, 8-bit and on the way 4-bit processing, right? Those models are still mostly, I mean, inference with those models is still mostly, you know, general matrix multiplication, as fancy as we want to make them, at the end of the day, we still need to multiply matrices and add them up together, and the Graviton chips, as Jeff mentioned, coming from the ARM instruction set and also from the Graviton additions, have dedicated instructions that make that process very, very fast. So we shrink the model, which brings obviously some performance because we just have less stuff to move from RAM to the CPU, but also we can leverage dedicated instructions. So how do you do that? Well, there's an amazing project called LLaMA CPP, right, and LLaMA CPP is a high-performance C++ toolset for model inference, and it also includes a quantization tool. So if it's the first time you hear about LLaMA CPP, please go and check it out. It's very simple to use, and the performance is great. So let's cut to the chase and let's just run quantization, and this is not going to take too long. So let's go into the LLaMA CPP folder. I cloned the repo and I compiled the tool, so I'm just running llama quantize my 16-bit model and quantizing to 4 bits, and you can see this as a funky name q4_048, which is a quantization technique optimized for Graviton 4, leveraging those fancy instructions that Jeff mentioned. Okay, so let's just run this. And Julien, while you're doing that, I just want to point out, so what this is doing is taking those 16-bit numbers, it's essentially converting them to 4-bit numbers, so you've got a smaller model, more efficient to run, but this is a one-time operation too, right? You do this only once, right? And we can see it. Each one of the layers, it says, you know, F16 converting to Q4, etc., etc., and it's going to take a few seconds, and save the file, and you do this once, right, and you can do 4 bits, and you can do 6 bits, and you can do 8 bits, I mean, you can experiment with different quantization schemes, but this one here is specifically designed for Graviton 4, right? And if you want to see all of them, just ignore this, right? And it'll give you all the available options, right? So you have vanilla 4-bit and then you have the optimized versions. And there's one for Graviton 4, so this one is Graviton 3 actually, if you want to run Graviton 3, right? And you could do 3 bits and 4 bits and 2 bits even. I mean, you can go crazy with this. Okay, so once we've done that, and you see how it took, you know, 30 seconds or something, now we have this. Let me show you the size because it is smaller. Okay, we have the quantized model. Okay. And so now this thing is only 4.6 gigs or something because we only have four bits per parameter. So it's just a smaller file and it will load faster and process faster. Okay. So now that we have the quantized artifact, we can just go and run inference with it. And this time we use LLaMA CLI, right, passing the quantized model, asking it to generate 256 new tokens and starting with a prompt like what is deep learning, which isn't a great prompt but it will serve our purpose. Okay, so there we go. I couldn't even explain it, it's already generating. Okay, so the model was loaded right. Okay, we need to do this again because it's so fast. I'm gonna try more tokens. Let's do 1024. Okay, so LLaMA CLI will load the quantized model and start generating, and you know, off we go. Okay, so Jeff, you made a very interesting point saying you know we can't really read faster than 10 tokens or 10 words per second. This is way, way faster. Right. I love it. It's even quoting Yann LeCun and Yoshua Bengio. That's perfect. Okay. And it keeps going, and you know, those things are chatty, right? Because I said 1024. And we see we are so here we see some very interesting numbers. First, we see how fast the model is evaluating the prompts. Okay, so here I have a super short prompt. I only have five tokens in my question. So the speed is irrelevant, but imagine passing, imagine you're translating a 4K or 8K token document. Processing speed for the input tokens is very important. And we can very easily get with larger workloads, we can easily get 200, 300 tokens per second on this instance, which means even if you're passing, you know, let's say 4K tokens, which is several pages of text, it's only going to take a few seconds to analyze those tokens and start generating, right? So the time to the first token is going to be quite fast. And then we see how fast we generate. And you can see, you know, well, it will change from one run to the next. But generally, we're able to generate at 39 or let's say 40 tokens per second. We could enable something called flash attention, which might help us get over the 40 limit. Let's see. Flash attention is an optimized way to run inference and it is available in LLaMA CPP. Let's see if we can make 41 or 42, which is generally what I get. Let's see, and you can see how fast this is. I mean, this is way, way faster than anyone can read. Right? Let's see. It's repeating itself a little bit. Asked for too many tokens. Oh, 38. Okay, so now it's going to behave, but generally, you can run you can easily run at 40 tokens per second. So if you want to chat, you can use this conversation mode. So let's try the conversation mode. It's nicer. Let's try this. What's the weather like in Seattle? So that's an impossible question to answer. That's cruel. I've lived here for six years and I've learned not to trust the weather forecast here. Be nicer. Why not? Oh, yes. Yes. Okay. So, just to show you, okay, that's just LLaMA. Again, that's just the LLaMA conversation mode. And again, LLaMA CPP is an amazing project, but you can see it's just to show you how fast generation is and all of us are used to using other chatbots, other systems. How is this different, right? I mean, the speed is more than adequate, right? The quality of the answers is very good. So we'll double-click on that in a minute, but you can see this is not a toy. It's not a toy for geeks. It's a solid solution where we start from a very good open-source model, in this case from Arcee. We start from a very good model, we quantize it, and we can run it on a super cost-effective instance. And now what I love to do, you know, I do a ton of benchmarks, and you can argue about benchmarks forever. My metric, the way I look at it, I'm not saying it's the ultimate way. The way I look at it is, so I'm getting, let's say, 40 tokens per second. Where was that number? It's buried here. Okay. So I'm getting, let's say, 40 tokens per second. And that's just one inference request. Okay. If we started batching things, we'd get more. And actually, there is a good way to benchmark this as well. It's called LLaMA bench. Maybe I can run it while I'm explaining. So here we're gonna run larger prompt sizes, larger context sizes, and we're going to see more throughput. So you're starting from this good model, you're quantizing it, you're getting this really great throughput, and you're getting this for again 60 or 70 cents an hour. So if you look at the speed you're getting, you know, cost-performance, we mentioned it several times already, cost-performance is what you need to look at. Okay, so cost-performance is how fast can I generate for this dollar amount. So I would encourage you to do that math. Take the cost per, let's say the cost per hour, or let's take the tokens per second. That's an easier way to look at it. Let's take the tokens per second that you get, divide it by the instance cost per hour, and you get a decent metric, right? And you'll be shocked. Trust me, I've run those numbers, you'll be shocked. Maybe we'll put some of them in the blog post. But when you start looking at it this way, you know, I'm generating at, let's say maybe 100 tokens per second, and I'm spending, you know, a dollar an hour or less. You know, how does that compare to your, you know, fancy GPU instance that costs you up to $100 an hour, even if it's faster, you know, I can guarantee it's not 100 times faster. So cost-performance is everything. People are obsessed about throughput, and they forget to look at cost, right? But you need to look at both sides of the equation. You know, what performance am I getting, and what's the spend? And if you do the math, you will see Graviton instances sitting at the top of that performance leaderboard. And if you need more, my friends, you just scale out. Okay. I'm taking you back to that microservices discussion. If you need more throughput for your AI app, then you can just add, you know, add a load balancer and have 64 Graviton instances serving in parallel. You know, that stuff is stateless. You don't need to be on a single box. You know, why are we forgetting all those good design practices? So people want to scale up all the time, and I think it's a huge mistake. I think the best way is to scale out on the smallest possible compute unit, right? Just like we were scaling on, you know, T3 micro or T2 micro back in the day or tiny instances. This is the way to do it, and that's definitely how we're doing it, and we're pretty happy with it.
For sure. Thank you so much for that demo, Julien. An explanation to kind of show how easy it actually is to run a large model using Graviton. Chat, as you can see, not a heavy lift at all. I always say that the majority of the work is in the planning and making sure that you have the proper workload fit. But as kind of demonstrated here, when it comes down to that actual execution after you've done that planning, it's often very simple and straightforward. And so as I'm looking at the clock here, it looks like we're coming down to the end of the show. And with re:Invent around the corner, I would like to remind all of you, if you haven't already, if you're going, awesome. If you can't make it in person, remind you definitely you can register virtually to be able to do that as well. So I will say to you, Julien, are you going to re:Invent?
So I am not. You're not? Okay. No, I mean, I have flown so much in the last few months. And it's not my dog, in case you're wondering. It's fine. Sorry, I don't have one. No, it's okay. I have a cat, so they're more quiet. But no, no, so I've flown enough in the last few weeks, but it's a good point. I mean, my colleagues will be at re:Invent. We actually have a booth. So look for the Arcee booth at the Expo Hall, and, you know, please go say hi and learn more about Arcee. And by then, you know, I think the new version of our platform will be out. And I'm sure they'll have an amazing agent and workflow demo for you there. So yeah, go say hi. Maybe they'll have stickers. I don't know. You always got to look for that. I'll be back. I'll be back at re:Invent. But I've done enough re:Invent, and then I'll hang around for now and being home for a little bit is going to be nice. I'm going to watch the keynote from the comfort of my chair, and I'm super excited about all the cool stuff that's going to be announced.
Nice. So you heard that, Chad. If you are in person at re:Invent, make sure you stop by the Arcee booth and check out all the cool stuff they're doing. How about you, Jeff? Are you going?
I will definitely be there. Yeah, I've got a couple of chalk talks talking about Graviton adoption, key learnings from developers that have adopted Graviton at scale. And then I have a breakout session on optimizing AI ML workloads with Graviton. So yeah, come check us out.
Awesome. Again, Chad, if you're there, make sure you find Jeff, tell him what's up, tell him you saw him on Under the Hood with AWS and you learned a lot of cool stuff there. All right. So as we come down to the end of the show, I again would like to thank my guests, Julien and Jeff, for joining me today. If you like what you heard, make sure you hit that follow button so you get notified every time we go live. We are here every Tuesday at 2 PM Pacific, 5 PM Eastern. And as always, don't forget to join us next time on Under the Hood, AWS Compute, the show that takes you behind the scenes to understand how AWS services and solutions help drive great outcomes everywhere from media and entertainment to technology. I'm your host, Lorenzo Winfrey. See you next time, everybody. And remember, go be great. Take care. Cheers. Thank you. Bye.
Tags
AWS GravitonAI ML WorkloadsCost-Effective ComputingSmall Language ModelsGraviton Optimization
Julien Simon is the Chief Evangelist at Arcee AI
, specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.
With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.
Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.
Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.