#cloudfest #ai #ampere #responsibleai #sustainableai #cloudinfrastructure #cloudcomputing #csp
As AI technology encroaches on all traditional technology industries, IT professionals and enterprise managers are searching for more responsible and efficient ways to address the growing needs of computing to support the compelling benefits of AI while still making business sense. The concern that capital and operating costs of AI infrastructure are manifestly unreasonable are growing. Creative solutions to these problems are developing through the efforts of an entire community of companies that are not in the front page of tech press today. In this CloudFest 2025 keynote, Sean Varley, Chief Evangelist & VP, Business Development at Ampere Computing, explores the various hinge factors and technology innovations that lead to more efficient, affordable, sustainable and open solutions to the AI infrastructure problem from edge to core data centers made possible through the cooperation of a growing community of alternate voices in the industry.
Find out more about the world’s top Cloud and internet infrastructure festival:
cloudfest.com
Transcript
Why? Because CPUs are not very virtualizable. You've got a sledgehammer, that's all you've got. If you needed something that was a lot less than a sledgehammer, you don't have that. You only have that if you break down these models, put them in virtual machines and containers, and put them in power efficiency CPUs.
Good morning. Good morning. Thank you, Jeff, and thank you for CloudFest for having me. So I am going to talk to you a little bit about, I'm going to kind of riff off what Johan was talking about a little bit earlier, and I'll have to remember to give him a little, you know, a couple of bucks for kind of being a straight man for me, because I don't really want to talk about training. I want to talk to you about inference, and he kind of primed you guys all for this.
Before I do that, let me introduce Ampere. Ampere is a sustainable computing company. We are innovators in what we call AI computing, and we are also the introducers of cloud-native processors, which now have been imitated by all the major competitors to us, and so it's been wildly successful as an architecture. I'm going to show you why it is the most responsible type of component for you guys to use for your inference infrastructure.
Before I do that, I want to talk a little bit about legacy computing. Now, most data centers were built on CPUs and now GPUs that were designed when power didn't matter, not at all. Code and applications were monoclonal and they fit inside of that one machine or that one virtual machine. It has also evolved from a client-server architecture now to what we have as a cloud-native architecture. I'll talk a little bit more about what that really means.
Now, how does AI kind of meld into this picture now? We have today some top trends that are actually happening that you've already heard of even up on this stage today. Agentic AI, as Gartner has published, agentic AI is the number one trend for 2025, but also up there, energy efficient computing. So conveniently on that particular diagram, they fit into my box. Now, we are already talking about sovereignty and security for data, which is a primary concern when you get to inference. Why? Because all of the important things to infer are in the data that enterprises own, right? So naturally, they don't want to keep that data secure, but in order for them to actually take the next step down the inference journey, they're going to need to make this whole paradigm, these use cases that Johan was talking about much more economical. They also need to be right-sized for the task, and I'm going to talk a little bit more about what that is. But also, they're going to be distributed. Why? Because AI use cases that Johan was talking about and most people will talk to you about are naturally going to be distributed where people and events are. And so that's going to be a very important aspect of the use cases that develop in the future.
So I have a little bit of a paradigm here for you all. This is AI from edge to cloud, and it's really kind of built around a power paradigm. If you look at all the way from embedded devices, all the way up into the core data center, what you'll find is that to do inference, to really do this stuff well and economically right-sized for the task, you really only need about 300 watts for a device, maybe less than that. And when you finally get all the way up into the core data center, you probably only need about less than 1800 watts. And it's going to be budgeted between the compute aspect and the AI aspect. So that is part of what this whole paradigm is going to teach you, is that you don't need GPUs to do all of this inference. I'm going to show you some very specific reasons why that's true.
First, let's talk about models. Models come in four classes, class 0 to class 3. If you separate those classes by parameter count, then you start to see an interesting pattern emerge which is in the graph above this chart. What you see is that the computational and memory requirements for models actually grow exponentially as you go to the right in this chart. So a 70 billion parameter model requires a whole lot more resources than a 5 billion parameter model or a 1 billion parameter model. And so what you can do with these class 0 and class 1 models is you can run them on very efficient hardware. In fact, AI-optimized CPUs. But when you get up into these class 2 and class 3 models, you're going to need accelerators, you're going to need GPUs to run those things because it requires a lot of memory and a lot of computational resources. So the volume of occurrences I'm going to postulate to you is going to be on the lower end of this.
And you can say, oh, why, why, why, why? You know, I can give you a million technological reasons why that's the wrong assumption. But I'm going to say to you, that doesn't matter. Because none of this is going to be actually usable until it's affordable. The one thing that always, the common denominator, always drives markets is economics. And so naturally, economics are going to push the models down here, and I'm going to show you why.
So, we talked about this AGP. We've also heard other buzzwords like mixture of experts, MOE, also multimodal AI. What are these things? They're just hierarchies. Think of them as a natural tree. And what I'm showing here is a tree of agents, a domain agent, and some task agents. Those things are optimized models. Task agents are going to get called by domain agents, and domain agents are going to be the ones that kind of take a query and break it down into little tasks, right? This is going to be the optimization that's going to get the cost out of inference. So AI computing is the combination of these optimized trees of models that can run on both general-purpose processors as well as domain agents in acceleration. So the domain agents are probably going to be bigger models. But you're also going to need a lot of other applications to really kind of build out a complete AI compute infrastructure. So the combination of model proliferation and optimization plus AI computing is going to get to be more sustainable or responsible AI.
So let's take a look at the actual build out of data centers today. What you have today is a situation where a lot of GPUs have been installed to do training. And you can look at the racks that are actually built for those, and they'll probably be very, very high power budget racks, but very, very low utilization in terms of space, right? Because you can't even fit a lot of these GPUs into one rack. Now, if you follow this paradigm that I've just been describing to you, where you go into these optimized models, you're going to be able to put a lot more of these models into compute platforms that require a lot less power. Coming back to my slide earlier, you can put a lot more of those into a rack, and so you'll get domain agents that are running with accelerators and also task agents, potentially thousands, maybe even tens of thousands of these task agents eventually, that are going to run on CPUs. The ratio between these task agents and domain agents is going to increase, maybe it starts out 5 to 10 to 15 but probably actually increases to maybe 50 or a hundred to one when these model paradigms start to really build out.
All right, so I am going to invite up Julien, Julien Simon, another chief evangelist by the way, so let's see how that goes because you never know what's going to happen with two chief evangelists on the stage. But Julien is, for some reason your logo didn't come up, but from Arcee, we've got the important part here, so thank you for joining me, Julien. Thank you. Tell us about who you are and what Arcee is. So Arcee is a US startup, and we're the champions of small language models. And what I mean by that is we only work on small language models. We start from existing open-source architectures, and we make them better with our training stack and secret sauce. And we're also the champions because it's fair to say we're building pretty good models that tend to rank very, very highly on various leaderboards. And we also build platforms. We don't just build models. You mentioned agentic, et cetera. We have an agentic platform called Arcee Orchestra, where you can drag and drop and build business workflows in a very simple way. And actually yesterday we launched another inference platform, and inference is the topic today, which is called Arcee Conductor, that automatically sends your prompts to the best model in terms of domain knowledge and performance. So you literally guess the best model for the job at the prompt level. So that's who we are.
Nice, thank you. And that conductor that you were just talking about, that is really an example of one of these domain agents, right? Because it's doing some of that query breakdown, right? We talked about this earlier, query breakdown, you know, kind of shoving to task agents. Exactly. You know, I meet with a lot of customers and there's this dilemma, which is, you know, AI can get really expensive, especially if you use the large language models. So should you optimize for cost? Should you work with a small language model that will do a very good job most of the time, but sometimes deliver lower quality answers to complex problems? Or should you optimize for quality and use one of those larger models all the time, which are likely to do a good job on most problems, but will be very expensive. And we were chatting earlier and the cost difference is insane. The text generation cost difference between our most cost-effective model, which is called Blitz, the 24 billion parameter model, and a large model like Anthropic, Claude Sone 3.6, the cost difference is 300x, not 300%, 300x. So most of the prompts you will send to the models every day in most organizations are simpler prompts. So you are paying 300 times more than you need to when you use the large language models for everything. And that's what Conductor is all about. Analyze the prompt in an instant and decide where to send it to give you the most cost-efficient answer and the higher quality answer. And this is really important. There's no Swiss Army knife model, I've been saying it for years. When I was at Hugging Face, I used to say that all the time. We're not in Switzerland, so I can say it. Apologies to Swiss fellows here. But there is no Swiss Army knife model. You need to find the right model for the job. And that's not just the right model for the task, it's the right model for the prompt now. That's where we are if we want to optimize and be power efficient and cost efficient.
Right. Well, Julien, I appreciate you coming up and joining me. Thank you very much. And help me illustrate these concepts. My pleasure. All right. Thank you very much. Have a good day. All right, thank you very much, Julien.
We're going to talk a little bit more now about how to actually put in an infrastructure for inference that is going to actually be sustainable and responsible. So our company, Ampere, has been building products now for over seven years, and we have now three product lines. The Ultra line, which was kind of the very beginning, now followed by Ampere One, our own architecture, our own cores, and now today I'm going to talk to you about Ampere 1M. These products range from 32 to 192 cores and they're kind of in these sort of lanes that I'm showing you here. On the left-hand side, these are 32 to 128 cores, DDR4, these are good now for edge and telco kinds of use cases, they're very, very power efficient. These are going to be good for all sorts of domain/task agents on the edge. Ampere One is kind of that volume compute 192 core, you know, sort of beast that's going to be very, very good for SLMs and traditional AI, and then on now that introducing this Ampere One M product for the high end.
One of the things I also want to really point out is that we actually have a software layer that does a lot of model optimization and instruction mapping that gives us 2 to 5x speed up for all of this kind of thing, so this is a very, very important piece. Now, Ampere introduced the concept of performance per rack, and I'm going to show you some benchmarks and metrics in the next couple of slides around that. So Ampere Ultra started it all. This is the platform that introduced the concept that you could actually have three times the performance per rack versus legacy x86 processors. It also gave you a lot less space, in fact, one-third of the space to do any particular amount of work, and it also consumed about one-third the power of the legacy x86 processor. All of this really works better at high utilization. Because our processors are so efficient, they're linearly scalable, you run them at high utilization, now you start to see the real magic behind this architecture because you're going to get a much better performance per watt and really, really hone in on that efficiency aspect. So high utilization really comes along with this whole thing. Get 3x the performance per rack and about a third the space and power.
Ampere 1 comes along and now you've got this 192 core beast that is there for all of that number crunching that goes into small language modeling, as well as all of the typical applications that surround the inference task itself. So vector databases, regular databases, all the web infrastructure that's required to carry queries, all of that kind of stuff. So Ampere 1 delivers 2x, up to 2x, better performance per rack than the closest x86 competitor in this industry. You can see here some of the benchmarks all the way from web hosting and Nginx to caches, Redis and Memcached, relational database, MySQL, and then finally two of those typical AI benchmarks, DLRM is a recommender engine, and then of course Lama 3.8b. So this is something that's very, very powerful, gives you a lot of performance per rack and highlights that efficiency aspect.
Now I want to spend some time on this slide because introducing now Ampere 1M is the density leader in SLM, small language models, and agents. And coming back to my paradigm earlier where I was showing you a full rack of these high efficiency processors, 17 servers in a 12.5 kilowatt rack, and that's going to get compared now, I've shifted the competitive angle to NVIDIA. Why? Because NVIDIA is sort of the standard, if you will, today for doing a lot of AI tasks. But it's not all there, because as Julien said earlier, the reason why it costs so much money for all these really, really large models is because they're getting run on a GPU. I'm going to draw your attention to the red writing on this slide. The red writing is cost to operate per year. If you start on the far right of this chart, you'll see that it costs about the same on an NVIDIA GPU to run a model that's 70 billion parameters as it does 1 billion parameters. Why? Because GPUs are not very virtualizable. You've got a sledgehammer, that's all you've got. If you needed something that was a lot less than a sledgehammer, you don't have that. You only have that if you break down these models, put them in virtual machines and containers, and put them in power efficient CPUs. So you're getting to the point now where you go from eight H100s in a rack to 17 Ampere One servers, and in those 17 Ampere One servers, I can deliver over 1600 small language models, over 1600 at an annual cost to offer rate of less than $15 a year. This is how economics drives inference. This is why models will get smaller, and you will need many more of them in order to do any sort of inference tasking. So small language models cost a lot less in CPU execution, and all of that kind of wraps up into a value proposition that is up to 200x more cost-effective than GPUs. This is major.
Okay, you can see Ampere 1M at our booth today. It's E01. It's in actually an OCP host processor module, which is also unique in the industry today because it is the very beginning of a modularity paradigm that is moving in the OEM markets, the suppliers to data centers, to on-prem environments. OCP HPM modules enable them to get there much more effectively, much more cost-effectively than previously. So come see that in our booth. We have an AI Platform Alliance, Arcee is a member of that alliance, and we are all about establishing open, efficient, and sustainable inference solutions. We have a solutions marketplace, we have a very robust ecosystem that's building all around cost-effective, very, very power-efficient AI inference. Now, you can go and see this up on platformalliance.ai. Here is a smattering of the companies that are in there, but it's a growing community, and we are very, very blessed to have many of those people with us here at the show today. So if you go by our booth, here is a number of the people you'll be able to hear from that are also in this ecosystem, in this alliance, looking to build much more cost-effective, responsible AI and compute.
Okay, so thank you very much for your attention today. Come join us. If you're interested in joining us, come to the platformalliance.ai and join up. We would love to have you. Thank you so much.
Tags
AI InferencePower EfficiencyCloud-Native Processors