Hi everybody, this is Julien from Arcee. In just a few months, text-to-image models have become extremely popular, and you can use them for use cases like content creation, synthetic data generation, and all kinds of different things. These models, particularly the stable diffusion models, are really large. So far, to get good performance, the only option was to use GPUs for inference. But thanks to work we've been doing with Intel recently, in this video, I'm going to show you how you can use Intel CPUs to generate images with stable diffusion models in under five seconds. Don't believe me? Let's get started.
Before we start accelerating, we need to know what the baseline is. So let's start by running a vanilla stable diffusion pipeline on an AWS instance powered by a new generation Xeon CPU with the Sapphire Rapids architecture. I've mentioned these CPUs already 10 times, so go and watch the previous videos if you want to know why they are really cool. Here, we're creating a simple stable diffusion pipeline with the diffusers library and a model from the hub. We're just generating images a few times and averaging the latency. We print out the average latency. So nothing complicated. Let's just run this. I'm using a vanilla PyTorch environment here, so I'll just go and run this and we'll see what the baseline is.
Okay, all right, let's give it a second. I'm running a couple of warm-up iterations just to have meaningful numbers when running the actual predictions. Okay, let's see what the number is. It should be around 30 seconds. Okay, so that's the warm-up iteration, and here's the first real iteration. Yes, it's going to be around 35 seconds. We've run our five iterations, averaged out the values, and we can see the average latency here is 36 seconds. That's quite faster than on previous generation Xeons, but it's probably still a bit too slow for production. So now we can start accelerating.
There are different ways to do this, but in this video, I'm going to show you first how to accelerate inference using OpenVINO, which is an Intel tool that optimizes models, and we've integrated it into our Optimum Intel library. Starting from this code here, with the stable diffusion pipeline, we're just going to install Optimum Intel and OpenVINO. I'll put all the setup steps in the video description. It's really just a pip install command. We're going to switch to this code. What we're going to do here is replace our stable diffusion pipeline with an OV, OpenVINO, stable diffusion pipeline. That comes from the Optimum Intel library. So that's about it.
We're going to run a few warm-up iterations and see what happens. So let's just run the OpenVINO version and see what happens. Remember, we started from 36 seconds. We're going to do the same, load the model with the OpenVINO pipeline. It's going to get automatically optimized by the OpenVINO runtime. Then we're going to run a couple of warm-ups and start predicting. Okay. Just switching to the OpenVINO pipeline takes us down to 16 seconds. From 36 to 16, in one line of code. That's pretty good and that's really easy to do. But 16 is probably still a bit too slow. If you want to generate thousands and thousands of images a day or if you need to generate a whole maybe 10,000 or 100,000 images for a dataset, that's still a bit too slow. So what can we do next?
Starting from the same code, as I mentioned, we can actually apply a static shape to the pipeline. When you create a pipeline like this one, there's no assumption on the size of the images that you're going to generate, such as 256 by 256 or 512 by 512 or any kind of size that the model supports. But in practice, it's quite likely you're going to be generating just a few sizes or maybe just one. So if you know that all your images are going to be, let's say, 512 by 512, giving this information to OpenVINO is going to help optimize the model further because now you can use statically sized tensors for all the layers and apply more optimizations. So that's what we're doing here. If you need just, let's say, 256 by 256 and 512 by 512, you can create two pipelines and optimize them and use either one or the other depending. So it's a great trick to really accelerate those models.
Let's run that thing again with the static shape and we can compare. So let's just save this and run it again. It's going to take a few more seconds and then we can see the results. Okay. We've run again the baseline OpenVINO pipeline and we see we get 16 seconds again. Now we're actually predicting with a statically shaped pipeline, and it is much, much faster. In fact, we're even under 4.5 seconds. So this is a good run. We started at 36 plus, took it down to about 16 with OpenVINO, and slashed it down to 4.5 seconds, which is definitely fast enough for a lot of real-life use cases. This shows that you don't have to use GPUs all the time. There are some use cases where GPUs are required, but in this particular case, you could very well run your stable diffusion inference on CPU and scale out to as many servers as you need.
So I think this is a really great collaboration again between Intel and Arcee. If you want to learn more about this, I would encourage you to go check out the Optimum Intel library. We have examples and docs, etc. I actually wrote a blog post on this with my colleague Ella, who's the lead developer for Optimum Intel. So great job, Ella. You can find the actual blog post on this. And last but not least, I would encourage you to go check out the Intel.org page on the Arcee Hub where you'll find some developer resources, examples, and a great space where they compare stable diffusion inference on the brand-new Xeons and the older ones, and we can see the speedup there. They also have lots of very interesting models, quantized models, etc. So go and check that out if you're interested in performance optimization.
Okay, well, that's really what I wanted to tell you today. Thank you for watching. I hope this was useful. I'll be back very quickly with more videos and definitely more Stability Fusion videos. Okay, see you and until next time, keep rocking.
Julien Simon is the Chief Evangelist at Arcee AI
, specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.
With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.
Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.
Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.