Accelerating Stable Diffusion Inference on Intel CPUs with Hugging Face part 2

April 03, 2023
In this video, you will learn how to accelerate image generation with an Intel Sapphire Rapids server. Using Stable Diffusion models, the Intel Extension for PyTorch and system-level optimizations, we're going to cut inference latency from 36+ seconds to 5 seconds! ⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos ⭐️⭐️⭐️ - Blog post: https://huggingface.co/blog/stable-diffusion-inference-intel - Code: https://github.com/juliensimon/huggingface-demos/tree/main/optimum/stable_diffusion_intel - Jemalloc: https://jemalloc.net/ - Intel Extension for PyTorch: https://github.com/intel/intel-extension-for-pytorch - Intel Sapphire Rapids: https://en.wikipedia.org/wiki/Sapphire_Rapids - Intel Advanced Matrix Extensions: https://en.wikipedia.org/wiki/Advanced_Matrix_Extensions

Transcript

Hi everybody, this is Julien from Arcee. In a previous video, I showed you how we could generate images with stable diffusion models on Intel CPUs in less than five seconds. To get these great results, we combined Intel OpenVINO and our very own Optimum Intel library. Now, in some scenarios, maybe you can't use OpenVINO or don't want to use OpenVINO, and that's fine. In this video, I'm going to show you a different set of techniques using the Intel extension for PyTorch and some system-level optimizations and a few more cool tricks to get to the same result: image generation with stable diffusion models on CPU in five seconds. Let's get to work. So, just like in the previous video, let's take a quick look at our baseline example. This is the one we're starting from. We're creating a stable diffusion pipeline with the diffusers library, loading a stable diffusion model, running a few iterations, and averaging inference time. My environment here only includes vanilla PyTorch. Nothing fancy. We should be around 35, 36 seconds. Let's check that and then we can start optimizing. I'll be back in a sec. Okay, so we ran our five iterations, and the average latency is 35 seconds. Now we can start accelerating. The first step is to optimize the operating system for this particular job. Sometimes this is overlooked. People rush to optimizing the code or the machine learning side of things, but there's a lot you can do at the system level to speed things up. Stable diffusion models are really big, and the generation process is memory-intensive and compute-intensive, so we can leverage multiple cores, threads, and smarter memory allocation. First, we'll upgrade the memory allocation library. I'm using a library called Jemalloc, which is a high-performance library. They have a tuning guide that gives good tips for high resource consumption applications, prioritizing CPU utilization, which is a good starting point for us. These are the settings I'm using. In the interest of time, I won't break everything down, but we're maximizing parallelism and increasing the time that dirty memory pages are kept instead of being reclaimed by the system. This lowers memory allocation overhead and increases parallelism. The second step is to add a thread management library from the Intel OpenMP collection, which is part of the Intel MKL package on Ubuntu. This lets us set the number of threads we want to use for parallelization. I'm setting it to 32, the number of physical cores on this machine. We need to ensure these libraries are loaded using the LD_PRELOAD environment variable. So, Jemalloc and libiomp. Now we're good to go. No changes to the Python code. The last tweak is to use the NUMACTL tool to pin our threads to particular cores. I'm going to use all 32 cores from 0 to 31 and pin the threads used for our Python application to those cores. This avoids some of the overhead related to context switching. Let's try this. It should speed things up. Let's see how fast we're going now. We have our warm-up iteration, and now our real iteration, which seems to be going quite faster. I'm going to say 11 seconds, something like that. We can wait for the five iterations to complete. We went from 35, 36 seconds to 11 seconds. That's more than 3x better, just with system-level optimization. One caveat is that by changing memory allocation, we're changing the system's behavior. If you have other applications running and apply these environment variables globally, check that you didn't negatively impact those applications. But if you have dedicated servers for inference, go ahead and apply every possible tweak to speed things up. We're at 12 seconds, so more than 3x speedup, no code change, just tweaking. That's pretty cool. But we can still go faster. We'll start changing our code to speed things up even more. First, we'll install the Intel extension for PyTorch, version 1.13.1, which matches our PyTorch version. This leverages hardware acceleration features on our Intel CPU and automatically enables them in PyTorch. It brings support for JIT compilation, among other things. Now, let's look at our code changes. We still start from our pipeline, the same as before. We'll optimize every part of the pipeline with IPEX for BFLOAT16, supported by the Sapphire Rapids CPU. First, convert all components in the pipeline to channels last format, which works best with IPEX. Then, optimize each element of the pipeline using IPEX Optimize, specifying the BFLOAT16 format. This leverages the Intel AMX hardware acceleration. We pass a random sample input with the proper dimensions for the model to enable JIT compilation during inference. Not a lot of complexity here. Just channels last, optimizing, and we need to do this only once. Then, run our prediction loop again, making sure to enable BFLOAT16. This is key to have BFLOAT16. Otherwise, you'll optimize by default for Float32, and you won't get the significant speedup. We still have our system-level optimizations in place. Let's run this, making sure to use NUMACTL. Now we have system-level and CPU/framework-level optimizations. They should be fast. Wow, suspense is killing me. Although I do see the number and it's good. You'll see it in a second. Alright, can you guess? One more. All right. So now we're down to 5.3 seconds. From 36 to 12 to 5.3. That's another very nice 2.5x speedup with minor code changes, pretty much compiling the model and predicting as usual. We're getting close to our five seconds. Let's see if we can do just a tiny bit better. There is one last trick we can use. Starting from the exact same code, the only difference is applying a scheduler from the diffusers library. The diffusers library tries to find the right compromise between denoising quality and speed. Stable diffusion starts from random noise and gradually denoises the image to match your prompt. There's a compromise between quality and speed. According to the documentation, this particular scheduler is the best compromise between speed and quality. I've tried it, and it works. Adding this scheduler to the pipeline, let's run this again and see if we can go just a tiny bit faster. 5.25 is what we got. Let's see if we can scrape a few more milliseconds. Every bit counts, especially at scale. This is fast. Can we go under five? 5.06. Pretty close to five. That's a smaller improvement, maybe 5%, but significant at scale. So there you go. We started from 36 seconds and are down to five. OpenVINO did a tiny bit better, but in some scenarios, you can't use OpenVINO. You can still get really close to five seconds per image using system-level optimizations and the Intel extension for PyTorch. I'm sure there are more things we could do. If you have ideas, post a comment. Leveraging the Intel extension for PyTorch and advanced settings in the Diffuser library, we can go faster. Speed is everything. That's pretty much what I wanted to show you today. I hope this was fun, and all the information will be in the video description. I'll see you soon with more videos. I have quite a to-do list, and until next time, keep rocking. Thank you.

Tags

StableDiffusionIntelCPUsSystemOptimizationPyTorchImageGeneration

About the Author

Julien Simon is the Chief Evangelist at Arcee AI , specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.

With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.

Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.

Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.