SLM Inference on a Windows laptop Intel Lunar Lake CPU GPU NPU + OpenVINO

July 14, 2025

Unlock the full potential of your Intel Lunar Lake processor! In this demo, we transform an MSI Prestige 13+ Evo laptop running Windows 11 into a local AI powerhouse, running cutting-edge language models like Llama-3.1-SuperNova-Lite (8B) with very good performance and efficiency. ⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos. You can also follow me on Medium at https://julsimon.medium.com or Substack at https://www.airealist.ai. ⭐️⭐️⭐️ Intel's Lunar Lake architecture brings together CPU, GPU, and the revolutionary NPU (Neural Processing Unit) in perfect harmony. With OpenVINO's Intel-optimized inference engine, you'll see how to leverage each component for maximum AI performance. No more cloud dependencies or expensive API calls - everything runs locally on your Intel hardware! Intel's deep integration with OpenVINO means you get the most optimized performance possible. From quantization techniques to hardware-specific optimizations, every aspect is fine-tuned for Intel architecture. Whether you're a hobbyist or building enterprise AI solutions, this combination delivers professional-grade results. ** Laptop specs: https://www.msi.com/Business-Productivity/Prestige-13-AI-plus-Evo-A2VMX/Specification Intel Core Ultra 9 288V, 32GB RAM, Intel Arc 140 GPU (16GB), Intel NPU ** Guide and code: https://github.com/juliensimon/arcee-demos/tree/main/openvino-lunar-lake ⭐️⭐️⭐️ While you're here, I’ve got a great deal for you! If you care about your online security, you need Proton Pass — the ultra-secure password manager from the creators of Proton Mail. GET 60% OFF at https://go.getproton.me/aff_c?offer_id=42&aff_id=13055&url_id=994 ⭐️⭐️⭐️

Transcript

Hi everybody, this is Julien from Arcee. I hope this video looks okay because I'm recording it on a Windows machine. Yes, you heard that right, and no, I haven't gone crazy. The reason why I'm using Windows today is to demonstrate how you can run local inference on a Windows machine powered by an Intel Lunar Lake CPU. Lunar Lake is particularly interesting because it gives us the option to run on CPU, on an Intel GPU, and on an Intel NPU. So I'm going to use OpenVINO to optimize a small language model, and we're going to run inference on those three hardware platforms. This should be interesting. Let's go. Before we dive into the demo, a few words about the dev environment. This machine runs Windows 11, and I'm assuming you have a working Python environment. I won't show you how to set that up; there are lots of good resources out there. You can use PowerShell, Miniconda, etc. It doesn't really matter. I also tried to use the Windows Subsystem for Linux and Docker, which would have given me Ubuntu and tools I'm generally more familiar with. However, I couldn't get access to the GPU or the NPU in WSL or Docker. I spent a fair amount of time trying, followed all the tutorials, but couldn't get it to work. So either I'm an idiot, which is a strong possibility, or this particular GPU is not supported by WSL or Docker. If you know how to get this right, leave a comment. I'll be happy to learn. But I couldn't get it to work, so I'm going to use the CPU, GPU, and NPU natively. As you will see, this is pretty efficient. Okay, let's talk about OpenVINO now. OpenVINO is an Intel toolkit that helps you optimize deep learning models across Intel platforms. The documentation is actually pretty good, but it covers a lot of different things, from computer vision models to Gen AI models, etc. Starting from your working Python environment, the most useful page for what we're going to do today is this one, and I'll put all the links in the video description. We're going to export models using a script. The first step is to grab a model from Hugging Face. I'm going to use Supernova Light, our own LLaMA 8 billion parameter model, and export it to the OpenVINO format. Once we've done that, we'll be able to run inference across different platforms. We'll run local inference in a Python script, and I'll also show you how to use the OpenVINO Model Server (OVMS) for proper serving through HTTP APIs. In the repository, you'll find a small guide I wrote to walk you through the steps from model export to inferencing with OVMS. Let's look at model export. We grab our model from Hugging Face and optimize it for a particular device, which can be GPU, CPU, or NPU, assigning a certain quantization format, either int 8 or int 4. Both work for GPU and CPU, and for NPU, we should really use int 4. This is where we save the model and the config file for model serving. So nothing really crazy here. Let's run one example. I'm going to run the NPU export just for you to see how that works. I already downloaded the model from Hugging Face; it's in my cache. If you run this for the first time, you'll see the model download happening. Then OpenVINO is going to load the model and optimize it for this particular configuration. This should only take a minute or two, so let's just wait for a second and we'll see the results. Okay, so we see OpenVINO optimizing the model. Most parameters are going to be quantized to four bits, and a few will stay in 8-bit precision. As you can see, this is a fast process. Once it's done, we'll get the saved model locally and the config file to serve it with the model server. Let's give it a few more seconds. Now we have the model and the config file. The process is exactly the same for the other platforms. There are some more hardcore quantization settings if you want to look at those, and I would refer you to the OpenVINO documentation. But for our purpose, this is more than enough. Now that we have quantized all those models, we can run inference using a small script I wrote, `OpenVINOExample.py`. We just need to mention the device, the precision, and a prompt. Here we see the performance for the CPU. It's a bit slower because I'm recording with OBS. The NPU is doing nothing, and the GPU is managing video. Let's try the CPU first. CPU, four bits. Here we're doing local inference, not model serving. We're just loading the model and predicting with the pipeline. We can see the CPU is a bit busy, and the speed is not too bad. It's probably a bit too slow for conversational usage, but it's not ridiculous. We could use this. Let's see how many words per second we're getting. We got 4.5 words per second. For English tokens, which are about 30% more than words, that's about six tokens per second. A bit on the low end, but not ridiculous and usable for short prompts and small applications. Let's try four bits with the GPU this time. Write a creative story. The GPU is pretty busy. It's an Intel Arc 140 V with 16 gigs, which is definitely enough for this model. It also works in 8-bit with room to spare. GPU memory size shouldn't be an issue here. We wouldn't be running very big models anyway. Let's see what kind of speed we get. We got 9.4 words per second. Adding 30%, that's probably 12 tokens per second. Why not run 8-bit as well? Let's see if there's a big difference. It looks a bit slower, but if you need a little more quality, 8-bit could be an option. Quantization will degrade the model just a tiny bit, so 8-bit quantization should be almost invisible. It's slower, but not that slow. Maybe that's just a bad run, but it is a bit slower. About seven tokens per second. Now let's try the NPU. The NPU is a chip with dedicated hardware to accelerate deep learning operations. Let's grab the prompt here. The first time you run this, it's going to be slow because there's an extra compilation step on top of the OpenVINO compilation to further optimize the model for the NPU. I'm using the cache mechanism available in OpenVINO, and you can look at my code to see how that works. The first run is slow and takes a few minutes. You'll see the CPU jump to almost 100%. But then the NPU compiled artifact is saved, and it starts very quickly. We can see the NPU going almost 100%. Very good hardware optimization here. The speed looks fairly nice. The good thing is, we leave the GPU alone. If you need the GPU for video processing or other applications, or if you're running another model on the GPU, this is a really good option. We're getting about eight tokens per second with this 8 billion parameter model, which is pretty good performance. The combination of CPU, NPU, and GPU is very interesting. You have flexibility to run different models potentially at the same time and pick the sweet spot for each one. If you have a really tiny model, maybe it's worth running on the CPU. If you have something that needs a little more speed than the GPU, and if you want to leave the CPU and GPU alone for other applications, the NPU is dedicated for you. That's pretty cool. The NPU has 16 gigs as well. That's local inference. Now, if we wanted to invoke our model through HTTP and maybe an OpenAI-compatible API, the OpenVINO Model Server is a good option. I provided two different possibilities here. You can work with an existing OpenVINO server or start one locally on the machine. Let's do that. I'll launch OVMS, which I installed on my machine, and it will load the model. It could take maybe 30 seconds. Then we'll run inference through HTTP using the OpenAI client directly on OVMS. We see that the model has been loaded and is available. Now we can prompt it, and you recognize the OpenAI format. If you need OpenAI compatibility, HTTP, or to serve different apps, this is a good option. This one was a little faster, but we just got lucky. That's how you can do it. Of course, you can launch OVMS manually and then run inference. It's the same. Just don't say "start server" and it won't. That's pretty much what I wanted to show you. For fun and for Windows fans, I included some extra commands. If you want to use PowerShell to query the model, you can do that. The script options we saw, the names of the models, and some troubleshooting. The performance here, I just ran a few examples, so your mileage may vary. Generally, GPU 4-bit is faster, consistently around 10 words per second. NPU is generally second. We saw this last run was actually 7.5, so almost 10 tokens per second, freeing the rest of the machine for other applications. GPU int 8 will be a bit slower, and then CPU 4 and CPU 8. Bottom line, if you really want speed, the GPU is the best place, but if you need the GPU for something else, or if you want to run two models, the NPU is also very interesting. That's pretty much what I wanted to show you. You'll find everything in the repo. This is the script. You can tailor it to your own liking, use your own model. But I think this is an easy way to experiment with small language models and OpenVINO. On any AI PC, this would work equally well on another generation. That's what I wanted to show you for today. I hope this was useful. I hope you enjoyed looking at Windows for a change. It wasn't too different or painful for me. Thanks a lot for watching. I'll be back soon with more content. Until next time, keep rocking.

Tags

OpenVINOIntel Lunar LakeLocal InferenceWindows MachineNPU GPU CPU

← Back to 2025 Videos ← Back to YouTube Overview