Running AFM 4.5B on Intel CPUs with OpenVINO

Transcript

Hi everybody, this is Julien from Arcee. I'm a big fan of running small language models on CPU, and recently I've been working with Intel to optimize AFM 4.5b, our first foundation model, for Intel platforms. In this video, I'm going to show you how to optimize this model using the Intel OpenVINO toolkit and deploy it on the latest generation of Intel Xeon processors, and I'll use an AWS instance for that. This should be fun. Let's get started. Before we dive into the demo, let's look at the building blocks we're going to use. We're going to need an Intel server, and I'm going to use the newly launched R8i instances on AWS. Family names get crazier. So what is R8i? R8i is Intel-based and uses the latest generation of Intel chips, Intel Xeons. You can call them Intel Xeon 6, or Granite Rapids. The important thing is it's the latest generation, and that matters a lot, as we will see, because these CPUs have dedicated instruction sets to accelerate AI inference. You can use any Granite Rapids servers. In fact, you could use the previous generation as well. It would work. You can use a real server or a cloud server. It doesn't really matter. Those Granite Rapids servers on AWS are currently only available in the R family, which means memory-optimized servers, so high memory capacity servers. I remember that because R is probably for RAM. These are way too much memory for our purpose, but that's the only option at the time of recording. I'm guessing we'll see C8i at some point, which are the compute-optimized instances, and these will have less RAM, making them more cost-effective. This would be my go-to option. But for this, we'll use R8i, and I'll put the link to this nice blog post in the video description. We'll need a model, obviously, and I'll use AFM 4.5b, which we've discussed quite a few times already. This is the model page on Hugging Face. Again, I'll add the link. Just for information, if you want to skip the model optimization process altogether and just deploy it, I've also added OpenVINO versions on Hugging Face. There's a 16-bit, 8-bit, and 4-bit model. You can just clone the repo and load those if you prefer. But in the demo, I will go through the full process. And of course, we'll need the OpenVINO toolkit, which, as you probably know, is an Intel toolkit to optimize models for a wide range of Intel platforms, CPU, GPU, and even NPU for AI PCs. We'll do CPU this time around. And last but not least, we'll use Optimum Intel, which is an open-source library by Hugging Face that makes it really easy to work with Intel tools. Literally a one-liner, as you can see. These are the building blocks. You'll find all the resources in the video description. Now let's get to work. In the interest of time, I've already launched the AWS instance. I'm running Ubuntu 24, so all the commands here will be for Ubuntu, but if you prefer another Linux variant, that's not a problem. If we run `lscpu`, we'll see we are running this fancy Intel Xeon, which is actually the Granite Rapids generation. If we look at the CPU flags, we'll see a whole collection of flags that are really important for AVX 512 with sub-instruction sets, AMX, which is a matrix multiplication instruction set, and a few more. These are what will speed up AI inference on this instance. This is the R8i 4XL with 32 vCPUs. You can probably go a little lower if you want. I don't think we need 32. We'll see what kind of speed we get. But I'm pretty sure you could go even to maybe 8 vCPUs and still get okay performance. This is a middle-of-the-road option. In the interest of time, I've already installed Docker. If you don't know how to install Docker on Ubuntu, I'll add the link, but I didn't think it was important to show you that. Just copy-paste from the instructions. Now we're ready to create our environment, start installing our tools, and optimize the model. First, let's install some dependencies. We're not going to need much. I just want to make sure I have git and the virtual environment in Python. I'll put a link to all those commands as well, so you can just copy-paste. Let's create a virtual environment, which is generally good practice. Let's make sure we have the latest pip. And now we can almost install everything with a single command, which is this one: `pip install optimum-intel[openvino]`. Don't forget this one, because if you do, OpenVINO will not get installed, and you'll see some errors. We'll just install directly from the Git repo. This guarantees you have the latest version. Optimum Intel moves fairly quickly, so usually, it's a better option to install from source. After a minute or so, this is installed. We can check. We have what we need here. Optimum and OpenVINO are important. This is the right version. AFM support was added in OpenVINO 2025.3, which just came out last week. So we should be good to go. The next thing we need to do is make sure we can download the model from Hugging Face. I've already accepted the terms, so you need to accept that and you'll be able to download the model. Make sure you have a token. If you don't have one, you can create it in Access Tokens. A read-only token is all you need. Store that somewhere safe. Now, we should be able to log in to the Hugging Face hub. Just copy your token here. We don't need a Git credential. Now we've logged into Hugging Face and we should be able to download the model. The next step is very simple. We're going to use Optimum and its nice CLI to download AFM and optimize it with OpenVINO. Here, all we have to do is `optimum-cli export openvino `. Let's first optimize the model for 8-bit precision, so int8, and this is the local path where the optimized model will be saved. Let's run this. We should see the model download. If this fails, it means your token is invalid or you didn't log in or something like that. But if you followed my steps, it should be okay. We downloaded the model. Now we're loading it, and in a few seconds, we'll see OpenVINO optimizing it to 8-bit. There we go. We have 218 layers, and 100% of the layers are quantized to 8-bit precision. We'll just save the model to that local path. We can see the model here. Let's do the same for 4-bit. If you want to do that quickly, it's exactly the same process. Just say int4. If you want to see a 16-bit version, you just say FP16. But I'm not going to do that. We've already downloaded the model, so we can skip that and quantize to 4 bits. Once we've done that, we'll write a small Python program and we'll load the model. Some layers are still in int8 by default, and most of the layers are 4-bit. There are a ton of options for Optimum Intel and OpenVINO in general. We'll look at a few more later. But if you have fancy quantization recipes, you can apply them. Let's make sure we have our environment. Now we're going to run a very simple test, a test of `v.py`, and we'll use the Optimum Intel API to load the 8-bit model here and use the Transformers library from Hugging Face to project. The really nice thing about Optimum Intel is it is extremely close to the vanilla Transformers API. So if you're familiar with Transformers, you would say `AutoModelForCausalLM`. Here we just turn that into `OVModelForCausalLM`, and the rest of the code is exactly the same. You can use the pipeline, the Hugging Face pipeline, etc. Let's try this. See if it works. Now we're loading the quantized model and we should start predicting. RCAI is an AI research and engineering firm, blah, blah, blah. This works. We could do the same thing with the 4-bit model, but let me bring that code up again. As you can see, that's all there is to it. If you want to integrate your model directly into an app, you don't want to run an inference server, this is how you could do it. If you want the high-level Hugging Face API, you can use the pipeline. There's also an OpenVINO API, which is a little bit different, available in Python and C++. Go to the OpenVINO website, and they have some code snippets and good examples in their repos as well. So that's the first way we can work with the models. Now, what if we want to deploy the model with a proper inference server? The answer is yes. OpenVINO comes with a proper inference server called OpenVINO Model Server or OVMS. I'm going to show you how to set that up, configure the models for deployment with OVMS, and deploy it. Let me clean things up a little bit and I'll be right back. Let's look at OVMS. OVMS uses OpenVINO under the hood and needs a few config files for the inference server to load models, etc. We could write them manually or use the export script that comes with the OpenVINO model server. We'll re-optimize the model and create the config file in one step. First, let's clone the model server repo. The only reason we need this is for the export script, which lives here. We don't need to add the requirements because we've already installed OpenVINO. Let's just create a directory to store the models and config files, which we'll pass to the inference server. In a very similar way as before, we can run the export script. So `export_model.py text-generation --quantize int8 --ovms_config_path `. I'm also adding two performance options: `--enable_prefix_caching` and `--kv_cache_precision int8`. These are to cache prefixes and save RAM, respectively. And of course, I want to convert this for the host CPU. Let's run this. Under the hood, it uses exactly what we've done before. We'll see the model conversion process again. If we look in the models directory, we'll see the model itself and the config file. This is the bare bones config file, which tells you where the model is. If we add the 4-bit model as well, just change the model name and weight format. We'll convert the model, and it will be added to the config file. This means you could have lots of different models available at any time. You can configure several models, which could be completely different. As long as they're defined in the config file, you can request them on demand, and they get loaded on demand. Now we have two models configured for our model server, which is exactly what we wanted. Next, let's grab the OpenVINO model server, which is a Docker image. It's not big, less than 500 MB. Now we can just run the server. So `docker run -p 8000:8000 -v :/models --config_path /models/config.json`. Make sure you have OpenVINO model server 25.3 or higher, and we should see something like this, which is good news. AFM45B int8 is available, so the model is ready to be served. If you see "unavailable" or some kind of error, it means there's something wrong with your config file, the path is wrong, the model doesn't exist, or it's corrupted. But if you see "available," it means it's ready to be served. Now, let's run another example. We can try `curl`, why not? First, `curl ovms` on port 8000. We use the OpenAI format here, which is fully supported by OVMS. We can pass the model name, and this should be the name of the model in the config file. Some additional parameters if you feel like it, and then the prompt. Let's see what happens. We get an answer. As you can see, we can invoke this just like we would invoke any OpenAI model. Now, let's use the OpenAI client with Python. So `pip install openai`. Let me grab my code here. Import the OpenAI client, point to the endpoint, so the OVMS endpoint, no API key needed, and we can do streaming inference this time with the 8-bit model, streaming equal to true, and then a very simple function to decode the streaming answer. Let's try this one. Here we go. This is pretty good. Let's try a slightly more complicated question. How about my favorite question? It keeps me awake at night. Let's go for more tokens here. Off it goes. You can see the speed is more than adequate. We're running on all threads, 32 threads. We're actually using every single thread on the system. Some of them are busy doing something else, so we could try fewer threads. But you can see here, this is already faster than I can read. So there you go. As you can see, it's really easy to work with small language models in general and AFM-4.5b in particular on Intel instances. Even with a small instance, we get very nice speed. For some use cases, this is everything we need. We can run several models or several model variants even on the same machine. We're not limited by GPU RAM. There's much more host RAM than we'll ever need here. So we can just load models on demand and work with them without the need for GPU instances, which tend to be expensive and a single point of failure in your architecture. Nice little demo. Again, I'll put everything in the video description. I'll put all the commands that I use for setup. I hope you liked it. Until next time, my friends, you know what to do. Keep rocking.

Running AFM 4.5B on Intel CPUs with OpenVINO

Transcript

Tags