Run SLMs locally Llama.cpp vs. MLX with 10B and 32B Arcee models

February 05, 2025

In this video, we run local inference on an Apple M3 MacBook with llama.cpp and MLX, two projects that optimize and accelerate small language models on CPU platforms. For this purpose, we use two new Arcee open-source models distilled from DeepSeek-v3: Virtuoso Lite 10B and Virtuoso Medium v2 32B. First, we download the two models from the Hugging Face hub with the Hugging Face CLI. Then, we go through the step-by-step installation procedure for llama.cpp and MLX. Next, we optimize and quantize the models to 4-bit precision for maximum acceleration. Finally, we run inference and look at performance numbers. So, who's fastest? Watch and find out :) If you’d like to understand how Arcee AI can help your organization build scalable and cost-efficient AI solutions, don't hesitate to contact sales@arcee.ai or book a demo at https://www.arcee.ai/book-a-demo. ⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos. You can also follow me on Medium at https://julsimon.medium.com or Substack at https://www.airealist.ai. ⭐️⭐️⭐️ * Blog post: https://www.arcee.ai/blog/virtuoso-lite-virtuoso-medium-v2-distilling-deepseek-v3-into-10b-32b-small-language-models-slms * Virtuoso Lite: https://huggingface.co/arcee-ai/Virtuoso-Lite * Virtuoso Medium v2: https://huggingface.co/arcee-ai/Virtuoso-Medium-v2 * Llama.cpp: https://github.com/ggerganov/llama.cpp * MLX for language models: https://github.com/ml-explore/mlx-examples/ 00:00 Introduction 00:40 A quick look at Virtuoso-Lite and Virtuoso-Medium-v2 03:10 Downloading the models from Hugging Face 04:30 Building llama.cpp from source 06:15 Converting Hugging Face models to GGUF 09:20 Quantizing models with llama.cpp 11:20 Running local inference with llama.cpp 15:00 Installing MLX 11:20 Running local inference with MLX 19:45 Conclusion

Transcript

Hi everybody, this is Julien from Arcee. Small language models are great candidates for local inference. In this video, I'm going to show you how to run two of our latest open-source models, Virtuoso Lite and Virtuoso Medium V2, on a MacBook, specifically on an M series MacBook. To do this, I'm going to show you two different tools: Lama CPP and MLX. I'll guide you through every step, from installation to converting the models and running inference. You'll be able to replicate everything. Sounds good? Let's get started. Before we dive into the tools, let's take a quick look at the two models we're going to work with. Virtuoso Lite and Virtuoso Medium V2 were released as open-source models about a week ago. I'll put a link to our blog post where you can read about the models and look at the benchmarks. I'll also add the link to the model pages on Hugging Face so you can read all about that and grab the models. What's really cool about these two models is that, despite their sizes—Virtuoso Lite is 10B and Virtuoso Medium V2 is 32B—they are way better than much larger models from just a few months ago. If we look at the leaderboard, Virtuoso Lite is the number one model for models up to 14 billion parameters, outperforming other models under 14B. There's no ranking for Virtuoso Medium V2 yet, but I wouldn't be surprised if it ends up very close to the top. We have new models coming all the time, so keep an eye on those. Virtuoso Lite is a very good choice for models 10B and below. If you're using 8B models today, such as LAMA 8B or similar, I highly recommend you look at Virtuoso Lite. It's just a tiny bit bigger, so it would probably still fit in whatever platform you're using today, and it's likely to be a massive upgrade. Details are in the blog post. Now, let's take a look at the tools we're going to use. The first one is Lama CPP, a great library to optimize models for inference across a number of platforms. We'll set it up, compile it, and work with MLX, particularly MLX examples, which has the language model support we need. Let's jump into a terminal and start installing everything. Before we build the tools, let's grab the model. The easiest way to do this is to use the Hugging Face CLI tool, which you can install like this: `pip install huggingface_hub`. Then, you need to log in with a read-only token from your Hugging Face account. I've already done this, so no need to log in again. You can just go `huggingface-cli download` and grab the model name to a local directory. I've already done this, but let's run it again for the sake of it. Let's do the same for the other model. Good, so now I've got the models. Let's install our tools. I'm assuming you have a build environment ready, which should be as simple as installing `make` and `GCC` with Homebrew. We have all of that, so we should be good. Let's clone Lama CPP. The build is pretty simple. We want the Metal backend, the Apple one. Now we can just build it. Let's make it fast. Now we've built it, and we should see all the tools in the `bin` directory that we're going to use. There's one last step to make sure we have all the requirements: `pip install -r requirements.txt`. This will ensure we have everything we need for the Python tools in Lama CPP. Now we can start working with the models. The first step is to convert the Hugging Face models into GGUF, the format that Lama CPP uses. It converts everything into a single file, which is easier to work with. Let's run this. You just need to point to your directory, and it will do its thing. This shouldn't take too long. In many cases, you will find the GGUF models on the hub. For example, we have Virtuoso Lite GGUF in our org, with plenty of versions: 2-bit, 3-bit, 4-bit. If you find those versions already, great. You don't have to run the conversion. But conversion is very fast, so there's no harm in doing it yourself. Once you have the 16-bit single file version of the model, we can run it to see how fast it is. Let's use the Lama CLI, no conversation mode, generate 512 tokens, and prompt something. We're running the 16-bit model, original precision. It's reasonably fast, about 16 tokens per second. By default, we're not using all the threads on this machine, so let's try using more threads: `--threads 16`. This keeps the machine busy, but the speed is about the same. To make this faster, we can quantize the model. We'll start from the 16-bit model and use the Q4_0 quantization, which is usually very fast. If you want a little more precision, you can try Q4_KS or Q4_KM. Quantization slightly degrades accuracy, but for most use cases, it should be fine. Let's quantize to Q4_0. It's a very fast operation, and we get a much smaller GGUF file. The original 16-bit model is about 20 gigs, and the Q4_0 version is under 6 gigs. Now let's run inference again with the quantized model, generating 512 tokens, no conversation. We can see it's much faster, about 47 tokens per second. That's almost three times faster. Lama CPP also has flash attention, which you can try, but it's sometimes faster and sometimes not. The answers are still good, and the speed is about the same. You can run more benchmarking with tools like LlamaBench and measure model quality with the Lama perplexity tool. For Virtuoso Lite on the Wikitext 2 data set, perplexity only increases by 1% when going from 16-bit to 4-bit, which is negligible and amazing. This means you can run the 4-bit model with the same precision as the 16-bit model, but 3x faster. Let's do the same for the other model, Virtuoso Medium V2. The process is the same. We quantize to Q4_0, and the 32B model runs at 16 tokens per second, the same speed as the 10B model at 16-bit precision. For the same speed, you can either run a 10B model at original precision or a much larger model quantized to 4 bits. If you're after performance and speed, the 10B model is unbeatable. Let's earmark those numbers: 16 tokens per second for the larger model and 47-48 for the smaller model. Now, let's switch to MLX. MLX is a library similar to NumPy to accelerate scientific operations on Apple hardware. The repo we're interested in is MLX examples. Installing it is as simple as `pip install mlxlm`. You can use it programmatically with a Python API or via the CLI. MLX generate is where you start exploring. MLX is similar to Lama CPP in that you need to convert and optimize the models. You can do this with the Python API or the command line. MLX supports quantization, and they have a large community on the Hugging Face Hub that converts and shares models. You can find over 1,400 models already converted in the MLX community organization. Let's run Virtuoso Lite 4-bit directly. We don't need to convert it. We can just run `mlx generate` with the model and a prompt. The model was downloaded, and now we can generate tokens. It's nice and fast, about 51 tokens per second, which is 10% faster than Lama CPP. Let's try the larger model, Virtuoso Medium V2 4-bit. It's already downloaded, and we're at about 17 tokens per second, an 8-9% improvement. That's what I wanted to tell you today. I'm a huge fan of local inference for small language models, and both Lama CPP and MLX deliver that. MLX is a little faster on Mac hardware, probably 8-9%, and I love the community they have on Hugging Face, which makes it easy to find and contribute models. Lama CPP is still very good and runs on many more platforms, not just MacBooks. Both tools are equally interesting, and you should follow them both. That's it for today, and until next time, stay metal—Apple Metal or Metal. Keep rocking.

Tags

Local InferenceSmall Language ModelsMacBook M Series

← Back to 2025 Videos ← Back to YouTube Overview