Benchmarking TurboQuant with MLX on Apple Silicon

May 31, 2026

KV-cache quantization in MLX: I benchmarked TurboQuant on Apple Silicon to find out how much memory does it really save, and where it breaks. ⭐️⭐️⭐️ Thanks to VEED for sponsoring. Try the VEED Subtitles API on Fal: https://bit.ly/4fCVp4g ⭐️⭐️⭐️ The KV cache grows linearly with context and quickly becomes the memory wall on a Mac. 8-bit quantization is the easy fix, but 4-bit breaks greedy decode — because the two halves of the cache, K and V, behave completely differently: K feeds softmax (fragile), V gets averaged (forgiving). TurboQuant (Hadamard rotation + a Lloyd-Max codebook) is built to exploit that. I measured it across three architectures and four cache backends. Measured (resident KV cache, Qwen 2.5 32B @ 32K): • baseline fp16 — 7.85 GB • scalar K8V4 — 3.17 GB (−60%) • TurboQuant K4V4 — 2.07 GB (−74%) • TurboQuant K3V3 — 1.71 GB (−78%) …and the honest part: on Gemma 4 (hybrid sliding-window) every TurboQuant config tested produces garbage in Python MLX, while scalar K8V4 stays coherent. The gain is real, but it depends on your architecture and context length — you have to know which regime you're in. Code and slides: https://github.com/juliensimon/turboquant-apple-metal More content on Substack at https://www.airealist.ai

Transcript

Hi everybody, Julien here. For the past few weeks, I've been trying to make sense of a KVCache quantization on Apple Silicon. There's a Google paper called TurboQuant, several great MLX ports, and people online saying this cuts your KVCash in half, which sounds amazing. But every time I try to reproduce that on my own Mac, the numbers don't quite match the story. In this video, I run the actual experiments, You make content for a living, you really want to use the API. One API call in your favorite language is all you need. Pass the video URL, get the finished video out. This handles transcription, line breaks, styling, and burn in rendering. 125 languages are supported and the captions are 98% accurate. Pricing starts at 10 cents per minute. For a content pipeline, this is the difference between a lot of manual work and that's infeasible. So we cache K and V once per token and reuse them again and again. A generation becomes linear cost per token. But the cache lives in memory and it grows linearly with context forever. I measured this on QAN 2.5, 32B. With 4K context, the KV cache is 1 gig. At 16K, 4 gigs. At 32K, almost 8. And the V cache at 8-bit precision and you're done. VLLM has it, TensorRTLM has, MLX exposes it as a one-line flag, KV bits equal 8. So you quantize the KV cache from 16-bit precision down to 8-bit and memory halves. Quality loss at 8- bit precision is pretty much zero. Solve problem. So why are we even talking about this? K and V behave very differently under quantization. So here's the picture K on the left, V on the right. Let's look at K first. So K is what attention uses to pick the winner token through a function called softmax. So softmax takes the dot product scores between the current query and every past key, runs exponentiation and normalizes so that all tokens So now let's look at V. V gets averaged after the winner is picked. So every cache token's V vector is multiplied by its attention weight and summed. So it's a linear weight average. And of course, averaging is more forgiving. Green dots are the true V values and the red crosses are heavily quantized V values. And if we look at the dashed lines, , if we do naive 3-bit, evenly spacing the levels across the value range, well, we're going to miss out on some information. And that's exactly where TurboQuant comes in. Here's the TurboQuant trick in one slide. So we have two distributions here. On the left in red, we have a realistic KB cache value distribution. Most numbers are grouped around zero, We're throwing away a lot of precision exactly where it would matter most. Looking at the right panel, we see the Torbocquant trick, which is to reshape the data before compressing it. And that happens in two steps that are both reversible. This is very heavy math, so I'll just try to give you the overview and I'll share links to more precise resources at the end. Step 2, and that's a mouthful, is called the Walsh-Adama transform. So think of it as a recursive shuffle. So we take the 128 numbers, the one we flipped, and we replace each pair with the sum and the difference of the two numbers. Then we take those results, we pair them again, and we do that again and again and again. We do it seven times. At Bell Labs, Joel Max at MIT, and so it's called the Lloyd Max quantizer. And if we look at the blue lines on the right panel, the vertical lines, you can see they're packed close together where the Bell curve is really dense and they're sparser out there on the tails. So every bit is doing useful work, and I would say more useful work than with the naive codebook. When we read the cache back, we can A. Great, but how do we run this on a Mac? There are several serious implementations of this on Apple Metal already, and let's take a quick look. So we have an MLX LM fork by Arazonoff. That's the drop-in version. It has the Easy Scalar K8 V4 path, which uses a native quantized metrics multiplication available on Apple hardware, so no custom kernels. Then we have another MLX fork, Landon Maltz. This one has native metal codebook attention built inside the MLX fork. And it's tracked currently by MLX issue 3404 if you want to follow integration. And last but not least, we have the Tom's TurboQuent Plus, which is probably the most production-grade effort. This one is not based on MLX. It's based on Llama CPP. Then TurboQuant K4v4. And then TurboQuant k3v3, trying to push things a little more. So if we look at Qwent 2.5 32b, which is a dense model, we see that the baseline Kv cache usage is 7.85 gigs at 32k. Apple Scaler Kv4 cuts that to 60%. TurboQuant, K4v 4 cuts it at 74%. And the Adamar rotation in TurboQuant actually makes 4-bit K survivable on dense models. The rotation that we discussed smooths out the outliers that would otherwise break the process. And that's where the extra savings come from. The keyword here is dense. And if we look at an MOE model like QUEN3.5 Active 3B, same story, 67% with TurboQuant, 54% with Apple Scalar. And the bottom row gigabytes back is the one that decides what you can actually run. On QUEN2.5, 32B at 32K, real TurboQuant K3v3 gives you over 6%. The yellow line, the Apple Scalar tracks much lower, but the green and blue lines, the TurboQuant lines, stay reasonably flat. And the gap between yellow and green widens as context grows. That's the visual proof that real TurboQuant beats Scalar mode by a large margin. The middle panel is Gemma 4, the hybrid model. We see the quantized lines are closer to one another because really saving half a gig. So percent saved tells you how well TurboQuant is doing the job. Gigabyte saved tells you whether this is changes that really mean anything to you. So what can we do with the savings? Let me translate the bytes into something concrete. On my 48 gig Mac running Quantu 5 32B with 4-bit weights, model weights require 7, I can fit about 270K tokens, so about 2.5 more. K4v4 turbo quant, I can get 410K context. And with K3v3, I can go all the way up to 480K context. So huge differences, but these are a bit theoretical. You will save 50% of a bounded thing. So you're not going to double the context. You're going to double on a fraction of the layers. So sure, you get extra headroom for other things, but you don't get the same bank for your buck. So memory savings are real, but they're half the story. So it is a smoke test. Just run a few prompts and try to detect issues. So your mileage may vary. This is really designed to catch catastrophic failures, not subtle, weird problems. So for production, you would want to run more extensive tests and measure perplexity and cosine similarity, et cetera. Still, I did find catastrophic failure. That's a big surprise, TurboQuant breaks every single time. So K4V4, K3V3, and even V-only 4-bit and V-only 3-bit, they generate garbage. I tried it again and again, and it just doesn't work. So that's expected. It is a problem. And the reason is that basically the reconstruction error, retention mechanism doesn't tolerate that error. The mix of global layers plus quantized sliding window layers just doesn't click. The error compounds and decoding breaks. So this isn't something that can be fixed with a kernel patch. It needs a bunch of tricks, and those are actually implemented in one of the projects we've discussed. If your model is a classic dance model, Llama, Quintu5, et cetera, all attention layers being global attention, then TurboQuant K4v4 works perfectly well, and you get about 70% cash savings. If your model or anything with a mix of sliding window plus global attention, you cannot use TurboQuant on Python MLX today. At 54%. And you can try TurboCon K4v4 with extra testing on top if you really need the extra savings. So if you want the full optimization, again, this will be covered in the follow-up video where I will try out the TOMS LAMA CPP fork. And the set of commands, et cetera, will be in the repo shared in the video description. So in the next video, I will dive into the LAMA CPP side of the story. And we'll see we can actually do a little more. OK, well, that's it for today. I hope you enjoy the video. Share your numbers in the comments. Really curious to see what you're getting on your own models. Until next time, my friend, you know what to do.

Tags

AIMachine LearningTechnology

← Back to 2026 Videos ← Back to YouTube Overview