Deep Dive: How Three MoE Reasoning Models Actually Work — Trinity, DeepSeek R1, Kimi K2 - Part 2
May 4, 2026
Three frontier open-weight MoE reasoning models — Trinity Large Thinking (~400B), DeepSeek R1 (671B), and Kimi K2 Thinking (1T) — are compared side by side. Architecture, training, and post-training, explained from first principles.
⭐️⭐️⭐️ More content on Substack at https://www.airealist.ai ⭐️⭐️⭐️
In Part 1 (https://youtu.be/2uQQ8nKNq1U), I broke down how these three models are actually built — not benchmarks, not vibes, but the engineering decisions and why they matter.
Now the question that actually matters: which one should you use? Benchmarks, costs, deployment, and practical details.
*** Models
Trinity Large Thinking: ~400B total, ~13B active, 256 experts, 512K context, Apache 2.0 https://huggingface.co/arcee-ai/Trinity-Large-Thinking NVFP4: https://huggingface.co/arcee-ai/Trinity-Large-Thinking-NVFP4
DeepSeek R1: 671B total, ~37B active, 256 experts, 128K context, MIT https://huggingface.co/deepseek-ai/DeepSeek-R1
Kimi K2 Thinking: 1T total, ~32B active, 384 experts, 256K context, Modified MIT https://huggingface.co/moonshotai/Kimi-K2-Instruct
*** Papers & blogs
Trinity architecture blog: https://www.arcee.ai/blog/trinity-large
Trinity tech report: https://arxiv.org/abs/2602.17004
OpenRouter (all three): https://openrouter.ai/
#llm #moe #deepseek #reasoning #architecture #training #openweight #inference #arcee #kimi #trinity #mixtureofexperts
Transcript
TriniLargeThinking, DeepSeekR1, and KimiK2Thinking actually work under the hood. In part two, we're going to answer the question that actually matters, which one should you use? So we're going talk about benchmarks, cost, deployment, and practical details. Let's get started. All three models are sparse mixture of experts models with chain of thought reasoning.
They share more architectural DNA than you would expect, but they made very different trade-offs in terms of sparsity, scale, and attention. So just as a reminder, Trinity is the most sparse for experts active out of 256, with only 13 billion parameters active out of about 400 billion. DeepSeq is the middle ground, 8 experts out of 256 with 37 billion parameters active out of 671b and Kimi is the largest model 8 experts out 384 30 billion parameters active, out of a full trillion. Okay, let's look at the benchmarks. So what do the benchmarks tell us?
We can see in general that Trinity is delivering very strongly on agentic benchmarks. Is exactly what the model was trained for in the first video we saw that even though the models share common architecture features the way they are post trained really really decides what they're optimized for and in the case of trinity it is really uh agentic applications so we saw we see the really strong scores on taught to telecom on pinbench uh taught to airline life code bench where um trinity is uh goes 's it for an open model and so all these dimension marks reflect the agentic performance of Trinity large thinking. We can see Kimmy 2 is a good model across the board. It's pretty balanced and the latest model from Moonshot, Kimmy 2 5 is even stronger than that. Deepsea car 1 is a little behind, But hey, it was the first of that generation of models.
And the value now lies in the many distilled versions that we can find on Hogging Face, like R1 Distilled Quen 32B, which you can run on a single NVIDIA RTX 1490. So yeah, DeepSeaCar1 has fallen behind, but it has spawned a lot of smaller, super powerful models. We see the numbers for opus 4.5 for reference and generally yes the closed frontier models are still ahead but we can see the gap is closing and it will certainly keep closing as we see more releases so now let's talk about inference costs and deployment well this is where the rubber meets the road we need to talk about money so first let's look at the api pricing so at the time of recording All three models are available through API providers, and we can see the costs per million tokens. So DeepSeaCar1 is $0.55 input, $2 output. It's widely available on Together and enabled as Bedrock, etc., etc.
So it's definitely the less expensive option for most use cases. Just a little more expensive, 60 cents and 250 on Moonshot's platform. And it's still competitive. Kimi 2.5 is actually 60 cents and $3. The price just changes all the time.
So you may want to check the latest rates. And Trinity Large Thinking, which is available on Arcee's own inference platform and open router is 90 cents. Output which is crazy that's not only much cheaper than the chinese models but it's also massively cheaper than the frontier models so when you see trinity large thinking going head to head with opus 4 or 5 or maybe 4 6 it does it at a much much lower price point so definitely worth a shot and the few percent A lot of performance that you may lose working with Trinity compared to the Frontier models might not mean anything for a lot of use cases. So clearly here we see Trinity is the most efficient. RCR has really, really optimized the thinking token budgets to deliver a very, very cost-effective model.
So now let's talk about self-hosting. And, well, these are large models, you may still justify the hosting cost for those models. So let's look at DeepSeek R1. So R1 is probably the easiest to self-host. You can run the full 671b model on eight H100s if you use FP8 quantization.
So that's a large AWS instance. But again, the real story is the many, many distilled models. That you can run on RTX 1490 or even a large Mac, a Mac Studio with 192 gigs is gonna do the job. You have even smaller versions. There's a 14B version that you can on any laptop and the HoggingFist community and whether through GGUF or MLX has done a lot of work to make it possible to run DeepSig distilled models.
On um on local machines right and you may even be able to use uh you know laura with unsloth uh for fine tuning so uh if you go down to four bits then you can run the the original model on 48100 which is still not so expensive for for a model of this quality so kimmy k2 is harder to self-host because it is quite bigger one trillion parameters So your full precision is going to take 16 H100s. And even with FP8 quantization, that's still going to be probably eight of those. So that's probably a little too much. And in practice, you may want to use the API. The Kimi K25 model is actually added in for quantization with quantization-aware training, a little smaller but you're still looking at a substantial gpu infrastructure so trinity large is still quite big at 400 billion parameters but because it has only 13 billion active parameters you will get the best throughput out of it so you will need a few gpus yes but you will get higher tokens per second because we're just running fewer tokens at inference time so and so that's a good one plus trinity has the 512k extended context window which is the largest of the three models and that should allow you to run probably larger workloads and longer conversations than with the other models and you'll find again you know gg web quantized versions on on Hugging Face but now let's look at context size and what it really means for a long running multiple turn conversations and this is not really discussed very often so everybody talks about the context size and that looks great I guess but the reality is a bit different because the thinking tokens actually eat a lot of your context so when a reasoning model thinks of course we know it generates chain of thought tokens and these are part of the context window even though you may not see them in the final the advertised context window is in fact quite smaller once you add the thinking overhead so if we look at DeepSeek r1 which has 128k context it is thinking heavy and on multiple steps complex tasks all those thinking tokens will add up and consume a substantial portion of the context so on hard problems long-running discussions the actual context you can use is quite reduced so that's a problem newer r1 variants and DeepSeek v4 have introduced larger context so you need to check for those as well kimmy k2 thinking as 131 k context window and kimmy25 as 256 So here Moonshot build the models for long context conversations from the start and the thinking chain is a little shorter and a little more structured than for DeepSeq, meaning you're likely to consume less context for reasoning tokens.
And Trinity Large Thinking, which is a 512k extended context, which is the largest of the three, has been specifically optimized by Arcee. To minimize the thinking token overhead. So the approach is to adjust the chain of thought depth to the task complexity. So simpler tasks will not generate a zillion tokens and will preserve your context window a little more. So that's an interesting technique that's quite efficient.
And the combination of the long context window plus The thinking budget optimization is one of the things that helps Trinity achieve strong performance on agent tick tests. The takeaway is you shouldn't just compare the context sizes. You need to understand how much of that is left, so to speak, once you've removed the number of tokens that the thinking process is going to typically consume. The next item I want to discuss is licenses. Matters more than you think because it means what you can actually build with the model.
So DeepSeaCar 1 is the MIT license, which is very permissive, commercial use, modification, redistribution, fine-tuning, distillation. All of them are allowed with no restrictions, and this is probably why we see so many distilled variants on Hugging Face. You can take DeepSeaCard 1 and pretty much do anything you want with it. And redistribute it, right? And the distilled models become MIT, too.
So good license, and again, probably a good reason why DeepSeq R1 has been so popular. So Kimi K2 and K2-5 are modified MIT, still very permissive, commercial use modification, redistribution overload. The modification touches on the acceptable use policy, which restricts harmful applications like weapons or surveillance or deception and nasty things like that. In practice, legitimate commercial use is totally okay. And, well, I guess that's a reasonable compromise.
And Trinity Large is Apache 2.0, which allows commercial use, modification, distribution, etc. ...but still otherwise super permissive and there are no custom restrictions. So that's a very straightforward open source license that a lot of enterprises are comfortable with and your legal team certainly knows about Apache 2.0. So all three are very permissive licenses and in practice you shouldn't have any issue building commercial applications with any of those. So now we've come to the last slide.
Should you use so these are my honest recommendations so DeepSeek r1 has been out for a while it's I guess the safe pick you should use it when you need a proven well-supported reasoning model with a large community ecosystem it also works for budget conscious teams and for people who are just starting to work with reasoning models mit license means zero legal the large number of distilled versions will help you find the sweet spot between reasoning power and cost. So you can start small with a small distillation model and scale up if you need a little more power. Deployment issues are solved. The model is supported absolutely everywhere. It's supported in every major cloud.
That's fine. If you're not sure, you can't really go wrong with DeepSeq. Kimi, K2, and N2.5 are a slightly powerful family of models with very broad capabilities, long context, document processing, multi-turn agent workflows, et cetera, et cetera. They're also quite good on multilingual applications. So today this is probably the highest composite benchmark score across open models it is a trillion parameter model so you're very unlikely to self-host it but you can use it through the kimi api and and you should get your money's worth so if you have slightly more complicated use cases like you know document ai or complex multilingual deployments etc and your if you're comfortable with an external API, which you may not be for different reasons, this is a good choice.
And finally, Trinity Large Thinking is really the most efficient model you can use. So if you want the best performance per dollar ratio, this is it. If you have complex agentic workflows with many concrete sessions, very, very long conversations, this is the one. If you want to self-host, ratio this is it if you have complex agentic workflows with many concrete sessions very very long conversations this is the one if you want to self-host without massive gpu infrastructure this is also the one so it's efficient it has the smallest number of active parameters meaning the inference cost is also going to be lowest and it's going to be the highest throughput and you can just run this on your own platform at a reasonable call. So this one is really best as mentioned for agentic AI in general especially if you need to self-host and keep an eye on your inference budget.
So that's what I wanted to cover after those two videos. The three models represent a very nice step up in you can own the weights and you can pay 10 to 50x less than you would pay for the closed models so part one covered the how the architectures the training etc part two covered okay what now how do I work with those which one do I pick so now you should have everything you need to make the right decision and and pick the right model for your application that's what I wanted to tell you today I hope this was informative if you have any questions please ask them in the comments and until next time my friends you know what to do keep rocking