Homunculus 12B and GLM 4 32B Base 32K 2 new Arcee AI research oriented models

July 03, 2025

In this video, I introduce two new research-oriented models that Arcee AI recently released on Hugging Face. Homunculus is a 12 billion-parameter instruction model distilled from Qwen3-235B onto the Mistral-Nemo backbone. It was purpose-built to preserve Qwen’s two-mode interaction style—/think (deliberate chain-of-thought) and /nothink (concise answers)—while running on a single consumer GPU, and even on CPU as demonstrated in the video. GLM-4-32B-Base-32K is an enhanced version of THUDM's GLM-4-32B-Base-0414, specifically engineered to offer robust performance over an extended context window. While the original model's capabilities degraded after 8,192 tokens, this version maintains strong performance up to a 32,000-token context, making it ideal for tasks requiring long-context understanding and processing. ⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos. You can also follow me on Medium at https://julsimon.medium.com or Substack at https://www.airealist.ai. ⭐️⭐️⭐️ ** Homunculus - https://huggingface.co/arcee-ai/Homunculus - https://huggingface.co/arcee-ai/Homunculus-GGUF bin/llama-cli -m ~/models/homunculus/Homunculus-Q4_K_M.gguf --color -c 65535 "Looking at multi-head attention, group-query attention, multi-query attention, and multi-head latent attention, which method would optimize inference latency for a small language model with 32 attention layers running on a 64-core Intel CPU?" ** GLM-4-32B-Base-32K - https://huggingface.co/arcee-ai/GLM-4-32B-Base-32K - https://huggingface.co/bartowski/arcee-ai_GLM-4-32B-Base-32K-GGUF - https://www.arcee.ai/blog/extending-afm-4-5b-to-64k-context-length ⭐️⭐️⭐️ While you're here, I’ve got a great deal for you! If you care about your online security, you need Proton Pass — the ultra-secure password manager from the creators of Proton Mail. GET 60% OFF at https://go.getproton.me/aff_c?offer_id=42&aff_id=13055&url_id=994 ⭐️⭐️⭐️

Transcript

Hi everybody, this is Julien from Arcee. In this video, we're going to discuss two new research-oriented open models that we recently released on Hugging Face. The first one is called Homunculus, a strange name that I will explain. Homunculus is a distillation of Qwen3-235B into a Mistral Nemo-12B architecture. The second model is a variant of the GLM432B model that we built while working on our first foundation model AFM45B. We used this variant to validate the context extension work that we did on AFM. Let's dive into this. As usual, you will find all the links in the video description. Let's start with Homunculus. First, we should explain what this strange word means. Homunculus is a tiny human being created through artificial means. This word comes from alchemy. It's a fitting name for this particular model because, as we can see, it was distilled from a very large model, Qwen3, 235 billion parameters, into Mistral Nemo, which is a 12B model. So, about a 20x size reduction through distillation. What's really interesting about this model is that it preserves the reasoning and interaction style of Qwen. You can prompt it with /think if you want to use reasoning mode, or /nothink if you just want straight answers. Because the model is only 12B, it's definitely small enough to run anywhere, and in fact, I'm going to run it on my laptop, and you will see it works very well. Let's start with a straight question first. So, no thinking mode. We're going to ask: "Looking at multi-head attention, group query attention, multi-query attention, multi-head latent attention, which method would optimize inference latency for a small language model, 32 attention layers running on a 64-core Intel CPU?" Okay, let's give it a shot. Okay, so we can see there's no thinking. We get a knowledge answer, right? A nice little description of the four methods and a nice little table. We're not going to read through all of it, but we can see we got the usual behavior, a fast answer, and some good details. Now, let's ask the same question but with the thinking mode on. And off it goes. I suspect this will take a while, so I might speed it up or edit it. We'll see what works best. Okay, so after a little while and thousands of tokens, no doubt, we have an answer: group query attention, which makes sense. This ran for 251 seconds, so a little more than four minutes, and we generated 6,000 tokens. Definitely a deep and richer answer, and we can see the model engaging in math complexity, figuring out which method would be optimal, etc. Actually, I think the data from the thinking process is probably more valuable than the answer itself because it helps you discover and understand the problem from different angles. So, I like these reasoning models a lot. That's Homunculus 12B, runs very nicely on your local machine or on a small GPU, and it definitely packs a punch thanks to distillation. Let's switch to the GLM model. The second model we released is improving on this GLM432B model, which is pretty popular. Technically, this model has a 32K context size, but experiments show that it degrades very rapidly after 8K. While working on our first foundation model, AFM45B, our team worked on context extension to give AFM a 64K context, and they were successful. To validate their approach, and I highly encourage you to read this amazing blog post which shares a lot of the recipes we used, we tried to replicate the same technique on a different and larger model to figure out if this was just a lucky accident or if it worked for larger models. We wanted to run that context extension process on another model, and that's what the team did. The result was this model: GLM432B base, 32K, the same architecture but gone through the context extension process. It's a long blog post, and we're not going to look at all of it. You can go and try the model and see how this Arcee model works better with larger context sizes. The x-axis is context length, and the y-axis is the needle in a haystack benchmark, trying to find a short piece of information inside a very large context, similar to a RAG scenario. Up to 8K, all models do very well. At 8K, the original GLM model drops very rapidly, and at 16K, it's only 50% capable of finding the item. It keeps dropping, and after 24K, it's completely incapable. We can see that the model our team improved is doing much, much better. It still has a 60%, 70% chance of finding the data you need in a 32K context. This validates the approach we used for context extension, and again, this is a really good post by Charles, highly recommended. That's what I wanted to tell you today. Two cool open-source models: one with a focus on thinking and distillation, and the other, a nice experiment in context extension, and overall, a very useful model to consider for your applications and projects. All links are in the video description. Thank you so much for watching. Until next time, keep rocking!

Tags

ModelDistillationContextExtensionOpenSourceAIReasoningModelsLargeLanguageModels

← Back to 2025 Videos ← Back to YouTube Overview