Judging AFM 4.5B with DeepSeek R1 670B

Transcript

Hi, everybody. This is Julien from Arcee. A couple of weeks ago, we released our first foundation model, AFM45B. Arcee foundation model, 4.5 billion parameters. In this video, I'm going to demo the preview version of the model, which is available on Together AI. I'm going to run some prompts across different domains, and I'm going to ask DeepSeek R1 to judge the quality of the answers. This should be fun. Let's go. If you want to quickly try AFM45B, the easiest way is just to go to our website, arcee.ai, and on the homepage, you'll find a chat with AFM link that will take you to the playground. You can just go and ask important questions like this. Okay, so you can quickly try it out. If you want to work with it in a slightly more interesting way, which is what we're going to do in the demo, you can work with the model on Together AI. They have a playground as well, and of course, you can use the API, and that's what we're going to do. Before we dive into the code, I just want to point out the different blog posts we wrote about AFM. There's a high-level post announcing the AFM family. This model is actually the first, and others will follow—smaller, bigger, domain-specific, we'll see. You can also learn about the value proposition of the AFM models, why we built them, what customers have been asking us, and why we think this is the right way forward. There's also a deep dive post, obviously going deeper into the building process, the data work that we did, and model merging, et cetera. And last but not least, there's a really cool post about the work the team did on extending context length to 64K using, of course, model merging. You can read all about it. As usual, I'll put all the links in the video description. But now, let's switch to the notebook, run some prompts, and see what DeepSeek thinks. In this notebook, I'm going to run some prompts for knowledge questions, creative writing, and domain-specific questions in healthcare, finance, etc. In the right corner, we have AFM 4.5 billion parameters. In the left corner, we have DeepSeaCar1, a 671 billion parameter model. So AFM is about 140 times smaller. Let's find out. So we'll use the two models on Together. We need the SDK, the key, and a streaming function. Okay, you can go through all this stuff. So let's start with a couple of knowledge questions. You can add your own. I'm going to run the first one here: Compare and contrast the first and second industrial revolutions, analyzing their distinct technological innovations and socioeconomic impacts across different regions of the world. It feels like you're back to school, right? Okay, let's run this thing. So we can see how fast AFM is. Well, that's what you get with a 4.5 billion parameter model. Okay. So I will leave all the output in the notebook. I won't clean it up, and you can go and read everything if you want. So what does DeepSeek think? This is a high-quality answer. It accurately identifies key technological innovations. It is complete. The analysis is concise, structured, and avoids significant issues. Okay, sounds like a thumbs up to me. Let's try the second one. The second one is about photosynthesis. Okay, so accuracy. The answer is fundamentally correct. Completeness: the answer covers the key differences, but there's a small omission. The explanation is very clear and well-structured. Just a tiny bit of a mission here, not a big problem. Try your own questions; see what works. Okay, creative writing: Write a short story that begins at... Let's give that a shot. Great writer, by the way. I recommend him. Okay, so we have a story here. Let's wait for the judge. The answer demonstrates high quality with vivid Borghesean imagery and engaging prose. It is accurate to the prompt, etc. The story is complete. Wow, well done. Okay, let's try another creative writing prompt: Write a story that integrates Victorian Gothic, cyberpunk, and magical realism. And we have a topic. Okay, man, this is hard. Let's see how we do here. The writing is vivid and evocative, successfully blending all three styles. The core prompt is directly addressed, and the narrative fulfills all requirements. Okay, so it looks like AFM is a really good writer. That's nice. For a small model, this is awesome. Okay, how about going into domain-specific prompts? Let's do healthcare: Compare and contrast the mechanism of action between CRISPR-Cas9 and zinc finger nucleosides. Okay, we get an answer. Thank God the LLM is judging, because I wouldn't know. Okay, the answer accurately describes the core mechanisms and key differences. Completeness: effectively addresses all requested comparison points. Quality: the information is well-structured, clear, and concise. Okay, so I have no idea what this is about, but DeepSeek thinks this is a good answer. This is a crazy question. Okay, let's try finance: The Black-Scholes Merton model to price options. Does it work during extreme market volatility? Well, I guess we found out in 2008, right? Okay. So accuracy sounds good. There's a tiny imprecision. And that's the kind of stuff fine-tuning would get rid of, obviously. Completeness: covers key limitations, proposes enhancements with examples. Quality: well-structured, logically flows from limitations to solutions to challenges. Okay, so again, that's a strong answer on a really complex question. Okay, let's try tech: The architectural trade-offs between transformer-based language models and RNNs for real-time natural language processing on resource-constrained edge devices. Ah, CPU inference. Woo-hoo. Okay. So AFM, come on. Prove yourself. Okay, the information is technically accurate. Completeness is impressive. It is a language model. The answer demonstrates high quality. It is complete, covering all critical dimensions of the question. Not bad at all. How about education? Evaluate the neuroscientific evidence supporting spaced repetition and interleaving in knowledge retention. Okay, that's a long answer. Quality: high, well-structured, clear, logically organized. Accuracy: good. Completeness: good. Okay, AFM knows more about education than I do. Not surprising. Let's try a multi-turn conversation. We'll do some quantum stuff. Why not? And follow up with questions: Explain the implications of the holographic principle for quantum gravity. Yeah, sounds like something from Interstellar. Come on, DeepSeek. Question one, what do you think? Okay, accurate, complete, clear, and concise. Again, I'll leave everything in the notebook. You can read through all of it. Follow-up questions: What are the limitations of that principle in addressing quantum entanglement across the event horizon? I feel stupid reading this. Sounds like a science fiction movie. Okay, the answer demonstrated high quality by accurately identifying key limitations, etc. It is comprehensive and logically structured. Good. So the follow-up question was good. Let's try a final follow-up, even the tensions between general relativity and quantum, etc. Einstein versus Niels Bohr. Great story. Okay, this seems like a high-quality answer for its purpose. Comprehensive, high-quality, accurate, complete. All right. So it looks really good to me. And before you ask, all those prompts were actually generated with an LLM. So they're not curated examples. You can try your own. Some will be better, some will be worse. But I find it quite impressive that those 10 or so examples are all very positively judged by DeepSeek. Again, we're only talking about a 4.5 billion parameter model, and it's the preview version. So there you go. The model will be on Hugging Face in a few weeks. It will be available for non-commercial use. If you want commercial use, then please get in touch with us, and we'll tell you all about the commercial license. And if you're interested in fine-tuning the model for specific domains to make AFM even better than what you saw here on a particular slice of knowledge, again, talk to us. I'm easy to find, or you can just go through our website and contact us. OK, that's it for today. Welcome, AFM. You look like a great model. Can't wait to keep testing you. And thank you so much for watching once again. I hope you enjoyed this. Until next time, you know what to do. Keep rocking.

Judging AFM 4.5B with DeepSeek R1 670B

Transcript

Tags

About the Author