Hi everybody, this is Julien from Arcee. Yes, you heard that right. I have left Hugging Face a few days ago and I'm now working with Arcee.ai. As you can imagine, there's a ton of content coming your way on open-source models, small language models, cloud stuff. Yes, still a fair chunk of AWS. No worries there. And obviously, I will show you how you can use Arcee Cloud and Arcee SDKs and models and datasets to build enterprise AI solutions with small open-source models, getting very high-quality, cost-effective, and secure models that you can host on your infrastructure of choice. So, tons of stuff coming your way, but I couldn't wait to do the first video. In fact, yesterday, Arcee released a new model called Scribe, which is based on internal M and it's actually a merge of internal M with additional fine-tuning on top. This has state-of-the-art performance on creative writing, a topic I'm quite interested in for obvious reasons, so I thought, let's show you this model. We'll start from the model on Hugging Face and, for a change, I'm not going to use any cloud services. I will download the model and we'll run some local inference on my Mac with llama.cpp, and you'll see how fast that goes. Pretty cool demo, let's get started.
Before we dive into the demo, here's the blog post that my colleague Lucas actually published yesterday on Arcee Scribe. I will, of course, put the link in the video description. I would encourage you to go and read this. It has some really impressive examples and benchmarks on how well the model is doing. So go check it out. Good place to start. Once you've done that, head out to the Hugging Face Hub. Go to the organization page for Arcee. That's where you will find learning resources, papers, models, spaces, and datasets that Arcee is publishing to help the community and enterprise users build cool stuff. We can see just a day ago, not even a day ago, Scribe was published. There's the Scribe model. You can see some information there. Model page, as we know it. We can see this is a 7.7 billion parameter model. It's based on the intern LM2 model, which we have here. Intern LM has a ton of good models, so I would encourage you to check that stuff out as well. Scribe is available as your good old PyTorch model. Feel free to use that if you'd like. But I'm going to go directly with the GGUF versions because I'm going to run those versions directly on my machine with llama.cpp. You could grab the existing model and run the GGUF conversion on them. You could run the conversion process on the base model if you want it. But you can save time and grab one of those GGUF models. We have quite a few here, and a bunch of quantized versions. Let's start with the 8-bit version. You don't need to clone the repo with all those big files. You can just download the file directly from here, which I have done. I've also grabbed a 4-bit model to see if it's faster, better, or worse. We'll run both. Go and download that stuff. I've done it already, and now we can switch to a terminal. I've got my models right there. The 8-bit version is about 7.7 gigs, as you would expect. The 4-bit is about half, a little more than half. Lama CPP is already installed here. It's super straightforward, so I won't cover the installation steps. I just ran the vanilla instructions available in their GitHub repo, and it works. If you're a Mac user, you can also install Lama CPP directly with Homebrew. But here, I just cloned the repo and compiled it. Let's go and run something now.
To test the model, I have a first prompt: "Please write a marketing speech for a new SaaS AI platform called Arcee Cloud. We will send this speech by email to business and technical decision-makers. So make it sound exciting and convincing. The contact email is sales@arcee.ai. Feel free to use emojis as appropriate. Arcee Cloud makes it simple for enterprise users to tailor open-source small language models to their own domain knowledge in order to build high-quality, cost-effective, and secure solutions." Just a basic prompt to get things going. Let's see how the model does. At the top, you see the usage on my machine. Let's run this thing. We'll start with the 8-bit version. GPU usage is shooting up. We'll start seeing generation once the model has been loaded. And off it goes. You can see how fast this is. Well done, Lama CPP, and well done, Apple. The M3 hardware is pretty good. It's expensive, but it is pretty good. Here's our generated email. It's got emojis for sure. I'll leave it on screen so you can read it. It looks quite good. It's certainly better than some of those marketing emails I get daily from random companies. With a little bit of prompting, you can tweak this in any way you like. This is an interesting model. A basic prompt is turning out a pretty good email, and I'm running it locally. Outside of the laptop cost, you would say the cost is zero. No MLOps, no deployment, no nothing. Just run it, and you can run it all day. Or, of course, have a local inference server and plug that into a proper web app. That's pretty amazing. This is why I'm so excited about small language models. The quality you get from a 7 billion parameter model, and the fact that I can run it that fast on my local machine without any fuss.
Now, let's try another prompt: "Write a fictional technical description of a conversation between Alice, a senior MLE working at a Fortune 500 company in the telco industry, and Bob, an Arcee pre-sales engineer. Alice is trying to figure out if Arcee Cloud is a good fit to help her build a customer support chatbot to offload their existing call centers. Bob should focus on understanding the customer pain points and see how Arcee can help answer them. Bob should explain how continuous pre-training, model merging, and instruction fine-tuning should help Alice tailor her models on company and customer data with a high level of accuracy." Let's see how we can do this. Still pretty fast. We start seeing the conversation. I'll put that in the video description as well. It's a pretty interesting conversation. Certainly, you would want to tweak this, but as a baseline, I think it's more than fine. Let's try the 4-bit model. This is even smaller and would run on an even smaller accelerator. It looks even faster to me. It might generate a slightly different story, but that's not bad. Let's run this technical discussion again. It's still doing a good job. Even though it's got only 4-bit parameters, it's faster and still pretty good. You need to experiment and find the right size for this.
That's pretty much what I wanted to tell you. If you've been wondering why people keep getting excited about small language models (SLMs), it's exactly that. It's taking a really good open-source model and focusing it on a particular task. Here, it's creative writing, and you could further specialize it. These are already small models. If you want to use a small model, you can do that on a mid-range GPU. But as you can see, if you quantize those models to 8-bit or 4-bit, you keep shrinking them and can start running them on local machines. I guess this would still run on CPU as well. This is super exciting to me. I think this is a great idea. The combination of high-quality open-source models, the ability to make them better and more knowledgeable about your particular domain, and then shrink them, quantize them, and run them on very little infrastructure in a simple, repeatable way is game-changing. This is what Arcee is building. Go check out our website, our platform, ask for a demo, sign up, read the docs, ask me. That's what I'm here for. Again, that's what I wanted to show you today. There's going to be a million more things coming, and I'm very excited about Arcee. Until next time, my friends, you know what you need to do. Keep rocking.
Julien Simon is the Chief Evangelist at Arcee AI
, specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.
With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.
Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.
Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.