Hi everybody, this is Julien from Arcee. In this video, I'd like to introduce a new Arcee model called Blitz. Blitz is an open-source model based on the Mistral 24 billion model. Thanks to additional training based on distillation from DeepSeek, Blitz significantly outperforms its base model. We're going to take a quick look at the model itself and then deploy it on Amazon SageMaker and run some tests. Let's get started.
Blitz was released yesterday, and you can read about it in this blog post. As usual, all the links will be in the video description. Blitz is based on Mistral24B, and we trained it further using tokens coming from DeepSeek V3 distillation. Looking at the benchmarks, we can see how Blitz significantly improves on its base model, and particularly in math benchmarks, Blitz is definitely a much better model than Mistral Small 3. You'll also see a nice bump on MMLU Pro. So go and download the model; it's available on Hugging Face, and you can run your own tests and make up your own mind.
Now let's deploy the model on SageMaker and run some tests. As usual, I'm going to download the model from Hugging Face and use the AWS LMI container to deploy it on a SageMaker instance, in this case, a GPU instance. We need to import our dependencies. Blitz is still a fairly small model, only 24B, and I like to use cost-efficient configurations, so I'm going to stick with the G6E instances, which are based on Nvidia L40s GPUs. The model is too large to fit on a single GPU, so I'm going to use G6E12XL, which has four GPUs, more than I need. If you want to deploy this on EC2, ECS, EKS, maybe with VLLM, you could run this on two GPUs only, but we don't get that option in SageMaker. So let's use the G6C12XL instance and create the endpoint. We use the LMI container, which is based on DGL serving, which is why you see DGL model here.
After a few minutes, the instance is up, and we can start running some tests. Let's run my usual pet food store name prompt. Here we're running synchronous inference, generating everything, and printing out the result. As usual, we see the OpenAI response format, and if we print the nice markdown, we see some good names. Bark Bistro, that's not bad.
Now let's look at slightly more interesting examples and run inference. I want to write a marketing pitch for a new SaaS AI platform called Arcee Maestro and write a nice email that I could send to customers. We can see that the model is doing well, adding emojis, which we could get rid of if we wanted. The generation speed is really nice because we're leveraging those four GPUs, and for a small model like this, it's going to be quite fast. Let's ask a technical question about transformer models, and again, we see a well-structured, nicely detailed response. This model is really suited for general-purpose tasks, making it a drop-in replacement for its base model. At 24 billion parameters, it's still very cost-efficient, and you can certainly get a lot done with this particular model.
Let's try the motorcycle dealership email. Passing some information about the customer, let's see if this is used. This is a nice email. Of course, you could tweak the prompts to get a general sense of the writing skills of the model.
Now let's try something completely different with a new prompt. I took some code from Lama CPP, one of the files for ARM CPUs, where I'm doing some matrix operations using 4-bit kernels. This is pretty hairy code, not your typical Python notebook kind of thing. It has macros and lots of functions that are not included, so a lot of context is missing. In the code below, can you analyze how NEON instructions are used for vectorization? NEON is an instruction set for ARM CPUs for acceleration. Write a clear and compact summary. Let's run this.
That's a good summary. The code employs NEON SIMD instructions to perform operations on multiple data points, enhancing performance for matrix multiplication. There's a dot product function, and when we do inference, we do a lot of dot products, data types, and data loading, etc. So, the answer is pretty satisfactory; it analyzed that code pretty nicely, although it is actually weird code.
Looks like a very nice, cost-efficient, general-purpose model outperforming its base model. Go and download it from Hugging Face, give it a try, and let us know. Ask your questions in the comments. And until next time, which is really just a few minutes from now, keep rocking.