SLM in Action Local Inference with Arcee Nova 72B and Ollama
July 19, 2024
In this video, you will learn about Arcee-Nova, a new state-of-the-art 72-billion parameter model created by Arcee.ai.
⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos. Follow me on Medium at https://julsimon.medium.com or Substack at https://julsimon.substack.com. ⭐️⭐️⭐️
At the time of recording, Nova sits at the #1 spot in the Hugging Face Open LLM leaderboard, outperforming all other open-source models out there.
The model is a merge of Qwen2-72B with additional training. It was built on Arcee Cloud and you can learn more at https://www.arcee.ai/product/arceecloud.
Here, I run a 3-bit quantized version on my M3 MacBook and ollama, with a couple of creative writing and multi-step reasoning prompts. Excellent results and very decent speed for a model this large!
00:00 Introduction
00:25 Introducing Arcee-Nova
03:10 Running Arcee-Nova locally with ollama
05:25 Trying a few simple questions
07:25 Looking at the MUSR benchmark (Multi-Step Reasoning)
09:15 Trying a sample question from the MUSR dataset
11:50 Conclusion: try out the Arcee.ai platform!
* Blog post: https://blog.arcee.ai/introducing-arcee-nova/
* Model page: https://huggingface.co/arcee-ai/Arcee-Nova-GGUF
* Ollama: https://github.com/ollama/ollama
Merging the shards into a single GGUF file:
llama.cpp/llama-gguf-split --merge Arcee-Nova-Alpha-GGUF.Q3_K_S-00001-of-00008.gguf Arcee-Nova-Alpha-GGUF.Q3_K_S.gguf
Configuration file for ollama:
FROM ./Arcee-Nova-Alpha-GGUF.Q3_K_S.gguf
#ai #slm #llm #openai #chatgpt #opensource #huggingface
Sign up for Arcee Cloud at https://www.arcee.ai, and please follow Arcee.ai on LinkedIn to stay on top of the latest Small Language Model action! https://www.linkedin.com/company/99895334
Transcript
Hi everybody, this is Julien from Arcee. In this video, I'm happy to introduce a new state-of-the-art model called Nova. Nova has been built by Arcee starting from Qwen and improved with merging and fine-tuning. At the time of recording, it sits at the top of the Hugging Face LLM leaderboard. Let's get started.
This model was literally released yesterday, and here's the blog post announcing it. It has a few details on the model itself. What I think is really interesting is the performance of this model. What you see here is a screenshot of the Hugging Face leaderboard. Here's the live version, and we can see the models at the top are still the same. These include QIN 272B Instruct, ARCA Mini, and LAMA 370B. So, nothing really changed overnight. We still see the same models here. We see Arcee Nova evaluated on the same benchmarks, such as Big Bench Hard, the math benchmark, and the multi-step reasoning benchmark, on which it is doing quite well, MMLU, etc. If we average these scores, which is the default setting for the LLM leaderboard, Nova is the best at the moment. That's pretty cool.
You can find the model on the Hugging Face website. You have the vanilla model, and you have GGUF versions quantized in different resolutions. I'm going to be working with one of those quantized versions. For simplicity, I will run it on my Mac. It's a big model, a 72 billion parameter model. Even the 8-bit version is out of the question, so I've decided to go with a 3-bit version, specifically the Q3Ks flavor. This comes in eight different chunks, each a little more than 4 gigs. You want to download all of these. Once you've downloaded the eight chunks, you can stitch them back together and run them in Olama. Funny enough, I couldn't get the model to work with Lama CPP this time, but I'm running the bleeding edge version of Lama CPP. I would assume the model to work, as Qwen is supposed to be working on Lama CPP. So, we'll use Olama this time.
First, you should install Olama, which is super simple. Just go to the repo, and they have instructions. Then, we need to create an environment for our model. Here's how you would merge the shards using the llama gguf split utility in the Lama CPP repo. Even though it's called split, it has an option called merge, where you pass the first shard and then the name of the full-size model you want to build. It runs for a minute or two, and then you get your full-size model. It's kind of big, 32 gigs, even though it is quantized to 3 bits. Next, we need to create an environment in Olama. This is the simplest thing; we just need a one-line config file with the name of the model itself. I'll include all of that in the video description. With this model file, we'll be able to create the environment using the create command. It takes a few minutes, and then you can go to Olama's list and see your model has been created. Now we can just run it. It takes a minute to load, and once we have a prompt, we'll start testing the Arcee Nova model.
Let's find out a little bit about the model and get a sense of its speed. This is based on Qwen, and the speed is not bad at all. It's generating faster than I can read, which is always good. Let's ask a few more things. What are your guidelines? No harmful, illegal content. No controversial issues. No violent activities. So, we'll stick to legal and friendly activities. Why don't we try a prompt we tested yesterday on a model for creative writing? Let's try this one again: Write a marketing speech for a SaaS AI platform called Arcee Cloud. This is a much larger model than Scribe, so it should be able to do a whole bunch of different things, not just creative writing. As we will see later in the demo, it's quite good at multi-step reasoning. It's generating and using emojis. The speed is more than adequate. Pretty cool. It's got a motto for Arcee Cloud. I should send that to my colleagues. I love how creative these models are.
Let's try something harder. If we look at the benchmarks again, this model is good at a bunch of things, but it looks particularly good at the MUSR benchmark, which stands for multi-step reasoning. It's better than the other QIN 272B model, so something in the merging or fine-tuning really made it better at this. Let's look at this benchmark and the dataset behind it, and take an example. The dataset is on the Hugging Face Hub, and all links will be in the video description. It's an interesting dataset with murder mystery questions, object placements, and team allocation. You have rather long stories, thousands of tokens, and then a question. For example, given the story, how would you uniquely allocate each person to make sure both tasks are accomplished efficiently? You have a whole bunch of options, and this is the correct answer. Let's try one of these. I've prepared the whole prompt. Here we go. I'll create a multiline prompt. We have a story about folks working at an art gallery: Patricia, Matthew, and Rebecca. Each has a different personality and skills, and we want to know who should be creating art and who should be selling art. We need to understand the context, work the embedding magic on the longer prompt, and start answering. The most efficient allocation would be Rebecca creating art and Matthew and Patricia selling art. This is the correct answer. Creating art: Rebecca. Selling art: Matthew and Patricia. The model got it right. More importantly, the model explains why it thinks this, based on the information given. Matthew is forgetful and not very successful at selling, but his jovial character could still make him an approachable salesperson. Patricia's frankness about true value can aid in selling, compensating for a lack of interest in socializing. Rebecca, with her meticulous art creation skills but lacking in selling, would be best placed to focus solely on creating high-quality art pieces. This is not just the answer but also extra information on why the model came to that conclusion.
While you can see the model is large and the speed is more than adequate on this local machine, why would anyone use ChatGPT after this is beyond me. But hey, maybe I'm missing something. That's really what I wanted to show you about this new cool model, Nova. You can find it on the Hugging Face Hub. It was built using the Arcee Cloud platform. If you're curious, visit the Arcee website to learn about model merging and continuous pre-training and how you can do the same with your own models on your own domains. There's much more coming. We'll start diving into the Arcee platform in the next videos. I'll show you how you can do all of this yourself. Until then, my friends, keep rocking.
Tags
Arcee NovaLLM leaderboardMulti-step reasoningQuantized modelsCreative writing AI
Julien Simon is the Chief Evangelist at Arcee AI
, specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.
With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.
Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.
Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.