Introducing the Arcee Model Engine

December 06, 2024
In this video, you will learn about the Arcee Model Engine (https://models.arcee.ai), a new SaaS platform for Small Language Model inference. Model Engine introduces 6 new models, with more to come. *** UPDATE: The Model Engine has been upgraded to Conductor. See https://youtu.be/1-SCHE9Idcs If you’d like to understand how Arcee AI can help your organization build scalable and cost-efficient AI solutions, please contact sales@arcee.ai or book a demo at https://www.arcee.ai/book-a-demo. ⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos. You can also follow me on Medium at https://julsimon.medium.com or Substack at https://julsimon.substack.com. ⭐️⭐️⭐️ First, we look at the Arcee models available in Model Engine: Virtuoso, a general-purpose SLM; Coder, an SLM specialized for code generation and code conversations; and Caller, an SLM specialized for function calling and tool usage. Then, we discuss model pricing, which is usage-based. Finally, we test Virtuoso and Coder, both in the UI sandbox and in a Jupyter notebook with the OpenAI API. 00:00 Introduction 00:30 What is Model Engine? 02:30 A quick look at model benchmarks: is 14B the new 70B? 04:30 Testing Virtuoso Large in the sandbox 08:30 Testing Virtuoso Large with the OpenAI API 12:00 Testing Coder Large with the OpenAI API 15:40 Conclusion * Blog post: https://blog.arcee.ai/announcing-the-arcee-model-engine-public-beta/ * Sample notebook: https://github.com/juliensimon/arcee-demos/blob/main/model-engine/ Interested in learning more about our solutions? Book at demo at https://www.arcee.ai/book-a-demo

Transcript

Hi everybody, this is Julien from Arcee. Just a few days ago, we launched our new SaaS inference platform called Model Engine. Model Engine introduces a new collection of Arcee models, and in this video, I'm going to show you what those models are, discuss their benchmarks, how they compare to models we've previously built, and of course, we'll run some demos using the models. Let's get started. To get started with Model Engine, you need to go to models.arcee.ai and sign up. All these models are commercial models. You can sign up in a minute, enter your credit card details, and you're good to go. Our current plan starts at $20 a month, which buys you a significant amount of credits that you can then spend on different models according to their prices. We'll look at the prices as we move along. What models are available today? We have three different models: - Virtuoso: A general-purpose model for conversational apps, available in large, medium, and small sizes. - Coder: A code generation and code conversation model, available in large and small sizes. - Caller: Optimized for function calling and working with external tools, currently available only in a large size. We have more models coming very soon. By the time you watch this, they may already be available. If you look at our launch blog post (I'll put the link in the video description), you'll see that we have two more models on the way: - Spotlight: A vision-language model for working with images and text. - Arcee Maestro: A reasoning model. Before we dive into testing the models, I wanted to show you how these models compare to some of our previous models. Here, you see benchmarks from the Hugging Face leaderboard for three of our open-source models: LLaMA 3 1.8 billion, Supernova Medius 14 billion, and Arcee Nova 72 billion. All three models are open source and available on Hugging Face. You can see the numbers for Virtuoso Small, which is 14 billion. If we compare the two 14 billion models, Supernova Medius was released not even two months ago, and you can see how Virtuoso Small outperforms it significantly on all benchmarks. This shows how fast we are moving and improving our models. Virtuoso Small is also getting dangerously close to Arcee Nova 72B. When Nova was released last July, it was the best 70B class open-source model. Now, we can build a 14 billion model that is almost as good. IF eval is actually higher, Big Bench Hard is a little lower, and other benchmarks like MMLU Pro are pretty close. So, is 14B the new 70B? Not quite, but we're getting really close. You can imagine the performance you can get from our larger Virtuoso models, and the increased cost-performance and ROI due to the simpler and more cost-effective infrastructure deployment for 14B models compared to 72B. Let's take a look at the models now. We'll start with Virtuoso. Here are the prices: $1.27 per million input tokens, $1.50 per million output tokens, and a context length of 128K. The large model is 72B, the medium is 32B, and the small, which we just looked at, is 14B. You can see all these prices. Let's ask a question. What should we ask? Here's my prompt. Let me zoom out a bit. It's working, generating, and it's fast. We generated 604 tokens. If we look at usage, we'll see how many queries we sent to the model, how many input tokens we used, and how many output tokens we used. I've been playing with this a little bit, and you can see the cost for 1 million input plus output tokens. So, once you spend $2.77, you will spend another $2.77 for another million input plus output tokens. Let's go back to the models. We can play with this in the sandbox to get a feel for the performance, but obviously, we want to work programmatically as well. Being a SaaS platform, Model Engine lets you query the models through APIs, and you will recognize the familiar OpenAI format for URLs and the payload. One thing to keep in mind is that we use HTTP2, which is much more efficient. Make sure you enable HTTP2 for curl or the OpenAI client. You need to pass your authorization token, your API key, which you can create in the API keys section. Once you have your API key, you can use it to query the models. Here's how we would do it with curl, and this is how we would do it with Python. All the models are OpenAI compatible, so you can use the vanilla OpenAI client. Just make sure you enable HTTP2. The rest is just the chat completions API that many of you are already familiar with. Let's switch to a notebook and run some examples. Today, I'll show you Virtuoso Large and Coder Large. We need to define the endpoint, have our API key ready, and the name of the model. This lets us create the OpenAI client. I have a utility function to print out the streaming response, which looks nicer in demos. Let's run a first prompt: "Write a short horror story in the style of H.P. Lovecraft. It should take place in the 20s in Antarctica. Write at least 2,000 words." I'm setting streaming to true. It's generating, and the generation speed is more than adequate. It ended up generating 3,500 plus tokens in 75 seconds, which is a good figure. I'll save that story and read it over the weekend. Let's try something else. I'll leverage the huge context we have. I have the full text for "Alice in Wonderland" and "The Great Gatsby." Let's ask the model to draw a parallel between the main characters, Alice and Gatsby, and pass the full text for both books. We have over 70,000 words, which is probably close to 100,000 tokens. The processing was super fast, and here's the parallel between Alice and Jay Gatsby's quest for identity and self-discovery. It covers dreamlike and unreal worlds, naivety and idealism, the role of the narrator, and the role of authority figures. It's worth a read. Let's try Coder. We'll set up our client again and use the streaming response. Let's try this prompt: "Explain the difference between Logit-Based distillation and Hidden-State distillation. Show an example for both with PyTorch code, using BERT-Large as the teacher model and BERT-Base as the student model." The model provides a bit of explanation and then some code. It correctly loads the models, creates a simple dataset, and shows the forward pass, teacher outputs, student outputs, and the distillation loss. For hidden-state distillation, it defines a linear projection, which is exactly how you would do it. This should be working, and it's a pretty good answer with good explanations and code. Let's try to get the model to fix my code. Here's my streaming function, and let's ask it to improve the following code and explain what it did. The model provides an improved version of the code, using more Pythonic techniques for safe access to delta content and updating token counts. It's a fair improvement. That's what I wanted to show you today about our brand new Model Engine, the initial batch of models available in the UI or through API calls with the OpenAI client. Go and have fun, sign up, and let me know how you're doing. All questions are welcome, and I'll see you soon with more examples, more models, and the distillation video. Thanks for watching. Keep rocking.

Tags

Model EngineArceeSaaS Inference PlatformAI ModelsAPI Integration