SLM in Action SuperNova Medius a high performance 14B model

Transcript

Hi everybody, this is Julien from Arcee. I'm in the Bay Area once again, so I guess I have to continue the tradition of doing those hotel room videos. So here it is. In this video, we're going to talk about one of our latest models called Supernova Medius. Medius is a 14 billion parameter model with performance close to our 70 billion parameter model called Supernova. I'm going to show you how to deploy this model on AWS and how to do this from the AWS Marketplace. First, we'll do it in a single click with CloudFormation and predict with the endpoint. Then, we'll deploy with a proper Jupyter notebook for those of you who prefer to run code. Of course, we'll run some examples and talk about Supernova Medius along the way. Let's get started. As usual, I'll put all the links in the video description, particularly this blog post which introduces our Supernova Medius model. As you would guess, this has been built using our tech stack with model merging and distillation, starting from the much larger Lama 3145B model and then throwing in a Qwen model for good measure. It's actually a mix of two different architectures, which is probably why this is such a good model. It also creates all kinds of interesting questions on how to build models coming from different architectures into a single model. There's a pretty cool tool in MergeKit called MergeKit Token Surgeon, which is a good name and has quite a story to it, so I won't go too deep into this today, but please read the blog post and you'll see why this is such a good model and why our tech stack is really the best way to build those high-performance models. We published it on Hugging Face, so you can definitely go and grab it from there and look at some of the benchmarks. When this model was published, it became the best 14B model, so it was at the top of the leaderboard. Since then, we have a few merges that seem to outperform it just a little bit, but this is still an amazing model. The good thing is it gets really close to the performance of our Supernova 70B model, which I covered in other videos. Deploying a 14B model is much more cost-effective; we won't need a huge GPU instance to do this, so I would recommend checking out Medius. There's also a good amount of positive feedback in the community tab, so go and check that out. I also saw some tweets saying that this was almost as good as the Nvidia 70B Nemo model, so apparently this is good and the community likes it. Go try it out; maybe that's all you need. Let's go and deploy this. As mentioned, we'll start from the AWS Marketplace. Start from aws.amazon.com marketplace, look for Arcee, and you should find us. You'll see our latest models in there, like Supernova, and if you go down, you'll find Supernova Medius. Please read all of this, not because I spent some time writing it, but because I try to give you an overview and some usage instructions. We support the OpenAI input and output format here, which is good. Some release notes and you can see this particular model package will run on four GPU and eight GPU instances in the G5, G6, P4, P5 families, with a context size set to 32K and the OpenAI messages API enabled. No G6 because, at the time of recording and packaging, they're not on SageMaker, but once they become available, trust me, I will add them. I like those G6Es a lot. Please take a look and make sure you're familiar with this. Next, you need to figure out where and how you're going to deploy the model. There is only one version for G5, G6, etc., maybe I'll do other versions for Tranium or other platforms, but for now, there's only one package region. I'm going to deploy in Oregon, where I have all my limits set up the way I like them. You can see the model is free, so you won't pay any extra for inference. Zero is exactly what it says; you will only pay for the underlying instance. By default, I've selected G5 12XL, which is more than enough for testing and pretty cost-effective at about seven bucks an hour. If you need faster, more expensive instances, you can go and try P4 or the monster P5. Why not? But that's scary, so we'll just stick to G5. That's more than enough. Once you've read all of this and know what to do, click on continue to subscribe. We'll make you an offer you can't refuse because it's a free model. Again, nothing to pay. The Midas license, which is really the coin license, because that's the architecture of the model, we can accept the offer. It takes a minute or two for the subscription to go through. After a minute or two, you should see "Thank you for subscribing; you can use our product." This means you now have access to the model package I created and can go and deploy it. There are different ways to do this, so let's continue to configure and maybe open a new window for this because we'll come back to that window later. Now we can deploy the model package, so Supernova Medius on G5 12XL using different techniques. We can do CloudFormation, which is the one we're going to do. We could also use the SageMaker console, which pre-fills all the boxes and opens up the SageMaker console where you just click through, but we've seen this too many times now. If you're adventurous, you can try the CLI, although I'm not sure that's a good idea. Let's do CloudFormation instead. This will launch a CloudFormation template, and we'll look at it. It will create a SageMaker endpoint and all the resources. I really like this version best because it's a one-click thing, and you don't need to worry. We only have one software version here, which is V1, just one package. Pick our region, so let's do Oregon. If this is the first time you do this, you may have to create a service role for SageMaker and make sure it has SageMaker full access attached to it. If you have no idea what SageMaker is, pause the video, go read the SageMaker getting started docs, and then come back here. I have everything set up and have a service role already, so I'm just going to use the existing one. We could download the CloudFormation template, but we're not going to do a CloudFormation video this time around. We'll just launch it directly. Let me zoom in a bit. Everything is pre-filled; you shouldn't need to change anything, maybe just the name. Medius one is not a great name, so let's make it a bit more unique. The same for the endpoint name. Make sure you have your IAM role, one instance, G512. That's the package ARN, don't touch it. Just click on create stack. Now, we see the stack being created. Let's just close this and see the resources popping up. The model and endpoint config are fast, and the endpoint will take a few minutes. If we go to the SageMaker console, we'll see the endpoint being created. Here it is. That's pretty cool. There's really nothing to do here. If you're not familiar with the SageMaker SDK, you just need the one-click easy button. This CloudFormation deployment is really it. Let's wait for a few minutes for the stack to be created, and then we'll take a look at the endpoint itself. After a little while, the stack creation is complete. We see the endpoint has been created, the stack status is complete, and if we go to our endpoint, it says "In Service." We can check the logs, and it's good to go. What now? Well, now we need to invoke the endpoint. Maybe you already have your notebooks for that. If not, I have a repo on GitHub called AWS samples. Feel free to give it a star if you like it. I'm in SageMaker Studio, but you can use any Jupyter environment as long as you have your SageMaker credentials set up. Let's clone this and go to the folder called "model package notebooks." That's the one we want because we are deploying from marketplace packages, not from the Hugging Face Hub. We'll look at one of those later, but for now, open the sample notebook for all models with existing SageMaker endpoints because that's what we have right here. We have an endpoint ready to go. We just need to enter its name. Grab the name here, leave the quotes. This is a generic notebook that will work with any SageMaker endpoint as long as it supports the OpenAI API. We have a test payload: "Suggest five names for a neighborhood pet food store. Names should be short, fun, easy to remember, etc." Let's use this. You can see this is the OpenAI format. We just use the SageMaker Runtime Client to invoke the endpoint. Let's run this. Here we do synchronous input, generate the full answer, and print it out. I'll show you streaming inference in the following example. Now it's generated, and we can print it out. We see the OpenAI output format, the number of generated tokens, and we could print it out in a nicer way with Markdown. We could run more examples. That's the first way to work with our marketplace models: deploy them from the marketplace, use CloudFormation, wait for a few minutes, open this notebook, enter the endpoint name, and predict. Super simple, no code to write, and it will work for every single model. Before I show you how to deploy from a proper Jupyter notebook, please don't forget to delete the stack because you have an endpoint running. I don't recommend deleting the endpoint from here because it was created through CloudFormation, so the proper way is to delete it the same way. Just click on delete. It will say "Delete in progress," and very quickly you'll see all the resources disappearing. You can double-check in the endpoint section. Bam. Endpoint deleted. No more charge. Please don't forget to delete. That's also a good way to free AWS instances for your fellow developers. Now, let's look at the second way to work with our marketplace models. Remember, going back to when we subscribed, we got this screen: "Thank you for subscribing," and we continued to configuration for CloudFormation deployment. Instead of continuing to configuration, you could stop right there. You did subscribe, and you have access to the model package. We can come back to my GitHub repo and its model package notebooks. We already cloned this, so let's close the one we don't need anymore and look at this one: "Sample notebook Supernova Medius on SageMaker." We just subscribed, and now we want to deploy. There's a bit of information here on what's happening. Here, I filled all the model package ARNs in all the regions where SageMaker is available. You don't need to edit this; just click through. Let me double-check. No, everything is fine. Let's go through those cells. This is the list of model package ARNs, so Amazon resource names in all the regions where SageMaker is available. If your region is not on the list, it means SageMaker is not available there, and don't yell at me; yell at AWS. You do not have to edit anything here unless there's a typo, and please let me know if you find one, but these should be fine. Let's select this, import our dependencies. I am running in the Oregon region, so if we print this, yes, US West 2, and we also selected from the list above. Please do not edit that list. Grab the role, and now we're going to create our endpoint. If you're used to doing this with SageMaker, it's the same old process: launch the instance, download the container, load the model, etc. The only difference is that we're deploying from the model package. Let's define the model name, instance type, and use an instance type that is available in the listing. If G6E became available this morning and you tried it, it would fail because it is not listed in the supported instance types. So, let's just say G5. Here is where the magic happens. Instead of pointing at a Hugging Face model or an S3 artifact, we're pointing to the model package ARN, which we retrieved just a few seconds ago. Then we just call deploy on it. From then on, it's exactly the same as SageMaker as you know it. We need to wait for a few minutes, and we'll have an endpoint we can test. I'll show you streaming inference. After a few minutes, the endpoint is in service. We can see it here. Great, and now it's just a SageMaker endpoint as we know it. We can invoke it and do synchronous inference. Let's try the same example as before. Let's wait for this, and now we can do streaming inference. We pass `stream=True` and use a different API in the SageMaker SDK, `invoke_endpoint_with_response_stream`. There's a utility function that will retrieve each token or set of tokens as they are generated. Let's try this. Write a marketing pitch for a SaaS AI platform. Why not? As you can see, we are streaming, and it's generating emojis and everything. That's pretty nice. Here's another one, and of course, feel free to take those notebooks and add anything you like, try your own data. This runs in your AWS account, so you can pull in data from RDS, S3, or anything you like. Here's another one getting me excited about more cycles. That's pretty easy, I have to say. Once again, once you're done, please run the cleanup cells, delete everything, and stop paying. If you want to double-check, go to endpoints, refresh, and it's gone. That's pretty much what I wanted to show you: how simple it is to work with our models on the marketplace, one-click deployment on CloudFormation, and just open a simple notebook and start testing. If you prefer to run a full deployment notebook, go to my GitHub repo, find the one for the model package you're interested in. Make sure you have subscribed to that package on the marketplace page; that's still a mandatory step. Then you can go and deploy it and test it in any way you like. Well, that's it for me. I guess I'm going to go enjoy San Francisco, hopefully. Until next time, keep rocking.

SLM in Action SuperNova Medius a high performance 14B model

Transcript

Tags

About the Author