Hi everybody, this is Julien from Arcee. In this video, I'm going to show you the simplest way to deploy Arcee models on AWS, and that is Amazon SageMaker Jumpstart. As you will see, our most popular models are listed in Jumpstart, and you can one-click deploy them to Amazon SageMaker in just a few minutes and open one of our sample notebooks to get to work. Super simple, very quick path to experimentation. Let's get started.
For this demonstration, I've decided to deploy Arcee Nova. Arcee Nova is a 70 billion parameter model based on the QN2 architecture. So, of course, we see here the model page on Hugging Face. And we can see this is quite a popular model, with thousands of downloads last month. You can go and read a little bit about the model and look at the benchmarks, etc. Still a very good 7TB model. So let's say we'd like to deploy this on AWS to experiment and just get a few results for the model, run some of our prompts, and generally decide if this is a good model for our project or not. Okay? So how do we do this?
First, you need to start from the SageMaker homepage, SageMaker Studio homepage. And you will see there, if you scroll down a little bit, this section called Jumpstart. So let's click on this. Jumpstart gives you a curated list of the best models out there, and we see a lot of good companies, including our Hugging Face friends, and we see Arcee as well. So let's just click on Arcee, and at the moment, we have five models: four open source models, and we have our commercial supernova model, which I've already discussed quite a few times. So let's go and click on Nova.
Here, you get a summary of the model, which comes from the model page. So what do we do next? How does Jumpstart work? Jumpstart is actually based on the AWS Marketplace. So if we decide to click on deploy here, we see a message saying, "Hey, you are about to deploy a model that requires an AWS Marketplace subscription." So maybe you already subscribed and you can go and deploy, but in this case, we're gonna do the whole thing. So let's just click on subscribe. It's going to open the marketplace page for us. Let me zoom in here; it's too small. We see the Arcee Nova page. If you want to see all the details, you can just click on product detail here, and you'll see my super nice marketplace page. Great, let's go back to the original page.
We need to subscribe, but in this case, this is an open source model, so you can see there's no extra fee to use the model on AWS. You will obviously just pay for the underlying instance, but the model itself doesn't add any charge. That's why you see zeros all the way. Some of our models may have different versions; I don't think that's the case here. For example, I have models where there's a version for P4, P5 instances, and there's another version for a quantized version for smaller GPU instances, etc. Just double-check you are using the version that matches what you're trying to do here. There's only one here, so no worries. So let's just click on accept offer, and this takes a minute. I'll just pause the video for a minute, and once the subscription is complete, we can go back to Jumpstart and deploy.
Okay, so just took a minute, and now we have a subscription. Once we've subscribed to the model, we can deploy it. You can do this in two ways. You can click on the deploy button, which provides a managed deployment experience. You get to pick the instance type, etc. But sometimes you will run into problems because it's a generic way to deploy models, and particularly with large models like that, 72B models, you can hit some timeouts. The default timeouts that are implemented in the deploy process can sometimes fail. So I would actually recommend deploying from our notebooks, which are on GitHub. They live in the Arcee AI AWS samples repository. I would recommend doing that generally and have a little more control. But for smaller models, the vanilla deployment will be okay. For larger models, I recommend running my notebook that has less strict timeouts.
So how do you run those notebooks? Very simple. Click on notebooks, and you have a preview of the notebook here, which will let us do deployment. You just have to click and open in JupyterLab. I have an existing space, so I'll go and do that. Okay, so the notebook opened, and we can just start running it. So just go here, maybe close this. Here, we see the list of packages for Nova in all the regions. So let's just grab that list; it will automatically select the one in the right region. No need to change anything here. Grab the package, grab a SageMaker client, and then go and deploy. So we'll deploy that Nova package on a P4 instance, and we'll just run this. Right. And these are the timeouts I was talking about. The data download timeout and the startup timeout. The vanilla deployment from the marketplace has values that may be too short for large models like that, which is why if you click on that deploy button, your mileage may vary. It may work, it may not.
So let's just give this model a few minutes to deploy, and then we'll run it. Okay, it worked. It actually took longer than I thought. I suspect I've been waiting for capacity for a little while. But anyway, we have the endpoint. If we look at the SageMaker console, we see the endpoint in service. If we open the logs, we see that the shards and generally the model server is ready, so we are good to go. Going back to our notebook, we can just go and try a prompt. All right, and this worked. As you can see, once again, we have the OpenAI format for inputs and outputs. So if you have existing OpenAI prompts, you don't have to rewrite the code for that. And of course, we can run some more examples here. Let's see how that works. Marketing email, technical questions on transformer models, and I guess another email from a motorcycle shop. Why not? There you go.
Once we're done with this, we shouldn't forget to delete the model and stop paying. The endpoint should be gone here. So that's, I would say, the more elaborate way to deploy models from SageMaker Jumpstart. Again, which I recommend if you're working with larger models because you have a little more control over deployment parameters and timeouts, etc. And if something fails, generally, you'll see an error in the notebook, which makes it a little easier to understand what went wrong. Sometimes you don't even get CloudWatch logs if the process fails early.
Now let me show you how you can deploy one of our models directly from JumpStart without a sample notebook. And again, this should work most of the time, particularly for the smaller models, which are a little less demanding. Okay, so Arcee and why don't we try Arcee Light? Yeah, we did the really large one. Now let's do the really small one. So this is a 1.5 billion parameter model. You can go check out the model page on the Hugging Face Hub. Let's just do this quickly and deploy. I believe I've already subscribed to this, but let's double-check. Yes, we're good. As you can see, we have a v1 version with CPU and GPU instances and a v2 version which adds CPU instances, as you can see, C7 and I believe C6 is down there too, for customers who want to run this on CPU. Already subscribed, so we can move straight to deployment.
We get to pick the instance type. So why don't we use G5 XL, the smallest, most cost-effective GPU instance. Okay, we'll just change this, click on Deploy, wait for a few minutes, and the endpoint should be ready. So as you can see, this time around, just a click, endpoint creation was much faster. This is a much smaller model. So it is in service. We see it in the console. If we open the log once again, we see the model is ready. So what do we do next? Well, all we need is really just a simple notebook that lets us invoke a SageMaker endpoint. As it turns out, I have one for you. Again, if we go to my GitHub repo and open that model package notebooks folder, you will see this notebook called sample notebook. All models exist SageMaker endpoint. Hopefully, that's a good name. And this one lets you run inference on any endpoint. So all we have to do is just clone this repo. Let's go to SageMaker Studio here. So you just run git clone. Bam. Okay, I have already done this. And now, of course, we have our notebook. That's the one. Sample notebook, all models, existing endpoint. Okay? And this will work for all endpoints. Well, I guess at least all the ones that are based on Arcee models. So then we just need to grab the name of the endpoint. Not the URL, the name. So this. Let's just put this here. And now run the cells. And we should be able to predict. There we go. Right? So this is a good one to have. It will work for all our models. Just change the endpoint name, and then you can start tweaking.
Let's do another one. And you can see how fast this model is. It will generate at least 100 tokens per second very easily on a single request, so go and have fun. Please make sure to delete the endpoint when you're done. I haven't added this in the notebook, so as usual, just go to endpoints and delete. Okay, all right, that's what I wanted to show you today. So as you can see, you can find our models in Jumpstart with sample notebooks for most of the models. The managed deployment experience, the deploy button, should work, but if you see problems, timeout problems mean that the default deployment configuration implemented in the service doesn't work. So in that case, you have our sample notebooks. You can just subscribe to the model in the marketplace and then open the sample notebook, run deployment, and those should work. And if they don't, ping me and let me know what kind of problem you had. And we'll have more models in there soon, as you can guess. All right. Well, thanks again for watching. I hope this was useful. And my friends, until next time, you know what to do. Keep rocking.