Deploying Llama3 with Inference Endpoints and AWS Inferentia2

Transcript

Hi everybody, this is Julian from Hugging Face. As you move your LLMs to production, one of the key topics you have to work on is cost performance optimization. So of course, you should get the latency and the throughput that are compatible with your business requirements, but you should also try to minimize cost, right? You need to work on both parts of the equation. So in this In this video, to help you do that, I'm going to show you how you can easily deploy hugging face models on our model deployment service called inference endpoints, relying on the AWS Inferentia 2 accelerator, which we find delivers great cost performance for LLNs in production. Okay, let's get started. If you enjoy this video, please consider subscribing to my YouTube channel and I would really appreciate it if you hit the like button and gave me a thumbs up. This really helps with recommendation. Also, why not share this video with your colleagues or on your social networks because if you enjoyed it, chances are someone else will. As always, thank you so much for your support. Let's say we would like to deploy Lama 3 8 billion on an inference endpoint. So that's super easy, as you will see. I'll show you how to do it with the UI, and then I'll show you how to do it with the SDK. So on the model page, click on the Deploy button, select Inference Endpoints, make sure you selected AWS, and then Inferential 2. which come in two sizes, a small one for dev and test or small-scale traffic at $0.75 an hour, and a larger one for more serious traffic at $12 an hour. Okay, so we'll just go with the inexpensive one. We don't need more right now. Then we can decide if we like to scale the endpoint to zero after a certain amount of time, which is super important for dev and test to minimize your cost. So you can scale down after 30 minutes, one hour, et cetera, et cetera. Or if you're actually using this for production and you'd like to have low latency all the time, you can decide never to scale. Okay? Then you need to set the security level. You can go from fully public, which is exactly what the name says, no authentication, to fully private, where the endpoint can only be accessed through your AWS account with private link, which is a very simple setup. Or in the middle, you have protected where your endpoint is visible on the internet, but it requires token authentication. Okay. And that's what I will use here. We could just click on create endpoint and wait for a few minutes and see the endpoint. We could also look at advanced configuration like auto scaling and model settings, et cetera, et cetera. Right. I'll just use defaults here. Okay. So click on create endpoint, wait for a few minutes, and you have your endpoint running. So after a few minutes, our endpoint is running, as we can see here. We can probably test it. So we need to pass a token. And let's try something here, right? We'll just go and try basic text generation. All right, there you go. So you have your endpoint running, and that's just a click away. So now let me show you how to do this with the Hugging Face API. We need to install the Hugging Face Hub library, which brings the Create Infant Endpoint API. And just like I I showed you in a previous video, we're also going to install the OpenAI client because inference endpoints let you invoke the model with the OpenAI messages API, right? So if you've built application code that uses the OpenAI client because you are working with OpenAI models, this is a super easy way to migrate to open source models. without having to change much in your code, as you will see, right? So log into the hub and then just call create inference endpoint with the model name, set the accelerator to neuron, not GPU. Vendor is AWS, region U.S. is one, security type is protected, instance type is inferentia2, instance size is x1, the small one. And then we're passing some settings for the model. For example, we're passing the number of cores to use on the inferentia2 accelerator. So this particular one has two cores. So we're going to do that. And we want to have a sequence length of 4K and a batch size of 1. And these are important because, as mentioned again in previous videos, you know we are pre-compiling the models and we are storing them in that optimum neuron cache repository. So you want to make sure, if you want to avoid pre-compilation of the model, So you want to make sure that you work with a configuration that we support. So just go to that inference cache config directory and you can look for, let's say, llama3 and here you'll see the pre-compiled versions. So that's the one I'm actually using, batch size one, sequence four K, number of course two. Okay. So, if you match that, then once you deploy the model, the API will actually fetch the pre-compiled artifact and deploy it. And so, you won't have to pre-compile the model before deployment. Okay? Just a good shortcut. Okay? And feel free to try those different configurations. Okay. So, then we run this cell. and we wait for a few minutes, and then the endpoint is up, and I can invoke it. And as you can see, I can invoke it with the OpenAI client and the OpenAI Messages API. Okay, so let's try this. And yes, it is running, right? And this, my friends, is how you run Lama 3 8 billion very easily. with OpenAI code, right? OpenAI prompts on a 75 cents an hour instance, right? So for dev and test or small scale usage, it's just perfect, right? And if you need more serious traffic, then you can go and run the larger Inf2 instance. So you would just change the instance size. and you would just change the number of cores and then decide if you want to optimize for throughput or latency, right? So if you want low latency, batch size one is the best choice. If you want more throughput, then you need to increase the batch size, but the latency will go down a bit. And a lot of folks are asking about benchmarks and there's this really cool page in the Neuron SDK documentation that has actually some good numbers, right? So you have encoder models, you have decoder models, and you have encoder decoder models. And as you can see, you have numbers. So let's look at Lama 3. We have numbers for Lama 3, 8 billion and above. And because this is really important, production oriented. They give you numbers for the larger instance, and they give you time to first token and tokens per second, et cetera, et cetera. So good stuff there. And they give you throughput optimized numbers and latency optimized numbers. So you can see throughput optimized as batch size eight. when latency optimized has batch size one. Okay, so depending on what you're trying to achieve here, you will pick one or the other. Okay, so for example, we see for latency optimized, time to first token is 30 milliseconds, right? And that's P50. So let's say P99 is 42 milliseconds. So that's pretty good and throughput is about 145 tokens per second. If we optimize for throughput, you can see throughput goes up 4x, which is pretty nice, and latency is a little worse, as you'd expect. So, find the right configuration. That works for you there. Okay. All right. Well, that's pretty much what I wanted to show you, and as you can see it is really really simple to deploy models on AWS Inferentia 2 with endpoints. I showed you how to do this on SageMaker in another video but if you just want the quickest simplest way including using the open API prompts well I think this is it right. So thank you very much for listening hope this was useful and I'll see you with more content in the future, of course. And until then, as always, keep rocking.

Deploying Llama3 with Inference Endpoints and AWS Inferentia2

Transcript

Tags

About the Author