In this video tutorial, I'll show you how easy it is to deploy the Meta Llama 3 8B model using Amazon SageMaker and the latest Hugging Face Text Generation Inference containers (TGI 2.0). Follow along as I guide you through the process of setting up synchronous and streaming inference, making text generation tasks a breeze!
The Meta Llama 3 8B model is a powerful tool for natural language processing, and with Amazon SageMaker's scalable infrastructure, you can leverage this model efficiently. I'll take you through the step-by-step process, from setting up the environment to running inference, ensuring you have the knowledge to implement this in your own projects.
So, whether you're a data scientist, machine learning engineer, or developer interested in text generation and NLP, this video is for you!
#MachineLearning #NLG #AmazonSageMaker #HuggingFace #TextGeneration
⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos. Follow me on Medium at https://julsimon.medium.com or Substack at https://julsimon.substack.com. ⭐️⭐️⭐️
Model:
https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
Notebook:
https://github.com/juliensimon/huggingface-demos/blob/main/llama3/deploy_llama3_8b.ipynb
Deep Learning Containers:
https://github.com/aws/deep-learning-containers/blob/master/available_images.md
Transcript
Hi everybody, this is Julien from Arcee. Big news today, LAMA3 is out, and I couldn't wait to show it to you. In this quick video, I'm going to walk you through the deployment of LAMA3 8 billion on AWS, and as usual, I'm going to show you how to do this with Amazon SageMaker. You don't even have to write the code. For extra oomph, I will add the latest version of our text generation inference server, TGI20, and I'll also show you how to do streaming inference. Okay, let's get started. If you like this video, don't forget to give it a thumbs up. Please consider joining my channel and enabling notifications.
LAMA3, here we go. As usual, we start from the Hugging Face Hub and go to the Metal Llama 3.8b page. Here, I'm using the instruction fine-tuned version. On the first visit, you will have to enter your personal details to get access to the model. It is a gated model, but you'll get access immediately, so no problem there. Once you've done that, you can go to deploy SageMaker and you'll see the code snippet to deploy there. Just copy this and move on a bit.
Here, I'm using SageMaker Studio. I just pasted the code snippet. The only thing you need to change is adding your Hugging Face token so you can access the gated model. Here's mine, but I will recycle it after the demo, of course. Because at the time of recording, the SageMaker SDK hasn't been updated for TGI 2.0, I can't use that LLM image URI API. Instead, I will just add the full name of the container. You can find those names in the deep learning images repository by AWS on GitHub. It has the full list of SageMaker containers, including the HuggingFace containers. Here, you can see we use TGI 2.0. Hopefully, by the time you watch this, you'll be able to specify version 2.0 and ignore that.
So, build the model object called `deploy`. I'm using a small GPU instance here, G5.xlarge. That's the one I usually use for those 7B models, so hopefully, it still works for 8 billion. And in fact, it does. After a few minutes, I've got my endpoint and can start predicting. Just grab the tokenizer, build a system prompt, and a user question with the OpenAI format. I don't need to worry about the LAMA 3 format; that's taken care of automatically with the chat template. The only thing to worry about is the end of the sequence tokens, but fortunately, the model page gives you the right information. So just specify those tokens and feed them to the generation arguments.
Let's use synchronous inference first, where we generate the full answer and print it out in one go. And then, of course, we'll see streaming inference. We need 256 tokens in about 9 seconds, so it's about 30-35 tokens per second, roughly. We're not trying to optimize the bug; it feels about right. Then print out the answer.
If you want to do streaming inference, just like I showed you with the Zephyr model in the previous video, we can reuse this super nice code snippet from this AWS blog post that shows how to stream tokens coming from the endpoint. I literally took that code from the blog post. Thank you. The only thing we really need to do is reconfigure the endpoint for streaming, so streaming to serialization and enabling streaming. Let's do this, and then we can invoke the endpoint, and it is streaming, right? Speed is quite good.
As usual, our collaboration with AWS makes it very simple to try out these new models. You literally go to the model page, copy-paste that code snippet, and that's it. You're done. You can do streaming, and with TGI 2, you get a very decent performance even on a small, cost-effective GPU instance. I guess in the next video, I'll show you how to do this on Inferentia, but it's not going to be very different.
All right, well, that's it. Welcome LAMA3 to the family and enjoy testing. That's it for me. Until next time, keep rocking.