Deploy Hugging Face models on Google Cloud directly from Vertex AI

April 09, 2024
In this series of three videos, I walk you through the deployment of Hugging Face models on Google Cloud, in three different ways: - Deployment from the hub model page to Inference endpoints (https://youtu.be/mlU-2QYx4a0), with the Google Gemma 7B model, - Deployment from the hub model page to Vertex AI (https://youtu.be/cBdLw5BnGrk), with the Microsoft Phi-2 2.7B model, - Deployment directly from within Vertex AI (this video), with the TinyLlama 1.1B model. Get started at https://huggingface.co :) ⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos. Follow me on Medium at https://julsimon.medium.com or Substack at https://julsimon.substack.com. ⭐️⭐️⭐️

Transcript

Hi everybody, this is Julien from Arcee. In the two previous videos, I showed you how you could easily deploy Hugging Face models to Google Cloud, starting from the Hub and using either inference endpoints or Vertex AI. There's still another way to do this. So, in this video, I'm going to show you the third way you can deploy Hugging Face models on Google Cloud, starting directly from Vertex AI and referring to models posted on the Hugging Face Hub. Okay, let's get started. If you enjoy this video, please give it a thumbs up and consider subscribing to my YouTube channel. If you do, please don't forget to enable notifications so that you won't miss anything in the future. Also, why not share this video on your social networks or with your colleagues? If you enjoyed it, it's very likely someone else will thank you very much for your support. Unlike the two previous videos where we started from the Hugging Face Hub, this time we're starting directly from the Google Cloud console, specifically the Model Garden page on Google Vertex. You can't really miss it; there's this thing called "Deploy from Hugging Face Hub." Let's click on this, and we see the same screen that we saw when we deployed from the Hub to Vertex. But this time we need to fill in the details. So, why don't we try TinyLlama? Yeah, but I like the autocomplete here. Let's just select this one. Okay, same region, don't need a token, model name, endpoint name—all looks good to me. That instance is selected for me, sample request—let's take this and use it later, and just click on deploy. It will take a few minutes, so I'll pause the video. But as you can see, if you don't want to visit the Hugging Face Hub, if you'd rather work directly from Vertex and the Model Garden, you can absolutely do that. This looks like a curated list, though. I don't think you get all the models, but I guess the main ones are there. Okay, so let's wait for the model to come up and then we'll test it. After a few minutes, we see the endpoints come up and it's ready. Let's go back to the Model Garden, view our models, and now we can test it. So this is a Llama model, fine-tuned for chat, and we need to prompt it. Okay, so that's the format—no system prompt, just a user question and the stop token. Let's see if it works. France won the World Cup for the first time on July 15, 1998, when it defeated Italy. No, of course it wasn't Italy, it was Brazil. All right, but at least the general process works. Anyway, Creative AI. So there you go. That's how you deploy. In the previous video, I told you not to forget to delete everything, but I didn't show you how, so I'll show you how now. So starting from that same page where we tested the model, go and undeploy the model. Okay, confirm. This removes the model from the endpoint, but it doesn't delete the endpoint. Those are separate operations. So if you go to the list of endpoints, now you can delete it. Confirm. Okay, and it's gone. So you still have the model in the model registry, which I don't think is a problem, but if you really want to delete that one as well, you can do that here, right? Which means, of course, next time you deploy it, you have to import it again. And, well, I guess that would take a little bit of time, right? Yeah, let's just delete it. Okay. Now it's all gone. So there you go. Three different ways to deploy Hugging Face models on Google Cloud: 1. Inference endpoints from the model page on the Hub on our managed infrastructure. 2. From the model page to Vertex AI. 3. Which we saw in this video, directly from Vertex AI, referencing models on the Hub. I hope this was useful. I hope you learned a few things and you will enjoy deploying models on Google Cloud. Again, if you enjoyed those videos, don't forget the thumbs up and the subscription. I really appreciate your support. And until next time, keep rocking.

Tags

HuggingFaceGoogleCloudVertexAIModelDeploymentMachineLearning

About the Author

Julien Simon is the Chief Evangelist at Arcee AI , specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.

With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.

Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.

Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.