Deploy Hugging Face models on Google Cloud from the hub to Vertex AI
April 09, 2024
In this series of three videos, I walk you through the deployment of Hugging Face models on Google Cloud, in three different ways:
- Deployment from the hub model page to Inference endpoints (https://youtu.be/mlU-2QYx4a0), with the Google Gemma 7B model,
- Deployment from the hub model page to Vertex AI (this video), with the Microsoft Phi-2 2.7B model,
- Deployment directly from within Vertex AI (https://youtu.be/PFHzfzyY2iY), with the TinyLlama 1.1B model.
Get started at https://huggingface.co :)
⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos. Follow me on Medium at https://julsimon.medium.com or Substack at https://julsimon.substack.com. ⭐️⭐️⭐️
Transcript
Hi, everybody. This is Julien from Arcee. In a previous video, I showed you how you could very easily deploy Hugging Face models to Google Cloud using inference endpoints, our very own model deployment service. I mentioned there were other ways to do this. So, in this video, I'm going to show you method number two, which is starting from the hub and deploying Hugging Face models using Google's machine learning service called Vertex AI. This will give you more control and let you work with the Google Cloud Console. Let's get started.
If you enjoy this video, please give it a thumbs up and consider subscribing to my YouTube channel. If you do, please don't forget to enable notifications so that you won't miss anything in the future. Also, why not share this video on your social networks or with your colleagues because if you enjoyed it, it's very likely someone else will. Thank you very much for your support.
Just like in the previous video, let's start from the Hugging Face Hub. Last time we deployed Google Gemma. Let's try something else. Why don't we try Microsoft Phi 2, which I find really interesting. So click on deploy, and this time we won't select any different endpoints. We'll go for Google Cloud. If you're logged into your Google Cloud account, this should open the Google console immediately. If not, it will prompt you to log in.
Let's maybe zoom in a bit here. So we see everything has been filled in automatically: the model, we get an option to deploy on Vertex or Kubernetes, but I guess I'll do Kubernetes next time. We'll stick to Vertex, which is quite simpler. We get to pick the region. If you want to deploy your private model, you have an option to enter your token. We don't need to do that here. Model name, model endpoint name are filled in, and the best instance type is automatically selected. We can see we are using TGI, and we get a sample request. So why don't we save this for later and click on deploy. Let's just do this. It starts deploying here. We see the little program indicator, and if we click on this, we're taken to a pretty useless place. What you really want to see is Vertex online prediction.
We can see it's deploying. I've got one that I already deployed, in the interest of time. We can see some monitoring information here. It's not super busy at the moment. If we go back to the model garden, we see this thing "view my models," which is what I want. This is the one that's already been deployed. Let's just click on this. Alright, let's paste this example called predict. See how that goes. Yeah, we did get an answer. Of course, we could use the Google SDK and whatnot to do the same, but that's not really the point today. I just wanted to show you how simple it was to deploy those models from the Hugging Face hub straight to Vertex. And, of course, when you're done, you shouldn't forget to delete your models and delete your endpoints to avoid unnecessary charges.
That's it for deployment technique number two. We have one more to go, which is deploying from Vertex directly without visiting the Hugging Face hub. So check out the last video of this series. Hope this was useful. Keep rocking.
Julien Simon is the Chief Evangelist at Arcee AI
, specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.
With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.
Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.
Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.