Deploy Hugging Face models on Google Cloud from the hub to Inference Endpoints
April 09, 2024
In this series of three videos, I walk you through the deployment of Hugging Face models on Google Cloud, in three different ways:
- Deployment from the hub model page to Inference endpoints (this video), with the Google Gemma 7B model,
- Deployment from the hub model page to Vertex AI (https://youtu.be/cBdLw5BnGrk), with the Microsoft Phi-2 2.7B model,
- Deployment directly from within Vertex AI (https://youtu.be/PFHzfzyY2iY), with the TinyLlama 1.1B model.
Get started at https://huggingface.co :)
⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos. Follow me on Medium at https://julsimon.medium.com or Substack at https://julsimon.substack.com. ⭐️⭐️⭐️
Transcript
Hi everybody, this is Julien from Arcee. As you can see, I'm on the road right now, but that's not an excuse not to do any videos. As you probably know, we've recently announced a partnership with Google Cloud. In this video and the following videos, I will show you how to quickly and easily deploy Hugging Face models on Google Cloud. There are different ways to do this, which is why I'm going to do several videos. In the first one, I'm going to show you how to use our own deployment service called Inference Endpoints, and we'll see how we can deploy one-click models from the hub to Google Cloud. As simple as that. Let's get started. If you enjoy this video, please give it a thumbs up and consider subscribing to my YouTube channel. And if you do, please don't forget to enable notifications so that you won't miss anything in the future. Also, why not share this video on your social networks or with your colleagues? Because if you enjoyed it, it's very likely someone else will. Thank you very much for your support.
Starting from the hub, let's find a good model to deploy on Google Cloud. How about we try Gemma, this new version of the Gemma model from Google. Let's just click on this. If this is the first time you open this model page, you'll have to ask for access. This is a gated model, but just enter your email and confirm, and you should have access in seconds. OK, so don't let that stop you. As always, I would encourage you to read about the model. Why not maybe test it locally, etc. Right, lots of good information there. But for now, we want to deploy it on inference endpoints. So let's just click on deploy inference endpoints. You can see we have a new option for Google Cloud, right next to AWS and Azure. So why don't we select Google? At the moment, we have a single US region, but I'm pretty sure we will add more. And we automatically select what we think is the best configuration for this model. So here we're going to deploy on this particular instance. As you can see, we are deploying with our TGI serving container. By the way, just yesterday, we announced that TGI is now back to the Apache 2 license, which I think is good news for everyone.
We can decide what the security level should be. Remember, public means public, so wide open to the public Internet, no authentication. I wouldn't recommend it. Protected means accessible from the internet with token authentication. We don't have a private option for now, which we have on other clouds. So let's go with protected. We could always take a look at the configuration. Do we want auto-scaling? Do we want a particular revision of the model? I guess we'll go with TGI. We could enable quantization if we wanted, etc. But I will stick with all those defaults. Okay, so very simple. Just select Google and that's pretty much it. Yeah, at the security level, of course. Okay, let's click on create endpoint. And so now it will take a few minutes. We'll launch automatically this GCP instance in our own account and prepare the endpoint, etc. So I'll pause the video and wait for the endpoint to come up, and of course, we'll test it afterwards.
After a few minutes, the endpoint is up. I can see it says running here, so why don't we test it? We could test it with the playground. We just need to select a token, obviously, because we're using the protected security, right? So let's just try this. Challenging question. Trust me, all right, let's see what that says. All right, what did I tell you? Starbucks horrible coffee. So hopefully, there's something more interesting in Seattle than Starbucks. Anyway, playground, let's try the API. So again, I don't need my token; I could change some of the generation parameters, increase temperature, etc. Let's just include the token. Don't worry, I will invalidate it afterwards. I just need to copy this. Okay. And let's just switch to a notebook. So let's just paste the code. Maybe I'll change the question. Let's try that again. Okay, just run this, invoking the endpoints, passing the token. Okay, and I guess we need to print the output. Okay, let's just print the output. All right, the Seattle fire of 1907 was one of the more destructive in American history. Okay, well, that's clearly more interesting than Starbucks. Okay, so as you can see, super, super nice and simple. Just one click to deploy, and then copy-paste the code, and you can test in minutes.
When you're done, don't forget to delete the endpoint. Let me show you. So when you're done testing, just go to settings, scroll all the way down. You need to type or paste the endpoint name, click on delete, and it goes away, and you're stopped being charged. Right? Perfect. Okay, so that's the first way to deploy Hugging Face models on Google Cloud using inference endpoints. I hope this was interesting. I've got two more ways to show you, so keep an eye out for the next two videos. Okay? Keep rocking.
Tags
Hugging Face ModelsGoogle Cloud DeploymentInference EndpointsMachine LearningModel Deployment
Julien Simon is the Chief Evangelist at Arcee AI
, specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.
With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.
Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.
Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.