Deploy models with Hugging Face Inference Endpoints

Transcript

Hi, everybody. This is Julien from Arcee. In this video, we're going to talk about inference endpoints. Inference endpoints is a new service we just launched to make it very simple to deploy transformers from the Hugging Face hub and manage infrastructure on your favorite cloud. Simplicity doesn't come at the expense of production-grade quality and scalability. Let me show you. In this demo, I'm going to reuse a model that I already trained. It's one of my recent videos where I showed you how to fine-tune an image classification model with Auto-Train on the Food 101 dataset. You can go and watch that one if you want to know how the model was built. Here, I'll just start directly from the model. This is the model we trained, and this is the Food 101 dataset, which, as the name implies, has 101 classes for different types of food. It's a fun dataset to work with. The model is ready here, and if I go and click to deploy, I can see and deploy straight away to inference endpoints. The model repo has been selected. Let's just use another name for this. It's a SWIN model, so let's call it SWIN-boot-101-demo1. Then I can select a cloud provider. At the time of recording, we support AWS and Microsoft Azure, and I'll just go and select AWS because that's the one I know better. Why don't we go and deploy in the Ireland region, eu-west1? We could keep it at that, but let's click on advanced configuration. We can pick an instance type, so we get different options for CPUs and GPUs. Let's be conservative and keep a medium-sized CPU. I can set up autoscaling, but maybe I don't want to autoscale. Maybe I want a single instance because it's just a very low traffic endpoint, so no need for that. I can see the task type and framework filled in automatically. I could enter the revision of the model. By default, we'll use the latest version, but if you want to deploy a previous version of the model that's in the repo, you can just enter the commit here. I could also use a custom image that I bring instead of using the built-in container that comes with inference endpoints, but let's keep it simple for now. The most important setting here is the security level of the model. Public means it's open on the internet and no authentication is required. Anyone on the public internet can use that endpoint without any authentication. Maybe that's what you want, maybe not. Think twice about the security, privacy, and cost concerns. Most likely, you want to do either protected or private, and we'll do both. I'll start with protected. Protected means the endpoint is going to be created in a public subnet managed by Hugging Face, but you will need to provide a Hugging Face token to access the endpoint. That's a decent first level of security. We'll do private afterwards. So let's just go and create that endpoint. Click on this, and it's going to run for a few minutes. We can see it's building the container with the model I selected and provisioning all the infrastructure. We can see the logs here if you're interested. But for now, I'll just pause the video and I'll be back when the endpoint is ready. The endpoint is ready. We see the endpoint URL, and of course, we should be able to test it. Let's just go and drag an image here. One of my favorite foods ever. Hopefully, yours too. Yes, very, very high probability, probably higher than 0.9999, that it is indeed hummus, marvelous hummus. But probably what we really want to do is run some code. So this is the URL of the endpoint. This is my token, which I'm sorry, you won't get to see. I build headers with the token and the content type, and then I just use the request library to send an HTTP POST request to the endpoint. We should see the same prediction. So why don't I run this code? Just copy this stuff here and put it in a terminal. Hopefully, we see some probabilities now. Yes, 0.9998, it is almost. That's it. This is how you deploy an inference endpoint in just literally minutes on managed infrastructure. You could have auto-scaling in place, you have authentication in place, and it's backed by AWS infrastructure, so probably fairly solid. It's just a vanilla HTTP endpoint that you can use just like this. There are a few more things you may want to look at. If we go here, we'll see some statistics, latency, 2xx invocations, 4xx invocations, etc. We have those. So we have the build logs and we actually also have the invocation logs. You can see the two invocations I've done here, so easier for debugging if you need that. So that's really it. As simple as that. Now, let me show you how to do a private deployment. Creating a private endpoint means that the endpoint will not be accessible on the internet. It's going to be deployed in a private subnet inside the Hugging Face account, and we will connect this endpoint to your own AWS account using an AWS service called AWS PrivateLink. What that really means is we'll do the setup on our side, and then in your AWS account, you need to create a VPC endpoint that connects to our account so that the inference endpoint is visible in one of your VPCs and in whatever subnets you select there. It's not a complicated procedure if you're not familiar with private networking. I will put all the documentation in the video description. So let's just go and create the endpoint, do the Hugging Face side of the job, so to speak. So I just set a different name for the endpoint. We're still using AWS in eu-west1, and this time we're going with private. You need to enter the ID of the AWS account that will access the endpoint. Don't get it wrong because otherwise, nothing will work. You can find this ID in the AWS console or with the AWS CLI. It shouldn't be difficult. So that's all it takes. Enter your AWS account and click on create endpoint. It's going to do some setup on the Hugging Face side, and then it's going to pause, and we'll switch to the AWS console to complete the setup on the customer side. So if you're a little bit confused, this is what things will look like once we're done. Within the same region, we have two accounts. We have the service provider account and VPC, which is, of course, Hugging Face, hosting the endpoint in a private subnet. On the other side, we have the consumer VPC, so in the customer account, in your account. This is where we'll need to create that VPC endpoint that can connect your instances to the endpoint, the inference endpoint in our own account. Don't confuse the inference endpoint, which is really the prediction API, and the VPC endpoint, which is just the channel that connects the two accounts. So let's wait a couple of minutes for the setup to be done on the provider side, and then we'll switch to the AWS console. After a minute or two, you should see the VPC service name. When you see this, it means we've done our side of the job, and now you need to switch to the AWS console to complete and authorize the connection between our account and your account. Let's keep this name; we're going to need it. So let's just go to this page. I am in the VPC console in the endpoint section. We'll just click on Create Endpoint. Let's give it a name. So let's give it the same name as the model for convenience. We're trying to connect to another endpoint service. You can use VPC endpoints for AWS services or AWS Marketplace. Here we have an existing service, and that's the name we just copied. Let's verify that the name exists and is correct. Then we need to select a VPC. So that endpoint will be created in a dedicated VPC. In this account, I just have a default VPC, so simple enough. And then I can pick the subnets where the VPC endpoint will be accessible. I'll just go and select all three subnets in this VPC, but it could be that you want to give access to just one subnet. So you don't have to click them all. Just click the ones where you're going to run instances, applications that need to access this endpoint privately. Let's just go and use IPv4. Next, we need to assign a security group, just like every time we do networking on AWS, and this will restrict ports, for example, that are open between the endpoint and the instances. Keep in mind, only resources running in those subnets will be able to access the endpoint. But you may want to further restrict things. I think I've got one here already, and this one is actually not very restrictive at all, but good enough for my demo. I think that's it, so I can just click on Create Endpoint. It's going to do that connection, as we saw in the documentation, and it's going to connect the two accounts. It just took a few more seconds, so now it's available, and we're good to go. Now we have this connection between the consumer account and the producer account, so we can test the endpoint. What I've done is created an EC2 instance, which we can see here. It's just a plain instance. You just need to make sure it's in the VPC that's connected and in one of the subnets that you gave access to. The instance is up, and I've created it here. Now we can try to call the endpoint. First things first, I will need the name of that endpoint, which I see here. So let me grab that, update my script, and we'll test it. In my code, I've simply updated the URL or the endpoint to point at this private endpoint. Just to be on the safe side, this is the endpoint URL. Don't go and try invoking this. This is just a VPC name for AWS; it's nothing like the URL. This is what you want to use. So now I can just go grab this, my instance, and invoke it. And of course, we see predictions. If I try this on my own machine, let's log out. I'm back on my Mac. Let's put Python again. Ah, of course, it's not working because this endpoint is not visible on the internet. That's what we wanted. So security works. You'll be blocked forever. That's really it. That's what I wanted to show you. We could do more videos on scaling and maybe we'll do custom containers, and there are more features coming, of course, as you would expect. But as you can see, this is really, really simple. You can really go and deploy your endpoints in a few clicks or with curl. You can pick the level of security that works for you, and then invoking the endpoints is just as easy as a few lines of Python. So give it a try. Let us know what you think and let us know what you would like to see next in the service. That's super useful. All right, well, that's it for the endpoints. I hope you learned a few things, and as usual, keep rocking.

Deploy models with Hugging Face Inference Endpoints

Transcript

Tags

About the Author