Migrating from OpenAI models to Hugging Face models

May 16, 2024
In this video, we'll be discussing the seamless migration of your applications from OpenAI models to Hugging Face's open-source alternatives. We'll cover the step-by-step process of deploying a Llama3 8B model on a Google Cloud Inference Endpoint and invoking it with the OpenAI client library. Additionally, we'll deploy a Zephyr 7B model on Amazon SageMaker and demonstrate how easily it can be integrated with the OpenAI Messages API. By the end of this video, you'll be equipped with the knowledge to unlock the full potential of open-source models, all without rewriting your existing code! This video is perfect for data scientists, machine learning engineers, and developers looking to explore the benefits of open-source models and expand their ML engineering toolkit. #OpenAI #HuggingFace #MLEngineering ⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos. Follow me on Medium at https://julsimon.medium.com or Substack at https://julsimon.substack.com. ⭐️⭐️⭐️ Notebooks * https://github.com/juliensimon/huggingface-demos/blob/main/inference-endpoints/llama3-8b-openai.ipynb * https://github.com/juliensimon/huggingface-demos/blob/main/sagemaker-openai/deploy_zephyr_openai.ipynb Blog: https://huggingface.co/blog/tgi-messages-api

Transcript

Hi, everybody. This is Julien from Arcee. A lot of you have been experimenting with OpenAI models in the last year or so, and some of you have moved those models to production. But as you've probably figured out by now, open source models have a lot of benefits compared to those closed-source models in terms of privacy, security, cost performance, and scaling. To help you transition your applications from OpenAI models to open source models, I will do two demos in this video. The first demo will be based on inference endpoints, our own model deployment service, and I will show you how you can easily use the OpenAI client library with the model deployed on inference endpoints. In the second demo, I will start from an open source model deployed on an Amazon SageMaker endpoint, and you will see how you can easily use the OpenAI Messages API with that endpoint. Both techniques should help you adapt your applications to open source models and enjoy all their benefits. Enough talk, let's get started. If you enjoyed this video, please consider joining my YouTube channel, and I would really appreciate it if you hit the like button and give me a thumbs up. Also, please enable notifications so that you won't miss anything in the future. And last but not least, why not share this video with your colleagues or your social networks? Because if you enjoyed it, chances are someone else will. Thank you so much for your support. Let's start with the first demo. Here, I'm going to deploy Llama 3 8B, the instruction-tuned version, on inference endpoints. Starting from the model page, you can just click on the deploy button, select inference endpoints, and this time around, why not deploy it on GCP? So let's just grab this one, zoom in a bit maybe. I'll select the instance type, select the protected security level to enable token authentication, and that's really all we would have to do. So we could click on create endpoint. I've shown you this a few times before, but we could also use the API, and I don't think I've shown you this. There's an example here where you can use curl, and if you want to see all the parameters, you could do that. Or you could use the proper Python API, and this is what I'm going to show you this time around. This is actually based on a really nice blog post by my colleague, so as always, I would recommend reading that. What am I doing here? I am installing the Hugging Face Hub API, which is what I need for the inference endpoint API. I need the OpenAI library because that's how we will invoke the model. And well, you can ignore this. This is SageMaker being weird and insisting on this particular version of Pydantic. Anyway, so once we've done that, we can just start running cells. I need to log into the hub in order to have permissions to create the endpoint. So I just need to go to my settings, grab my access token, and come back to my notebook and log in. So now we are logged into the hub and we can call hub APIs. Next, we're going to create the endpoint. This is the create inference endpoint API, which is part of the package we just installed. Nothing really strange. You will see the same parameters that you would see here, or you would see in the curl API call. So endpoint name, the model ID on the hub, the framework, the task type, and I guess the only ones you may want to tweak are these: the platform you want to deploy on, and the security level. Here I am deploying on GCP, US East 4 region. I am using an L4 instance type and instance size is X4. If you're curious about all the available instance types and their pricing, this is the page you should go to, and the parameters you see there are the ones you can pass to that API. And then the security level, so again, protected, which will protect the endpoint with a token. The rest is just input length and max total tokens, etc. And I did not invent those. These are the default settings you would see in the configuration. So if you don't know where to start, this is a good place to start. Okay, so once we have that, we can just run that cell. It's going to take a few minutes. A waiter is a convenient way to know when this is complete, so let's just run the waiter, and once this is complete, we'll be able to invoke the endpoint. Okay, so see you in a few minutes. After a few minutes, the endpoint is ready to go. We can see its name and its URL, etc. So let's continue. We will need a Hugging Face token to invoke this endpoint because we did set the security level to protected. You only need a read-only token, and obviously, I will recycle this one right after the recording. So we have the token and we can pass the endpoint URL and the token to the OpenAI library, which we installed here. That's the official OpenAI client. And once we have the client, obviously, we can invoke it, and we invoke it using the OpenAI messages format. So the system prompt and the user prompt. Super simple, and we just need to pass the model to TGI, which is the inference server used by inference endpoints. Okay, so we'll use streaming and we'll just stream the answer. Let's see how that goes. All right, pretty cool. So why are transformers better models than LSTM? And this is what Llama 3 thinks about that. So that's what it takes to switch from OpenAI models to an inference endpoint running your favorite open source model. Very simple. When you're done, please don't forget to delete the endpoint. And obviously, you don't want to leave that thing running for no reason and get charged. So don't forget to call delete. So this is demo number one, the OpenAI client with an open source model running on inference endpoints. Let's switch to the second demo. This time we're going to deploy another model. I've picked Zephyr 7B beta. We're going to deploy it on a SageMaker endpoint and we're going to see how we can invoke it with the OpenAI messages format. This time around, we can't use the OpenAI client because SageMaker endpoints require specific authentication based on AWS signatures, which are not implemented in the OpenAI client. So maybe someone at some point will add that. One can dream. So the next best thing is just to use that OpenAI Messages format directly on the endpoint. So let's see how that works. You can go to the model page, click on Deploy SageMaker, and you see the code snippet to deploy that model on SageMaker. Here, this is the code for a GPU instance, which is the one I used, but don't forget, you could also give Inferentia 2 a try and see if you are getting more cost performance out of that. I think you would. Anyway, that's the code snippet. That's what I started from, and that's what we see here. So call deploy on a small GPU instance. Wait for a few minutes for the endpoint to come up. And then we can run inference. And you see this is really exactly the same format because obviously it is the OpenAI messages format. Okay, so that's the body of my request. And I will just send that to the endpoint. So here I'm using the SageMaker SDK for simplicity, but if you want to use the requests Python library or really any other library in any other language, it doesn't have to be Python, that would obviously work. The SageMaker endpoint is an HTTPS endpoint, and you just need to pass the body and that will work. Okay? So let's try that. And here we're doing synchronous inference. So we're generating the full response and then printing it. So it's going to take a few seconds. And then, of course, we'll try streaming inference. So you see, once again, if you are using that format in your apps, the only thing you need to do is... well, just switch that HTTP request from OpenAI to your SageMaker endpoint, and it will just work the same. Here, by the way, I'm showing you SageMaker, but it would work with TGI anywhere. So it would work on Google Vertex because we use TGI there as well. So that's TGI behavior, not SageMaker behavior. Okay. So let's see. We do see the response, right, on why transformers are more interesting than LSTMs. And we see the OpenAI output format, which is what we expected. Okay. What about streaming? So we can do streaming too. Let's just reset the deserialization to streaming, instead of just JSON. Okay, and then we have this utility function, which you've seen in a few other videos, and I borrowed that from this nice post here on the AWS Machine Learning blog. Feel free to go and read that. And that helps us iterate on the streaming object returned by the endpoint. So let's just use that. And this is how we grab each token and display it. You may need to adapt this from one model to the next because the stop token could be different, etc. So here for Zephyr, this is working nicely. And now let's try that again. So same body, only difference is I'm setting streaming to true, invoking the endpoint. So the response is a streaming response, so I just see the streaming object here, and once I start iterating on that streaming object, I should see... Oh, it's fast, yeah. I should see the streaming answer. There we go. Let's try that again. Scroll quicker. Okay, here we go. Looks like streaming now. Once again, your application code shouldn't be impacted by the switch from OpenAI to open source models. Just remember to set the model to TGI, but the body and everything else will be the same. And at the end of the day, it is still an HTTPS query. As always, once you're done, please don't forget to delete the endpoints and avoid unnecessary charges. Okay, well, that's really what I wanted to show you. So give that a try. If you're hitting some cost performance or scalability walls with OpenAI models, I would really recommend that you look at open source models. They're really good, 7B, 8B, is really the sweet spot for a lot of enterprise applications. So, you know, Llama 3, Zephyr, Google Gemma, and all the ones that are coming soon are going to be really interesting. Okay, so give them a try and tell us what you think. So I'll see you soon with more content as usual. And until next time, keep rocking.

Tags

OpenAIOpenSourceModelsModelDeploymentInferenceEndpointsSageMaker