How to Easily Deploy Your Hugging Face Models to Production MLOps Live 20 With Hugging Face
October 26, 2022
Watch Julien Simon (Hugging Face), Noah Gift (MLOps Expert) and Aaron Haviv (Iguazio) discuss how you can deploy models into real business environments, serve them continuously at scale, manage their lifecycle in production, and much more in this on-demand webinar!
Transcript
Thanks for joining us and welcome to the MLOps Live webinar series, where we talk about bringing data science to production and creating real business value with AI. My name is Sahar, I'm the VP of Marketing at Iguazio, and it's great to see all of you with us again for our 20th episode. Today, we're going to be talking about how to easily deploy your Hugging Face model to production at scale. I'd like to start by introducing our speakers, who really need no introduction.
Our first speaker is Julian Simon, the chief evangelist at Hugging Face. Prior to joining Hugging Face, Julian spent six years at AWS as the global technical evangelist for AI and ML. Before that, he spent 10 years as the CTO and VP of engineering in a series of large-scale startups.
Secondly, we have Noah Giff, who is one of the top voices in ML Ops today. He is an accomplished author and published the book "Practical ML Ops" by O'Reilly. He's working on a second book now called "Implementing ML Ops in the Enterprise" together with Yaron Khabib. He has a couple of other books in various stages of development. He is also a lecturer at Duke and has a couple of courses on Coursera, and he's a big advocate for MLOps education. He's the founder of Pragmatic AI Labs. We're very excited to have both of you with us.
Last but not least, we have Yaron Havid, who's the co-founder and CTO of Iguazio. Yaron has spent decades in the data and AI space and initiated two open-source frameworks: Nucleo, the serverless framework, and an MLOps orchestration framework. He also sits on the data science committee of AIA, and it's a pleasure to have all three of you here with us today.
This dynamic trio is going to take us through a couple of very exciting topics. We'll start by talking about the key challenges to operationalize machine learning. We'll give a quick introduction to Hugging Face and Emelron, and we'll focus on deploying your Hugging Face model to production at scale in a reproducible way, because we all know how challenging that can be. We'll finish off with a live demo of Hugging Face and ML Run, taking you through the entire ML Ops pipeline. We'll leave plenty of time for Q&A at the end so you can ask your questions live.
At this point, I also want to invite you to join our thriving community of over 800 data scientists, ML engineers, and data engineers. That way, you can continue the conversation with our speakers and with each other. We'll drop the link in the chat now and run two polls about your MLOps journey. We really want to hear from you, and it helps us make these sessions as valuable as possible for you. Please do fill those in, and we'll share the results with everyone at the end.
In terms of Q&A, feel free to add your questions as the session goes. We'll answer some of them as we go through the different topics and save most of them for the end so that we can have a great live discussion. We'll now put everyone on mute and record the session so that we can send it to you afterwards for your convenience. With that, I will pass it on to Noah to kick things off.
All right. I'm ready to talk a little bit about some of the stuff I've been doing recently, especially with Duke students. One of the challenges in the current environment is that we might be heading into a recession for the next couple of years. As a result, I've mentioned to students that Hugging Face is something they should invest in and learn about—how to efficiently build models and put them into production. I can't think of a better technology to showcase their ability to do this.
The models have a lifecycle where you not only use them in web apps, Hugging Face Spaces apps, GitHub CodeSpaces, and command-line tools, but you can also pick a particular revision, download it, upload it, and put it somewhere else. The special sauce is the ability to fine-tune a model. This is where we need lots of assistance and tooling, like MLRun, because in some cases, you may want a large-scale fine-tuning process that uses a cluster of GPUs or special data. You also want to keep track of your experiments. In a nutshell, I can't think of a technology I would recommend more for people to master the Hugging Face model lifecycle.
This culminates in a pretty revolutionary change in machine learning. It's not difficult to find slides that say, "What is AI? What is machine learning?" They often talk about unsupervised and supervised machine learning. Of course, we need supervised machine learning, which uses historical data to make predictions, like predicting a professional basketball player's salary based on points scored. However, one problem with supervised machine learning is the data and compute constraints.
We don't need to reinvent the wheel, and this is where transfer learning comes into play. We can stand on the shoulders of giants and take large language models (LLMs) and tweak them a little bit to use the best parts of the model. For example, with a news dataset, you could grab just the body of the news data, swap off the head, and make it do summarization for 18th-century literature. This is one of the great advantages of leveraging powerful resources developed by companies like OpenAI, StableDiffusion, and Google. With Hugging Face and the ecosystem, you can tweak a few knobs and use a high-quality model trained in a planet-friendly way.
In the coming years, 2023 and 2024, there will be a reckoning in terms of people asking about your spend and why you're using inefficient training processes. We need to focus on efficient training processes. That's my section. We have one more slide here, which is that once you have a robust AI ecosystem, you have AI tools that talk to AI tools that create AI. This creates a feedback loop that allows you to be extremely productive.
For example, I could launch GitHub Codespaces, use a co-pilot tool to help me write code, including Hugging Face code. Once I take that suggestion, I could create a model, take that model, and deploy it back to a cloud-based environment. This is a meta AI-enabled programming system. I was talking to Julian earlier, and he's doing similar things with Hugging Face technology that talks to pre-trained models to create synthetic datasets. This concept is going to explode in 2023 and beyond, where you need AI tools to help you create other AI tools.
Thanks so much, Noah. And what a great segue to Julian's presentation. Good morning, everyone, or good afternoon, depending on where you are. Thanks for joining, and thank you for having me today. I'd like to tell you a little bit about Hugging Face and why I think you should care. What we see today is a reinvention of deep learning. One way to put it is that transformer models are eating deep learning.
We've been doing deep learning in recent years using traditional architectures like CNNs, LSTMs, and RNNs to solve problems with unstructured data like natural language, speech, and images. These models have proven quite efficient. However, since the release of transformer models and Google BERT in 2018, we've seen the transformer architecture being more efficient than anything else in extracting insights from unstructured data.
Whether we're talking about natural language with models like BERT, BART, GPT-2, and GPT-3, or our very own BigScience, an open alternative to GPT-3, or computer vision with CLIP and the Vision Transformer, or text-to-image models like StableDiffusion, or speech-to-text with Wave2Vec 2 from Meta or the latest Whisper model from OpenAI, transformer models are improving state-of-the-art benchmarks.
It's not just researchers saying this; the industry has picked up on it. In last year's State of AI report, transformers were called out as a general-purpose architecture for machine learning. In the Kaggle data science survey, we saw RNN and CNN usage going down, while transformers are going up. We really see this reinvention of deep learning based on the generalization of the transformer architecture.
The latest State of AI report came out a couple of weeks ago, and one slide really caught my eye. This slide shows the modality of transformer usage in research papers. A couple of years ago, transformers were used mostly for NLP (81%), with almost zero usage in images. In just under two years, NLP is now the minority use case (41%), while image usage has exploded 10X or even 11X. We see different modalities rising, such as audio, video, multimodal, text-to-image, and other use cases like recommendation, protein, and drug prediction. This is a strong sign that the transformer architecture is generalizing to many different deep learning and machine learning use cases.
Hugging Face is a company and the steward of open-source projects. The most popular one is the Transformers Library, which lets you download models from the Hugging Face Hub, our main website, and fine-tune them, predict with them, etc. The Transformers library is one of the fastest-growing open-source projects ever. The GitHub stars for various projects are shown on the slide. Hugging Face is the blue line with the little starship, showing a steep slope and a continuing trend. We're well past 70K stars and aiming for 100K. If you haven't started using the library yet, we'd appreciate your help.
We're growing faster in popularity than amazing projects like PyTorch, Keras, and others, and even faster than Kubernetes, which is mind-blowing. This continued adoption in the community is a strong sign that it's not just a fun toy but a tool that researchers and developers are using to get work done.
The Hugging Face Hub, our main website at HuggingFace.co, is often called the GitHub of machine learning. Just like you go to GitHub to find code for your projects and share your own code with the community, you go to the Hugging Face Hub to find models and datasets for your machine learning projects and share them with the community.
As of today, we have a large number of models. When I put this slide together a week ago, it was 78,000 models. I checked just before the webinar, and now it's 81,000 models and over 12,000 datasets. We have over 10,000 organizations, from Google to Meta to NVIDIA to Microsoft to research labs, companies, and individual developers, sharing models and datasets every minute. We have more than 1 million model downloads every single day.
This shows the continued popularity and usage of Hugging Face models. If you've never checked it out, I encourage you to do so. You can sign up in seconds; it's completely free and open source. You can start experimenting with models right away.
Before we continue with the deployment discussion, I wanted to show you quickly what Hugging Face is all about. We try to build simple tools that let you use state-of-the-art models in the easiest possible way, even if you're not a machine learning expert. Here, I'm importing the pipeline object from the Transformers library and starting with a pre-trained BART model from Facebook. I create a pipeline for zero-shot classification, where we score the content of a text against an arbitrary list of labels.
Once I have that pipeline, I just invoke it on my sentence, passing my labels, and I get my score. This is how you do zero-shot classification in literally four lines of Python code. What happens under the hood is complex, but we don't need to know that. We just need to know what business problem we're trying to solve, what popular models can do that, and you can see them on the Hugging Face Hub.
Anyone who can write simple Python code can get the work done. Running a simple prediction is one thing, but what happens when we want to deploy this in production for 24/7 usage? That's where the real problem starts. You want automation, a simple, reliable way to deploy models into production, and a scalable, solid way to manage them. Working in a notebook is one thing, but automating and scaling your MLOps workflow is another.
That's where we want to hear about MLRun and how we can bring Hugging Face models into production in a solid and simple way. Thanks so much, Julian. Those are some impressive stats, I have to say.
At this point, we're going to pause for our first poll. What is your biggest challenge when deploying your Hugging Face model? We'll share the results at the end.
I'd now like to invite Yaron to talk a little bit about the next stage, which is deployment and management in production. Yaron, over to you.
Thank you, Sahar. Before we dive into the demo, let's go over some details. We'll show how to deploy Hugging Face models to production, to real-world endpoints, along with data preparation through a real application pipeline. We'll also show how to retrain the model with fresh data to update it to something more accurate.
MLRun is an open-source project that automates your entire MLOps workflow. The four main components in MLRun are a feature store for feature engineering, a fully automated flow for building models, including testing, training, and deployment across distributed cluster environments, and real-time monitoring of all endpoints.
MLRun is comprised of a client and a server. The server runs on Kubernetes and can run on any cloud environment, on-premises, on virtual machines, and Kubernetes. The client can be anything you want—Jupyter notebooks, VS Code, SageMaker, Azure ML, PyCharm, or any environment that can edit code. You can build your functions locally, test them, and then launch them on a distributed cluster with all the tracking, auto-scaling, and operational aspects built automatically for you.
The paradigm shift that MLRun brings is that before, you had to write code, and then machine learning engineers would convert it into containers and elastic services, adding instrumentation, security, logging, and other operational aspects. Once you've done that, you build workflows from those individual functions that handle data processing, training, testing, deployment, and so on. Finally, you need to build monitoring. Each task is usually done in silos with a lot of manual activities.
In MLRun, you can write your code in your favorite development environment, and with a single click or API call, automatically convert it into a fully elastic, production-grade service that tracks all information, artifacts, and data. You can build real-time or batch pipelines using those individual functions with a single line of code. The monitoring is built-in, as all the classes we use internally know how to monitor activities, whether it's data analysis, model monitoring, or infrastructure monitoring.
All this data is fed back into the system to trigger actions like redeployment, retraining, or tuning your model. This is the general idea, and we'll show how it works, especially how it applies to Hugging Face.
Before we dive into the demo, which has two parts—one for serving and one for automated retraining—let's look at MLRun. You can visit MLRun.org for links and documentation. The best way to start with MLRun is to go to the MLRun documentation site. The landing page is organized by what you want to do, such as ingesting data, building a model, using the model, deploying it into applications, and monitoring.
Each section has documentation, tutorials, and videos to show you how to do it. For each category, you may have different things like deploying batch or real-time, and you'll have different examples and tutorials. MLRun has many built-in components that serve this flow, such as serverless functions that automatically take your code and build elastic services, data processing workflows, distributed runtimes, and real-time pipelines.
We'll use some of these components in the demo. If you want to understand how it works, you can go into the tutorial section, where you'll find many tutorials, including videos for each step of the flow, from training, deploying, and serving models to building custom model serving, advanced serving, model monitoring, and batch processing for script analysis.
One of the things we'll show in this demo is the concept of real-time serving pipelines. We call it a pipeline rather than an endpoint because an endpoint merely serves the model. When you build an application, there are aspects of intercepting messages, transforming them, enriching data, doing scoring, and post-processing.
You need a multi-stage flow that encompasses all these steps and can be treated as one entity for rolling upgrades or debugging. This is the concept of real-time serving pipelines or graphs, where you can build topologies. We'll see how to build an NLP topology with pre-processing, a Hugging Face model, and post-processing.
Let's move into the demo. We'll show two different flows. The first is a simpler one where I'll take a model from Hugging Face, build a serving pipeline using that model, and test it with Gradio. The second, more advanced scenario involves taking a new dataset with up-to-date information, retraining the model, and deploying it to an existing endpoint.
We'll start with the survey application notebook. In MLRun, everything is a project, which usually maps to a Git repo for saving work and collaborating. We import MLRun, create a project, and define a serving function with multiple steps. The serving function has a graph with three steps: intercepting the message, pre-processing, sentiment analysis using the Hugging Face model, and post-processing.
Hugging Face is built into the latest MLRun releases for serving and training, so you don't need to build anything special. To build this graph, I need a pre-processing function, the Hugging Face model serving class with parameters for the model and tokenizer, and post-processing. The model serving is compatible with other frameworks like TensorRT and KF serving.
The usual grid requires passing a dictionary of inputs for scoring and returning a dictionary with output vectors and other data. We transform the text received into a serving request and post-process the response into a text message. We build this application pipeline with multiple steps, save it in the project, and simulate it to verify it works.
We push requests into the pipeline and verify the output. Once we debug and are satisfied, we want to turn it into a real-world endpoint that runs on Kubernetes with API services and rolling upgrades. We deploy the function, and in the background, it builds containers, Kubernetes assets, API gateways, adds security, logging, and other features.
The endpoint serves the entire application pipeline, not just the model. To test it, we use Gradio, a simple way to build front-ends with minimal Python code. We build a widget that accepts text and calls the model using the URL generated by the endpoint. We launch the widget, and it builds a local app.
For example, if I input "I hate flying," it returns a sentiment analysis. We've built a complete application pipeline, including pre-processing, the Hugging Face model scoring, and post-processing, tested it locally, and deployed it into an elastic, auto-scaling service with additional features.
Now, we want to take fresh data, build a fresh model, and deploy it to replace the existing model. We'll build a training pipeline that automatically deploys the new model into production. We import MLRun, log into the project, and add a training function. This function includes data preparation, training, evaluation, testing, and optimization methods.
We build a workflow with these steps. The training function prepares the dataset, trains the model using the Hugging Face training way, and optimizes the model using ONNX for higher performance. The workflow file, written in Kubeflow semantics, has several steps: getting parameters, recording the dataset, training, optimizing the model, and deploying the serving function.
The first step gets parameters like the dataset and pre-trained model. The second step records the dataset, the third trains the model, and the fourth optimizes it. Finally, we deploy the serving function. This pipeline takes everything we did, from data preparation to training, testing, evaluation, and optimization, and deploys it into production. After retraining, the serving function can be loaded and tested, showing improved accuracy with the new dataset. This demonstrates how to build and automate a pipeline for retraining models and deploying them into production. The process is streamlined, making it accessible even to data scientists without extensive DevOps or MLOps skills.
Julien from Arcee highlighted that starting with an off-the-shelf model from Hugging Face can save time and effort. With over 80,000 models available, it's likely to find one close to your business problem. If the initial model is good enough, you can move to production quickly. If more accuracy is needed, find a relevant dataset, train, and iterate. This iterative approach, reusing existing models and data, is encouraged to get models into production faster and test them with real data.
Another use case is updating models with fresh data, such as labeled images or text from chatbots. This helps maintain and improve model accuracy over time, addressing issues like model drift. Even deep learning models can benefit from periodic retraining, especially with changing data patterns.
Julien also noted that Hugging Face is becoming a popular target for model deployment, not just for production environments. MLRun helps users with less deep technical expertise to maintain and improve models, contributing to a growing ecosystem of models on Hugging Face.
Yaron explained that MLRun supports building models in a development account on AWS and promoting them to upstream environments and other accounts. Projects in MLRun are versioned using Git, allowing for seamless transitions between development, staging, and production. CI/CD integrations automate the deployment process, making it easier to manage multiple environments.
Regarding the cost of deploying transformer models, Julien from Arcee emphasized that while these models can be expensive, starting small and optimizing models can significantly reduce costs. Tools like Optimum, an open-source library, can optimize models for inference, reducing their size and improving efficiency. Balancing accuracy and cost is crucial, and smaller models often suffice for many business problems.
Noah Giff mentioned that Makefiles can be complementary to any workflow, including those using MLRun. While Jupyter Notebooks are valuable, the trend is moving towards more developer-oriented workflows, where Makefiles can help standardize local and build system processes.
Yohan discussed MLRun's environment agnosticism, noting that it can run on various cloud and on-premises environments, including Kubernetes clusters and cloud services like SageMaker and Azure ML. MLRun can integrate with third-party services and build workflows around them, making it versatile for different deployment scenarios.
In comparing MLRun to other frameworks like MLFlow and BentoML, Yohan highlighted that MLRun is designed for production, focusing on building and deploying serverless functions, feature engineering, and real-time and batch workflows. MLRun provides a comprehensive metadata layer that integrates training, serving, and monitoring without additional glue code. This makes it suitable for building complex, real-world applications.
Finally, Yaron shared ways to try MLRun, including small-scale deployments on a laptop, large-scale deployments on-premises or in the cloud, and managed solutions. The session concluded with polls on the challenges and methods of deploying models, showing that model monitoring, infrastructure costs, and CI/CD automation are common challenges, while open-source frameworks are widely used for deployment.
Julien Simon is the Chief Evangelist at Arcee AI
, specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.
With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.
Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.
Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.