Machine Learning Hyper Productivity with Transformers and Hugging Face

Transcript

Thank you, and good morning, everybody. Hopefully, jet lag won't crush me in the next 34 minutes. Let's see how this goes, okay? So today, I'd like to introduce you to what we build at Hugging Face. Hugging Face is a French-American startup, and we do machine learning. We do it in a way that helps you go pretty quickly. I will try to show you lots of things today. So let's just get started. This is probably what some of you or a lot of you are doing today. This is how the world sees machine learning, which has become a synonym for deep learning in many places. We start with neural networks and neural network architectures. We spend an insane amount of time cleaning, preparing, and labeling large data sets. A lot of the sessions here will talk about that, but I won't. Hopefully, that's good news. We then put those models and data sets together and use compute, usually in the form of GPUs, to train and get a model that's good enough for production. How do we do this? We use a bunch of tools, many of which are open source. Don't get me wrong; they're great. I love TensorFlow and PyTorch as much as anyone else, but they're still very difficult to use. Unless you have solid skills in machine learning, deep learning, computer science, statistics, and all that, it's pretty difficult to get production-grade models. That's what a lot of people do today, and that's what we're trying to reinvent. In the last few years, we've seen transformers, a new architecture for deep learning models, become extremely efficient over a wide range of use cases. They are becoming the de facto solution for many problems, as you'll see. Instead of building and cleaning and labeling massive data sets, we start from pre-trained models, off-the-shelf models for whatever task you're trying to solve. We use transfer learning to reuse and inherit knowledge from the pre-trained model. Sometimes, the model is good enough as is, and we save ourselves the trouble of building a machine learning project from scratch. Other times, we fine-tune it with a bit of data. The key word here is "a bit of data." We don't need millions of labeled samples; a few hundred or a few thousand, and sometimes much less, can do the trick. GPUs are still around, but we also see a new generation of chips designed to accelerate machine learning. As we all know, GPUs were originally designed for 3D games, but they work well for machine learning too. Last but not least, we're trying to do all of this with developer tools. We build tools that make it easy for anyone to do this stuff. You don't need to be an expert or write thousands of lines of machine learning code. You can do it with very little, very generic, and very simple code. Transformers are the new models introduced in 2018. You've all heard about BERT, which Google launched. Transformers is also the name of an open-source library that we steward. We have a huge community around it, but we lead that effort. Transformers is one of the fastest-growing open-source software projects ever. This is a bold statement, but we can back it up by looking at GitHub stars. The Transformers library is the yellow line on the left with the steepest slope. We grow in popularity faster than excellent projects like PyTorch, Keras, or Spark. We also grow faster than Kubernetes or Node.js, which is mind-boggling. With this library and our companion website, the Hugging Face Hub, where we host models and data sets, we serve over 1 million models every day and counting. We have crazy adoption in the open-source community, and we see analysts and industry experts catching up. The State of AI report said transformers are emerging as a general-purpose solution for ML. If you thought transformers were just good for NLP, they are still excellent for NLP, but they also work very well for computer vision, speech, reinforcement learning, and more. We have plenty of seats around here. There's no risk in sitting in the front row. I'm not going to pick on you. Even if you write Python on Windows, you're safe. The Kaggle survey told us that recurrent neural network and convolutional neural network usage is going down, while transformers usage is going up. We're becoming the de facto solution. I don't have a lot of time today, so I'll go straight to the point. This is the family picture for Hugging Face. We host over 5,000 data sets, which is probably 6,000 by now. On the Hugging Face Hub at huggingface.co, we host over 55,000 models. By combining these with the Transformers library, you can train and experiment anywhere—on your laptop, on your GPU server, in the cloud. We have additional libraries for hardware acceleration. If time permits, I'll give you a quick sneak peek at that. We've built Spaces, a cool way to showcase machine learning models for non-technical stakeholders with simple web apps. I'll definitely show you that today. We have a managed inference service called the Inference API that lets you deploy and predict with your models in one line of code, on managed infrastructure. We have partnerships with large cloud vendors like AWS, SageMaker, and more recently, Azure, where you can one-click deploy any NLP model from the hub on Azure infrastructure, literally one-click. Now, let's run some code. I'll need my glasses for that. This is the Hugging Face Hub if you've never seen it. Who's never been to the Hugging Face Hub? A few people. Okay, so this is where we have 50,000 models and 6,000 data sets. This is the raw material we'll use to build our applications. Let's say I want to build a text classification model that takes Amazon product reviews for shoes and predicts the star rating from 1 to 5. First, I would go to the Hugging Face Hub and find pre-trained models. You'll find sentiment analysis models and others, but not one specifically trained for shoe reviews and star ratings. However, we can find models trained on large text corpuses, like BERT, RoBERTa, and DistilBERT, which understand the finer points of the English language. We can start from one of those. We'll need data, so we can use the Amazon US reviews data set, which is readily available on the hub. It has millions of product reviews, including a shoe category. We can browse it on the hub and see that it includes product titles, reviews, and star ratings. For experimentation, I'll use the datasets library to download the data set, which is one line of code. Since it's a large data set, I'll take 10% of it, reducing it to 436,000 reviews. I'll drop unnecessary columns and focus on the review text and star rating. I'll run a sanity check on the star ratings to ensure they range from 1 to 5 and check if the data set is balanced. It's not balanced, with many five-star reviews. For machine learning, this is a problem, so I'll rebalance the data set to have an equal number of reviews for each star rating, reducing it to 100,000 ratings. This is more than enough for experimentation. The datasets library is very similar to pandas, so if you're familiar with pandas, you'll feel at home. I need to adjust the star ratings to start at zero for classification, so I'll write a simple function to decrement the ratings and apply it to the data set. I'll also rename the columns to match the model's expectations, such as "text" and "labels." After ten lines of easy Python, the data is prepared, and I ran all of this on my laptop. I'll split the data for training and validation and save it locally. I can save it to CSV for additional exploration or processing. I can save it back to S3, as the datasets library is integrated with cloud storage services. I can also push it to the Hugging Face Hub, creating a git repository and pushing the data there. Now, we're ready to train. Or should we use AutoML? Let's use AutoTrain, which is part of our services. This is a multi-class classification problem, and I want the service to pick the models automatically. I'll use the data set I just pushed, map the columns, and start the training. This will run 15 jobs in the background, and I can go back to other things. When it's done, it will tell me which model architecture works best. I've run this before, and the top accuracy model is automatically pushed to the hub with a filled model card. I can see the model file and the metrics. Let's assume we don't know this yet and experiment. I'll start from the Transformers library and train locally on this machine. I'll use DistilBERT, a smaller, faster version of BERT, for a good baseline. I'll define simple hyperparameters, load the data set from the hub, and define a function to compute metrics like accuracy, F1, precision, and recall. I'll load the base model and tokenizer from the hub, tokenize the text, and define training arguments. I'll configure the training job with the model, parameters, metrics function, training set, and validation set. This code is generic and can be used for other tasks like image or audio classification. I'll call the train function for a single epoch to keep the training time short. The accuracy is a bit under 58%, which is not great but okay for a baseline. I'll save the model locally and push it to the hub, where I can share it with my team or the community. The model is loaded on demand using the Inference API. I can test it right there with the inference widget, a cool way to test models without writing code. I can send the model page to others to run samples and provide feedback. The model labels four, which means five stars, as we decremented the ratings. It's a 72% five-star rating, not difficult at all. Now, we have the model on the hub, and we can download it again or use the Transformers library to predict with it. The pipeline is super simple: define the task type, pass the model name, and call the classifier to get the top-scoring class. We can go a level deeper if needed, controlling tokenization or pre-processing. This model is good enough for POCs. It took about an hour of training, 10 minutes of copy-pasting, and 15 minutes of debugging. Once you know how to do this, you can experiment quickly and get to a first model in a couple of hours. Imagine you want to show this to your marketing director or CEO. They might be scared by the technical output. What they want is something user-friendly, like a simple web app where they can enter a customer review and get stars and a confidence score. This is a really simple example, and it takes only 15 lines of code. Most of the code is about extracting the label and displaying stars. I tested it locally with the Gradio library, created a space repository on the Hugging Face Hub, and pushed the file. The hub handles the DevOps, and it runs on managed infrastructure. Spaces is great for showcasing models. If you've played with the Dali Mini model, generating silly images, you might have noticed the huggingface.co URL. You can find thousands of spaces, and it's a cool way to showcase your models. After the demo, your marketing director might be happy and ask you to keep improving accuracy. You can train for a few more epochs on more data or look at the AutoTrain model, which might be more accurate. Now, you need to deploy it for real. The first option is to use the Inference API, which takes minimal code. Define the URL of the model, pass an authentication token, and fire up a query. The first call might take a few seconds to load the model, but subsequent calls are faster. You can pin models with an enterprise plan for 24/7 availability. If you run on AWS or Azure, you can deploy any model from the hub to Amazon SageMaker or Azure with a few lines of code. For SageMaker, point to the model on the hub, create a SageMaker object, and call deploy. For Azure, search for Hugging Face endpoints in the Azure Marketplace and deploy with a few clicks. This way, you don't write DevOps code, and the process is completely generic. Now, your model is in production, and it works, but it's a bit slow. Transformer models are big, and optimizing them is complex. Instead of doing manual optimization, you can use the Optimum library, which partners with hardware vendors like Intel, Graphcore, and Habana Labs to accelerate transformers. I'll show you inference optimization using Intel. Starting from the trained model, I'll load the model, define a function to evaluate accuracy, and use the Intel Neural Compressor for quantization. Quantization replaces 32-bit floating-point values with 8-bit integers, making the model faster. I'll set a target to not lose more than 3% accuracy. After quantization, the model's accuracy slightly improves, and it predicts the evaluation set 30% faster. Hardware optimization has never been simpler. That's a quick tour through the Hugging Face family. Machine learning code, not really. DevOps, not really. Simple code, copy-paste, adapt, experiment, and iterate. If you want to get started, this is the one slide you want to take a picture of. You can join Hugging Face, learn more about machine learning tasks and transformers, and ask questions in the forums. If your company needs help, we have a support program and can deploy these tools on your own infrastructure. Looks like jet lag didn't kill me. Now I can crumble. Thank you very much. I hope you liked it. Thank you.

Machine Learning Hyper Productivity with Transformers and Hugging Face

Transcript

Tags

About the Author