The DataHour Building NLP applications using Hugging Face
July 14, 2022
The DataHour: Building NLP applications using Hugging Face
As amazing as state-of-the-art machine learning models are, training, optimizing, and deploying them remains a challenging endeavor that requires a significant amount of time, resources, and skills, all the more when different languages are involved. Unfortunately, this complexity prevents most organizations from using these models effectively, if at all. Instead, wouldn’t it be great if we could just start from pre-trained versions and put them to work immediately?
This is the exact challenge that Hugging Face is tackling.
In this DataHour, Julien will introduce you to Transformer models and what business problems you can solve with them. Then, he’ll show you how you can simplify and accelerate your machine learning projects end-to-end: experimenting, training, optimizing, and deploying. Along the way, he’ll run some demos to keep things concrete and exciting!
Prerequisites: Enthusiasm for learning and basic concepts of Machine Learning & Python.
🔗 More action pack session here: https://datahack.analyticsvidhya.com/contest/all/
Stay on top of your industry by interacting with us on our social channels:
Follow us on Instagram: https://www.instagram.com/analytics_vidhya/
Like us on Facebook: https://www.facebook.com/AnalyticsVidhya/
Follow us on Twitter: https://twitter.com/AnalyticsVidhya
Follow us on LinkedIn:https://www.linkedin.com/company/analytics-vidhya
Transcript
Hello and welcome everyone to another session in this Data Hour series. We are thrilled to be here with you this evening for a session full of action-packed learning. I am Ketak Gunjar, part of the Data Science team at Analytics Vidya, and I will be moderating this session along with my colleagues Rishabh and Monish. For those who have joined us for the first time, a brief introduction about the Data Hour sessions. The Data Hour is a series of webinars conducted by Analytics Vidya, led by top industry experts. It is a fun way to understand the concepts of data science from the leading players in the data tech domain. As the name suggests, this is one whole hour dedicated to only data. We are hopeful that these sessions will be a great source of enrichment and value for our audience.
Now onto our session today, which is building NLP applications using Hugging Face. In this Data Hour, Julien, our speaker, will introduce you to transformer models and the business problems you can solve with them. Then he will show you how to simplify and accelerate your machine learning projects end-to-end, including experimenting, training, optimizing, and deployment. Along the way, he will also run some demos to keep things concrete and exciting for you. I hope you are excited to attend this Data Hour with us. Now, on to our speaker. In this session, we have Julien Simon with us. Julien is currently Chief Evangelist at Hugging Face. He has recently spent six years at Amazon Web Services, where he was the global technical evangelist for AWS and machine learning. Prior to joining AWS, Julien served for 10 years as CTO or VP at large-scale startups. So over to you, Julien, the virtual stage is all yours.
Hello, Julien. Hi, Julien. Are you able to unmute yourself? Yeah, hi Julien, you can start your session. Okay, let me get started. Ready to go. Julien, are you able to hear us? Okay, I guess. Yes, I think you can start. Okay, all right, I'll take this as a yes. Okay, so let me share my screen, and we'll be starting right now. Okay, here we go. All right, you should see my screen now. Okay. Ah, double checking everything. Yeah, let me get this out of the way. I want to keep the chat window open if I can. All right, here it is. Okay, let's get started. So, I guess good morning, good afternoon, probably good evening. We have lots of people from all over the world. That's amazing. Thank you so much for joining. Hi, you know, I never thought there would be so many people on the session. So thanks again. And thanks for showing up, even if it's very late where you are. I really appreciate it. I'm based outside Paris. It's mid-afternoon for me right now.
So in this session, I'm going to try to introduce you to Hugging Face and how to build NLP applications. Let's start with a few slides to set the scene and introduce you to what transformers are, what Hugging Face is, and what we're trying to build. So, of course, we'll quickly dive into a very long demo, trying to highlight most of the solutions we have out there. Okay, and I'll keep some time for questions. So, you can post your questions in the Q&A window, and I'll try to keep an eye on it. Okay, so I'll try to answer as many as I can here. Okay, so let's get started.
The first thing we should discuss is the starting point. Of course, I'd love to convince you that transformers and Hugging Face are awesome, but to do that, I need to explain where we're starting from. What I call Deep Learning 1.0 is deep learning the way we've been practicing it from 2012 to 2018 or something. The first few years where deep learning exploded onto the stage and became something that not just a research tool but something that businesses and organizations could use. The foundation of deep learning is neural networks, which is really old technology. When you think of it, and it just came back and proved very efficient for different tasks like computer vision and natural language processing and more.
So architectures like convolutional neural networks, recurrent neural networks, LSTMs, etc., have been heavily used and are still heavily used. To train models with neural networks, you need datasets. And as we all know, deep learning is very data-intensive. The majority of those problems are supervised learning problems, which means you need to label that data. Not only do you need to collect and clean the data, but you also need to label it, and that's a lot of work. It's very painful. Although there are some open datasets out there, like ImageNet for image recognition, generally, it's been extremely painful for companies to build those huge datasets, and it's been slowing them down in adopting deep learning.
Putting the models and algorithms and the data together to build models, of course, you need computing power. So far, this has really meant GPUs. There's nothing wrong with that, but GPUs are expensive and haven't been designed specifically for machine learning. They've been originally designed for graphical computing and 3D games. They're a brute force solution that gets the job done, but they're expensive and power-hungry. There's got to be a better solution than that. Finally, we need to put all that stuff together and write some code to get training and deployment going. In the first few years, it really meant working with libraries like the first versions of TensorFlow and the first version of Torch. These are great tools, but they're not the easiest ones to use. Even now, if you're going to write TensorFlow code or PyTorch code, you need to know what you're doing. You need a deep understanding of the models and the machine learning process. Let's face it, it's not something that everybody has.
So, it kind of restricts the field to experts, and while we don't think that's a really good thing, we're trying to reinvent the way machine learning is done today. The first step is, of course, standardizing models with transformers. In a nutshell, it's an architecture based on deep learning that originally proved very efficient with natural language processing tasks. Everyone here, I'm sure, has heard of the BERT model from Google, released in 2018. That was the first major transformer to be available. But very quickly, we saw models for computer vision and audio and speech. So transformers are really becoming a kind of standard now for deep learning and even traditional machine learning in a way.
Instead of building those huge datasets, we can now rely on transfer learning and pre-trained models. Transfer learning is a way to use models that have been pre-trained over a huge dataset, like Wikipedia or millions of images, to pick up the relationships between all the different concepts and patterns inside that data. This initial training job is very expensive and complex, but it's already been taken care of. Some companies and organizations have already done that for us, so now we can come and grab those models off the shelf. We can either use them as is, or we can take those pre-trained models and fine-tune them. Now you don't have to go through the crazy effort of collecting, cleaning, and labeling data. You just need a little bit of data to get the job done.
GPUs are still around, obviously, and they're still working. But we also see some new companies building dedicated chips for machine learning, whether it's for training or inference. These are amazingly efficient, and I'll talk about that a little more in a minute. Finally, putting everything together, we still need expert tools for low-level work and low-level libraries, but we also need easy, open-source libraries that everyone can use, even if they don't have a formal education in machine learning. So, in just a few lines of code, you can get the job done and go quickly from your IDE to your model without the need for expert skills or complex code writing. And hopefully, that's what I'll show you in the demo today.
So, transformers are this deep learning architecture, as I mentioned, but when it comes to Hugging Face, Transformers is also an open-source library that Hugging Face stewards. This is one of the most popular projects in open-source history, and we were really humbled by that and very impressed by the adoption we see in the community. What you see here on this slide is the number of GitHub stars for different projects. Hugging Face is the yellow line on the left, and you can see it has the steepest slope. That means we're growing faster than these other cool projects. Don't get me wrong; we have huge respect for all the projects out there and generally all open source, but it's really amazing to see we're growing faster than PyTorch, Keras, Spark, and even Kubernetes, which is a little bit mind-boggling. So that's pretty cool. And again, thank you everyone for supporting us. If you haven't given a star to the Transformers library, now would be the time to do it. Or maybe after the demo, we can always use more stars out there. So, go to GitHub.com/huggingface/transformers and give us a star.
Community adoption is really cool, but of course, we want to see industry adoption as well. We did see in a few industry reports, like the State of AI report, that transformers are increasingly popular. They're called out as a general-purpose architecture for ML, not just NLP. We see traditional deep learning architectures becoming less popular, and transformers becoming more popular. That's a really good sign that there's a shift from deep learning to transformers. Just to give you a couple of numbers, we have over 1 million model downloads from the Hugging Face Hub, which lives at huggingface.co. We serve over 100,000 users every day, and that's still growing very quickly. So, there's a lot of traffic on that website for sure.
Here's the family picture before we dive into the demo. We're going to cover quite a few of those building blocks in the demo. On the right-hand side, we have the artifacts hosted on the Hugging Face Hub. Over 6,000 datasets and over 55,000 models today. Starting from those, you can use the Transformers library to train your models with very little code and pretty simple code, as you will see. But you could also use Auto-Train, which is our AutoML solution. It's a no-code solution where you just use the user interface to train your models. You can also use the Optimum open-source library to accelerate training. I'll talk about Optimum in a little more detail in a second. And when you have a model you like, you can deploy it on one of our solutions called Spaces. Spaces is a really easy way to build a web app to showcase your models. We'll look at Spaces today. Finally, if you want to deploy for production, you can deploy on our very own managed inference API, which I'll show you as well. You can use Optimum to accelerate inference, too.
So, that cycle is very fast, as you will see. It's not a lot of code that you need to write, and you can iterate very quickly to get to high-quality models. I also want to mention that we have cloud partnerships. The first of those is with AWS, where Hugging Face is a first-party framework on Amazon SageMaker. Hugging Face containers for training and inference are readily available in SageMaker, so you don't have to build them. You can just bring your code and leverage the long list of SageMaker features. A few weeks ago, we also launched Hugging Face endpoints on Azure, where you can deploy any public NLP model from the hub to a managed endpoint on Azure. There will probably be more things coming up, but as you can see, that's the family picture. So, quite a few options here. Open source, you know, run them on your laptop, run them on your server, run them on the cloud. We want to be everywhere that you are, right?
Maybe just a quick word on hardware acceleration, which I think is a really important topic. Optimum is worth a look. I'll show you an example of model quantization at the end of the demo. Basically, Optimum builds on top of the Transformers library to bring you either training acceleration with chips from Habana and inference using the ONNX Runtime Acceleration and the Intel Neural Compressor. The API is very close to the Transformers library, so it's very easy to move your vanilla Transformers code to Optimum. We also have another collaboration on hardware acceleration, which is not part of Optimum, which is part of the AWS SDK for Inferentia, a custom chip that AWS built to accelerate inference. All right, that's the family picture. I will share some resources at the end, but for now, I think it's time to start running some code.
So, before everybody asks, yes, this code is available. I'll leave it on the screen for a second. You can grab this whole thing on GitLab here. This is actually a self-paced workshop that I'm building and keep adding stuff to it and trying to keep it up to date. I've picked a few modules out of that today, but there's certainly more to explore, and if you keep an eye on it, I'll keep adding stuff. Okay. All right. So, GitLab.com, and maybe I should post that to the chat as well so that you can all catch it. Okay. There we go. All right. So, there are quite a few things, and this is trying to replicate a real-world scenario.
I'll accelerate a little bit because we really want to be running on time, but I encourage you to go and read this intro notebook here. Basically, what we're trying to do is assume we're working for a retailer selling shoes. We'd like to build a machine learning model to classify customer feedback, whether it's product reviews on our website or forums or social networks. To keep it simple, we'd like to provide an English language review and score it using the Amazon star rating concept. You could do positive, negative, neutral, or different things, but here, let's go for stars. Simple enough. When it comes to machine learning, this is a multi-class classification problem where I'm going to give you an English piece of text and predict whether it's a one-star, two-star, three-star, four-star, or five-star review. So, five classes.
I mentioned those 55,000 models, so we could look at the hub and maybe we're lucky. Maybe someone already built this. We could browse the Hugging Face Hub and look for text classification models for English, and maybe someone has already shared that. There are tons of models for sentiment analysis, hate speech detection, emotion detection, and more. But, unfortunately, we don't have a model to classify product reviews for stars. So, we'll need to pick a model that's been originally trained, and we can pick any of the language models for English. We don't even need to start with a sentiment analysis model. Here, I'll just start from the DistilBERT model, which is a condensed version of the BERT model. This one has been trained on a very large English corpus. I'll start from there and fine-tune it on our own data.
Speaking of which, we do need data. If we're a shoe retailer, we probably have some customer reviews and customer posts on forums, etc. But remember, you need to clean that stuff and label it, and it's a pain. Maybe you need to do that to get to the maximum accuracy you want. But to get started quickly, maybe we can find a dataset that just works out of the box. In fact, there is this Amazon US reviews dataset. Let's click on that. As the name implies, it's millions of real-life reviews from Amazon.com. It's neatly organized in categories, and of course, they have shoes. We could just go and click on any of those. We see the product title, the product review, and the star rating. So, this looks like a good start. We have customer content and a star rating. These are not the exact shoes we're selling, but I'm guessing if someone's happy with shoes, they're going to say similar things regardless of the specific shoes. So, this is a really good place to get started. That's what we're going to do now. We're going to start working with this dataset and see where that takes us.
Moving on to the next notebook, let's take a look at those shoe reviews. The library I'm going to be using here is mostly the datasets library, another Hugging Face library, so easy to install. In one line of code, I can just download that Amazon review dataset from the hub. It's really huge, over two gigabytes, and it has 4.3 million reviews. That's nice. That's a lot of data. But to experiment, I don't need that much. So, I'll just download 10%. That's a reasonable number. It's going to be cached on my local machine, so it's going to be fast. You can see here we're using datasets, so you're not downloading again and again; it's cached locally. Now we have 436,000 reviews. We can see the different columns. Here's an example, and we can see all those columns. There's a lot of stuff here.
What we want to keep is really the review body and the star rating. The product title could be interesting, maybe other things would be interesting, but I'm just going to stick to those two columns and remove everything else. Now my dataset looks like this. You can see how easy it is to work with the datasets library. A couple of lines of code to clean things out, and it's super nice. I want to check that my star ratings are what they should be, make sure we don't have weird values. The unique values are really 1, 2, 3, 4, 5. Good to know. Another thing I could do is count how many reviews I have per class. I can see a potential problem here. We have a ton of five-star reviews and not so many one and two-star reviews. This is a challenge because if we have way too many five-star ratings, the model is going to be biased toward that class. So, I'll run a couple of pandas lines of code and rebalance everything to make sure I have the same number of reviews in each class. The imbalance is almost one to 10 between one-star and five-star reviews, so let's just rebalance everything. We'll have 100,000 reviews, which is still more than we need, and we can work with that.
Another thing I need to do is make sure everything starts at zero. Machine learning is not different. Here, my star ratings start at one. If I train that way, I'm going to run into problems. So, I can write a simple function to apply to the full dataset to decrement all the star rating values by one. Now my star rating values are between 0 and 4, which is what the model expects. All right. So, now my dataset looks like this. Labels are from 0 to 4, and text is just text. I'm going to split this between training and validation. I'll keep 10% for validation and 90% for training, which are typical values. My dataset is ready. With the datasets library, you can download data from the hub, explore it, process it, and it's a super simple API. You can learn this library in a couple of hours.
I can save that data locally, why not? I could load it again just like that, super simple. I could equally easily save it as CSV data. It's always good to keep a CSV file somewhere. I could save it to Amazon S3 or use another cloud storage service on Google Cloud or Azure to save that data in the cloud. But most of all, I would want to push that dataset back to the hub. All it takes is this: just call the push to hub API, give the name of the dataset repository on the hub, and it saves everything. If I go to this repo in my HuggingFace account, I can see this dataset. I can explore it, and I see that my files have been automatically saved in a Git repository because all models and data on the hub are managed as Git repos. We could use the load dataset library to load the dataset or use the Git workflow to clone the dataset directly from the hub, whichever you prefer. So, now my data is ready and on the hub. It's time to train something on it. Let's move on to this notebook.
I'm installing a few more libraries, and most of all, I'm installing the Transformers library, which is the one I'm going to use to train. I'm training on this machine with a GPU. Generally, you would want GPU compute or Optimum acceleration to train. Training on CPU is going to be extremely painful if possible at all. Importing a few objects we'll see in a second, defining a few hyperparameters. To keep my training time low, I'm just going to train for one epoch. I have to use five labels, remember? Star ratings go from zero to four now, so five classes. The learning rate, batch size, and a few more things. You could probably use the default values here if you wanted to. I need to load the data, so I'll pass that to my training job. I can load it from the hub just like we saw. Of course, I could load it locally, but let's load it from the hub. No change; the training set has 90k samples, and the validation set has 10k. It still looks the way it should. That's good news. That's pretty much what we need to do with the dataset; it's been prepared already.
I want some detailed metrics during my training job, so I would like to see accuracy, F1 score, precision, and recall. I can write a simple compute metrics function that takes the predictions as input and scores the predictions compared to the original labels. This is not specific to the text classification problem we're working on. Generally, you'll see that this code, the Transformers code, can easily be reused from one task type to the next. All right. So, now I'm going to grab my model. This model is the base model ID I define here. It's my DistilBERT model. We can see it on the hub. It's been trained on a ton of English text, so it's a good starting point to understand what customers are saying about our products.
I'm downloading the model and the tokenizer. If you're not familiar with NLP, one of the key things you need to understand is that words and generally strings mean nothing to machine learning models. Machine learning models want numerical data. So, tokenization is a process that replaces each individual word and punctuation sign by an integer. As the model has been pre-trained, we also have a trained tokenizer that will replace each word in the vocabulary that this model has been exposed to into an integer. We'll see that in more detail in a few minutes, but it's not critical right now that you understand that. I apply the tokenizer to the training set and the validation set. We do the same way we did when we decremented the star ratings. We write a simple function that applies the tokenizer to the text column in the dataset and call the map API to process everything. Very simple data processing with that.
The dataset has been turned into tokens, and we can actually go and train. We first define our training arguments, where to store the model, how many epochs to train for, the batch size, the learning rate, and all that good stuff. One important parameter is this one because, as we will see, we can automatically push the trained model back to the hub. This is where I would like to push it, in that new repo. Finally, we put everything together with the Trainer API, which is super simple and high-level. It's really my preferred way of training models. I just pass the model, the training arguments, the tokenizer, the metrics function, the training set, and the validation set. Then I call train, and that's it. This fires up the training, I should say, the fine-tuning of this model. As you can see, we did not write any machine learning code here. The model I'm working with is a PyTorch model, but I did not have to understand how to train a PyTorch model. I'm just using the high-level Trainer API to get the job done. Once again, you can see how generic this is. There's nothing here that's specific to text classification.
I'm copying and pasting that code again and again across NLP, computer vision, and audio examples because it is completely generic. And I think this is really one of the strengths of this library. You don't need to know anything about TensorFlow or PyTorch, and in fact, you don't really need to know much about machine learning at all. You just need to understand what data you want to train on and what model you want to apply to it, and off you go. We see the training happening locally. We train for a single epoch, so I just see one line of log. I see all the metrics that my compute metrics function computed. My accuracy is about 57%, which is a good start, but not amazing. For one epoch and five classes, it's probably good enough to continue working here. But of course, you would want to train a little faster. This took about 24 minutes on a GPU, so still pretty intensive.
What's next? Well, next, we could evaluate here on the validation data. But if you had a test set, another benchmark dataset, you could very easily evaluate the dataset on it. Of course, you want to save the model. You could do that locally, or you could push it to the hub. When you do that, this model is going to be pushed to the hub. The information has been added automatically for me, which is nice. We call this the model card. It describes the metrics and hyperparameters. If I had trained for 10 epochs, I would see information on 10 different epochs. All that stuff is created automatically, and it's not magic; it's just a markdown file created in the repo and displayed by the hub. You can edit this and add extra information on the training process, make it very clear what this model is and what it's good for. I can certainly try this model right there. Let's give it a shot. This should know about shoes now, so let's see what this thing says.
This is the inference widget, and it's going to load the model on demand on the inference service that we build. It should predict this as a positive review. Let's wait a few more seconds. Okay, yes, so "labeled for" really means five stars. I could have renamed those labels to make them more human-friendly. These are the default labels. So, that's a five-star review with a confidence score of 88% or almost. That's a very positive review, and this model is kind of promising. All right, so where do we go from here? We have this model, we've pushed it to the Hub, and now it's available. I'm also downloading the tokenizer and tokenizing a couple of reviews. The tokenizer transforms strings into something the model understands. For example, "the" and "this" might be tokenized the same way because they represent the same concept. "Shoes" is token 1996. We can add padding if the samples are too short and truncation if they are too long.
We pass these tokens to the model and get outputs. These are not probabilities; they are activation values. To make them look like probabilities, we apply the softmax function, which rescales the values so they add up to one. For the first sample, the shoes falling apart after a few days has the highest probability for class zero, which is a one-star rating, very negative. The other review is very positive, with the highest probability for class five. That's what softmax does.
Let's try deploying this model on the inference API. You need a Hugging Face account, so let's go to my Hugging Face account and log in. I'll use the access token called Demo. I'm creating a URL for the API inference with the name of my model, passing my token as authorization, and running an HTTPS query with a payload. The model loads on demand, so it might take a few seconds. You can use the inference API for free, but models will unload after a few hits or seconds. To keep models loaded and pinned on GPUs, you need a paid plan. Once the model is loaded, you can predict just like that. This is one of the simplest ways to deploy a model to a production API.
Now, imagine you've iterated a few times and have a good, accurate model. You want to show it to your marketing director or customers, who are not technical. Showing them a Jupyter Notebook won't be impressive. They want to see a web application. Here's a simple web app where we can type a review. Let's take a real one. Shoes. No shoes on the homepage. Let's grab some shoes. This is a four-star review with 47% confidence. It's still a pretty positive review. Showing this to customers or your marketing director will make it clear what you're doing. They can use and test it if you integrate it into your web or mobile app. This is based on the solution called Spaces, which we build.
How many lines of code does this take? It's 15 lines. I imported my model, created a simple user interface using the Gradio framework, which is part of Hugging Face. There's an input box, an output box, and a button that calls the predict function, extracts the numerical label, and prints out the corresponding number of stars. You can run this on your local machine, test it, and push it to a Space repo on Hugging Face, which will automatically create the web app. Spaces is great, and I encourage you to check them out. There are many examples, and the best way to learn is to find a model you like, check its page, and see the spaces that use it.
The very last thing I want to show you is inference optimization. Your model might be a bit slow in production. One way to solve this is to use Optimum, which is integrated with different chips. Here, I'll use Intel optimizations. I'm installing Optimum and running this on my Mac, so I don't need a GPU. I'm loading a test set and the same model we trained. I write a simple evaluation function to score the evaluation dataset on this model because quantizing and optimizing can degrade accuracy. We use the Intel Neural Compressor to do this and set an accuracy target, like not losing more than 3% accuracy.
Optimum computes the baseline accuracy and how long it took. The model predicted the evaluation dataset in 18 seconds with 42% accuracy. Now, it's running optimization, replacing 32-bit floating-point values with 8-bit integers. After the first pass, the optimized model is a bit more accurate at 42.79%. The optimized model predicts in less than 13 seconds, a 30% speedup. You can save the model and push it to the hub. Just a few lines of code, and you get significant results.
That's pretty much what I wanted to tell you today. Here are some resources to get started. Sign up on Hugging Face if you haven't. The best place to start is the tasks page, which introduces you to various machine learning tasks. Take the Hugging Face course, which is beginner-friendly. If you have questions, ask them on the forums where the whole team is waiting to help. We have additional commercial products for companies needing support or an in-house version of the hub.
If you want to stay in touch, follow me on Twitter, my Medium blog, and my YouTube channel. I hope this was useful. Thank you for being here. I'm impressed by the diversity of the audience from all over the world. Machine learning can solve global problems, so get educated and start solving them. Thank you for your questions and messages. Thanks to Analytics Vidya for the invite. Have a great day. Bye-bye. Thanks a lot, Julien. This was really insightful. I would like to thank you for your time and for delivering such a wonderful session on behalf of Analytics Vidya. I'm sure our audience found it insightful, and hopefully, we can conduct more such sessions with you in the future.