Julien Simon Machine Learning 2.0 With Hugging Face PyData London 2022

July 07, 2022
Julien Simon Presents: Machine Learning 2.0 with Hugging Face In this session, we’ll introduce you to Transformer models and what business problems you can solve with them. Then, we’ll show you how you can simplify and accelerate your machine learning projects end-to-end: experimenting, training, optimizing, and deploying. Along the way, we’ll run some demos to keep things concrete and exciting! https://www.pydata.org PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases. 00:00 Welcome! 00:10 Help us add time stamps or captions to this video! See the description for details. Want to help add timestamps to our YouTube videos to help with discoverability? Find out more here: https://github.com/numfocus/YouTubeVideoTimestamps

Transcript

Please come in and take plenty of seats in the front; it's safe. Usually, at shows, you don't want to sit in the first row because you never know what's going to happen. But I can promise it's totally safe today. I won't pick on you or anything, even if you run Python on Windows, I won't, okay. Thank you very much for showing up. It's a pleasure to be back in actual rooms with actual people after sitting on my bottom for two years and watching Zoom people. Thank you, London, for not being 40 degrees Celsius today. I just took the train from Paris; it's blazing hot there, so stay in London. It's raining, very safe. My name is Julien, and I work for a company you may have heard of called Hugging Face. We build this thing called Transformers, and that's what I'm going to talk about today. I have a few slides to get the party going, just a few, I promise, and then we'll start running some demos, okay. Before we get started, who here has never worked with transformers? It's okay. All right, I barely know the stuff. Okay. Who's running production workloads on transformers? Okay, a few, all right, and everybody's kind of in the middle. Okay, so hopefully, everybody will learn something today. We like to oversimplify things, and I'm no different. So, I guess deep learning kind of started, technically it was there in the 40s, but it really exploded onto the stage in 2012 and onwards when people started using GPUs for more than shooting each other in video games, which was still a good thing to do, I guess. That's what I call deep learning to zero. Back then, what made deep learning possible was that neural networks became cool again after decades of being unusable due to a lack of computing power and scalability. So, neural networks, convolutional networks, recurrent neural networks became hot again. People started solving complex problems with those. Obviously, we needed some data for that, and we had a few open datasets. The king of them all was ImageNet back then, with the ImageNet competition. All of a sudden, all those PhDs got obsessed with recognizing dogs, cats, and elephants, which was probably a refreshing change from their daily work. But anyway, that got the interest, and people could understand that AI is actually able to understand complex data and features. Putting the two together, of course, would not be very good if you didn't have GPUs and massive computing power. Cloud computing made it reasonably easy and cheap to grab those GPUs for a couple of hours, train some models, etc. The last bit was tools. And I think that's where the problem was, and maybe it still is for a lot of folks. Expert tools. Who wants to tell me that the first few versions of TensorFlow or Theano were user-friendly? No? No? No. They were great, much better than the alternative, which was writing everything from scratch. But they were still very hard to use. Those initial open-source libraries were not for the average developer. You really needed some deep machine learning and computer science background to get them working. Still, some people got some work done, although a lot of people got the feeling that deep learning wasn't really delivering. Everybody went to work, and I think things have changed a little bit now. So, I'm calling this 2.0. I'm really hoping it's not 1.1, but we'll see in a few years. The main thing for us at Hugging Face is Transformers. Everybody here has heard of BERT, Google BERT, and the following models. These have really broken all those state-of-the-art benchmarks and keep breaking them. The transformer architecture has proven very efficient, much more efficient than anything that came before, like CNNs and LSTMs. Initially, on natural language processing, but now increasingly on computer vision, speech, reinforcement learning, and all kinds of different new tasks. We think this is a significant step. Instead of building massive datasets, which no one really likes—raise your hand if you like labeling and cleaning data four days a week—okay, good, you're smart people in London. Yes, we use transfer learning instead. Someone has built those massive datasets, and they don't need to be particularly clean because transformers are robust to unclean data. They've pre-trained them on large GPU clusters and give them away to the community. Now, you can grab your BERT, Roberta, or T5 model, pre-trained, and use it as is or fine-tune it a little bit. This is called transfer learning. The benefit is you don't need to label millions of data instances or spend hundreds of thousands of dollars training models from scratch. You can just reuse existing models. GPUs are still around, and many of you use them daily. As great as they are, they were never designed for machine learning, but they're catching up. We see a new generation of chips designed and optimized for machine learning workloads, either for training or inference, and I'll talk about that a little later. Finally, Hugging Face, with the help of the open-source community, is pushing very hard to build software tools that everyone can use. If you're an expert, that's great; you can still tweak to death. But if you're the average Joe or Jane and know a little Python, you can now plug in some machine learning magic into your app much more easily than before. We keep improving that. That's our vision and what we're trying to build. More than anything else, you may know Hugging Face for the Transformers library, one of the fastest-growing open-source projects. We're the yellow line on the left, with the steepest slope. These are GitHub stars, and we're past 65,000 stars. Thanks, everyone, and if you haven't starred the repo, you know what to do next. We're growing faster than Kubernetes, which is pretty cool. It shows we have a lot of adoption. The State of AI report, which I'm sure you've read, mentions that transformers are becoming a general-purpose solution. I'm not tired of doing NLP talks, and we'll do NLP again today, but transformers work very well in computer vision, speech, and many other use cases. The Kaggle data science survey echoed this, showing the popularity of algorithms in the Kaggle community. We see RNNs and CNNs going down and transformers going up. It's a good sign that deep learning workloads are shifting to the transformer architecture. You should learn about it, not just for Kaggle competitions. Usage numbers are always fun. We have over one million model downloads every day and more than 100,000 hub users daily. These days, it's particularly insane. Who here has tried the Dolly mini thing? Okay, the infra team is thanking you for very busy nights scaling that. But it's a lot of fun. This is what the family picture looks like. I try to put everything on a single page. What we're striving to do is build a very fast development cycle that's truly agile. Within a day, you can go from a dataset and a model, fine-tune them using your infrastructure or cloud infrastructure, build a demo with Spaces, deploy it, evaluate the model in production, and keep doing that. We see our customers doing this in a day, sometimes in hours. I keep telling customers and execs that this is a very fast way to do machine learning. You can have something to show next Monday with a nice safety margin. This is a very fast way to do machine learning, and you'll see in the demo later that we're not writing machine learning code, which is great because I can't really do that anyway. The cycle we promote uses the datasets library, the transformers library, and the Optimum library, all open-source. We have customers using the cloud, and we have a partnership with AWS on SageMaker. We have a very deep integration and co-engineering with AWS on SageMaker. Anyone using SageMaker here? Good. Because I worked for AWS before, so I convinced you, right? To keep everyone happy, we also have a partnership with Microsoft Azure. This one is recent. You can go to the Azure marketplace, look for Hugging Face, and in a couple of clicks, deploy any NLP model from the hub on a management point and predict. If you like Azure, that's fine. There's one cloud missing from that picture. If anyone is working for that cloud in the room, you know where to find me. If you have friends working there, give them my email address; I'm sure they'll listen. That's the big picture. We'll look at most of these blocks quickly, but you can get all the code later and run it again. One thing I want to zoom in on is hardware acceleration because it's becoming more important. These transformer models are huge. BERT initially seemed big, but by today's standards, BERT and Distilled BERT are very lightweight. New models are 5 gigs, 10 gigs, 20 gigs, 30 gigs. We have big science models that we're training with the community, and these will be hundreds of gigs. Loading, training, and predicting is difficult. We have a library called Optimum, which I encourage you to look at. It's an open-source library based on hardware accelerations and partnerships with companies like Habana and Graphcore for training, and ONNX and Intel for inference. It makes it very easy to train or infer with your models, quantize, etc. We also have integration with AWS Inferentia, their custom chip for hardware acceleration on inference. This is a problem you'll face in production. Those of you who are in production are probably facing it already, and those who are not will definitely hit that problem at some point. It's a good idea to check out what Optimum can do and learn how to speed up those models. That's it for the slides. We'll come back to the last one for resources, but for now, I need to put my glasses on, and we need to start running. This is part of a workshop I'm building. You can find all the code at GitLab.com, Julien Simon, HuggingFaceDemos. I've decided to pick just a few bits and pieces. In a nutshell, we're training a model to classify Amazon product reviews according to the star rating, one to five stars, focusing on shoe reviews. We'll go through the major steps you would take if you had to work on this project yourself. The first step is to go to the Hugging Face Hub, where you have 53,000 models and almost 6,000 datasets. You could ask, is there a model ready to go out of the box? We don't have a ready-made model to classify product shoe reviews by star ratings, but we have things that are close enough. We even have a dataset. The Amazon US reviews dataset is on the hub and includes shoes. You can visualize this data on the Hugging Face Hub and see that it's a good starting point. If you're a shoe retailer, you probably have thousands of shoe reviews. You can use this data, but maybe you need to clean and process it. Since you promised to show something next Monday, you'll start from data that's already there. That's why we have almost 6,000 datasets on the hub—data that's ready to use in your project. The first step will be to use this data and build a proof of concept (POC) to show what the model could look like and get feedback. Meanwhile, you can assign data science interns to label 1.2 million unclean product reviews, tweets, and emails. Of course, they will all quit after a while, which is why you have interns. They need to learn data cleaning, the number one skill. Seriously, this is what we're going to do. We can start working with the datasets library, one of our open-source libraries. Loading the dataset is as easy as one line. This downloads the dataset from the hub. If you have your own data in Amazon S3 or Azure storage, you can do the same. Let's run this and see what happens. It's a big one, with 4.3 million reviews, so we'll take just 10% for experimentation. We can look at this data and see lots of columns. We'll keep it simple and focus on the star rating and the review itself. We'll drop everything else. This data set API is simple if you know pandas. We can check for weird values, ensure the dataset is balanced, and see that we have many more five-star reviews than other ratings. This isn't ideal, so let's rebalance everything and keep 20,000 for each star rating. A few lines of code, and we have a balanced dataset. We need to decrement the labels to start at zero. We can write a small function, map it to the datasets, and now we have everything we need. We can split the dataset for test and validation, save it locally, and push it to the hub. I've done this before, so here's my dataset on the hub. You can share it with your team or the community. All it took was a simple notebook and pushing it to the hub. If you've never used Hugging Face before, this is just a GitHub repository. You can use the open-source libraries to work with it or use the Git workflow to clone from the URL directly. Now we have a dataset and can use it to fine-tune a model. I had a talk earlier today on training Hugging Face on SageMaker, but my train was late, and I missed it. I encourage you to look at it if you're using AWS. This time, I'll use the transformers library and start with the DistilBert model, which I can see on the hub. This is a BERT language model for English, but it hasn't been set up for our classification task. We'll grab this model, download it from the hub, configure it for classification, and apply our dataset. We need to define a few parameters, load the dataset, and define a metrics function for the training job. We grab the DistilBert model and tokenizer, tokenize the training and validation sets, and configure the training job. The underlying model is a PyTorch model, but we can use the high-level trainer API in the Hugging Face transformers library. We define the model, parameters, tokenizer, metric, and datasets, and then call train. For simplicity, I trained it for a single epoch to keep the training time reasonable. Once training completes, we can see the metrics, evaluate the model, and push it to the hub. The push-to-hub API creates a new Git repo on the Hugging Face hub, uploads all the model files, and creates a model card with metrics, hyperparameters, and library versions. This is a markdown file, so you can edit it and add your own description. This is good practice for anyone in your team or company who wants to use the model. Now we have the model on the hub and can predict with it. We can test it here. This loads the model on demand on our infrastructure and predicts using the inference API, which is one of our services. The free version is integrated on the model page, and we have a commercial version for companies who want 24/7 endpoints. Let's give it a few seconds to load. We see that this is label 4, which means 5 stars because we decremented everything. It's a very positive review. You see how fast this is. In a couple of hours, you can experiment, deploy, and show results. We didn't write machine learning code; we just grabbed off-the-shelf components and used the Hugging Face services. Another bit you might be interested in is a new library called Evaluate, which makes it easy to evaluate your models on different metrics. This is useful early in the project when you have multiple models to experiment with. We download the model from the hub, load the evaluation set, and see all the built-in metrics. You can add custom metrics if needed. It's as easy as loading the accuracy metric and computing it on the model. We can push the results to the hub to keep track of all the experiments. The model page stores all the information, so you can evaluate the model on multiple test sets and store the results there. Now that we have a model we like, let's build a demo. Raise your hand if you've done a Jupyter Notebook demo to business people who went, "Huh?" Yes, all of you, me too. You show them tokens and attention masks, and they're lost. Instead, let's build something they can relate to and understand. This is called Spaces, a simple way to build a demo app on the Hugging Face hub using Gradio or Streamlit. I'm not much of a UI guy, but I managed to do it and even enjoyed it. You can play with all those models on the hub. I'll show you the simplest example. We import the Gradio framework and transformers, create a text classification pipeline, and build a simple interface with an input text box and an output text box. The input is where we'll type the review, and the output is where we'll see the result. We have a big button that invokes the predict function, which passes the text input, predicts it, and displays the result. I created a space repo on the Hugging Face hub, added 10 lines of Python, pushed, and waited 30 seconds for the space to go up. That's it. You can try it here. This is what it looks like. I went from this to this by creating a space repo, adding the code, and pushing it. It's a trivial example, but we have better ones. Let's try this one. It runs voice queries on financial documents, does speech-to-text, translation, and semantic search using Sentence Transformers and Facebook models. Let's try it. I'll record something here. It will do speech-to-text, translate from French to English, and then do semantic search on S&P 500 annual reports to find the top matching sentences. Imagine showing this to a business owner instead of a complex notebook. I do this demo a lot for bank CXOs, and they go blank for 60 seconds, then ask, "How much?" That's where the interesting discussion starts. It took me two days to build this, but anyone here can do it in an afternoon once you figure it out. Spaces is super powerful, so check it out. Find something that looks like your problem, watch the code, and build something cool. The last bit I want to show you is hardware acceleration. You've got your dataset, found your model, done your CXO demo, and promised it for production next week. You deploy it, and it predicts in 900 milliseconds, which is too slow. Enter hardware acceleration. I'll show you a couple of examples on inference, not training, which is a session in itself. I'll show you how to use ONNX acceleration and Intel Neural Compressor acceleration, which we just launched. It's not scary because every time you say hardware in a room full of data science people, they can tend to go, "You mean what? You mean GPUs?" Don't worry; we don't need any hardware skills, which I don't have either, so it's perfect. We're installing the Optimum library, loading the model, and converting it to ONNX. Now you could deploy it on any ONNX-compatible model server. We can predict with it, and it works fine. This is a simple benchmark: 14 seconds to do 500 predictions. That's a miserable way to benchmark, but it gives us a baseline. Now we can optimize this a bit. But in this case, it kind of works. So 99 means, everything you can do on this model, go and do. Now we have this optimized model, which we can load again. We can still use that very friendly pipeline object, and we can predict 500 times again. You can see we went from 14 seconds to less than 14 seconds. Again, this is a very crude way to benchmark, but clearly something happened. We already saved some time here and probably didn't hurt accuracy too much. We would need to check. If we want to go a little crazier, we can do quantization. For those not familiar with quantization, it means you replace your floating-point 32-bit parameters with 8-bit integers. Now, everybody goes, "What? I trained for three weeks. I wanted all those 32-bit values. Now you're taking them away." Well, I'm taking them away because 8-bit integer arithmetic is quite faster than 32-bit floating-point arithmetic. It will shrink the model, making it smaller, so less memory usage and faster compute. You need to strike a balance between how much you want to shrink it and how fast you want to make it, and how much accuracy you're ready to lose in the process. Maybe you're ready to lose 1% if you can make the model 10 times smaller. We just do this, and here we're running on a CPU machine, optimizing for Intel chips with the AVX 512 instruction set, designed to accelerate multiply and accumulate operations frequent in deep learning. We save the model again, and we can see the model did shrink quite a lot. We shrunk it from 256 megs to 165 megs, which is pretty impressive. Predicting again, we see a very significant speed up. Again, you would need to run much better benchmarks, but this shows a couple of lines of code can have a large impact on inference performance. You would still need to run your evaluation dataset against this to make sure you didn't lose 10% accuracy, which is possible sometimes. But generally, Optimum makes it simple to do this. An alternative to using ONNX is this new collaboration we have with Intel. They have this tool called the Intel Neural Compressor, which does similar things, like quantization. It works pretty much the same. Here, we're loading the model, passing a config file, and quantizing. What's interesting is you can set a target, saying, "I don't want to lose more than 1% accuracy." In this instance, it says 3%. So, quantize but don't drop the accuracy more than 3%, and it will iterate on the quantization efforts until it's within budget. In this instance, I got very lucky because the first round actually hit that target, and the metric actually improved a little bit. Sometimes accuracy will go up a little bit because quantization helps fight overfitting in many cases. The only thing I did was this, so you don't need to know the first thing about the hardcore details of hardware optimization. It's very easy to use. So, I think now we've covered most of the spectrum: datasets, models, training, building demos with Spaces, optimizing. Now it's your turn to play. Where do you go next? Well, I guess you go to the Hugging Face Hub and sign up. It takes 30 seconds, it's completely free. If you're completely new, the best entry points are the tasks where the DevRel team has built lots of really interesting content explaining what those NLP, computer vision, and speech tasks are. What does it mean to do speaker separation? What does it mean to do zero-shot classification in plain English? Then you can take the Hugging Face course, which is awesome. The same folks also published a book on transformers. I highly recommend it, and if you have questions, you can go to the forum and ask. The whole company is reading and answering, so don't be shy. If you have production workloads and want to go to the next level, we can help and make a living doing so. We can provide expert support on the whole scope from picking models to optimizing latency down to crazy low numbers. If you work for a regulated company like a bank, insurance company, life sciences company, or public sector and have to run everything on infrastructure you manage, we can help with that because everything you saw here can also be deployed on your own infrastructure in the cloud or on-prem. So, you can get in touch. That's what I wanted to tell you today. It's a whirlwind tour of Hugging Face, June 2022. We're running pretty fast and want to keep building good stuff for the community and customers as well. So, I hope you like it, and if you need help later on, don't hesitate to get in touch. Thanks a lot.

Tags

HuggingFaceTransformersMachineLearningDeepLearningTransferLearning

About the Author

Julien Simon is the Chief Evangelist at Arcee AI , specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.

With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.

Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.

Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.