Accelerate Transformer Model Training with Hugging Face and Habana Labs

Transcript

Thank you for joining us today. This is the accelerated transformer model training with Hugging Face and Habana Labs, presented by Julien Simon, Chief Evangelist at Hugging Face, and Shree Ganesan, Head of Software Products at Habana Labs. Before I hand it over to them, I want to give a few housekeeping rules. Today's session is being recorded and will be made available for later viewing. For any questions or issues, please use the chat feature to reach us. At the end of the presentation, we will have a live Q&A where you can type your questions, and we will answer them. Now, I would like to present our first presenter, Shree. Shree, I'll pass it on to you. Welcome. Thank you, Pauline. Thank you all for taking the time to join us today. I'm Shree Ganesan, Head of Software Products at Habana Labs. With me, we have Julien from Hugging Face. Julien? Yeah. Hi, everybody. Good morning, good afternoon, wherever you are. I'm Julien. I'm based outside Paris. I'm the Chief Evangelist for Hugging Face, and that means I work with customers to help them understand and adopt Hugging Face tools to build efficient ML workflows. Hopefully, that's what we're going to do today. So we're going to talk about how we've partnered together with Hugging Face to make it really easy to bring the goodness of Hugging Face and the goodness of the Gaudi training platform to our developers and end users. Julien is going to start off by laying the stage for that. Before we talk about how we're trying to reinvent how we do deep learning, it's important to understand where we're starting from. I call this deep learning 1.0, which is how everybody's been doing deep learning in the last five to six years. We started with neural networks, a technology that was resurrected and quickly became very powerful at extracting insights from unstructured data using convolutional neural networks for computer vision or recurrent neural networks for natural language. They became the de facto solution for deep learning and processing unstructured data. However, building models often meant building or tweaking existing models, which wasn't easy. Collecting and cleaning data was also complex and time-consuming, as deep learning is very data-hungry. Extracting, cleaning, and labeling data was a huge task, especially for supervised learning. This led to a long lead time before experimentation could begin. When you could start experimenting, you needed computing power. One reason deep learning became popular is because we figured out how to use massively parallel chips like GPUs for more than 3D gaming and apply them to deep learning and other scientific problems. GPUs are useful and efficient but costly, not power-efficient, and difficult to obtain. So, perhaps something else is needed now. Putting everything together, we worked with expert tools like early versions of TensorFlow, PyTorch, Theano, etc. While brilliant, they were difficult to use unless you were a machine learning expert with a strong background in computer science and statistics. This is a problem because we need everyone to join the machine learning party, and any company should be able to add machine learning models to their workflow. The next step is simplifying the deep learning experience, which I call deep learning 2.0. The first step was standardizing deep learning models, moving from a collection of different architectures to a more standard architecture. Transformer models, which everyone has heard about, were put on the map by Google BERT in 2017, breaking NLP benchmarks. Since then, transformers have proven efficient across a wide range of use cases like computer vision and speech. They are becoming a general-purpose solution. The next step is that it's not so important to build huge datasets anymore. Thanks to transformers and transfer learning, we can use off-the-shelf pre-trained models and get to work much quicker. You can grab a model pre-trained on a large dataset similar to the one you need and quickly test if it's a good fit. If it is, you might not need to train it further. If you do, you can fine-tune it with a little bit of domain-specific data. This takes much less time and effort, which is a huge win. When it comes to training, companies are building amazing new hardware designed from the ground up to accelerate machine learning workloads, whether for training or inference. Habana Labs is doing this, and you'll learn how to use it in a minute. We are building developer tools that simplify the whole experience, abstract away complexity, and allow any developer to work with models, fine-tune them, and deploy them without understanding all the nitty-gritty details. This is important for democratizing machine learning and making it available to as many people as possible. Transformers, when it comes to Hugging Face, is the name of our most popular library. It's an open-source library you can find on GitHub, and it's one of the fastest-growing open-source projects. The graph on the left shows the number of GitHub stars, a good measure of popularity. Hugging Face has the steepest slope, showing that we are growing faster than projects like Kubernetes and Node.js. We are humbled by this amazing adoption in the community and happy to see industry uptake. Whether it's the State of AI report or the Kaggle Data Science Survey, practitioners are using transformers more and more over traditional deep learning architectures. This is a good sign, as it means companies and businesses are adopting transformers. We see over 1 million model downloads from our hub, Hugging Face Hub, which grows every day. We serve over 100,000 users daily. We work hard to keep the community happy and help our customers build the best workflows possible. Now, I'll pass it to Shree because we need to zoom in on training and how Habana helps with the whole story. Thank you, Julien. Julien set the stage for where deep learning has gone from 1.0 to 1.1 or 2.0 and how transformers are taking center stage. However, when working with deep learning, you need to train these models, even if you're doing transfer learning or fine-tuning. This involves having a pre-trained model and training it further. Businesses are adopting AI, and we see trends from IDC surveys showing that businesses are using AI for various applications, trying to get value from AI for their business problems, and building more complex models that require many training cycles. 74% of IDC respondents say they do five to 10 iterations of training, and half of them rebuild their models weekly or more often. 25% rebuild daily or hourly. There is a huge demand for training cycles, which will grow as AI adoption increases. The big industry challenge is cost. Cost is one of the most significant challenges for businesses trying to implement AI solutions. How do you provide customers with more training cycles in a cost-effective way and make it affordable? This is where Habana comes in. Julien talked about purpose-built hardware to solve deep learning training problems. Habana has built a purpose-built AI training and inference processor. Our training processor, the Gaudi family, is designed to optimize AI training efficiency. The architecture is very different from general-purpose architectures, combining a matrix math engine to accelerate matrix operations, a cluster of tensor processing cores for neural networks, and a software-managed memory architecture with local memories, SRAMs, and HBM. The first-generation Gaudi has 32 gigabytes of HBM, similar to a V100 GPU. A unique feature of Gaudi is the integration of 10.5 ports of 100 gigabit Ethernet on-chip, reducing the number of discrete components needed and saving costs. We use standard Ethernet protocols for flexibility. With models becoming more complex and distributed training growing, scaling and scaling efficiency are first principles. Our Gaudi platform offers this with unique architectural features. The hardware is in an industry-standard form factor, the OCP OAM form factor, allowing the larger ecosystem to benefit from industry-standard features. Next, I'll talk about our software suite because it's not enough to have just a purpose-built architecture and hardware without the software to put it in the hands of developers. We've spent a lot of time designing our software stack for performance and ease of use. First, we ensure it is integrated into TensorFlow and PyTorch with minimal code changes, allowing developers to get started by continuing to work in familiar frameworks. Julien will talk about how easy it is to get started. We have a rich library of performance-optimized kernels, a graph compiler, and runtime that automatically identify subgraphs that can be accelerated by Gaudi devices, optimize performance, and run the optimized recipe on Gaudi devices. We also provide a customer kernel library for writing custom kernels if needed. We didn't stop with just building a software stack. We want to make sure developers can use it effectively. We have developer collaterals on our Habana developer site, 40+ models on our GitHub repository, a developer forum, tutorials, and video tutorials in various formats to make learning easy. Lowering the barrier to entry is critical, and we empathize with developers to make this happen. Part of our partnership is also lowering that barrier, right, Julien? Yeah, and if you're a ninja developer who can write custom kernels for Habana, I'm in awe. I'm a simple person who just wants to get the job done quickly without diving into hardcore details. That's what we focused on. When we worked with Habana to integrate the Gaudi chip and the Synapse SDK into the Transformer tools, we aimed for the simplest way. The family picture for Hugging Face includes datasets and models hosted on the Hugging Face Hub, with over 6,000 datasets and 60,000 models. You can download these in one line of code using our open-source libraries. When accelerating with Habana, you want to work with the Optimum library, which has a simple API close to the vanilla Transformers API. You can accelerate your jobs and train models much faster. You can then move the model to Spaces, a way to build web applications using frameworks like Gradio or Streamlit, showcasing your model in a user-friendly way. You can also deploy models for production using our inference API solution or the Optimum library, which includes hardware acceleration for inference. We have cloud partnerships with Amazon SageMaker and Azure, making it easy to train and deploy models on these platforms. Now, let's zoom in on Optimum. Optimum is an open-source library dedicated to accelerating training and inference with minimal code changes. We support Habana Gaudi using either a single or multiple accelerators. We'll show you some good numbers. If you're interested in the background and setup, refer to this blog post. When I say it's simple and minimal changes, this is what I mean. The top code snippet shows vanilla Transformers using the Trainer API. You define training arguments, hyperparameters, etc., and add them to a Trainer object with the model and dataset, then call train. To move from training on any platform to Habana, you need to import the Optimum Habana library, replace Trainer with GaudiTrainer, and TrainingArguments with GaudiTrainingArguments. You also need to load a config for the model, including parameters for the training process. That's it—three lines of code to migrate from CPU or GPU training to Habana training. Let's go to the demo. As part of our partnership, we focused on making the solution as easy and similar as possible to the existing Transformer user experience. Transformer users can take advantage of Gaudi with very little effort. The code I'm running is on GitLab. I'm using an instance on AWS because the Habana accelerators are available on DL1 instances. We won't go through the setup, but all the steps are in the blog post. I'm logged into the DL1 instance and will pull the Docker container that includes the Habana SDK, drivers, and PyTorch 1.11. I've already done this, and the container is running. The first thing to do is check the Habana Gaudi chips using the HLSMI command, which shows eight Gaudi chips ready to go. Let's enter my repository. The use case is a multi-class classification model to predict the star rating of shoe reviews. This is part of a larger workshop. The vanilla script can run on CPU and GPU. I'm importing libraries, loading the dataset, and downloading the base model, DistilBert, which is a good baseline model. I tokenize the dataset and define training arguments, then create a Trainer object that puts everything together and calls train. This is Transformers 101. To move to Habana, you need to replace the Transformers objects with Gaudi counterparts, point to a Habana configuration for DistilBert-Base-Uncased, and replace TrainingArguments with GaudiTrainingArguments and Trainer with GaudiTrainer. This is all it takes, and it worked on my first try. Here's the code we have now. The only thing left is to run it. I'll run it as a single-node run, which will grab one Gaudi chip, compile, optimize, and start training. This will take 29 minutes on one chip, but we won't wait that long. To use all eight chips, you can use the Gaudi spawn script, which distributes the job automatically on the eight accelerators. This is very straightforward, and you don't need to change your script. The Gaudi spawn script handles the distribution. Part of the work is done inside the model itself, integrated with PyTorch DDP. When working with the Hugging Face team, we enable distributed training through the Gaudi spawn script. If you're using a TensorFlow model, we have it integrated with Horovod or the TensorFlow API for distributed training. Our goal is to take advantage of what's available in the frameworks and make it simple to turn on distributed training. After setting up the cluster on AWS, you can run it on any number of Gaudi devices. I ran this example on 1, 2, 4, and 8 HPUs. The configuration used is a DL1.24xlarge instance on EC2, costing a little more than $13 on-demand. I used an on-demand instance to avoid interruption, but I also ran it on Spot for a 70% discount. The plot shows the real-life training times (blue line) and the ideal linear scaling (red line). There is a little overhead, but we get near-perfect linear scaling from one to eight HPUs. I compared this to GPU training using a P3.16xlarge with eight NVIDIA V100s. The Gaudi instance was about 3x faster and approximately twice cheaper, resulting in a 6x improvement in cost performance. The cost savings of running ResNet50 on Gaudi DL1 versus other GPUs are 45% to 49% versus A100 and 77% versus V100. For BERT large, we got 22% to 26% cost savings versus V100 and 64% versus A100. Each model is different, but you should expect significant cost savings. Gaudi 2, our second-generation product, was launched in May. It offers leadership performance and nearly 2x better throughput versus A100 for ResNet50 and BERT. We submitted to MLPerf and achieved significant performance. The three main takeaways are that transformers are quickly becoming a general-purpose solution for deep learning, Habana-Gaudi has the best cost-performance ratio for transformers and training, and it is accessible. You can fire up a DL1 instance on AWS in minutes, adapt your Transformer code for Optimum Habana in minutes, and start training. This is a very interesting platform for developers who want to build high-quality models quickly. To get started with Hugging Face, check out the tasks page and the Hugging Face course. We have libraries on GitHub, and I'm available on LinkedIn, Twitter, Medium, and YouTube. For Habana Gaudi, the best place to start is our developer site, which has documentation, training, tutorials, and reference examples on GitHub. We have 40+ reference models, and more transformer models will be added over the next few months. If you have questions, reach out to us on our Habana forum. We hope to hear your stories and make this platform accessible to more developers. Thank you, Shree. I hope this was useful. We have a few questions. Let's pick a few. What are the nine models on Hugging Face, and what if I want to use another Hugging Face model? Those nine models are examples to show you how to use the Gaudi config and trainer. You can use other transformer models by following the instructions. If it doesn't work, it's a bug, and we'll fix it. More models and features are coming to Optimum Habana. What happens to models not on the Habana organization page? The instructions are there for how to make changes to existing transformer models. Once you know how to work with the Hugging Face Transformer library, it's simple to use the Optimum Habana library. More models will be published to the model page by the end of the year. Does Gaudi 2 support FB8 data format, and do you expect FB8 to be used extensively for training versus BF16? We support FB8, and we will see how models and the ecosystem evolve. There is an interest in lower precision for reducing cost or power, but there is an accuracy trade-off. We will work with the developer community as things evolve. Will there be SaaS options for Gaudi like Google? We are looking into this to empower more developers and democratize access. Having a cost-effective way to access a single Gaudi chip would be great. We will explore options to enable other paths for enthusiasts. Thank you, Shree. I want to thank our presenters. This was very informative. Thank you to our participants. This will be available on our website, and I'll send an email with a link later this week. If we didn't answer your questions, please go to our developer site and the developer forum. Thank you so much, and have a wonderful rest of your day. Bye.

Accelerate Transformer Model Training with Hugging Face and Habana Labs

Transcript

Tags

About the Author