Well, hello and welcome everyone to another thrilling and interactive deep dive session on fine-tuning for audio classification with transformers. Before we kick things off and I hand it over to our speaker, a quick recap of the guidelines for this session. Please keep your chats relevant to what is being asked by the speaker. Chats will be moderated at all times. The moderator reserves the right to take down any suspicious messages from the chat. All your questions need to be posted in the Q&A section. Questions will be subject to moderation. If your question is already posted in the Q&A section, do not post it again. Just upvote the question, and the most upvoted question will be answered by the speaker. Also, we'll be posting some polls in a short while, so just keep an eye on the poll section. With that, now on to our speaker for the session. We are delighted to be joined by Mr. Julien Simon, Chief Evangelist at Hugging Face. Julien has an unending curiosity for technology and innovative ways to solve real-world business challenges. He has recently spent six years at Amazon Web Services where he was the global technical evangelist for AI and machine learning. Prior to joining AWS, Julien served for 10 years as CTO and VP of engineering at large-scale startups. We have a short introductory clip on him. Without further ado, let's play the clip.
Welcome, Julien. Good morning, afternoon, or evening, depending on your time zone. Good afternoon. Happy to be here. It's morning for me, but that's okay. All right, so with this, you are all set to go. The virtual stage is all yours. Okay, well, thanks again for inviting me, and thanks, everyone, for attending. We have a pretty long session today, so that's great because we can dive pretty deep, and we'll certainly have time for questions at the end. Should we get started? Yes, let's just have a quick check on your screen sharing. Okay, let's do that right now. Okay. Here we go. Okay, hopefully you see my screen now, yes? Yeah, works fine. Okay. Well, I think we're ready then. So, very quick intro. Obviously, I'm Julien. I work for Hugging Face, and I'll explain a little bit about what Hugging Face is about. I'm based outside Paris, although I do travel. I really hope to be back in India and APAC as soon as I can. I always had a blast over there. So, hopefully, I can.
Can you please just put your slides on full-screen mode? Yeah, sure. Here we go. Thanks. All right. So, our topic today is going to be audio classification with transformers. Before we do that, I want to set the stage. I want to explain why transformers are important and disruptive. To do that, we need to understand how we were doing things before. Machine learning and deep learning have been around for a while now. Deep learning is not completely new anymore. But the way we've been doing it from, let's say, 2015 up until now for a lot of people or a couple of years ago is really different. Neural networks are not new technology, but they've been given a new life thanks to deep learning. Using models like convolutional neural networks or recurrent neural networks and their variations, LSTMs, etc., we've been able to extract insights from complex, unstructured data like natural language, images, videos, or audio. Generally, this has meant writing neural networks from scratch or trying to work with existing networks, which, as we all know, is not so easy.
The next important step in building a deep learning-powered application is, of course, to train your neural network on lots of data. This is a huge problem because most of the time, practitioners have been training from scratch. Deep learning is very data-hungry, and you need hundreds of thousands, maybe millions of data instances to get an accurate model. This means collecting a lot of data, cleaning a lot of data, labeling a lot of data, whether it's text, images, or audio. That's a lot of work. If you do this for a living, you know that building the dataset is going to be the most difficult and longest part of the project. This slows everything down. You can't really get started unless you have that data ready. The next step is to put the two together and train your neural network on your dataset. This really means using the dataset.
The explosion of deep learning is also due to the fact that GPUs have become more widely available than before. Researchers have found a way to use them for scientific computing and deep learning, not just 3D games. GPUs have been really powerful and helpful in getting those models there. To run the whole thing, to experiment, train, and deploy, practitioners have used a collection of tools. Early on, tools like TensorFlow, Torch, and others were used, and they are open-source tools, which is great. But they are generally very difficult to use. You really need to understand the finer details of the machine learning process and the neural network you're working with. This means that unless you have a formal background in computer science, statistics, and machine learning, it's really difficult to get good results. This is a problem because we really want machine learning and deep learning to be a common tool that any developer can use. But with those first few years of deep learning, it was really difficult and reserved for people who knew exactly what they were doing. We think that's a problem.
At Hugging Face, we're trying to reinvent and simplify the whole machine learning lifecycle. The first step is to work with transformer models. We'll talk about transformers in a few minutes. The transformer architecture was launched in 2017. I'm sure everyone here has heard about the BERT model from Google, which was the first widely known transformer model, breaking old benchmarks on natural language processing tasks. Transformers are still deep learning models but have a very specific architecture that's proven extremely efficient, not just for NLP, but also for computer vision, audio, and other tasks. They're really becoming the new standard, the de facto solution for machine learning and deep learning.
The next evolution is moving away from labeling huge datasets to using pre-trained models and transfer learning. One of the really good things about transformers is that they can be initially trained by experts and large organizations on huge datasets, like Wikipedia. These models can be shared and either used as-is for certain tasks, like translation, summarization, text generation, etc., or they can be fine-tuned. Fine-tuning is a simple and reasonably inexpensive process to specialize them on your own data for a particular task. For example, you could start from an off-the-shelf model for English summarization and fine-tune it at a very low cost on medical documents to make it even more efficient in that domain. This is a huge difference because instead of building large datasets, you only need to build a small fine-tuning set, which is one or two orders of magnitude smaller than the original training set. This is a strong benefit, and we'll use transfer learning today.
The next evolution is the emergence of specialized chips built from the ground up to accelerate machine learning workflows, whether it's training or inference. GPUs are still around and very interesting, but there's more choice now. I'll talk about this a little more in a few minutes. Finally, we're trying to build tools that are accessible to every developer. That's why I really want to call them developer tools, not expert tools, because we're trying to abstract away as much complexity as we can and make it unnecessary to understand every tiny detail about your model or underlying framework. The Transformers library and other open-source libraries by Hugging Face are really a simple API that makes it easy to experiment, train, and deploy models with little code and simple code.
Zooming in a little bit, the Transformers library, named after the transformer models, is a library created by Hugging Face a few years ago. Hugging Face is the steward for this library and others, with the help of the community. It's amazing to see how, in just a few years, Transformers has become one of the fastest-growing open-source software projects ever. You can see on this graph the number of GitHub stars for different projects. Hugging Face Transformers are the yellow line on the left with the steepest slope, showing the fastest growth. We have lots of respect and love for all the other projects, and of course, we use projects like PyTorch and Keras. It's amazing to see our popularity growing faster than those. What's even more impressive is that we're growing faster than Kubernetes, which is a project that literally everyone uses. This adoption and support from the community are incredible.
Another interesting number is the number of models we serve every day from the Hugging Face Hub, where we host models and datasets. We serve over 1 million models every day. The adoption and continued use of the library and models in the community and industry are extremely strong. By the way, if you haven't given a star to the Transformers library, I would really appreciate it if you did. You can just go to that GitHub URL and star it. We're trying to get to 100K stars as quickly as we can. Your help will be appreciated.
We also see this adoption translating into industry visibility and recognition. The State of AI report called out transformers as a general-purpose architecture for ML. Even though NLP was the starting point, transformers are now really fit for computer vision, audio, speech, reinforcement learning, and other tasks. They're generalizing to a lot of problems. The Kaggle data science survey confirmed this, showing that CNN and RNN usage was going down, and transformers usage was going up. There's really a shift from traditional deep learning to transformers.
So, I want to focus mostly on the demo today and answering questions at the end. Here's the family picture of Hugging Face. You can learn more about all of these during the demo and on our website, huggingface.com. Starting on the right, we have the Hugging Face Hub where we host models and datasets. We've hit 6,000 datasets, but we'll check once we go to the hub. We have over 55,000 models uploaded by the community. Chances are, you'll find either a model that works out of the box for you or a model you can use as a starting point for your project. Using the Transformers library, which is open source, you can train and experiment on your laptop, server, in the cloud, anywhere.
If you're interested in AutoML, we have a solution called AutoTrain, which is completely no-code. Just a few clicks in the UI will let you train and optimize models. We also have another open-source library called Optimum, dedicated to hardware acceleration, whether it's training or inference. I'll zoom in on Optimum in a minute. With these tools, you can train models very easily with very little code. Once you have a model you like, you can deploy it to Spaces. Spaces is a simple way to build web applications using simple web frameworks to showcase your models and demo them to non-technical people.
When it comes to deployment, you can deploy your models on the inference API in one line of code, which I'll show you again today. You can also use Optimum to accelerate inference. On top of that, we have cloud-based solutions. We have a partnership with AWS where Hugging Face is deeply integrated into Amazon SageMaker. We have training and inference containers readily available on SageMaker and deep integration with all SageMaker features. If you use AWS and SageMaker, make sure you read about Hugging Face on SageMaker. More recently, we launched a solution on Microsoft Azure called Hugging Face Endpoints, where you can go to the Azure Marketplace and find Hugging Face there. In just a few clicks, you can deploy any public NLP model from the hub to a managed instance on Azure and predict with it in one line of code.
So, there are more options for everyone, whether you want to work on your own machine or in the cloud. We think we have you covered, and we'll keep building more. One last thing before we dive into the demo: I want to zoom in on hardware acceleration, which is important for transformers because they are large models and can take some time to train and predict. That's why we built Optimum, which you can find on GitHub. Optimum is based on our collaboration with hardware vendors like Habana Labs or Graphcore to accelerate training and ONNX and Intel to accelerate inference. With minimal changes to your existing code, you can switch from GPU training to Habana training or use ONNX optimization and Intel. These are very simple to use and can get great results out of the box without deep optimization, which is extremely difficult. Another way to accelerate inference, which is not part of Optimum but still very interesting, is our work with AWS on Inferentia, a custom-designed chip to accelerate inference. The Neuron SDK, the AWS SDK to compile and optimize models, supports transformers. If you use AWS, this is a good option.
Okay, it's time to jump to the demo. You'll find the demo on GitLab. There are a few demos in there, but the one I'm going to cover today is called keyword spotting. Let's jump into that notebook. First, let's take a quick look at the Hugging Face Hub. You'll find the hub at huggingface.co. You can sign up in seconds, and I encourage you to do that. This will give you access to all the features, completely free. There's no reason not to do it. As we can see on the hub, we find over 56,000 models for various tasks, including NLP, computer vision, audio, multimodal models, tabular data, reinforcement learning, and more. We'll look at some models in more detail. We also have datasets, and it's even 7,000 now. It's growing faster than I can keep up.
We have datasets that you can just grab for whatever task you're interested in. It's very easy to download these datasets and work with them. You don't need to build large datasets from scratch. You can get started very quickly with existing datasets. The problem we want to solve today is audio classification. Let's explain what audio classification is. The best way to do this is to look at a few models. Let's say I'm interested in audio classification models for English. I can see some models. Let's grab this one and see what it does. This model expects a raw audio signal and is about speech emotion.
We're familiar with text classification, like sentiment analysis (positive, neutral, negative) or classifying news articles by topic. Audio classification is similar, except we use a piece of audio instead of natural language. We have existing models for emotion recognition, and we can experiment with our own. We can test these models right there. Let's try emotion detection. Hopefully, it will say "happy" or something like that.
Hey, good morning, everybody. Super happy to be here. That's a random life test. It should be fun. Let's see. Oh, it doesn't work. Oh, that's too bad. Let's try another one. Can we try this one? Okay. This is another one for emotion detection. Let's give it a try. Hey, good morning, everybody. Very happy to be here. This is called the inference widget, and it's available for almost all models on the Hub. You can test the models right there. The model gets loaded, and you have a prediction in place. This one works. It's mostly happy. As a random example, this shows that you can try out different models and find a good starting point.
What I want to do today is a specific type of audio classification called keyword spotting. Keyword spotting is a task where I want the model to pick up a particular keyword from a certain list. Applications include voice-based systems like Google Assistant or Amazon Echo, where you say "Alexa" or "Hey, Google." Generally, any voice-based system like in-car systems looking for specific commands like "call home" or "go to this location." We're going to try keyword spotting and see how accurate we can make it.
The first step is to have a model that understands speech. We have many speech-to-text models on the Hub. Let's look at automatic speech recognition. There's a really interesting family of models released by Facebook called the Wav2Vec2 architecture. Let's grab this one. These models have been trained on 960 hours of speech in the LibriSpeech dataset. Let's give it a try. We can even try real-time speech recognition, which is fun.
Hey, you guys, I'm going to try it. I'm really happy to be talking about machine learning today. That's not too bad. You can see how easily you can try these models without writing any code. You can try them in different languages. A few weeks ago, there was an evolution of this model called the Conformer model. Let me show you this one because that's the one we're going to use. It's a recent addition, and if you're interested, you can read about it in the Facebook paper. In a nutshell, the Conformer is a combination of convolutional and transformer architectures. It brings the best of both worlds, using transformers to understand very long sequences and convolution to understand very local relationships.
This is a good pick for keywords because a keyword is very short. The keywords I'm working with are one second long. I want to see these keywords as digital audio, sampled and used as a sequence. It's still a long sequence of digital samples, but it's just one word. Understanding both the relationship between all the samples in the sequence and the close relationship between them should work well. That's my intuition, and that's why I decided to go with this model. Now, let's see it in action.
This notebook was originally written by one of my colleagues and is on GitLab. It was written by Anton, who initially used a different model. When I wanted to try this new Conformer model, I thought I'd have to write a completely new notebook. In fact, I took this notebook, tweaked a few things, used the new Conformer model instead of the original Wav2Vec2 model, and it worked out of the box. This shows that the API we're going to see in the notebook is very flexible and generic. The code you write with the Transformers library is extremely reusable and not task-dependent. The training loop can be used for almost any kind of task, whether it's natural language, audio, or speech.
Let's zoom in a bit more and get started. This is the name of the model I'm starting from, which has been pre-trained on 960 hours of LibriSpeech. I'm going to start with this English language model. The next step is to install the open-source libraries I need: the Transformers library, the datasets library, and Librosa, which is a nice library to load and process audio samples. Then I can log into the Hugging Face Hub. This is not strictly necessary, but if I want to automatically push the model to the Hub once I've trained it, I need to do this. I can log in using the command line interface or directly in the notebook.
Now we can get started. We have a model, but we need a dataset. I could build my dataset, but I'm using an existing one called SpeechCommands. This dataset contains predefined files for 35 different keywords. It's a great way to start. The keywords include simple words like "yes," "no," "up," "down," "left," "right," "on," "off," and a few longer ones. We have 36 classes in this dataset. We can work with this directly and add our own at the end.
Loading the dataset is super simple. We just need to run the `load_dataset` API, which will download the SpeechCommands dataset from the Hub to my local machine. I have almost 85,000 keywords. We can see the labels and look at a particular sample. A particular sample is a WAV file that's part of the dataset. When we load a sample, we convert it automatically to an array of digitized audio. We need a sampling rate of 16 kilohertz because that's what the model has been trained on. If you add your own samples, you need to pay attention to this.
We can try listening to some samples. You can see short keywords, different speakers, and different voices, which is important. There's some diversity there. Now we can move on and process the data. We have a built-in feature extractor that comes with the model. It will truncate the samples to a fixed length. We need to set the max duration to one second. We write a simple function that iterates over each audio clip, sets the max length to one second, and truncates it. We apply this function to the training set and the validation set using the `map` API in the datasets library.
Now we have our 84,000+ samples ready and truncated. There's not a lot of processing to do here. We're just taking the wave form and making sure it's one second maximum. This is the sequence of values we'll train the transformers on. The training code is super simple because it's abstracted away by the Transformers library. I'm using the high-level Trainer API, which is my favorite way to train models because it's very simple and generic.
The first step is to load the base model, the Facebook Wav2Vec2 model. I can do this in one line of code using the `AutoModelForAudioClassification` API, passing the model name on the Hub and the number of labels (36 classes in this dataset). For convenience and readability, we can pass mappings from label to class names and from class names to labels. This lets me easily decode class IDs and class names when predicting.
Next, I need to set some training parameters in the `TrainingArguments` object. We set the name of the model we want to create on the Hub. Once we've trained this model, we want to automatically push it to the Hub, so `push_to_hub=True`. We set the learning rate, batch size, number of epochs, and the metric we want to run, which is accuracy. You could stick with defaults for many of these parameters. The important ones are the number of epochs and whether you want to push the model to the Hub.
I need to provide a function to compute the metrics, which will be reported after each epoch. For a classification problem, we'll use accuracy. I write a function that takes the predictions, compares them to the labels, and returns the appropriate metric. This is a generic metrics function for classification that you can reuse for text or image classification.
Finally, I put everything together in the `Trainer` object, passing the model, training arguments, training set, evaluation set, feature extractor, and metrics function. Then we simply call `train`. This isn't really machine learning code; we're not going deep into the model or PyTorch. We're just providing arguments, a metrics function, and a trainer object, and all the complexity of the training process and optimization is hidden inside the objects. If you want to work one level down, you can customize everything with PyTorch code.
I ran this training job before because it lasted for a few hours. We'll look at the results. Once the training is complete, you can call `push_to_hub` to automatically push your model to the Hub along with lots of information. The code itself is not difficult at all and is very flexible and generic. Fast forward a few hours, and what do we get? Once the training is complete and the model is pushed to the Hub, I have this new model stored in my user interface. All the files have been pushed there, and we can see this is a Git repository because all models and datasets on the Hub are stored in Git repos.
This means I can easily download them using the Transformers library or use a simple Git workflow to clone the repo to my local machine and experiment. The Transformers library created all this information, including the training parameters and results, which we see here. This is called the model card and is super important because it gives you a lot of useful information about the model. You can quickly understand what it is and whether it's a good fit for you. We can see the different epochs and metrics. The best model hit 97.24% accuracy, which is very good. I trained this on SageMaker using a large instance with eight GPUs, and it lasted about four hours, costing around $150. For a business app, this is a good deal for a high-accuracy model.
Now we want to test the model. I recorded a few WAV files to try real-life samples. These are 16K, 16-kilohertz mono WAV files. Let's listen to one. This is "marvin." It's under a second, so it should work. We can predict with the model using the pipeline API. We create a pipeline for audio classification, pass the name of the model, and pass the WAV file directly or the loaded file. This tells me the clip is "marvin" with 52% confidence. Let's try another one.
Happy. This is a very short one and more than one second, so it will be truncated. It still figures it out with 54% confidence. Another way to predict is if you want more control. We load the model using the `AutoModelForAudioClassification` API, load the sample, make sure it's 16 kilohertz, use the feature extractor to truncate it to one second, and pass the input to the model for prediction. This is the output tensor with 36 values (one for each class). Applying the softmax function rescales these values to probabilities.
Let's run this on a new sample. It should say "happy" with 54% confidence. Now, let's deploy this on a real endpoint using the Inference API. You build the URL for your endpoint, pass an authorization token, and send an HTTPS POST query to that endpoint loading the data. The first time we hit it, it says it's still loading the model. If we try again, it should predict. You can use the Inference API for free, but for 24/7 availability, GPU, and auto-scaling, you need a paid plan.
Imagine you want to show this to business stakeholders or customers who are not technical. Instead of showing them Jupyter notebooks and HTTP queries, you can show them this model running on Spaces. This is what I've built here. It's public, so you can try it. You can go to the URL and try it yourself. It's a simple and plain app, but you can make it look like your own application. How much code is that? It's about 10-12 lines of code using Gradio, one of the two frameworks you can use in Spaces (the other is Streamlit).
You can write this code on your local machine, install Gradio, and debug everything. Once you're happy, you create a new space with the Hugging Face CLI, which creates a Git repository where you commit your files. Spaces is really great. If you've been playing with the DALI mini model, you know what you've been using. There are plenty of spaces to look at, and when you browse models, you'll see spaces for that model. This gives you examples, use cases, and sample code to get started.
Before we conclude, let's see where we could go next. We have an accurate model, but we could make it better by experimenting with noisy or distorted samples. I found a cool library called AudioMentations for data augmentation. It has a bunch of transforms. I loaded the validation dataset and built an augmentation object that picks three transforms out of four and applies them. The transforms include pitch shifting, time shifting, stretching, and masking. Using the `map` API, I applied this function to my validation set.
We can explore some samples, look at the original and distorted samples, and display them. We can listen to them and see the difference. If I predict the clean samples, it says "bad" with 95% confidence. The augmented sample does very well, too, with 93% confidence. This is good because it sounds weird. Data augmentation is a good technique to make your model more resilient. You can add adversarial samples or distortions to the dataset and train your model a little longer to improve its performance.
Let me share a few resources before we take questions. If you want to get started, the first step is to join our community. Sign up, which takes just a few seconds. Check out the tasks page, which introduces you to different machine learning tasks you can work on with Transformers. Take the Hugging Face course, which is completely free and very practical. If you have questions, go to the forums and ask. The whole team is there to help.
From a business perspective, if your company needs help with picking, training, or deploying models, we have an expert support program. If you have strong privacy, security, or compliance requirements, we can do private hub deployments on your own infrastructure, whether on-prem or in your private cloud. You can follow me on Twitter, take a look at my blog or YouTube channel, and connect with me on LinkedIn.
Thanks very much. I hope this was useful. We have plenty of time to take all the questions. The most upvoted question is about the format of audio data and which can be used in neural networks. Originally, transformers were used with natural language processing, where we transform natural language into numbers using tokenization. For audio data, the sequence comes from digitizing the audio. In this case, we have 16,000 floating-point numbers for one second at 16 kilohertz. This is the sequence that the model works on. You need to match what the model has been trained on, which is a wave file digitized at 16 kilohertz.
Another question is about the applications of audio classification. Audio classification can be used for sentiment analysis, real-time detection of emotions in speech, picking up specific words or product names, and voice-based systems like Amazon Echo or Google Assistant. For keywords, you can use it in car navigation systems, robotics, or any scenario where you need to understand specific commands.
Can we use audio classification for non-human voices? Yes, you can use text-to-speech systems to generate samples. For example, Amazon Polly can generate human-like voices. You can use these to add more diverse voices to your dataset, but you also need real people with different accents and demographics.
What are the major advantages of transformers compared to LSTMs? Transformers handle very long sequences better, understand context in both directions, train faster, and work well with transfer learning. LSTMs have a complex structure, making training long and costly.
Can we use transformers for object detection? Absolutely. If we go to the Hub and look at computer vision tasks, we have models for object detection. For example, the Facebook DETR model is very good for this.
What are the types of audio classification? Generally, the two main applications are classifying sentences (e.g., emotion detection) and specific commands.
What are transformers? Transformers are based on the attention mechanism, which allows the model to look at both left and right context and understand how to best predict a particular word. They started with NLP but have been generalized to other tasks. The original research paper is a good starting point, and there are many introductory blog posts available.
What is language modeling? Language modeling means training a model to predict the next word in a sequence. For example, you take a huge corpus of text, mask some words randomly, and train the model to predict the masked words. This helps the model understand the relationships between words in the vocabulary. You can then use this pre-trained model for downstream tasks like text classification or summarization.
Which format of data is good for deep learning models, and does the format affect the model? The format of the data itself doesn't matter. What matters is that the data is in the correct format and has the right feature names and types when you train or predict with the model. For example, the speech model expects an input audio in 16K resolution, and a text classification model expects a feature called "text" with the actual text and a label.
What is the difference between active learning and transfer learning? Transfer learning involves fine-tuning a pre-trained model on a small dataset. Active learning is an ongoing process where the model trains as it goes. For transformers, we do fine-tuning.
Thanks a lot, Julien, for a wonderful session. Thanks, everyone. Have a good day. For all attendees, there's an announcement. Please join the main stage. We are having a power talk on riding the flywheel of recommender systems by Deed Dood Mukherjee, starting at 2:30 p.m. IST. So please go to the main stage. Thanks.