Hi, everybody. This is Julien from Hugging Face. A few weeks ago, I published a short video where I demoed a multilingual voice query application on financial documents. In this video, we're going to dive in and see how I built this, first in a Jupyter Notebook, and then how I moved it to a Hugging Face Spaces application, which is what you probably saw in that video. I will include all the links in the video description so that you can follow along. Okay? All right. Let's get started.
So here's the application hosted on Spaces that we're going to build. But of course, first, we need to prepare some data and experiment a little bit, which is why we're going to start with a Jupyter notebook. I'm running this one on SageMaker because I need a GPU instance for embeddings, as we will see in a minute. But of course, you could run this locally as long as you have some GPU processing for embeddings. I ran it on CPU first, and it took hours versus not even two minutes on GPU. But I will also provide the embeddings file so that you don't have to do it all over again. So let's take a look at the notebook, start from the beginning, and see how we can solve that problem.
Here's the notebook and some data files. The link to all that stuff is in the video description. Obviously, I need to install some dependencies, some obvious ones like transformers, sentence transformers, NLTK, etc. But I also need Librosa, which is a sound processing library, which I will need to work with WAV files that I'm going to record. OK, so install, import, all that stuff. And then, of course, we need some data, right? Here, I'm going to reuse a file that I actually built in a previous video. I will include the link to that, but that file is in the repo as well. This is basically a CSV file that includes all the 2020 annual filings. Downloading that stuff and preparing it is a story in itself, so I'll point to that. But let's just say we have that data and we can get started with it.
We'll take a look at the file in just a minute. Some definitions here. This is the model I'm going to use to embed the text that is actually part of those SEC filings, and some file names because embedding takes a while and you just need to do it once for a given set of documents. Of course, I want to save that and run my examples again and again without having to go through embeddings all over again. So we'll see what those files are. The first step is, of course, to load the data and figure out which text I want to use in that dataset, which text I'm actually going to run my voice queries on, and then embed the text. The simplest way to do this embedding is to use the sentence transformer library. It's pretty much a one-liner with the model name that I selected, a distilled BERT variant. You can try other ones. So now I have this embedding model ready to go. The next step is to load the data, read the CSV file. There are a few duplicates in there. I'm not sure why, probably my ingestion process was weird. But anyway, I'll drop the duplicates. I can see I have 492 annual reports in there. They're already broken down into individual sections. Again, this is explained in a different video. So all the sections in the documents are actually stored in different columns, and the one that's particularly interesting to me is this MDNA column, which is the management discussion and analysis of the company. That's where the company's management will explain how that year went, how earnings look, and all the financial details. There's lots of good stuff in there, and that's where we want to run the queries.
So what are we going to do with that MDNA section? Well, first, we're going to break it down into individual sentences. That's a design decision I've made; I want my queries to match individual sentences in a document. You could go for paragraphs, but here, I went for sentences. So I'm breaking down each row in my dataset. I'm breaking down the MDNA section into individual sentences using the NLTK tokenizer. I'm also storing the number of sentences in each one of those MDNAs. The reason I'm doing this is because when I run the queries on the corpus of embedded sentences, the result will be a corpus ID, the identifier of the top matching sentence or sentences. If I want to locate the actual sentence and the actual document, I need some kind of index that tells me corpus ID 123456 is actually in this particular document. In real life, you could use a vector database or something similar, but here to keep it simple, I'm just building an in-memory index with the number of sentences for each document. If you tell me, "Hey, what sentence one, two, three, four, five, six?" I can quickly iterate and tell you, "Well, this is the one and this is the document it lives in." It's a bit of a hack to keep things simple, but at scale, you couldn't avoid doing that. So that's what I'm doing: breaking MDNA sections into sentences, keeping the number of sentences in each document, and here we go. The total number of sentences is a little more than 185,000. I'm storing those per document sentence counts into a pickle file because I want to load them again and again without having to do that process.
Obviously, the more important thing is to take those individual sentences and embed them. Using the model I loaded with the sentence transformers, I just call `model.encode`, which is super simple. I need to wait for about 1 minute 47 or 1 minute 48, depending on a single GPU instance. So that's quite fast. Each sentence becomes a 768-dimensional embedding. So now the corpus has the same number of embeddings as I have sentences, and 768 dimensions. That's kind of a big file, although it's not huge. Again, I'm saving that because I don't want to run that process again. So saving that to a NumPy file. Right? Simple enough. So now what we have is the original dataset, the embeddings, and the per document sentence counts. So now we can run some queries.
So how do we do that? First, we'll look at text queries and see how we can do voice in a minute. A query is very simple. It's actually a one-liner using the sentence transformer. First, encode the query as an embedding and call the semantic search function, passing the query and asking for the top five results. That gives me five hits in this case, and I get the corpus ID, which is a unique identifier pointing at embeddings in my corpus. That's where my in-memory index comes into play. All I have to do is enumerate sentence counts and sum them until I find the range that includes the corpus ID. In-memory index, pretty simple, but it works very nicely here and saves me from using any kind of backend. So if that's unclear, just go and run it. I'm just adding up sentence counts for all the documents until I find the range that includes the one the query returned, and that's good enough for me.
So how do we find something in there? We just call that function. Let's try to query, "Energy prices could have a negative impact in the future." This returns five documents. The corpus ID is what the search query actually returns, and using my index, I can match this to this document, CTRA. You can look it up. I have the sentence that actually matches my query and a score returned by semantic search. I've got the top five hits in descending order. Here's another one: "International sales have significantly increased." Again, I find the top five sentences, their corpus ID, and I can match that to the document that includes them. Very, very simple. One line to encode, one line to query, and matching the top hits to the original corpus. No backend, all in memory, which is fine for a smallish dataset like this.
Now that we have text figured out, how do we do speech? We'll keep it simple. We'll use a speech-to-text model. There's a really great one from Facebook, the Wave2Vec model. This one has 300 million parameters, and you can try multi-billion parameter models, but this one is already very good. It does speech-to-text from 21 languages to English, so I can speak French, German, Spanish, etc., and what I get is an English sentence. Here, I'm going to use the model in the simplest possible way. I'll just use the Hugging Face pipeline, loading the model for automatic speech recognition. So one line of code. Then all I have to do is pass some recordings. I've got a couple of samples that actually match the text sentences we used a minute ago. We can use those WAV files to convert speech to text and then run text queries. Very simple. We can actually listen to this. Let me get some sound.
Okay, so that's my recording in French. I'm just going to load that WAV file. This is where you need the Librosa library. I'm going to use my Hugging Face pipeline, so the ASR model. The output of that is just an English sentence: "Our international sales have significantly increased." This took a few seconds. Again, this is a GPU instance, so it's reasonably fast. I have another one here: "Le prix de l'énergie pourrait avoir un impact négatif." Okay, so French again, loading it, translating it to English: "The price of energy could have a negative impact on the future." Now, we're back to text queries because we have that sweet speech-to-text model, and we can just use that same function to create a text, grab the output of the text-to-speech translation, and pass it to the model. Unsurprisingly, we hit the same documents. This is really the bulk of the application. Using two models, one for encoding and embedding the corpus and one for speech-to-text. There's no machine learning in there. If anything, it's just a Python app. So that's pretty cool. That's the notebook.
Now imagine you want to demo this to your stakeholders. You need more than a notebook. So now let me show you how to actually grab this stuff and build an application with it. This is where I'm going to use Spaces. I've covered Spaces in a lot of detail in previous videos, so I'm just going to go straight for the throat here. Spaces is a Hugging Face feature that lets you host web applications written with Streamlit or Gradio to showcase machine learning applications. It's super simple to use. You start by creating a Git repository, which I've already done here, and you add your files to that repo. You can see I have my CSV file with the SEC filings, my embeddings, and a couple of simple WAV files. And of course, the application code, which we're going to look at in a minute. So I created a repo and pushed all that stuff in there. If you're lucky, it works. If not, you can easily debug it, and we'll take a look at the logs. The cool thing is I'm using Gradio here. I can actually do all the work locally. So I can fire up a Gradio app locally, which makes it really easy to write, test, and debug. Once it works locally, I can just push it to that repository, and it works. So pretty cool. I like to work locally when I can.
Here's the repo that I created. We can see all the files. I'll include the URL to this. It's a public space, so you can go and play with it. So no surprise, everything is in there. The most interesting bit is, of course, the application. So let's take a look at the app. You'll see it's very, very similar to the notebook code. I'm importing all that stuff. Gradio is the most important addition here, of course. And obviously, I'm not doing the embedding process here. I'm just using the embeddings that I've already processed and saved. You can see them here. I'm using the same models that I already used. So the application is very similar to the notebook. I load the corpus because I do need to get the original sentences. I tokenize them again. I guess I could have saved that as well, but it was easier to do it this way, just load it again. It works here because it is a simple dataset. Again, there is no backend integration, so I'm cutting a few corners. Loading the corpus, running that tokenization and sentence count process again. Then loading the embeddings, which we have already computed. That took about two minutes on a GPU, so we don't want to do that every time. The rest is pretty much identical: load the embedding model, load the speech-to-text model, my `find_sentence` function that runs the query, finds the actual source document. There's a little bit more data returned here. This is a cool feature in Gradio where if a function returns a pandas DataFrame, you can display it very nicely, as we'll see in a second. That beats just plain text, I guess. Looks nice.
Then I have the Gradio part. You can see the overall thing is 112 lines, and honestly, half of that is just UI code. So I took that query process from my notebook and built the Gradio UI, which is very simple. It has a radio button to say, "Am I using a text query or speech query?" It has a text box for text input, an audio input for speech input, and a slider to control how many hits I want the query to return. The outputs are a text box to show the query itself coming out of the speech-to-text model and that DataFrame output, where I actually display the results as returned by my search function. The interface is super simple. I have a bunch of ready-made examples. The process function is what you think it is. If I'm going for speech, then load the file that was recorded through the microphone, run the ASR engine, and match sentences. If the input selection is text, then I have the text query already and can just run it against the corpus.
There wasn't any trick in migrating my notebook to Spaces. I didn't hit a lot of problems. The two things worth mentioning are how to install dependencies. You can install Python packages with a `requirements.txt` file in the repo. If you have native dependencies, you can install those in a `packages.txt` file. When the app starts, it will do all of that automatically. You can see the logs here, which is really nice for debugging. You can see the models being downloaded and everything going on. Again, I would recommend testing locally. It makes everything simpler. Once it works on your machine, push it to Spaces. If you're missing a dependency, you can just add them to the `requirements.txt` or `packages.txt` files.
Let's try the app. I can see it's running. Hopefully, it's working. Let's try and run those samples, the same ones we ran in the notebook. Okay, so that's my recording. That's the local WAV file that the ASR model will work from. Click on Submit. It takes a few seconds to run the speech-to-text and translation. This is the first hit. That one's going to be slower, I guess. Then it runs the query. Yep. Let's do another one. Let's do one in Spanish. I don't speak Spanish, and it's not my voice. I actually generated this one with Amazon Polly, the text-to-speech service on AWS. Okay, and we run the query, and we should see results. All right. Translation is good, and the hits are good. Let's do one in German. I think this one comes from Pauli as well. Let's do five hits. Click on submit. We'll see the query and the results. All right.
So this is pretty cool. Just to give you a little background, I actually built this for a customer presentation. When I had the original idea, I thought, "Oh, I'm going to need so much help from my nice Hugging Face colleagues on working with embeddings, working with the models, building Spaces, etc." Much to my surprise, I built all of it in less time than I thought. This was an easy process. If it takes me a couple of days to do this, all you smart people out there will go much faster. There's really nothing complicated here. The big mess is always getting the data. Once your data is in a CSV file or something you can load in Pandas and start messing with, you're almost there. Well, that's my experience at least. Anyway, I hope that was fun. I hope this gives you lots of ideas to go and play with all those models—NLP, speech-to-text, computer vision, etc. There are so many cool transformer models out there. Give it a try. Find some data. Start to combine models. Publish some cool Spaces. And yeah, ping me. I'm happy to share all the good stuff that you build. Okay. All right. That's it for today. Hope you learned a few things. Hope that was fun. All the links in the video description. Don't forget to subscribe for more cool stuff in the future. Until then, keep rocking. Bye.