Summarizing legal documents with Hugging Face and Amazon SageMaker

Transcript

Hi everybody, this is Julien from Arcee. Everyone's excited about Generative AI these days, but sometimes the use cases are not so clear. In this video, I'm going to show you a pragmatic, real-life Generative AI scenario that you can probably use in your organization. I mean text summarization. We have way too many documents, and we don't have time to read all of them. Sometimes we just need a one-line summary of those documents for maybe search, indexing, or just for reference. Here, we're going to start from a language model and I'm going to use a model from the T5 family. We're going to look at it locally and then fine-tune it on a legal dataset, and we're going to see how well we can summarize complex legal documents. You can certainly adapt the example to any other domain that you're interested in. Okay, let's get to work. The Flan T5 family of models is really interesting. They come in different sizes, and we'll look at those in a second. They are really good general-purpose language models that perform very well on a variety of tasks. Here, I'm looking at the large size of the model, which is a mid-sized model in that family. In reading the model card, we can see it's been trained on a variety of tasks in multiple languages. We see the list somewhere down there. So quite a few languages here. It's a good starting point. Many customers have good success with these models. So I'm going to start from this one. And of course, I could evaluate it in the inference widget here, but let's jump into a notebook straight away and see what this model is capable of. I'm running this on a notebook instance on SageMaker, but you could run it in any Jupyter environment. I will put a link to the code in the video description. So let's just run the first few cells, install the Transformers library. First, we're going to try the model locally. For reference, I wrote down the parameter count for different sizes. The small version is 80 million, the base is 250 million, large, which is the one I'm using, is 780 million, and then we have XL at 3 billion and XXL at 11 billion. In general, I recommend starting small and evaluating accuracy before scaling up. Just like for EC2 instances, you need to run your app on a small instance and then scale up if needed. I'm starting halfway through the parameter size, but you could try 80 and 250. Maybe that would work well enough. We've installed the Transformers library, so let's create a pipeline object for summarization using T5 large. Downloading the model, it's already about three gigs, so it's a pretty large model. Hopefully, it will work well. Once we have our pipeline, we'll try to summarize this bit of text on the Eiffel Tower, which is general-purpose text from Wikipedia. We get a very good summary: "The Eiffel Tower is the tallest freestanding structure in France." This general-purpose model works well for summarizing general-purpose English text. The next step is to deploy this in the cloud and start experimenting with real-life data instead of working in the notebook. We can very easily do this by going back to the model page, clicking on deploy, SageMaker, summarization, AWS, and getting a code snippet that we can copy and paste into a notebook. We don't even need to write that code for deployment. In the interest of time, I have already done this and updated it for the latest Transformers version. So let's just run this. We could literally run this directly. While it's deploying, I can explain what we're doing here. First, we're referencing the model ID on the hub, deploying straight from the hub, which is really cool. You can do this with any model from the hub using the SageMaker SDK. Just point at the model, give the task name, and it's done. Then we create a Hugging Face model from the SageMaker SDK, passing the model name, and call deploy on that model with the number of instances needed to back the endpoint. I'll go with one instance and a G5 instance, which is a small, cost-efficient GPU instance on AWS. You could also use CPU instances if you're not concerned with latency, though they will be slower. For now, a small GPU will be sufficient. It will take a couple of minutes to deploy, so I'll pause the video and be right back when the endpoint is ready. Okay, the endpoint has been deployed, and we can see it in the SageMaker console. Now let's invoke it. We'll first try the same text as we did locally, and sure enough, we get a good result, the same result actually. Now we have the model in our production environment. We could set up auto-scaling and other features if needed, but not for now. We can pull some data from one of our backends, and let's say we have legal data that we want to summarize. Here's a blob of legalese in English. The model has been trained on a lot of English data, so hopefully, it will do a good job summarizing legal text. Let's run this prediction. It's not a very good summary at all. It just gave us the first sentence, the name of the section, which is kind of what the document is about, but this isn't really a summary in plain English. That's disappointing, but it doesn't mean the model cannot do the job. It means the model needs to be trained on generating summaries for this kind of legal document. So that's what we're going to do. First, we want to delete the endpoint to avoid unnecessary charges. Now let's fine-tune the model on legal documents to improve how it can summarize them. First, we need data to do this. Maybe you have a dataset already, and that's great. If not, you can look for legal datasets on the hub. You could use the full text search and say, for example, "summarization, legal." We can see some interesting starting points. The one I'm going to use is called Billsum. It's an English language dataset that contains legal texts or blobs. They have summaries, which are still a bit long, and titles. I'm interested in a one-line summary of these documents, so I'll use the title as the label and the text as the input, and we'll use the summary column. This is a good way to save time because we don't have to build our own dataset. We import the SageMaker SDK, the Transformers, and the datasets library from Hugging Face, and then load the dataset. We have about 19,000 training samples and 3,000 validation samples. This is a good number, and we could probably get away with less. If you have a few thousand documents with summaries, you're probably good to go. The whole point of fine-tuning is that you need just a small quantity of data to customize the model to your own needs. We do need to do a tiny bit of preprocessing. The model expects an instruction, and in this case, the instruction will be "summarize." We'll add that prefix to the legal blob, so the input will be "summarize: [text]." We tokenize that, and that's the model input. We use the one-liner title as the label and tokenize that as well. So, model inputs are tokenized prefix plus blob as the input and tokenized title as the summary. We run this on our dataset, apply it with the map function in one go, and save the processed dataset to avoid doing it repeatedly. The next step is to upload the dataset to S3, where SageMaker expects it. We define a few paths, one for training and one for validation, and save the datasets to disk with S3 as the file system. We can reload from S3 using `load_from_disk` and passing the S3 URI. Now we're ready to train. In the interest of time, I won't go into every line of the training script. In a nutshell, it's not difficult to adapt your training code for SageMaker, whether you use scikit-learn, PyTorch, or Hugging Face. The script will be invoked by SageMaker inside the training container as an actual script. It will be run as `python my_script.py` with command-line arguments for the parameters. Generally, the only thing you need to do is add `argparse` to handle hyperparameters and the location of the training and validation sets. This feature is called script mode, and the AWS documentation for script mode is still not great, but I have other videos on this channel that walk you through it. The rest is vanilla Hugging Face code. I'm using the high-level Trainer and Auto API, loading the model, tokenizer, defining the training arguments, defining the trainer, and then calling `train` and saving the model. You can take your vanilla Hugging Face code that runs on your laptop or virtual machine and adapt it in 15 minutes or less to run on SageMaker. Once we have this, we need to define the hyperparameters, which match the parameters passed to the script as command-line arguments. We define a single epoch, learning rate, etc., and pass the model ID as a parameter so you can reuse the script for different models. We create a Hugging Face Estimator in the SageMaker SDK, passing the script, dependencies, hyperparameters, transformers version, PyTorch version, Python version, and the infrastructure. I'm leveraging a P3DN 24XL instance, which comes with eight NVIDIA GPUs. This is the whole point of using managed services; you can go as big as you want for a few minutes. I'm enabling distributed training with the data parallel library in SageMaker, which is a high-performance implementation of distributed training. We call `fit`, passing the location of the training and validation sets, and the training job starts. It creates the instance, downloads the dataset from S3, and the training container for Hugging Face PyTorch 1.13. The training log is long, so let's go to the end. We can see evaluation running and generation happening. Eventually, the model is saved, and the training job ran for just about 30 minutes. You pay for 30 minutes of that instance type. We could configure spot instances to reduce costs, but I didn't want to do too much here. You can look for Spot Instances videos on the channel if you need more information. Now we have the model in S3, so we can load it locally, but we want to predict. I'm going to deploy this on a slightly larger instance, a P3, which is more powerful than G5 for more scalable work. We'll look at one of the samples from the test dataset. Here's the legal blob again, with formatting and section numbers. We just take this and predict it. The summary we get is much better than what we used to get with the vanilla model. It's an actual sentence. If you're a lawyer, you would see that this is a good summary for this particular document. It's a very English sentence. That shows how you can fine-tune models in minutes. Thirty minutes for a large model is really good, meaning we can iterate many times in a single business day. We can try different dataset combinations, hyperparameters, and get a lot of iterations done in a day to get a solution that's accurate enough for production. Of course, you could add your own data to the mix, but maybe not. If you try this model on your own English language data, it might be just fine. Once we're done, let's not forget to delete the endpoint. That's really what I wanted to show you: how to quickly work with hub models, hub datasets, try them out locally, evaluate them on your own data, and bring them to your production environment on AWS by copying and pasting code we provide. If needed, fine-tune on SageMaker to get much better relevance or accuracy on your domain-specific data. I hope this was useful. Feel free to ask questions, and I have more videos coming. Until next time, keep rocking.

Summarizing legal documents with Hugging Face and Amazon SageMaker

Transcript

Tags

About the Author