Hi everybody, this is Julien from Hugging Face. In this video, we're going to do something a little different. We're going to start from a research paper that uses transformer models and see if we can replicate the results and achieve state-of-the-art results with a few lines of code. The problem we're studying is called counterfactual detection. In a nutshell, we're trying to figure out if a piece of text mentions a fact that did not take place or could not take place. We'll see some examples, and as you will see, these are really, really confusing.
The research paper in question is by the Amazon science team, and they studied counterfactual text in product reviews. So, this is what we're going to do. We're going to take a quick look at the paper, the data set that the authors have shared, and then run a notebook where we train a transformer model using the Hugging Face libraries. We'll see if we can achieve or maybe exceed the results in the paper. All right, let's get started.
This is the paper we're going to start from. It's a collaboration between Liverpool University and Amazon. It's a good read, and I really enjoyed it, which is why I selected it. The good thing is they actually share the data set for the paper, which is why we're able to try the experiment. The data set includes four languages: English, English extended (which has fewer counterfactual sentences, making it a bit harder to learn), German, and Japanese. These are separate, so we can pick any of these data sets. We're going to work with the English one, which has about 5,000 samples and 18.9% counterfactual sentences.
If we check the data set on GitHub, we see the sentence and a label indicating whether the sentence is counterfactual or not. For example, "I wish the cord was longer and had a flat plug" means the cord is not longer and doesn't have a flat plug, so these facts did not happen. Let's take a few more examples: "I wish I had had him as an instructor at college, but you didn't," and "If you wanted to go commando, you could do that too," but you did not. This is the model we're trying to build—figuring out if a sentence mentions a fact that did not happen. It's all about using particular constructs and grammar, making it a really interesting problem.
If we go back to the research paper and move a few pages, we'll see the algorithms the authors have tried. They've tried traditional algorithms like SVM or random forests to get a baseline and then transformer models like BERT and RoBERTa. They train these models on the four data sets we just saw and share different metrics. The main metric they share is the Matthews correlation coefficient, which works well for unbalanced data sets like this one. In the appendix, they also share accuracy and F1 scores for the same training jobs.
They share the actual hyperparameters they used, so we'll try to replicate those as closely as possible to run the job under the same conditions. Now, let's go to the notebook and see how we can run this. I'm using a GPU instance on AWS, a P3 2XL instance with one NVIDIA V100. I'm running it in the notebook, not using Managed Infrastructure. I've uploaded the three TSV files from the English dataset, one for training, one for validation, and one for testing.
The first step is to create a Hugging Face dataset from these three files. We need to provide tabulation as the separator because the sentences contain plenty of commas. Once we've done that, we have our Hugging Face dataset. I'm going to rename the label column to "labels," which is what the model expects. Now, my dataset is built from three datasets: training, validation, and test, each with two columns: the sentence and the labels.
Next, let's grab the model. In the research paper, they used XLM-RoBERTa, so I'm going to pick the same model, XLM-RoBERTa base, from the Hugging Face hub. I'm downloading the tokenizer and tokenizing my dataset with the map function, setting the max length to 256, padding if the sentence is not long enough, and truncating if it's too long. Now, I've got my three datasets ready.
It's time to load the model. I can load the XLM-RoBERTa base model from the hub and update three hyperparameters according to the values in the research paper: the dropout factor for attention probabilities, the dropout for the classification layer, and the normalization parameter. I've modified the default configuration for the model, and now I can create my model from the pre-trained model with the customized configuration.
Next, we'll use the Trainer API, which is really simple. First, I need to set the training arguments: the name of the trainer, the number of epochs (50, as in the paper), the training batch size (16, as in the paper), and the evaluation batch size (also 16). I will log metrics after each epoch and every 100 steps to ensure I have enough steps to get my metrics ready. The default is 500, but with a small dataset and a batch size of 16, some metrics might not show up, so I'm forcing logging to happen more often.
I set the evaluation accumulation steps to prevent an out-of-memory CUDA error. The default behavior is to keep all evaluation steps on the GPU before sending them to the CPU, which caused an OOM. Now, I'm forcing the eval steps to be sent one by one, saving GPU memory and avoiding out-of-memory errors. I've updated a few more hyperparameters according to the paper, so I should have exactly the same model and configuration.
The next step is to set up the metrics. I want to report on the same metrics as in the paper: accuracy, F1, and Matthews correlation coefficient (MCC). My compute metrics function looks like this: it computes the three metrics and returns a dictionary with the values. Now, it's time to put everything together with the Trainer object: the model, the training args, the training dataset, the evaluation dataset, and the metrics. Then, I go and train.
This ran for an hour and 12 minutes, all the way to 50 epochs. The first few epochs are already very good, with decent accuracy, F1, and MCC. If I keep going down to 50, I can see the best epoch, and the convergence is quite good. I have three epochs with the exact same results, so 50 epochs looks like a good number. I get an accuracy of 95.52, an F1 score of 87.39, and an MCC of 84.71.
How does that compare to the paper? Let's check the accuracy first: 95.52. The paper reports an accuracy of 0.92 for English without a mask, so we've improved quite a lot. The F1 score is 87.39, which is a bit lower. I ran different jobs and couldn't reproduce the 0.92 F1 score, so maybe there's a parameter not specified or just bad luck. The MCC score is 0.847, which is much better than the 0.79 in the paper.
There might be a trade-off here: we have significantly higher accuracy and MCC but slightly lower F1. According to the paper, the most meaningful value is the one they report in the main paper, not the appendix, so I'm pretty happy that we exceeded that.
If we run evaluation on the test set, do we get the same? Running evaluation, we see slightly lower accuracy (94.47), F1 (86.73), and MCC (83.51). These are the benchmark values you would consider, and they're still better than in the research paper. This looks like a pretty good training job.
What does this tell us? The Transformers library makes it pretty easy to try and replicate papers, especially if the authors share hyperparameters and detailed training information. We can work with a pretty vanilla notebook, load the dataset, pass the model name, be careful with hyperparameters, and train. If an idiot like me can match or exceed state-of-the-art results, well, all your smart people out there can certainly do much better.
Since this is the first time I've done this, I think I should do a little victory dance. State-of-the-art, baby. That's my victory dance. You didn't want to see that, I know. Well, that's it for this video. Maybe I'll try more. I never thought this would be so easy, so I'll keep an eye out for papers I can try to replicate. Hope that was fun, and you learned a few things. I'll share everything in the video description. Until next time, have fun, keep learning. Bye-bye.
Tags
Transformer ModelsCounterfactual DetectionHugging Face LibraryMachine Learning ReplicationNatural Language Processing