Hi everybody, this is Julien from Arcee. Training deep learning models means having a dataset, and that usually means labeling data. Deep learning models are really data hungry, so we need lots of data, lots of labeled data, and that's time-consuming and expensive, not so fun to do. What if we could actually train models with just a tiny bit of labeled data? And when I say a tiny bit, I mean really a handful of data instances. This is exactly what you can do with the SetFit library. Bring in just a few labeled samples and get really good results with very short training times. In fact, training times are so short that you can consider training on CPU. I've been a little bit obsessed with those new Intel CPUs lately. Of course, we're going to train with SetFit on Intel CPUs. You'll see it'll only take minutes, and we're going to get extremely good results.
To begin with, you should really read the SetFit blog post, which has a lot of interesting information. In a nutshell, SetFit is an open-source library that lets you train sentence transformer models for text classification using few-shot learning. So with just a few labeled samples. The mile-high view is really that you start from that tiny bit of data. Let's say positive and negative sentiment data. SetFit will generate a certain number of sentence pairs, so positive and negative, to learn the difference between those two classes. It's called contrastive learning. First, the model will be automatically fine-tuned on those sentence pairs. The intuition is, we want to understand how a positive sentence is different from a negative sentence. So we'll fine-tune the embeddings here. Then we will actually encode the sentences in the dataset with the fine-tuned Sentence Transformer model. That will give us embeddings, and we can use that to train the actual classification layer. Start with a little bit of data, learn the differences, fine-tune a model, and then use that model to train a classifier. Pretty cool. And the code is dead simple. Why don't we look at that?
Here's a very simple example, adapted from the blog post. I experimented with different datasets and went for the Yelp polarity dataset. It's a set of reviews from Yelp, obviously, and it has two classes, positive and negative, and they're really strong, polarized reviews. I thought, well, that should work pretty well with few-shot learning. I'm only grabbing eight reviews from each class. As you can see here, I'm using the full test set. Then I'm downloading a sentence transformer model from the hub. We have this SetFit model API, which is very familiar. Thank you, library creators. We have a SetFit trainer object, which will feel very familiar. Passing the model, the training set, the evaluation set, the metric, the loss class we want to use, the batch size, and the number of iterations and the number of epochs. I'll just fine-tune for one single epoch. All right, so that's simple enough. Then I just call train and then call evaluate and then print the metrics. Yes, as mentioned before, I am running this on a Sapphire Rapids Intel instance on AWS. We've seen this before, but we see we have the AMX support enabled, so the advanced matrix extensions that accelerate matrix multiply and accumulate, which certainly we're going to run during the training step. Let's just run this code, and it should be fast.
The original dataset is actually 560,000 reviews. I'm just taking eight. Then I'm using the full test set. So that's my dataset, the actual one. Now it's training. You can see one epoch will last just a little bit less than three minutes. Why go and fire up a GPU for this? It's fast enough, right? You can run it on any machine that you have laying around. So that's pretty cool. Let's just wait for this to complete and then we'll see what accuracy we get. Once evaluation is done, we see accuracy is a very nice 94.5% with 2 minutes and 19 seconds of training. What else can I say? SetFit is very impressive, and I really encourage you to give it a try. It looks like it gets very sweet results with just a tiny bit of data. You could very well save yourself a world of trouble by just working with a tiny bit of data. And of course, you're going to work with CPU servers, which are my favorite. They're easier to manage, generally less expensive as well. So yeah, it looks like we have a good combination of SetFit and CPU, so I'll keep digging. That's it for today. I hope this was fun and informative. I'll see you soon with more content. Next time, keep rocking.