Let me show you how to use Hugging Face models on SageMaker. In this example, we're going to build a movie review classification model. Starting from a pre-trained model, we're going to fine-tune it on the movie review dataset, which is labeled with positive and negative reviews. Positive reviews are labeled with ones, and negative reviews are labeled with zeros. We're going to fine-tune the model on SageMaker, then copy it to our local machine, and use it for predictions.
First, we need to install the Transformers library and the Datasets library. I also recommend upgrading your PyTorch and TensorFlow versions, as transformers tend to require up-to-date versions. You will also need the latest SageMaker SDK. Here, I'm using a beta environment, so I'm installing from a local version of the SDK, but once this is generally available, just make sure you upgrade your SageMaker SDK to the latest version.
Next, we grab a bucket and a role as usual in SageMaker. The first step will be to download the dataset. This is an IMDB dataset with 25,000 movie reviews for training and 25,000 for validation. We download this using one of the APIs from the datasets library, and you can see how simple that is. We have these two datasets here. We can look at the first training example, which has a label of one, indicating a positive review, and the actual review text.
To feed this data to the model, we need to convert it to a format the model understands. We're going to use a BERT variant called Distilled BERT, which has already been trained on a large corpus of English texts. The first step is to grab the tokenizer that was learned during the initial training. The tokenizer replaces words with numerical IDs that the model can use. We download the existing tokenizer and then tokenize the training set and the validation set. Now, this is what the first sample looks like. We can see the tokens, and each word and punctuation mark has been replaced with a token. We also see a mask, where one means to take the word into account, and zero means to ignore it. We see a bunch of zeros because we're padding to the length of the sequence that the model can work with.
Next, we rename the label column to "labels," which is what the model expects. We then upload the dataset to S3, as training data mostly lives in S3 unless you really need EFS or FSx. We can use a handy API in the datasets library to upload directly to an S3 prefix. Now we have our data in S3.
This is our training script, which uses script mode. We pass hyperparameters and parameters as command line arguments and read the location of the training set as environment variables. This is what you need to add to interface your code with SageMaker. The rest is vanilla Hugging Face code. We download the pre-trained model, set training arguments such as epochs, batch size, and learning rate, and configure the training job using the trainer API. We then train, evaluate on the test set, and save the model.
We define hyperparameters, ensure we use the proper container for Hugging Face, and use the new Hugging Face estimator, passing the training script and training on a GPU instance for one epoch. We call train, and the model is saved in S3. We can easily retrieve it from the known S3 location. We can copy the model, and using the Hugging Face API, load it locally. We see the Distilled BERT model with a classifier at the end, outputting two probabilities for positive and negative reviews.
Let's try a prediction. If you think "The Phantom Menace" was a really bad movie, we can predict that. First, we tokenize the sample and forward it through the model. We get logits, and applying the softmax function, we get probabilities between 0 and 1. The top probability is index 0, indicating a negative review. If you think Jar Jar rocks, the highest probability is index 1, indicating a positive review.
The last thing I want to show you is the same example using distributed training. We can use the data parallelism library launched at re:Invent. This time, I'm training on two very large P3 instances. The only thing I have to do is add a parameter to the estimator. No changes to my code or training script are needed. Training now occurs on those two instances. If you have scheduled super large training jobs and want to fine-tune Hugging Face models, just enable data parallelism—it's as easy as that.
That's pretty much what I wanted to show you today. Hope that was useful, and I'll see you around. Bye.