Log with MLflow and Hugging Face Transformers

July 15, 2022
In this video, I show how to use MLflow (https://mlflow.org) with the Transformers library, and why it's a good idea to store the logs on the Hugging Face hub :) Let us know how you'd like to use MLflow with Transformers and the Hugging Face hub! Please join the discussion at https://discuss.huggingface.co/t/calling-mlflow-users/20420 ⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos ⭐️⭐️⭐️ Doc: https://huggingface.co/docs/transformers/v4.20.1/en/main_classes/callback#transformers.integrations.MLflowCallback Notebook: https://github.com/juliensimon/huggingface-demos/tree/main/mlflow

Transcript

Hi everybody, this is Julien from Hugging Face. MLflow is a really popular open-source project that helps you keep track of your machine learning experiments and visualize them. In this video, I would like to show you how you can easily use the Transformers library with MLflow. We're going to run a notebook, log some metrics, and visualize them. So let's get started. Here's what we're going to do. We're going to start with a simple transformer training job, and I'm using DistilBert on the IMDB dataset for movie review classification, but you could use your own code. It doesn't matter what training job we're running here. This is completely generic. We're going to run this, and it's going to log some MLflow information, parameters, metrics, etc. We're going to visualize that because it's a simple format that we can load. We're going to push everything to the Hugging Face Hub, the model, and the logs, because it's good practice to keep everything in the same place. Then I'll show you how you can grab that information from the Hub and load it in the MLflow UI. As you see, it's all super simple. Let me go full screen here. First things first, we're going to install some dependencies. As you will see, you don't need to change any of your transformer code to use MLflow. The only requirement is that MLflow is installed. Please make sure it's installed in the actual environment where your Python code runs. That means your virtual environment or your Conda kernel. If it's installed outside of that and your Python code cannot see it, it's not going to work. Here, I just made it simple and installed it in the notebook, but you would want to install it in your Conda kernel in this case. That's really the only gotcha, I think. Then we need to have Git LFS enabled because we're going to push the model back to the hub, so we run those usual commands. Okay, then we import a few things and log in to the Hugging Face Hub. Here I'm using Notebook Login, but you could use the Hugging Face CLI to do that. That will allow us to push files to the Hub. Now we move on to our actual training job. I'm loading the IMDB dataset, which I'm sure you're familiar with. Movie reviews, either positive or negative. Very good data to experiment with, not too big. Then I'm loading the model and the tokenizer for DistilBert. We have two labels again, positive and negative. I think that's the default value, but I always prefer setting that just to grow that habit and avoid problems when I have more than two labels. Then a tokenizing function to tokenize the movie reviews in the dataset. Pad and truncate, and apply that to the training set and the test set. Business as usual here. Then, a very simple metrics function to compute accuracy. Very simple. And then I could configure a little bit my MLflow experiment. In a nutshell, MLflow organizes your projects as experiments and runs. One experiment can have multiple runs. The first thing we need to do is set the experiment name. Where did I find this slightly obscure environment variable? Well, I found it in the Transformers documentation. If you go to the callbacks page and look for MLflow callback, you'll find this blob that tells you a little bit about the configuration you can do. You can decide to log artifacts, set the experiment name, add some tags to your run, hold nested runs, attach to existing run IDs, and flatten parameters. So either passing one single dictionary or individual parameters. That's all there is to it. There's nothing else to configure. Here, I just went for setting the experiment name and flattening the parameters. Since I'm logging locally, the model will already be saved by the trainer API, so it's not really useful to log artifacts. But if you were working with an external MLflow server, you could send those artifacts to the server, and the server could log them wherever it wants in S3 or locally, etc. But here I just went for the simple use case where I run stuff locally and don't have a tracking server. Just logging that stuff to local storage. Training arguments, we'll just train for one epoch. This is where we'll push the model. And that's about it. Okay. Then put everything together in the trainer API. As you can see, there's really nothing about MLflow here. The only thing I did was name the experiment. That callback is enabled by default, provided that MLflow is accessible on the path. You could go and tweak it and explicitly add it if you wanted to. But here, I'll just go with the defaults. Okay, and then we'll launch training. I can see this message telling me it's creating a new experiment because it does not exist. Otherwise, it would have added the run to the existing experiment. Then it trains for a little bit. I see some logging information, the model being saved. That's it. So transformers, as usual, no change. Hopefully, in the background, some stuff was logged to MLflow. We need to end the run. I read somewhere in the doc that the run is not explicitly closed. It's probably on that page. But yeah, I found it somewhere. So I did not invent it. It's written somewhere. So yeah, closing the run so that a new iteration of the notebook will just generate a new run. Then push everything to the hub. At this point, I have a new model repository with the model card and the model files in there. What I want to do is keep that MLflow information, those ml runs in the same repo because it's all useful information on how the training job went and all the parameters, etc. What I'm really doing is super simple. I'm just copying the ml runs folder, which is created by MLflow as we run the job. It's created locally again, I am not using a tracking server, and I'm simply copying this to the output directory, which is where the repo lives. The model repo lives, and I just add and commit and push. As you would expect, I see those ML runs in there. Well, that's only one run really. So I see my metrics and everything else. Now you have everything in the same place, which is practical. What else could we do with that information? Well, we would want to visualize it. We have the information locally, and the format for all those files is really simple. Let's look at one of those, maybe loss. There you go. It's a CSV format with a space separation, so it's a SSV, right? Space-separated value file. That means I can very easily load it. I could use pandas, read CSV with a file, metric file, space separator, just give the columns a name, and that's it. I can just plot it, using steps on the x-axis and values on the y-axis. Very easily with those local MLflow files, I can build my plots right there in the notebook. I don't even need the MLflow UI if I want to do that stuff. Yeah, so I can plot the learning rate, plot all of it. If I had multiple runs, it'd be very easy to just modify that function and plot everything in a notebook. Now, obviously, you certainly want to use the MLflow UI. All we need to do is grab a terminal here and go to that repo. Just zip that mlruns folder, which is not big because it doesn't have the artifacts. Now I can just go and grab that thing, download it to my local machine or MLflow server or anywhere convenient. I can just go and grab that file, move it here. So now I've got my ML runs. If I just go MLflow UI, I should see that thing coming to life. And of course, I will need to open that URL. So I see my MLflow experiment here, and there's only one run. We can just go and open that. Sure enough, we see all the parameters for the training job. And obviously, we'll see metrics, and we can just click on any of those and see the plot. That's a really easy way to document your transformer training jobs, just make sure you have MLflow in your environment, and save those runs to the model repo. You can just go and clone them again. Here I zipped them and downloaded them from the notebook, but obviously, they live in that repo. Anytime you're going to clone that, and yeah, let's do it. Let's just prove that point. Let's build this thing here. Clone the repo. Okay, so with clone the repo, we can see it here. And of course, it has that ML runs folder. If we start the MLflow UI again, and open that, of course, we will see the information. Yes, okay, so that's a really convenient way to keep all those logs next to the model in the same place. Very, very easy, very simple. There's really no code to write, which I always enjoy being lazy. So again, just make sure you have MLflow in your environment, and you're all set. You'll have those runs available locally. And if you push them to your model repo, just like I did here, it's very easy once you clone a model to visualize that information again. I'm sure there's more we could do on MLflow. I'll create a discussion on the Hugging Face forum. So I'll include the link to that in the video description. If you have ideas on how we could make MLflow and Transformers better, maybe integrate some visualization on the Hugging Face forum, why not? Join the discussion and let us know what you like. But in any case, I think that's already useful. We could always do more. Well, I hope this was interesting and useful. Thanks for watching, and I'll see you soon. Bye-bye.

Tags

MLflowHugging FaceTransformersMachine LearningModel Tracking