Managing datasets and models in your Hugging Face organization

Transcript

Hi everybody, this is Julien from Hugging Face. In this video, I would like to show you how to use the model management and dataset management capabilities on the Hugging Face hub, with a focus on working with an organization and setting access permissions. In large companies, with different machine learning teams, it's important to ensure that specific datasets and specific models are only accessible to the appropriate people. We're going to start from a dataset and a model on the hub, and I'm going to reuse a dataset and a model that I used in a previous O2NLP video, but you could apply this to any model and any dataset. Let's get started. First, let's talk about the organization. You can see my page here on the hub, and I'm part of a few organizations. I created one called JUL Simon Test, and some of my colleagues have joined to help me demonstrate a few things. It's all empty for now—no models, no datasets. You can go and create your own organization right here. It takes five seconds, and then you can invite team members to that org. It's super simple. You can look at these settings here. It's really just a name, a logo, a contact email for billing, and team members. We'll go back to that a little later. There's no subscription for now, and my API token is here. You can create this in about 10 seconds. Now, what can you do with this? First, let's find a dataset and a model that we like on the Hub. Browsing the Hub, I found a dataset I'd like to import to my organization. This is the Reuters dataset, which I used in a previous video to train a summarization model with AutoNLP. Again, you could use any dataset. The first step is to grab a copy of the dataset, create a repository in my org, and push the dataset there. I'm going to push the original version to keep a fresh copy and a processed version. Let's do this in a notebook. To get started, we can see how to easily import the dataset to our environment. I'm going to save a fresh copy to my local disk. We have a training set and a test set. Exactly like I did in the O2NLP video, I'm going to remove some columns, rename some columns, and apply a cleaning function to remove unwanted characters so that my text looks nicer. This is typical of the cleaning and data prep work you would do in a local environment. Now, I also want to save a processed version of the data for versioning and to avoid redoing this work. If I look at my local environment, I see the original version and the processed version. Let's create a repository for this dataset in my org and push those versions. First, I need to log in to the Hugging Face Hub, which I've done. Now I can create a repository for my organization. So, `repo create` the organization name, the name of the repo, and the dataset flag to indicate this is a dataset repo. I should see this in my org now, and it's empty. I'm going to clone that repo and start adding files. Now I have this repo, and I'll first commit the original version of the dataset. We just copy everything, and we see a dictionary with the training and test sets. I can add everything here, and this is the first version. Now I can push to my repo. If I go back to my page, I'll see the first version of this dataset. I could create a model card, for example, "Reuters model for my projects." You probably want to be more descriptive. Now, let's bring in the processed version. I need to pull first because I updated the README file. I'm going to copy the processed version of the files. You want to overwrite the previous files, not store different folders. This is the old way. What you want to use is the Git workflow with versions, commits, etc. If we look at this now, all the files have been updated. I can add them again, commit them, and push them. Now, if I go back to my repository, I'll see the history. If you're familiar with Git, there's nothing surprising here, but for machine learning datasets, this is a good way to manage them. Tracing changes, documenting commits, and tracing the full history of your datasets is important. Now it's in my org, and I can check the settings to make it private to my org so that only organization members can see it. Now let's do the same with the model. Here's a model I like because I trained it with O2NLP, but you could do this with any model on the hub. I'm going to clone that repo and create a new repo in my org and move the model there. Let's try this. I'll clone the model locally. It's a big model, so it'll take a few seconds. Cloning is complete, and I should see the model. The next step is to create a model repository for this model in my org. So, `repo create` organization name and model name. That's a new repo, and I should clone it now. It's empty, and if I go to my org page, I see this new repository. Let's import everything here. We can push it with `git commit` for the initial version and `git push`. It'll take a few seconds. Now, if I go back to this page, I see everything. I have a model card because it was already in the original repo, and I see the version. I could edit the model card to provide additional information. Let's do that. Commit changes. Done. Then I can pull again here, and my README is up to date. The model card has been updated. Now it's in the org. Let's make it private too. We can work with this model, fine-tune it, and use it for our own projects. I'm not going into fine-tuning, but I'll show that I can download the model from the hub. Notice I'm not using the original version; I'm using the organization version. I can fetch that. We can also load from a local copy. We're going to use the pipeline API and use the pipeline to predict. This is from a Yahoo article, and we can try to summarize it. It takes a few seconds, and we can control the minimum and maximum length. We see the summary. This is all local work, and we could fine-tune this further and push it back to the Hub, keeping all our versions properly traced. The last thing I want to show you is permissions. Going to my org, I see team members. I'm the admin, so I have full access. I can remove members and use all the datasets and models. My colleague Kunal is contributing to the project, so he needs write access to the models and datasets. He can clone and push new versions. I could change the roles if I wanted. Brian is just using the models we work on, so he has read permissions. He can clone and pull from the repos but can't push. Maybe Brian is part of a different team, and we invited him to our org so he can benefit from the models we trained, but he's not allowed to modify them. This is how you would organize work. We recommend creating an organization for each team, and most team members would have the right permissions to contribute. You could have guests from other teams who can reuse models or not, depending on compliance rules. Maybe Brian would need his own organization with his own team members, working on their own models, and I wouldn't be part of that. One org per team, using admin, read, or write permissions for team members to control model and dataset management. It's a good model, and hopefully, it'll work for you. That's what I wanted to tell you today. As you can see, there are lots of model management and dataset management features on the hub, and we'll keep adding more. If you have questions, please ask them in the comments, get in touch, and until next time, keep learning.

Managing datasets and models in your Hugging Face organization

Transcript

Tags

About the Author