Train tabular models automatically with Hugging Face AutoTrain

April 21, 2022
In this video, I use AutoTrain, an AutoML product designed by Hugging Face, to train a multi-class classification model on tabular data (the PetFinder dataset from Kaggle). ⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos ⭐️⭐️⭐️ Original dataset: https://www.kaggle.com/competitions/petfinder-adoption-prediction/ Dataset on the hub: https://huggingface.co/datasets/juliensimon/autotrain-data-petfinder-demo Model on the hub: https://huggingface.co/juliensimon/autotrain-tabular-demo-762523398 Notebook: https://github.com/juliensimon/huggingface-demos/tree/main/autotrain-petfinder New to Transformers? Check out the Hugging Face course at https://huggingface.co/course

Transcript

Hi everybody, this is Julien from Hugging Face. In this video, I'm going to talk about AutoTrain, our AutoML service. This service was formerly known as AutoNLP, and we renamed it. Don't worry, there's more to this video than a name change. The reason we renamed it is that the service is not just about building NLP models anymore. We can now work with tabular data, and in this video, I'm going to walk you through a simple example where I start from tabular data stored in CSV files, and I build a multi-class classification model in just a few clicks. Let's get started. First, we need a dataset. In this example, I'm going to use a well-known dataset called the PetFinder dataset. Each entry in this dataset describes a pet, and the label is the adoption speed for that pet. We can use this dataset to train a multi-class classification model. You can easily download this from Kaggle as usual. If we take a quick look, we have a training set and a test set. I'll only use the training set; the test set is for the competition. If we look at this, it is a CSV dataset. We see a number of features: type, name, age, breed one, breed two, gender, color one, color two, color three, etc. We also see the adoption speed label, which is an integer ranging from 0 to 4, if I'm not mistaken. That's the starting point. We're going to use this dataset as is, with no preprocessing, and get to work. Next up is the Hugging Face Hub, and of course, you need to create an account and log in. Once you've done that, you can go to "New AutoTrain Project" and click on "New Project." Let's give it a name. We'll call this "PetFinder Demo." We can pick a task type. We see all the previous ones inherited from AutoNLP and the new ones for tabular data. We see binary classification, multiclass classification, multilabel classification, and data scoring, regression. Here, we want to do multiclass. Let's create the project. We see here we can upload either CSV data or JSON lines data. I'm going to work with CSV here. I'll select my training file and let AutoTrain split it for training and validation. We have about 15K rows, so that should be enough. We need to map the columns, so the target is the label we want to predict, which for me is adoption speed. The optional ID is pet ID. I'll add this to the project. I can add more files if I want to, but I'm done here. Interestingly, we can also pick datasets directly from the hub. If you store your datasets on the hub, you can go and pick one here. Let's go to "Trainings" and train 15 models. We can still go from 5 to 100, so let's stick with the default. I see the budget for this, and I'll start training. Confirm, and off it goes. Within seconds, we should see those 15 jobs starting, and they should start reporting metrics. After a little while, we should see the winner. I have some information on data processing and splitting for training and validation. The dataset is pushed to the Hub automatically, and now the jobs start. If I click here, I can see my training data stored on the Hub. If I click here, I go back to "Trainings" and can see my jobs. Pretty soon, they start reporting metrics. In the interest of time, I've already done this exact same configuration. Let me switch to the completed job and see how we did. This is the one I've run previously. Some jobs were stopped because they were not converging or something was not right and were stopped early. A few jobs did complete. If I go to metrics, I can see different metrics: loss, accuracy, precision, recall, F1, and their macro and weighted versions. I'll stick with the loss. This is the winner, the salamander one. Let's go back to "Trainings." This is the model that won. We can view this on the model hub. Let's open this on the hub. Here we see the winning model. All these models are pushed automatically to the hub and are private. We see tags telling us they've been trained with AutoTrain. We see an XGBoost tag here. We see CO2 emissions and the metrics. Now let's try this model. We have a code snippet to use it. I'll clone the repo, which I've already done, but let me do that again. Let's grab that name and clone this. There we go. If we go in here, we see the model. Now I can switch to a notebook, which I've prepared. Here it is. We can test the model. I'm testing this locally on my machine, so I want to make sure I've got scikit-learn, joblib, and xgboost. At the time of recording, you need 1.5. If you watch this in the future, this might have been upgraded to 1.6 and future versions. If you see xgboost errors when you load the model, just make sure you have the appropriate version. Dependencies are loaded in the config. Grab the features. I can take a look at the model. As expected, this is a pipeline. We see a preprocessor step with the usual column transformations, imputing numerical values, scaling numerical values, dealing with categories, etc. Then there's the actual model, an XGBoost model. We see all the hyperparameters that have been used, and they're in the config file as well if you want to replicate this. Now I can just load my test set, remove the label, and keep the features because I want to predict. There's one tiny modification we need to do. We see the input features for transformation have the "feat_" prefix, so we just want to make sure that set has that as well. Add a prefix to all the columns. Now I'm fine. I can see all those columns. Then just predict, just like you would predict with any scikit-learn model. Voilà, show your prediction. This is really simple. All you need to do is start from your dataset in CSV or JSON line format, upload it in a couple of clicks to the Hugging Face UI, select the task type, select the number of jobs, click, wait a little while, and then you see the winning model and can grab it on the hub. The dataset has been pushed to the hub as well. Then you can clone that repo and work with the model just like you would work with any scikit-learn or XGBoost model. Super simple. I like that. Go to the Hugging Face hub and give it a try. If you have comments, feedback, or questions, leave comments. I'll try to answer as many as I can. Thanks a lot for listening, and I'll see you next time. Keep rocking.

Tags

AutoMLTabularDataMultiClassClassificationHuggingFaceAutoTrain