SageMaker Fridays Season 4 Episode 12 Using AutoML to build an multimodal classification model

October 22, 2021
Broadcasted on 22/10/2021. Join us for more episodes at https://pages.awscloud.com/SageMakerFridays ⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos ⭐️⭐️⭐️ In this episode, we use AutoGluon, an Open Source library for AutoML, to automatically build an classification model on the multimodal PetFinder dataset (image, text and columnar data) *** Notebook https://github.com/juliensimon/amazon-studio-demos/tree/main/sagemaker_fridays/season4/s04e10

Transcript

Hi, everybody, and welcome to this new episode of Stage Makeup Fridays, Season 4. If you've been following us so far, you'll notice I'm on my own again. Well, my friend Sego had a baby, and she's busy with the baby. Who could blame her? It's more important than machine learning. But, hey, I'll do my best to help you learn a few things today. So this is actually the last episode for now. It's also the last of our AutoML segment. I think it's a pretty cool one because today we will keep using AutoGlueOn, the open-source library we've been discussing for a few weeks now. We are going to use it to train a model for a multi-modal dataset. This dataset contains images, natural language, and tabular data. We're going to see how we can train an ensemble of models on this dataset. As usual, it will only take a few lines of code. Okay, so that's pretty cool. Let me show you where to get the code for today's demo. You'll find it in this repository. Just go and grab it, tweak it, and run it yourself to learn as much as you can. In case you haven't watched the previous episodes on AutoGlueOn, here's a quick recap. Autogluon is an open-source library designed by Amazon teams. It lets you build models automatically on tabular data, image data, NLP data, and, as we will see today, a combination of those, which I think is a really interesting feature. Code, samples, and documentation are on GitHub. There's a good research paper that tells you more about the design principles and unique features of Autogluon. I highly recommend it. It's a really good read. And that's about it. It's literally two or three lines of code to build sophisticated models. Okay, let's dive into the code right now. This is one of the tutorials from AutoGlueOn. You'll find this example on the AutoGlueOn website, along with API information and additional examples. So that's the topic we're focusing on today: How can we very easily train models on datasets that contain a mix of data types? In this dataset, which is the PetFinder dataset, we'll find tabular data, such as integers and categorical data. We have lots of good algorithms to train models on this. We've done this in a previous episode, and data scientists know how to do this very well. We have natural language, specifically a long text description that cannot be treated as categorical data. It's language, so we want to apply natural language models to it. There is also a picture, and we want to use computer vision algorithms for that. You could break this dataset into three parts: train traditional algorithms on the tabular data, train NLP algorithms on the text, and train computer vision algorithms on the image. Then, you could combine them for ensemble prediction. I'm sure you would get decent results. However, combining the models in the same training job, which is what Autogluon does, is actually easier. We'll see very little code, and it lets you leverage the fact that this is part of the same dataset. The tabular data, text, and image are related, so you can learn better by treating it as a single dataset rather than three sub-datasets. That's what the multi-modal problem really is. We're going to try to do this. Let's take a look at the dataset first. I'm working with Studio, just like last week, using a GPU instance because we're going to be fine-tuning computer vision and NLP algorithms, which are deep learning-based. GPUs are important for performance. I'm using a small one here, the G4TNXL, which is the most cost-effective GPU instance you can run on Studio. It's a good choice for experimentation. For larger scale, we could switch to P3 environments with multi-GPUs, but for now, this is good enough. First, install AutoGlueOn. I'm uninstalling Torch because having Torch and MXNet installed in the same environment can cause issues with GPU. Let's make sure we have a single library and single drivers. Alright, AutoGlueOn, and then just specify a directory where we're going to download our dataset. It's in S3, it's a zip file. It's the PetFinder dataset from Kaggle. We can easily download and unzip this dataset to our local instance, and this is what we get: CSV files with the tabular data and image directories. We can load the CSV files and display some of that data. This is pet data, and we're trying to predict the adoption status for each animal and how likely the animal is to be adopted. We have different classes, which we'll see here. Adoption speed, right? We have five different speeds. In this dataset, we have the name of the animal, its age, breed (a category), gender, color codes, all of which are categorical variables. Quantity, I'm guessing one of each, unless we have twins with the same name, which would be kind of weird. Rescuer ID, description (a natural language text field), the label (adoption speed), and a path to the image. That's what this dataset looks like. The label is adoption speed. We can try to display some images. There's very little processing we need to do here. The images column actually contains multiple images per animal, so we'll keep only one image, specifically the first image from the list. We'll fix the image names to include the full path because that's what Autogluon expects. We need to fix these two things. We split the images column and keep only the first image. You could argue that we're wasting data and that we don't know if it's the best image to keep. That's a fair point. We could add multiple columns with different images or duplicate the samples as many times as we have different images. I didn't consider this much, but it's something to think about. Just add the full path. Now we have a single image in the images column with the full path. Another look at all the features we have. Lots of categories, description, and the image. If we look at the description, it's natural language, so we'll need to use a deep learning model to extract patterns. Before we do that, we'll probably need to process the data, tokenize or vectorize it so that it can be learned by a deep learning model. We have pictures, and, of course, it's a cat. We'll have a dog later on. This dataset is pretty interesting, with lots of different things. There is no silver bullet, so no single model could be very efficient on all of this. Someone might say they can build a custom model that takes everything as input, but that's a lot of work. For many people, this is not an option, so we'll use AutoML instead. We'll train different algorithms and see how they do on this dataset. I'm working with the full dataset. If you want very short training times, you can sample the dataset and keep only 500 rows. We'll work with the full dataset. In a previous episode, we discussed how Autogluon automatically figures out the type of each column. You ask Autogluon to infer the data types from your dataset. This is what it comes up with: a bunch of ints, mostly categorical, some objects, and the text description. One important bit is missing: the image. It didn't pick it up because we store a string, not the image itself. We need to add a special type to the metadata and say, "Hey, this column is actually an image path." This tells Autogluon that the weird string with slashes is a path to an image, and it will need a computer vision algorithm. Make sure the data type is the right type for the model. Some of these integers, like age, have a sense of scale, but breed one and breed two are really categories. We can reassign the type of those categorical variables to category, which can give a nice boost in accuracy. In episode 9, this gave us a few extra percent of accuracy, which was meaningful. This is how you would do it. In this example, we'll carefully set all these to categories so that our tree algorithms know what to do with them. Now we need to tell Autogluon that this is a multi-modal dataset. We have a default setting for this. We can grab hyperparameters for multi-modal training jobs and print them out. This will show the specific settings Autogluon will use for different algorithms. We see default settings for LightGBM, CatBoost, XGBoost, and more. We could add algorithm-specific parameters, but we'll work with the default parameters Autogluon sets. What's particularly interesting is AG_text_NN, which picks deep learning models for NLP, and AG_image_NN, which picks computer vision models. Last week, we worked with an image dataset and trained computer vision models from GluonCV. Here, Autogluon will try a combination of all of this: tree-based algorithms, NLP algorithms, and computer vision algorithms. We can pick models from the GluonCV and GluonNLP model zoos. We'll probably use ResNet for computer vision and a BERT variant for NLP. The more models you add, the longer it will train, so find the right balance. Pay attention to your metadata and data types. Make sure your hyperparameters and the list of algorithms are correct. Training is super easy. We call the fit API, using a tabular predictor because our data lives in columns, but some of those columns are not just plain data. We specify the training set, the label, the hyperparameters, feature metadata for typing, and the time limit. The shortest useful training time I found was about 70 seconds on the full dataset. If you want to train for shorter periods, use fewer data samples. Initially, we have 12,000 rows and 24 columns. Autogluon infers our prediction problem is multi-class, with five adoption speed values. It performs feature engineering, filling missing values and handling categorical variables. We see NLP processing, n-grams, and vectorization. The description column is transformed into a vector of integers using a count vectorizer. Each description becomes a vector of 5,400-something values, representing the occurrences of each token. This is stored efficiently using a sparse matrix. Pet ID is dropped because it's a unique ID and doesn't carry any meaning. We split the data for training and validation and train eight level one models: LightGBM, LightGBM Extra Trees, CatBoost, HDBoost, NeuralNet (MXNet), and LightGBM Large. These models focus on the tabular features. CatBoost sets a sampling parameter because of the large number of features. Training time is about 70 seconds. For the deep learning part, we fine-tune a pre-trained BERT variant from the GluonNLP model zoo on the NLP part of the dataset. We see validation accuracy improve over epochs, reaching about 40%. Training takes about 64 minutes. Next, we pull an image predictor from the GluonCV model zoo and fine-tune it for a few epochs. Once we've trained these L1 models, we have a collection of very different models: tree-based, computer vision, and NLP. The final step is to train an ensemble model that predicts in parallel with all the L1 models and outputs an ensemble prediction. Training takes a little less than two hours. All models are saved in the autogluonmodels directory. Training completes, and we print the leaderboard. The Ensemble model is better than all the individual models. LightGBM and CatBoost did well, and the text and image models did okay, considering the short training time. We can predict one sample from the training set. We print the image and predict the adoption speed. Invoking predict requires passing a DataFrame, not a Pandas Series. We predict this as adoption speed 4. The predictor context is still there, and you can load models that have been archived. It's super easy to work with models you've trained previously. That's pretty much what I wanted to show you today. Here's where to find the code. I think AutoML is really interesting for people who don't have a ton of ML experience and for those who want to quickly explore and evaluate if a dataset shows promise. It's something you definitely want to add to your skills. As you can see, it's very little code. Try it with some data you have and train some models to see how you compare to existing models. AutoML is here to stay, and I'm really curious to see where it's headed next. It's the last episode of the season. I hope I did okay on my own. Thanks to Sego for helping me with this season and the previous one. She couldn't be here today because babies are babies, and that's more important than anything else. Sego, thank you so much for your help. To everybody out there, I hope you learned a few things with this episode and the previous ones. I'll see you soon. Bye-bye.

Tags

AutoMLMulti-modal DataAutogluonMachine LearningPetFinder Dataset

About the Author

Julien Simon is the Chief Evangelist at Arcee AI , specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.

With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.

Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.

Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.