Hi, this is Julien from AWS. Welcome to episode 11 of my podcast. This episode is an XJBoost special. As you probably know, XJBoost 1.0 came out just a few days ago, and I thought this would be a good opportunity to show you the different ways in which you can train and deploy XJBoost models on SageMaker. So don't forget to subscribe to my channel for future videos, and let's dive into XJBoost.
Let's take a look at the new features in XJBoost 1.0. The first one is better performance scaling on multi-core CPUs, with claims of up to 5x speed up on Intel CPUs with many cores. This is significant because on AWS, you can definitely use Intel CPUs with many cores, and you have a wide choice of training instances. Whether you run XJBoost on EC2 directly or on a managed service like SageMaker, this should make a difference. It's definitely worth benchmarking soon.
The next feature is good news for everyone using macOS; it's now easier to install XJBoost on macOS, which is not a minor improvement. There's also distributed XJBoost on Kubernetes, complete with a tutorial, so you might want to check that out. Running this on EKS could be interesting and is something I can add to my to-do list.
Ruby bindings for XGBoost are now available for Ruby developers. A Dask interface has been added, which is a very cool framework for distributed computing. XGBoost can now natively use Dask with multiple GPUs, which is quite interesting. Dask is becoming increasingly popular, so this is another item for the to-do list.
First-class support for QDF data frames and CuPy arrays has been added, further integrating XGBoost with NVIDIA GPUs. This is a strong theme in XGBoost 1.0, and it's interesting because, while most customers I meet run XGBoost on CPU and are quite happy, the appeal of GPUs becomes more important as they train on larger datasets. It's good to see XGBoost supporting and integrating more with GPUs, which is definitely worth testing.
Other notable features include ranking on GPUs and external memory for GPU training. This is useful if your dataset is larger than what your GPU can accommodate. XJBoost can now manage this, similar to how it handled larger-than-memory datasets on CPUs. Improvements to the scikit-learn interface are also good news. There are also updates to XGBoost4J Spark for those interested.
One of the last features I want to mention is the new format for saving models. Previously, most of us used pickle to serialize XGBoost models in Python, but this created several issues. XGBoost is moving to a more open and portable format, which is JSON-based, and they have new APIs for this. If you have pickle models, you will need to retrain them and save them using the new format. Loading pickle models in XGBoost 1.0 and later might be problematic, so retraining is recommended.
There are also bug fixes and other improvements, making this a major release. There's already a 1.0.1 patch release, addressing a minor issue that slipped past validation in 1.0.0.
Now, let's look at how to run XJBoost on SageMaker. The first way is to use it as a built-in algorithm. SageMaker has a collection of built-in algorithms, and XJBoost is one of them. This means you have a built-in XJBoost container that you can use directly. Just set hyperparameters, define the location of your data, and train. You can select the version you want to use, and I recommend using the latest version, 0.90-2, which supports SageMaker Debugger. This service allows you to configure debugging rules and inspect your training jobs as they go. If you use an older version, SageMaker Debugger won't be available.
The second way to use XJBoost on SageMaker is to treat it as a built-in framework, similar to TensorFlow, PyTorch, and MXNet. The container for this is open-sourced, and you can find the links in the video description. You can build and customize this container on your local machine. The process is similar to using other frameworks. You use the `XGBoost` object from the SageMaker SDK, pass a script, define infrastructure requirements, set hyperparameters, and specify the framework version.
Here's a simple example where I train a classifier on the direct marketing dataset using script mode. Script mode is a way to interface existing framework code with SageMaker. It involves receiving hyperparameters from the command line, environment variables for the location of the training script, training set, validation set, and where to save the model. If you have existing XGBoost code, converting it to script mode is straightforward.
In the script, I import XGBoost, receive a single hyperparameter (max depth), and grab environment variables for the training set, validation set, and model save location. I load the dataset, which is already split into training and validation sets, using pandas. The labels are in the "yes" column, and the samples are the remaining columns. I create an XGBoost classifier with binary logistic as the objective and use the area under the curve (AUC) metric. I train the model, score it on the validation set, print the AUC, and save the model using the new JSON-based format.
To use this script, I have a vanilla XGBoost script mode notebook. I download the dataset, one-hot encode it, drop the "yes" column, split the dataset into training and validation sets, and save them as CSV files. I define the location to save the model and use the XGBoost estimator to pass my script. I can train locally for quick debugging, set the framework version, and see the model training and AUC printed out.
Once the model is trained, I can deploy it using the `deploy` API. After a few minutes, I can predict using a sample from the validation set. I convert the sample to CSV and use the `InvokeEndpoint` API from Boto3 to send the payload to the model and read the probability between 0 and 1.
If you want to use a different format, such as NumPy arrays, you need to write an `input_fn` function to convert the input. The `model_fn` function, which loads the model, is mandatory and must be provided by you.
Regarding loading and saving models, XGBoost is moving to a new JSON-based format. If you have models from XGBoost 0.90, you can try loading them with the new version, but it might be problematic. The XGBoost repo provides a script to convert pickle models to the new format, but it's advised not to use it for stability-critical applications. Retraining is generally a better approach.
Finally, you can use a completely custom container. Last week, I showed how to do this with MLflow, where I trained a local model on my Mac and deployed it using a custom container on SageMaker. MLflow built the container for me, making it a hands-off operation. Alternatively, you can build your own SageMaker container if you have specific requirements.
These are the three main options: use the built-in algorithm for a straightforward setup, use the framework mode for more control over your code, or bring your own container for full customization. That's it for episode 11. I hope you learned a few things. Thank you for watching, don't forget to subscribe to my channel for future videos, and until next time, keep rocking.