Train and deploy Keras models with TensorFlow and Apache MXNet on Amazon SageMaker - Julien Simon

Keras is a popular and well-documented open source library for deep learning, while Amazon SageMaker provides you with easy tools to train and optimize machine learning models. Until now, you had to build a custom container to use both, but Keras is now part of the built-in TensorFlow environments for TensorFlow and Apache MXNet. Not only does this simplify the development process, it also allows you to use standard Amazon SageMaker features like script mode or automatic model tuning .

Keras’s excellent documentation, numerous examples, and active community make it a great choice for beginners and experienced practitioners alike. The library provides a high-level API that makes it easy to build all kind of deep learning architectures, with the option to use different backends for training and prediction: TensorFlow , Apache MXNet , and Theano.

In this post, I show you how to train and deploy Keras 2.x models on Amazon SageMaker, using the built-in TensorFlow environments for TensorFlow and Apache MXNet. In the process, you also learn the following:

The Keras example

This example demonstrates training a simple convolutional neural network on the Fashion MNIST dataset. This dataset replaces the well-known MNIST dataset. It has the same number of classes (10), samples (60,000 for training, 10,000 for validation), and image properties (28×28 pixels, black and white). But it’s also much harder to learn, which makes for a more interesting challenge.

First, set up TensorFlow as your Keras backend (and switch to Apache MXNet later on). For more information, see the mnist_keras_tf_local.py script.

Positioning your image channels can be tricky. Black and white images have a single channel (black), while color images have three channels (red, green, and blue). The library expects data to have a well-defined shape when training a model, describing the batch size, the height and width of images, and the number of channels. TensorFlow specifically requires the input shape formatted as (batch size, width, height, channels ), with channels last. Meanwhile, MXNet expects (batch size, channels, width, height ), with channels first. To avoid training issues created by using the wrong shape, I add a few lines of code to identify the active setting and reshape the dataset to compensate.

Now check that this code works by running it on a local machine, without using Amazon SageMaker.

$ python mnist_keras_tf_vanilla.py
Using TensorFlow backend.
channels_last
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
<output removed>
Validation loss    : 0.2472819224089384
Validation accuracy: 0.9126

Training and deploying the Keras model

You must make a few minimal changes, but script mode does most of the work for you. Before invoking your code inside the TensorFlow environment, Amazon SageMaker sets four environment variables

parser.add_argument('--gpu-count', type=int, default=os.environ['SM_NUM_GPUS'])
parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
parser.add_argument('--training', type=str, default=os.environ['SM_CHANNEL_TRAINING'])
parser.add_argument('--validation', type=str, default=os.environ['SM_CHANNEL_VALIDATION'])

What about hyperparameters? No work needed there. Amazon SageMaker passes them as command line arguments to your code.

Training on Amazon SageMaker

After deploying your Keras model, you can begin training on Amazon SageMaker. For more information, see the Fashion MNIST-SageMaker.ipynb notebook.

In the training log, you can see how Amazon SageMaker sets the environment variables and how it invokes the script with the three hyper parameters defined in the estimator:

/usr/bin/python mnist_keras_tf.py --batch-size 256 --epochs 20 --learning-rate 0.01 --model_dir s3://sagemaker-eu-west-1-123456789012/sagemaker-tensorflow-scriptmode-2019-05-16-14-11-19-743/model

Because you saved your model in TensorFlow Serving format, Amazon SageMaker can deploy it just like any other TensorFlow model by calling the deploy () API on the estimator. Finally, you can grab some random images from the dataset and predict them with the model you just deployed.

Script mode makes it easy to train and deploy existing TensorFlow code on Amazon SageMaker. Just grab those environment variables, add command line arguments for your hyperparameters, save the model in the right place, and voilà!

Switching to the Apache MXNet backend

As mentioned earlier, Keras also supports MXNet as a backend. Many customers find that it trains faster than TensorFlow, so you may want to give it a shot.

Everything discussed above still applies (script mode, etc.). You only make two changes:

You can find the Amazon SageMaker steps in the notebook. Apache MXNet uses virtually the same process I just reviewed, aside from using the MXNet estimator.

Automatic model tuning on Keras

Automatic model tuning is a technique that helps you find the optimal hyperparameters for your training job, that is, the hyperparameters that maximize validation accuracy.

You have access to this feature by default because you’re using the built-in estimators for TensorFlow and MXNet. For the sake of brevity, I only show you how to use it with Keras-TensorFlow, but the process is identical for Keras-MXNet.

First, define the hyperparameters you’d like to tune, and their ranges. How about all of them? Thanks to script mode, your parameters are passed as command line arguments, allowing you to tune anything.

hyperparameter_ranges = {
    'epochs':        IntegerParameter(20, 100),
    'learning-rate': ContinuousParameter(0.001, 0.1, scaling_type='Logarithmic'), 
    'batch-size':    IntegerParameter(32, 1024),
    'dense-layer':   IntegerParameter(128, 1024),
    'dropout':       ContinuousParameter(0.2, 0.6)
}

When configuring automatic model tuning, define which metric to optimize on. Amazon SageMaker supports predefined metrics that it can read automatically from the training log for built-in algorithms (XGBoost, etc.) and frameworks (TensorFlow, MXNet, etc.). That’s not the case for Keras. Instead, you must tell Amazon SageMaker how to grab your metric from the log with a simple regular expression:

objective_metric_name = 'val_acc'
objective_type = 'Maximize'
metric_definitions = [{'Name': 'val_acc', 'Regex': 'val_acc: ([0-9\\.]+)'}]

Then, you define your tuning job, run it, and deploy the best model. No difference here.

Advanced users may insist on using early stopping to avoid overfitting, and they would be right. You can implement this in Keras using a built-in callback ( keras.callbacks.EarlyStopping ). However, this also creates difficulty in automatic model tuning.

You need Amazon SageMaker to grab the metric for the best epoch, not the last epoch. To overcome this, define a custom callback to log the best validation accuracy. Modify the regular expression accordingly so that Amazon SageMaker can find it in the training log.

Conclusion

Thank you very much for reading. I hope this was useful. I always appreciate comments and feedback, either here or more directly on Twitter .

About the Author

Julien Simon is the Chief Evangelist at Arcee AI , specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.

With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.

Previously serving as Principal Evangelist at AWS and Chief Evangelist at Hugging Face, Julien has authored books on Amazon SageMaker and contributed to the open-source AI ecosystem. His mission is to make AI accessible, understandable, and controllable for everyone.