An introduction to the MXNet API

An introduction to the MXNet API — part 4

In part 3, we built and trained our first neural network. We now know enough to take on more advanced examples.

State of the art Deep Learning models are insanely complex. They have hundreds of layers and take days — if not weeks — to train on vast amounts of data. Building and tuning these models requires a lot of expertise.

Fortunately, using these models is much simpler and only requires a few lines of code. In this article, we’re going to work with a pre-trained model for image classification called Inception v3.

Inception v3

Published in December 2015, Inception v3 is an evolution of the GoogleNet model (which won the 2014 ImageNet challenge). We won’t go into the details of the research paper, but paraphrasing its conclusion, Inception v3 is 15–25% more accurate than the best models available at the time, while being six times cheaper computationally and using at least five times less parameters (i.e. less RAM is required to use the model).

Quite a beast, then. So how do we put it to work?

The MXNet model zoo

The model zoo is a collection of pre-trained models ready for use. You’ll find the model definition, the model parameters (i.e. the neuron weights) and instructions (maybe).

Let’s download the definition and the parameters (you may have to change the filename). Feel free to open the first file: you’ll see the definition of all the layers. The second one is a binary file, leave it alone ;)

$ wget http://data.dmlc.ml/models/imagenet/inception-bn/Inception-BN-symbol.json

$ wget http://data.dmlc.ml/models/imagenet/inception-bn/Inception-BN-0126.params

$ mv Inception-BN-0126.params Inception-BN-0000.params

Since this model has been trained on the ImageNet data set, we also need to download the corresponding list of image categories (1000 of them).

$ wget http://data.dmlc.ml/models/imagenet/synset.txt

$ wc -l synset.txt
    1000 synset.txt

$ head -5 synset.txt
n01440764 tench, Tinca tinca
n01443537 goldfish, Carassius auratus
n01484850 great white shark, white shark, man-eater, man-eating shark, Carcharodon carcharias
n01491361 tiger shark, Galeocerdo cuvieri
n01494475 hammerhead, hammerhead shark

Ok, done. Now let’s get to work.

Loading the model

Here’s what we need to do:

load the model from its saved state: MXNet calls this a checkpoint. In return, we get the input Symbol and the model parameters.

import mxnet as mx

sym, arg_params, aux_params = mx.model.load_checkpoint('Inception-BN', 0)

create a new Module and assign it the input Symbol. We could also a context parameter indicating where we want to run the model: the default value is cpu(0), but we’d use gpu(0) to run this on a GPU.

mod = mx.mod.Module(symbol=sym)

bind the input Symbol to input data. We’ll call it ‘data’ because that’s its name in the input layer of the network (look at the first few lines of the JSON file).
define the shape of ‘data’ as 1 x 3 x 224 x 224. Don’t panic ;) ‘224 x 224’ is the image resolution, that’s how the model was trained. ‘3’ is the number of channels : red, green and blue (in this order). ‘1’ is the batch size: we’ll predict one image at a time.

mod.bind(for_training=False, data_shapes=[('data', (1,3,224,224))])

set the model parameters.

mod.set_params(arg_params, aux_params)

That’s all it takes. Four lines of code! Now it’s take to push some data in there and see what happens. Well… not quite yet.

Preparing our data

Data preparation: making our life miserable since the Seventies… From relational databases to Machine Learning to Deep Learning, nothing has really changed in that respect. It’s boring but necessary. Let’s get it done.

Remember that the model expects a 4-dimension NDArray holding the red, green and blue channels of a single 224 x 224 image. We’re going to use the popular OpenCV library to build this NDArray from our input image. If you don’t have OpenCV installed, running “pip install opencv-python” should be enough in most cases :)

Here are the steps:

read the image: this will return a numpy array shaped as (image height, image width, 3), with the three channels in BGR order (blue, green and red).

img = cv2.imread(filename)

convert the image to RGB.

img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

resize the image to 224 x 224.

img = cv2.resize(img, (224, 224,))

reshape the array from (image height, image width, 3) to (3, image height, image width).

img = np.swapaxes(img, 0, 2)
img = np.swapaxes(img, 1, 2)

add a fourth dimension and build the NDArray

img = img[np.newaxis, :]
array = mx.nd.array(img)

>>> print array.shape
(1L, 3L, 224L, 224L)

Dizzy? Let’s look at an example. Here’s our input picture.

Input picture 448x336 (Source: metaltraveller.com)

Once processed, this picture has been resized and split into RGB channels stored in array[0] (here is the code used to generate the images below).

If batch size was higher than 1, then we would have a second image in array[1], a third in array[2] and so on.

Was this fun or what? Now let’s predict!

Predicting

You may remember from part 3 that a Module object must feed data to a model in batches: the common way to do this is to use a data iterator (specifically, we used an NDArrayIter object).

Here, we’d like to predict a single image, so although we could use data iterator, it’d probably be overkill. Instead, we’re going to create a named tuple, called Batch, which will act as a fake iterator by returning our input NDArray when its data attribute is referenced.

from collections import namedtuple
Batch = namedtuple('Batch', ['data'])

Now we can pass this “batch” to the model and let it predict.

mod.forward(Batch([array]))

The model will output an NDArray holding the 1000 probabilities, corresponding to the 1000 categories. It has only one line since batch size is equal to 1.

prob = mod.get_outputs()[0].asnumpy()

>>> prob.shape
(1, 1000)

Let’s turn this into an array with squeeze(). Then, using argsort(), we’re creating a second array holding the index of these probabilities sorted in descending order.

prob = np.squeeze(prob)

>>> prob.shape
(1000,)
>> prob
[  4.14978594e-08   1.31608676e-05   2.51907986e-05   2.24045834e-05
   2.30327873e-06   3.40798979e-05   7.41563645e-06   3.04062659e-08 etc.

sortedprob = np.argsort(prob)[::-1]

>> sortedprob.shape
(1000,)

According to the model, the most likely category for this picture is #546 , with a probability of 58%.

>> sortedprob
[546 819 862 818 542 402 650 420 983 632 733 644 513 875 776 917 795
etc.
>> prob[546]
0.58039135

Let’s find the name of this category. Using the synset.txt file, we can build a list of categories and find the one at index 546.

synsetfile = open('synset.txt', 'r')
categorylist = []
for line in synsetfile:
  categorylist.append(line.rstrip())

>>> categorylist[546]
'n03272010 electric guitar'

What about the second highest category?

>>> prob[819]
0.27168664
>>> categorylist[819]
'n04296562 stage

That’s pretty good, don’t you think?

So there you go. Now you know how to use a pre-trained, state of the art model for image classification. All it took was 4 lines of code… and the rest was just data preparation.

You’ll find the full code below. Have fun and stay tuned :D

Part 5: More pre-trained models (VGG16 and ResNet-152)
Part 6: Real-time object detection on a Raspberry Pi (and it speaks, too!)

About the Author

Julien Simon is the Chief Evangelist at Arcee AI , specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.

With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.

Previously serving as Principal Evangelist at AWS and Chief Evangelist at Hugging Face, Julien has authored books on Amazon SageMaker and contributed to the open-source AI ecosystem. His mission is to make AI accessible, understandable, and controllable for everyone.