Hi everybody, this is Julien from Arcee. In this video, I would like to introduce a new distributed training capability available in Amazon SageMaker. We just launched it at AWS Reinvent, and this one is SageMaker Data Parallelism. Let me explain what data parallelism is first. Data parallelism is a way to train on really large data sets by splitting the training data and distributing it to multiple training instances. Of course, when you have very large data sets for deep learning training jobs like computer vision or natural language processing, this is a really useful technique. Distributed training was already available in SageMaker, so it's not entirely new. What's the big difference here? This new data parallelism technique is more efficient. It removes GPU to GPU communication, which is how AuroVod works. AuroVod is great, but as you scale to really large training jobs, the network becomes a bottleneck, and there's just too much chatter going on between GPUs. SageMaker Data Parallelism, or SDP for short, removes that. Let me show you how this works.
Here, the data is distributed to different GPUs. A subset of the dataset is sent to each GPU. Each GPU trains the model on that subset of data and stores gradients, or parameter updates, in GPU memory. When a certain threshold is reached, it sends those gradients to the instance CPUs. The CPUs themselves will apply the gradient updates coming from the different GPUs, reduce those, and then send them back to the GPUs. As you can see, the GPUs don't talk to one another, so they can stay hopefully 100% busy on training and not on communicating. Another benefit is that we don't need dedicated infrastructure. We don't need parameter servers to receive gradients, consolidate updates, and distribute them back. Both the training part and the model updating part are distributed on your training cluster. You don't need anything else. Training on the GPU and consolidation and model updates on the CPUs all stay within your cluster, making this more efficient.
Let's do an example. In the blog post, I have a PyTorch example, but let's do a TensorFlow example. First, let me show you where you can find these. They're in the Amazon SageMaker examples repository. You need to go to training and distributed training. You have examples for TensorFlow and PyTorch. Let's look at data parallelism first through a simple example to show you what it takes to modify your existing code for SDP. There are also larger scale examples using BERT and CNN where you use Amazon FSX for Lustre to store the training set and speed up the training job even further. These are really good examples but are a bit too long for this video. I'm going to use the simple example here. You can find the documentation for this on readthedocs.io. I will put all those links in the video description, including API information for PyTorch and TensorFlow and a guide explaining how to adapt your existing code for SDP.
The first major thing is to update your SageMaker SDK. Otherwise, you won't have the data parallel APIs. At the time of recording, the latest version is 2.19, and it works for me. If you have 2.19 or newer, you should be fine. Now, let's look at the training code and focus specifically on what you need to add to make this run with SageMaker data parallelism. First, we need to import the package and initialize it. That wasn't too hard. Then, you need to pin the GPUs running on a particular instance to the data parallel process running on that instance. The data parallel process will manage the GPUs present on the instance, so you just need to declare those GPUs as associated with that process. This is the line of code you need to add here.
We load the dataset, build a TensorFlow dataset object, shuffle it, and so on. We build our model, select the loss function, and then, as we can see, we multiply the learning rate by the size of the training cluster. For example, if you have eight nodes, then we multiply the learning rate by eight. We use checkpointing, which is always a good idea in case you have long-lasting jobs. If something fails or if you need to restart them or resume from a checkpoint and train further, this is always good. Then we have the training step function. This uses the gradient tape TensorFlow object, which records all the forward propagation operations so that you can automatically compute gradients during backprop. You can keep using gradient tape. All you need to do is wrap it using distributed gradient tape from the SDK. Simple enough. The rest is as usual. Compute gradients, apply gradients, and if this is the first batch, so if the training job is just starting, we need to broadcast variables to all nodes. This only happens once. So here, just grab the variables from your model and broadcast them. Of course, we need to reduce all the gradient updates. We call the all-reduce operation from the SDK, which will collect and process the different updates coming from all the different nodes. We return the loss value. The rest is identical.
Finally, in the training loop, we only checkpoint on the master node. It's probably just a waste of storage to have a checkpoint on every single node. Here, if the node is node 0, if it's the leader node, then we checkpoint there. That's all we need to know. Summing things up, not such a large amount of work. Import the package, initialize it, pin the GPUs to the SDP processes running on each node of the cluster, multiply the learning rate accordingly, use the distributed gradient tape object for backward propagation, compute the reduce operation for the loss, and checkpoint only on the leader node. All right, so that's about it. Not a lot of work. We create a TensorFlow estimator, passing our script. We use TensorFlow 2.3.1, Python 3.7, and two p3.16xlarge instances, which is a total of 16 GPUs. This is a lot for MNIST, but it's a simple example. Then we just enable data parallelism like this, setting the distributed training configuration to data parallelism, just enabling it. One thing to know is that this is only available for three instance sizes: p3.16xlarge, p3dn.24xlarge, and p4d.24xlarge. This makes sense because if you are facing very large training jobs, you would certainly be using the larger GPU instances. We just run the training job and wait. It's a totally normal-looking training job, except that this time you're using this new algorithm instead of the native parameter server available in TensorFlow or Horovod. Swami in his keynote announced some benchmarks that were pretty spectacular, thanks to these new distributed training features on SageMaker. Go and take a look at the keynote if you haven't seen it.
That's what I wanted to tell you. Have fun scaling your training jobs with data parallelism. It's a really nice addition and doesn't take a lot of changes to your code. Let me know how you're doing. Happy to answer questions and listen to feedback. See you soon. Thank you.