Deep Dive Advanced distributed training with Hugging Face LLMs and AWS Trainium

January 23, 2024
Following up on the "Hugging Face on AWS accelerators" deep dive (https://youtu.be/66JUlAA8nOU) this video zooms in on distributed training with NeuronX Distributed, Optimum Neuron, and AWS Trainium. First, we explain the basics and benefits of advanced distributed techniques like tensor parallelism, pipeline parallelism, sequence parallelism, and DeepSpeed ZeRO. Then, we discuss how these techniques are implemented in NeuronX Distributed and Optimum. Finally, we launch an Amazon EC2 Trainium-powered instance and demonstrate these techniques with distributed training runs on the TinyLlama and Llama 2 7B models. Of course, we share results on training time and cost, which will probably surprise you! ⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos. Follow me on Medium at https://julsimon.medium.com or Substack at https://www.airealist.ai. ⭐️⭐️⭐️ This video focuses on the software details essential for achieving peak performance. Access relevant code snippets and developer resources, suitable for both newcomers and experienced professionals. Whether you're familiar with AWS Trainium or approaching it for the first time, this technical walkthrough ensures your readiness for success in training Hugging Face models on AWS. 00:00 Introduction 01:20 NeuronX Distributed 05:20 Tensor Parallelism 11:10 Pipeline Parallelism 16:55 Sequence Parallelism 20:32 Optimum Neuron 21:40 Optimum Neuron with Zero-1 25:55 Optimum Neuron with Tensor Parallelism and Sequence Parallelism 29:20 Amazon Machine Images (AMI) for Neuron devices 30:10 Launching an Amazon EC2 trn1n.32xlarge instance with the Hugging Face Neuron AMI 33:10 Fine-tuning TinyLlama with Optimum Neuron 41:15 Fine-tuning Llama 2 7B with Optimum Neuron Links: - Hugging Face Optimum Neuron: https://huggingface.co/docs/optimum-neuron/index - Source code for supported models: https://github.com/huggingface/optimum-neuron/tree/main/optimum/neuron/distributed - Release notes: https://github.com/huggingface/optimum-neuron/releases - Distributed training docs: https://huggingface.co/docs/optimum-neuron/guides/distributed_training - TinyLlama: https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0 - Llama 2 7B: https://huggingface.co/meta-llama/Llama-2-7b-chat-hf

Transcript

Hi everybody, this is Julien from Hugging Face. In a previous video, I took you through the complete stack involved in accelerating Hugging Face models on AWS accelerators. We covered the chips, the Neuron SDK, some of the libraries, and in this video, I'm going to keep exploring that stack. We're going to zoom in on distributed training. We'll talk about Neuron X distributed and its integration into Optimum Neuron. We'll cover techniques like tensor parallelism, sequence parallelism, and 0-1 optimization. These are very cool techniques. And of course, we'll run some tests, and I'll show you how to train TinyLlama to start. Then we'll run distributed training on LAMA to 7 billion and look at the time and cost. I think you will be surprised. Let's get started. If you like this video, please give it a thumbs up, subscribe to the channel, enable notifications so you won't miss anything in the future, and please consider sharing the video with friends or colleagues. I need all the help I can get to spread the good work. This is the stack we covered in the previous video, so if you haven't watched that, please take a look. I will put the link in the video description. In this particular video, we're going to focus on distributed training. First, we should dive into NeuronX distributed, which is a distributed training and inference library built by AWS. Just as a reminder, this is an AWS library for distributed training and inference on Neuron Core V2 devices, INF2, and TRN1. It implements advanced techniques like tensor parallelism, pipeline parallelism, and sequence parallelism. These three techniques are implemented in this library. We can combine them if we want, and in fact, we will during the demos. These are very advanced techniques, and if you want to work directly with NeuronX distributed, you have to modify your code. You can see some examples in the AWS documentation. This is very difficult, so unless you really know what you're doing, and if you are, you're probably not watching this video, I wouldn't touch it directly. That's why we integrated this into Optimum Neuron, and you will see how simple we made it. Still, it's interesting to dive into NeuronX Distributed and understand more about tensor parallelism and sequence parallelism, which are already available in Optimum Neuron, and pipeline parallelism, which we're currently working on. We'll figure out what these techniques are, their benefits, and how we can easily enable them in Optimum Neuron during the demos. This is the high-level view of the three techniques. Let's start with tensor parallelism. As the name implies, we're going to parallelize tensors, specifically model weight tensors. We split them into chunks, as many as we have devices. In this example, we have two devices, so let's say two neuron cores. We split the layers into chunks and load the different layer chunks across those devices so that devices can run part of the computation in parallel. We'll look at this in more detail. Pipeline parallelism, as you can see, is not about splitting the layers but the model itself. We split the model into layers. In this example, we see individual layers on three devices, but we could have multiple layers on the same device. We run batches through the layers, forward and backward, and need to orchestrate all of this. Let's see how that works. Lastly, sequence parallelism is not about splitting the layers or the model but the input sequence. We split the input sequence and run sequence chunks through the full model on each device, with synchronization to ensure we don't lose context across those chunks. These are the three different ways we can parallelize training. Let's look at them in more detail. Let's start with tensor parallelism. Tensor operations are at the core of model training, which often means multiplying matrices or tensors. If we have multiple compute devices, such as multiple cores or chips, it would be great to split that computation and run each chunk in parallel on several devices. This is exactly what we're doing here. This comes from the Megatron LM paper from 2019. If you look at the example, we're running A multiplied by B and storing the results in C. We leave A untouched, load A on all devices, split B into two chunks (assuming two devices), and load each B chunk on one device. We can run A multiplied by the first B chunk on one device and A multiplied by the second B chunk on another device, completely in parallel. Then we gather and merge the resulting C chunks. This is about splitting tensors, distributing chunks, computing partial operations, and concatenating the partial results. The benefit is reduced memory usage on each device. We're not loading the full model but only chunks of tensors, saving memory and allowing us to work with very large models that wouldn't fit on a single compute device. We can shard very large models across multiple devices and still train them. The tensor parallelism level is the number of chunks we build. In this example, we're using two-way parallelism because we split B into two to run on two devices. This works for training and inference, making it a distributed inference technique as well. NeuronX Distributed implements this with parallel versions of transformer layers, including parallel embedding and parallel linear layers (fully connected layers, row-wise and column-wise). If you work with these objects, the tensor operations are automatically split or distributed across different neuron cores. However, you need to change your code, replacing your embeddings and linear layers with the parallelized versions in NeuronX distributed. These are the building blocks used in attention layers, as shown in the illustration from the Neuron docs. We replace some attention layers with their parallelized versions, allowing us to run distributed training on different devices. Code modifications are required, and data loading and model definition are different. Fortunately, this is part of Optimum Neuron. You won't have to change anything; you just need to specify the level of tensor parallelism you want. This is very cool and we'll see it in the demo. The second technique is pipeline parallelism. Pipeline parallelism focuses on splitting the model at the layer level. Different parts of the model layers run on different devices. If you have very large models that wouldn't fit on an accelerator, you can split the model and run different layers on different devices. The challenge is orchestrating the forward and backward propagation of data across these layers on different devices. This technique was first introduced by G-Pipe. We split or partition the model at the layer level, not necessarily one layer per device, and distribute these layer chunks across compute devices. The big challenge is orchestration to keep all devices as busy as possible. We take the mini-batches in your training code and split them further into micro-batches. Let's look at an example. Suppose we split a mini-batch into eight micro-batches and have four devices. Initially, we forward propagate micro-batches one through four sequentially through the first device, then the second, and so on. Once micro-batch one is fully forward propagated, we start backward propagating it. While doing this, we continue forward propagating micro-batches two and three. Once micro-batch one is fully backward propagated, we start with micro-batch five, and so on. This keeps all devices busy, maximizing hardware usage. Once we finish the mini-batch, we sync the results before starting the next mini-batch. This technique can theoretically scale to very large models by adding more devices, though communication overhead can become an issue. This is a very cool technique, and we'll see it in action soon. The third technique is sequence parallelism. Tensor parallelism and pipeline parallelism help scale compute, but they don't solve the memory usage problem in the last layers of attention, especially when computing activations. At the end, you still need to concatenate results, and each core will have full activation values, leading to high memory usage. Sequence parallelism addresses this by splitting the input sequence. For example, if you have 2K tokens and four devices, you split the tokens into four subsequences. Each subsequence is processed on a device, requiring memory only for that subsequence in the final layers. This allows scaling to longer sequences and contexts. Synchronization ensures the final output is produced correctly. The cores implement a distributed algorithm called ring self-attention, which I won't go into detail about, but you can read the paper if interested. This technique helps reduce memory usage and allows processing longer sequences. In theory, you could process infinitely long sequences by adding more cores, but practical limits exist due to communication overhead. Now, let's see how to implement these techniques easily with Optimum Neuron. Optimum Neuron is our hardware acceleration library for Neuron devices, including Tranium and Inferentia. Using it is simple: install Optimum Neuron or use our AMI on AWS, and import the Neuron Trainer as Trainer. Your Transformers code should work unchanged. Models need to be compiled and optimized for Neuron devices, which is a one-time operation, and the output files (NEF, Neuron Executable File Format) can be cached on the Hugging Face Hub. Before diving into tensor parallelism and sequence parallelism, let's discuss the zero redundancy optimizer (ZeRO) technique, which is popular and implemented in Optimum Neuron. ZeRO is a data parallelism technique where the model is loaded across a cluster of compute nodes, with a full copy on each node. We run different batches across nodes and merge results. This requires 4x the model size memory to hold the model, gradients, and optimizer state. ZeRO addresses this by partitioning the optimizer state across nodes, reducing memory usage. ZeRO comes in different flavors: ZeRO-1 partitions the optimizer state, ZeRO-2 partitions gradients, and ZeRO-3 partitions parameters. Optimum Neuron currently supports ZeRO-1, which automatically partitions the optimizer state across neuron cores. Enabling ZeRO-1 is simple: set `zero1=True` in your training args or use `--zero1` in your training scripts. Tensor parallelism is supported in Optimum Neuron and is based on the NeuronX distributed library. When you load a model for training with tensor parallelism, Optimum Neuron automatically transforms the layers into parallel layers. This includes implementations for fully connected layers, cross-entropy, embeddings, and self-attention. Enabling tensor parallelism is as simple as setting the `tensor_parallel_size` parameter in your training args or scripts. The ideal value is the smallest that helps you fit the model on your devices. Sequence parallelism is on by default in Optimum Neuron, but you can disable it if needed. Now, let's see this in action. You don't need to set up anything; you can use pre-configured AMIs. I'll use the Hugging Face AMI, which comes with Optimum Neuron and all Hugging Face libraries. We'll launch a Tranium instance and run the demos. Starting from the EC2 console, we'll select the Hugging Face AMI, choose a large instance (32XL), and select TRN1N for enhanced chip-to-chip communication. We'll create a security group, add more storage, and use spot instances to minimize costs. Once the instance is running, we can SSH into it and run Neuron LS to see the devices. We'll clone the Optimum Neuron repo for the examples. For the first demo, we'll fine-tune TinyLlama on the Wikitext dataset. We'll start with vanilla distributed training and then add ZeRO-1 and tensor parallelism. The environment variable `MALLOC_ARENA_MAX=2` is used to work around a malloc bug. The first run on this instance will fetch pre-compiled model files from the Hugging Face Hub, avoiding compilation. We'll see the Neuron devices in action and monitor memory usage. Enabling ZeRO-1 and tensor parallelism is as simple as adding the corresponding parameters to the command line. We'll compare the training time and memory usage with different configurations. For the second demo, we'll train LAMA 2 7 billion. The code is similar, but we'll use all 32 cores and enable eight-way tensor parallelism. This should help fit the model on the instance and accelerate training. We'll see the model artifacts from the cache, avoiding compilation, and monitor the training process. Training should take about 22-23 minutes, and the cost will be under $5 using spot pricing. In summary, fine-tuning LAMA 2 7 billion for $5 is possible with the right techniques and tools. I hope you found this video helpful. If you did, please give it a thumbs up, subscribe to the channel, enable notifications, and share the video with your colleagues and friends. There's more coming on Optimum Neuron and model acceleration in general. I'll see you in the next video. Keep rocking.

Tags

Distributed TrainingNeuronX DistributedOptimum NeuronTensor ParallelismSequence Parallelism