Deep Dive Hugging Face models on AWS AI Accelerators

Transcript

Hi everybody, this is Julien from HuggingFace. In the last few months, I've done quite a few videos on accelerating HuggingFace models on AWS accelerators, Tranium, and Inferentia 2. We've been moving very fast with AWS, and a lot of things have changed in the last few months, weeks, maybe even days. It's not easy to get a clear picture of what's happening. The information is a little scattered across GitHub repos, websites, etc. So I thought, why don't we start 2024 with a complete, up-to-date overview of the tech stack, from the chips to the AWS libraries to the HuggingFace libraries. If you're completely new to Tranium and Inferentia, this is a really good place to start. You won't want to waste your time chasing around information that could be hard to find or outdated. And if you've been working with Tranium and Inferentia, you'll certainly learn what we've been up to in the last few weeks. Okay, so let's get started, and I'll share all the links in the video description. If you like the video, please don't forget to subscribe, give it a thumbs up, enable notifications, tell your colleagues, friends, or even your cat or dog. Let's bring this to everybody because these are really cool chips. Let's take a first look at the stack. What are all the moving parts involved? As the slide says, this is not an architecture graph, so don't think it shows the exact relationship between the building blocks or dependencies. It's a loosely organized collection of technical elements, but hopefully, the abstraction levels are in the right place. At the bottom, we have the chips: AWS Tranium for training and AWS Inferentia 2 for inference. They are very similar, particularly sharing the same compute element, the Neuron Core. We'll focus on Neuron Core V2. Neuron Core V1 was only for Inferentia 1, which is not in scope for this discussion. On top of this, we have EC2 instances: TRN1 and the TRN1N variant, and INF2. These are good old EC2 instances that include either Tranium or Inferentia 2 chips. Moving up, we have the Neuron SDK, designed by AWS to work with the Tranium and Inferentia chips. We'll see what's in that box. We also have Deep Learning Frameworks, PyTorch, and TensorFlow, both supported by the Neuron SDK. Going up, we have a couple of low-level libraries, which are still fairly new, beta, and iterating quickly. We'll see why these are important and why you should know about them. At the top, we have what I call the high-level libraries. These are user-friendly, easy to use, and generally HuggingFace libraries. We have Transformers and Diffusers, our well-known open-source libraries. If you're not using them, it's time to start. At the top, we have Optimum Neuron. Our Optimum family of open-source libraries is dedicated to hardware acceleration, and Optimum Neuron is dedicated to Neuron devices, specifically Tranium and Inferentia. This is the stack. We're going to cover every block here, except PyTorch and TensorFlow. We'll zoom in, look at the features, where they fit, and why you should care. Let's get started. We'll go bottom-up and start with the Neuron Core V2. If I say core, I really mean Neuron Core, not CPU cores. I'll try to say Neuron Core, but if I say core, you'll know it's the same thing. The Neuron Core V2 is a hardware design by AWS. It's a compute unit designed to accelerate deep learning workloads. The Neuron Core is at the core of both chips, Tranium and Inferentia 2. Both accelerators are based on Neuron Core V2, with a few differences we'll discuss. You need a software development kit to work with the Neuron Core, and this is called the Neuron SDK. This is what the Neuron Core V2 looks like. It has five main components. We're not hardware engineers, so we won't dive too deep, but it's important to understand what these are. We have static RAM, fast local storage for running local operations, extremely fast. We have a scalar engine for processing and computing numbers. Not everything is a vector or tensor, and we need a scalar engine for that. We have a vector engine for typical vector operations like add, multiply, or multiply and add, normalization, pooling, etc. We have a tensor engine for general-purpose metrics, multiplication, convolution, and all the deep learning operators commonly found in our favorite models. We also have the GPS-IMD, which stands for general-purpose single-instruction multiple-data engine. These are eight general-purpose processors that can run custom code. This is useful if your models have custom operators or if you need to implement optimized versions of existing operators. You can also implement brand-new operators for extensibility. Customizability is very important. If you want to know more, there's a good section in the SDK user documentation. Data types are crucial. There's a long list: 8-bit, 16-bit, 32-bit, floating points, integers. We've got you covered. Data types are critical for different use cases, especially with quantization. There are additional features, but we won't go too deep. If you want hardcore details, check the Neuron SDK documentation about control flow, dynamic shapes, and rounding modes, which are important when working with smaller data types. In a nutshell, this is a custom design by AWS. It powers both training and inference, with hardware support for acceleration across multiple data types and the ability to run custom code through general-purpose processors. Now, let's look at Tranium and where it fits. Tranium is a full chip designed by AWS to accelerate training, though it can still run inference for evaluation during the training loop. However, Inferentia 2 is probably more cost-effective for pure inference. This is what Tranium looks like. We have two Neuron Cores, 32 gigabytes of high-bandwidth memory (HBM). The exact flavor of HBM is not specified, but I suspect it's HBM2E. We need very fast communication with the host, so we use PCIe and DMA. We also need fast communication between Tranium devices for distributed training, which is what NeuronLink V2 is for. The TRN1N variant has more NeuronLink capacity and ports, allowing for quicker communication between devices. If you're interested in performance, there's a web page in the Neuron SDK documentation. It's a bit buried, but here's the link. Tranium 2 has been announced, with 3x the memory (96 gigs, likely HBM3) and up to 4x faster training. We'll have to verify once we get the chips. Tranium is a dedicated accelerator for training, with two Neuron Cores per device, fast memory, and fast communication. Now, what about Inferentia 2? Inferentia 2 is designed to accelerate inference. It's almost the same as Tranium, with two Neuron Cores and 30 gigabytes of HBM. The main difference is less chip-to-chip communication capacity, which is cost-saving. Inference is more sequential, so less bandwidth is needed. Inference cost is crucial. Customers spend 90% or more of their AI budget on inference. Once a model is in production, it stays there 24/7, so cost is very important. Removing unnecessary features while delivering great performance makes sense. There's a performance page with latency, throughput, and cost per million inferences for supported models. Tranium devices are available in two EC2 families, TRN1 and TRN1N, with two sizes. TRN1N.1N2XL has one device (two Neuron Cores), and TRN132XL has 16 devices. TRN1N32XL also has 16 devices but with more bandwidth and Ethernet bandwidth through EFA. For large distributed training jobs, TRN1N32XL is a more efficient option. TRN1-2XL is great for experimentation, and TRN132XL is good for larger models. TRN1N32XL is for scaling out. These instances are available in three US regions, but not in Europe yet. When it comes to SageMaker, you can use TRN1 and TRN1N for training jobs but not as notebook instances or studio environments. For interactive experiments, launch an EC2 instance and connect remotely to a Jupyter server running there. The training devices are interconnected through a neuron link in a two-dimensional torus. This configuration allows for distributed training across 16 devices, and you can cluster TRN1 or TRN1N instances to scale out further. INF2 comes in four sizes: Excel with one device, 8XL with one device but more memory, 24XL with six devices, and 48XL with 12 devices. Excel is cost-effective for experimentation or small models. 8XL is a bit bigger, and 24XL and 48XL offer more devices. You can shard a large model across devices or run multiple models on the same instance. Unlike TRN1, INF2 is available outside the US, including Ireland, Germany, and Asia-Pacific. When it comes to SageMaker, you can use INF2 for endpoints but not as notebook instances or studio environments. The Neuron SDK includes a compiler called NeuronX CC, which optimizes vanilla deep learning models for Neuron devices. You don't need a Neuron-powered instance to compile; a fast compute instance like C7 is better. Compilation can take a while, so pre-compiling models on the fastest instance is a good strategy. Compiled models are saved as nef files and can be reused. We've implemented nefcaching on the HuggingFace Hub in Optimum Neuron, allowing you to push and fetch pre-compiled models. A recent feature is weight decoupling, where you can compile an architecture and use multiple checkpoints without recompiling. The big question is, what models are supported? The list in the Neuron SDK documentation is out of date, so check the release notes for the latest supported models. The Neuron SDK runtime loads net files on Neuron devices and integrates with PyTorch and TensorFlow through XLA. Monitoring tools like neuron-ls and neuron-top help you list and monitor Neuron devices, and a recent feature is profiling capability to optimize performance. NeuronX Distributed is a new library for distributed training and inference on Inferentia 2 and NeuronX. It implements advanced techniques like tensor parallelism, pipeline parallelism, and sequence parallelism. These techniques are very advanced and require changes to your training code. We integrate these into Optimum Neuron to simplify usage. Transformers Neuron X is about inference and LLM inference, re-implementing neuron-optimized versions of famous LLMs. It's checkpoint compatible with Transformers, so you can train or fine-tune a model with Transformers and use it for inference with Transformers Neuron X. Optimum Neuron is our open-source library for hardware acceleration, supporting training and inference. It simplifies using Neuron devices with HuggingFace models. For training, you import the Neuron Trainer object, and it sets up everything for Tranium. For inference, you use the Optimum CLI to export and optimize models, and updating your code from Transformers to Optimum Neuron is straightforward. AWS provides Deep Learning AMIs and containers for PyTorch and TensorFlow, and we have our own HuggingFace Neuron Deep Learning AMI with all the necessary tools and libraries. Some developer resources include the Neuron SDK documentation, AWS repos with notebooks and code samples, and the Optimum Neuron documentation. Check out Philips Schmidt's blog for the best SageMaker notebooks for Tranium and Inferentia. That's it for today. This is a comprehensive overview of the stack. If you work with Optimum Neuron, you're leveraging all the hardware and software technology we've built with AWS. Start with Optimum Neuron for the fastest, simplest way to accelerate your HuggingFace workloads on AWS accelerators. If you're curious, explore the lower-level elements, but they are more complex. Thanks for watching, and I'll see you soon. Keep rocking!

Deep Dive Hugging Face models on AWS AI Accelerators

Transcript

Tags

About the Author