Deep Dive Quantizing Large Language Models part 1

Transcript

Hi everybody, this is Julien from Hugging Face. In previous videos, we looked at different techniques to optimize and accelerate large language models, like attention layers, model compilation, and hardware acceleration. In this video, we're going to look at a very important technique called model quantization, which is critical in shrinking and accelerating large models. In fact, I'm going to do two videos because there is a ton of material to cover. In this first video, I'm going to introduce what quantization is, give you an intuition of how it works, and start exploring the different types of quantization, such as post-training static or dynamic quantization, and quantization-aware training. We'll also look at some of the algorithms available and how they work with Transformers. In part two, I'll keep exploring more algorithms, including the latest and greatest bleeding-edge techniques. If you're interested in quantization, you definitely do not want to miss these two videos. Let's get started. If you like this video, please give it a thumbs up and consider subscribing to my channel. Don't forget to enable notifications so you won't miss future videos. Also, why not share this video on your social networks or with your colleagues because if you found this useful, others may find it useful too. Thank you very much for your support. As their name implies, large language models are large, and there's always a need to shrink them to help them fit into less memory. There's also a need to accelerate them, particularly to speed up inference. In past videos, I've covered different techniques like new attention layers, faster attention layers, hardware acceleration, and model compilation. Today, we're going to focus on another framework-level feature: quantization. First, let's define what quantization is. As we all know, model weights or parameters are learned during training or fine-tuning and are stored as numerical values. The common data types for these numerical values are typically floating-point formats, such as floating-point 32 (FP32), which is 32 bits or 4 bytes, and FP16, which is 16 bits or 2 bytes. There's also a more accurate variant called BF16, which is still 16 bits. I won't go too deep into numerical formats, but typically, FP32 uses a number of bits for the fraction, some bits for the exponent, and a sign bit. The formats have different lengths and use a different number of bits for different things, but generally, that's the idea. The larger the data type we use to store model parameters, the finer the granularity, and the more precisely we can store and represent values that are just a little bit different. More bits mean more precision, which helps the models be more accurate. However, this increased precision comes at a cost: more memory is needed to store the model parameters, and more memory bandwidth is required to read and write those parameters from the GPU to the memory where they're stored. We covered this problem when discussing attention layers, flash attention, etc. Larger models also require more compute, and inference will be slower. While this isn't a big deal for models of reasonable size, large models are getting larger, and as we add more parameters, increase sequence length, and increase the dimension of embeddings, performance can slow down significantly. The purpose of quantization is to shrink the model by rescaling the weights from a high-precision format to a lower-precision format, using fewer bits to store the parameters. This reduces the memory required to store the model, reduces the pressure on memory accesses between the GPU and memory, and reduces the compute required because 8-bit arithmetic is faster than 32-bit arithmetic. We aim to do all this with minimal loss of accuracy, if any. Typically, quantization uses integer formats, such as int8 (8-bit) or int4 (4-bit). Newer techniques can even go down to less than 4 bits, which is impressive. The goal is to start with a model trained in 32-bit or 16-bit mode and automatically rescale its parameters to 8 bits or even less. The big question is how we map the values from a 32-bit space to an 8-bit space. Imagine values ranging from the smallest value representable with FP32 to the largest value representable with FP32, and we need to rescale them to the smallest and largest values representable with 8-bit integers. We need to cram all those high-precision values into a much smaller range. There are different ways to do this, and the more clever we get, the less accuracy we lose. To illustrate, let's consider a one-dimensional example. We have values ranging from the smallest FP32 value to the largest FP32 value, and we need to rescale them to the smallest and largest values representable with 8-bit integers. One simple approach is to use the smallest and largest values in the input range as the bounds for rescaling. However, this is sensitive to outliers, and it can lead to wasted numerical space in the quantized range, squeezing the central values and reducing granularity. Different values in the input range might get mapped to identical values in the quantized range due to the lack of granularity. To minimize this, we can use techniques like percentiles or histogram bins to group weights that are close to one another and pack them more efficiently. Another approach is to eliminate outliers by choosing more central values for the input range bounds. This reduces wasted space and spreads the quantized values more evenly, but it can lead to accuracy loss if too many outliers are dropped. The goal is to minimize information loss between the input distribution of weights and the quantized distribution. We can achieve this by trying different thresholds and using a calibration dataset to observe the quantized distribution and minimize the difference between the two distributions using the Kullback-Leibler (KL) divergence. Before diving into quantization libraries and algorithms, we need to discuss when quantization can be applied. The simplest way is post-training quantization, where we take a trained model and quantize its weights. There are two flavors: post-training dynamic quantization and post-training static quantization. In dynamic quantization, we load a trained model, convert the weights, and quantize activations on the fly before running computations. This is simple and flexible but has some overhead, making the model slightly slower. In static quantization, we load the model, convert the weights, and use a calibration dataset to set the scaling factor for activations. This is done ahead of time, eliminating runtime overhead, but it can be sensitive to the dataset used for calibration. The third technique is quantization-aware training, where we apply quantization during the training process. We train at full precision (e.g., FP32 or FP16) and apply quantization in parallel to observe the distribution and set the right scaling factors. This results in higher-quality models but requires retraining, which can be expensive for large models. For LLMs, post-training quantization is generally preferred, while quantization-aware training is more suitable for smaller models or non-Transformer models like CNNs. When predicting with quantized models, we need to dequantize the output to return to the original data type. Dequantization is the reverse operation of quantization, mapping the small range value back to the large range value. Let's look at a quick example using PyTorch. We load a model, such as BERT, and apply dynamic quantization to the linear layers, quantizing them to 8 bits. This process is fast and reduces the model size by 58% and inference latency by 31%, with an accuracy degradation of under 1% for the MRPC task. The linear layers are replaced by their quantized equivalents, and the data type is int8. Over time, various algorithms and techniques have been introduced to improve quantization. One interesting technique is ZeroQuant, which is optimized for LLMs. ZeroQuant is dynamic post-training quantization that can handle int8 weights and activations or int4 weights with int8 activations. It addresses the variability in weight and activation ranges across layers by using group-wise quantization for weights and token-wise quantization for activations. This approach finds the right scaling factor for each group, leading to better performance and hardware efficiency. Another technique is BitsAndBytes, which is also dynamic post-training quantization for 8-bit and 4-bit quantization of weights. It introduces 8-bit optimizers to save GPU memory during training, allowing for the training of larger models. BitsAndBytes uses vector-wise quantization and mixed precision decomposition to handle outliers efficiently, providing good compression and precision for most parameters while maintaining better precision for outliers. This results in about 2x memory savings, on-par accuracy with FP16, and speedups for larger models, though smaller models may be slower due to the overhead of mixed precision decomposition. BitsAndBytes is well-integrated into Hugging Face libraries like Transformers and Accelerate, making it easy to use. Simply set the `load_in_8bit` or `load_in_4bit` parameter to `True` when loading a model. That's the end of part one. Go and digest all of this, read the papers, and when you're ready for part two, you know where to find it. We'll talk about GPT-Q, AWQ, HQ-Q, and SmoothQuant, which are all very interesting quantization techniques. Until next time, keep rocking!

Deep Dive Quantizing Large Language Models part 1

Transcript

Tags

About the Author