Deep Dive Quantizing Large Language Models part 2

Transcript

Hi everybody, this is Julien from Hugging Face. This is part 2 of our quantization expedition, so if you did not watch part 1, I recommend that you do that, because I will not reintroduce the basic concepts. If you did watch it, then in this video we're going to keep exploring quantization techniques: SmoothQuant, GPTQ, AWQ, HQQ, and the quantization techniques available for Intel platforms in our own Optimum Intel Library. Let's get started. If you like this video, please give it a thumbs up and consider subscribing to my channel. And don't forget to enable notifications so that you won't miss future videos. Also, why not share this video on your social networks or with your colleagues, because if you found this useful, others may find it useful too. Thank you very much for your support. I will stay off screen for this because the slides are a little busy and I don't want my face to mask some of the content. SmoothQuant is a post-training quantization technique that you can apply dynamically or statically with a calibration dataset. One of the key findings in SmoothQuant is that activations are much harder to quantize than weights because they display a much larger numerical range. To solve this problem, SmoothQuant introduces the idea of migrating some of the quantization difficulty to the weight layers. Instead of trying hard to quantize huge ranges of activations, can we tweak the weights to reduce that range of activations? This is exactly what they do. They rescale weights so that the range and magnitude of the activations are reduced, making them easier to quantize. In the paper, they show this with an example. On the left, you see the activation values, with the x-axis representing the channels (different dimensions) and the y-axis showing the actual activation values. Some channels have extreme magnitudes, making it difficult to quantize meaningfully. To simplify the problem, they apply a mathematical formula to rescale the weights so that the activation values exhibit much more reasonable ranges. Now, activation values are much more reasonable and easier to quantize, while weights might have slightly larger ranges but are still easier to quantize than the original activations. This is a divide and conquer approach, spreading out the problem instead of trying to quantize those impossible activations. They apply this to the self-attention and linear layers, then quantize the linear layers to 8-bit integers. The result is that SmoothQuant is as accurate as bits and bytes, which we covered in the previous video, but it is much faster. Here's a benchmark showing very large models like OPT-170B, BLOOM-176B, and GLM-130B. The FP-16 baseline is highlighted in blue, and the SmoothQuant results are shown. O1 and O2 are different optimization levels in SmoothQuant. You can see we match the accuracy of the baseline with SmoothQuant, quantizing from FP16 to INT8 with zero performance degradation. We generally outperform bits and bytes, referred to as LLM int 8 in the table, and we outperform zero quant, which performs poorly on these large models, especially OPT-175B and GLM. Zero quant does not scale to very large models. Bits and bytes do a good job, but inference performance is a problem, and SmoothQuant matches the accuracy while being faster. The gray bars represent FP16, and the reddish bars represent SmoothQuant. Even for smaller LLMs like OPT13B, there is a significant speedup, and the speedup is even more noticeable for larger models. For OPT-175B, we only need four GPUs to load the quantized model instead of eight for the FP16 model. In terms of cost performance, this is awesome because the cost is 2x lower, and we still get a speedup. Cost performance is really good with SmoothQuant. Next, we have Groupwise Precision Tuning Quantization, or GPT-Q. GPT-Q is a static post-training quantization technique that requires a calibration dataset to observe the distribution of activations and decide how to rescale the weights. The key feature of GPTQ is that it introduces new levels of quantization, allowing for 3-bit or even 2-bit precision. The core of GPTQ is an algorithm called OBQ (Optimal Brain Quantization), which is heavily modified. The math is very advanced, so I recommend reading the research paper for details. With 4-bit quantization, we get 4x memory savings compared to FP16, the same inference speed, and almost no performance degradation. For more extreme quantization, such as 3-bit, we see very close performance to the baseline for very large models like OPT-175B and BLOOM-176B. In terms of latency, 3-bit quantization of OPT-175B on an A100 GPU results in a 3.25x speedup, and we can move from 5 GPUs to just 1, achieving a 5x cost reduction and a 3.24x speedup. This shows the significant cost performance benefits of GPTQ. GPTQ relies on very fast kernels called xLamaV2, available on GitHub, which are optimized for 4-bit and lower inference. GPTQ is integrated into our open-source libraries, making it easy to use. Load your model, define a quantization config, and quantize it. GPT-Q takes about four GPU hours to quantize BLOOM-176B, but with a large multi-GPU instance, it can be done in under an hour. The cost performance improvement at inference time is amazing. Next, we have Activation-Aware Weight Quantization, or AWQ. AWQ is a static post-training quantization technique that requires calibration. The new insight in AWQ is that a tiny fraction of the weights have a critical influence on the generation process. These weights, called salient weights, are kept in their original high-precision format. During calibration, 0.1 to 1% of the weights are identified as critical and left alone. Comparing the FP16 baseline to the INT3 AWQ quantized model, we see very close performance for LAMA and LAMA2 models, with perplexity scores being very close. For example, LAMA-70B has an FP16 perplexity of 332, and AWQ at 3-bit has a perplexity of 3.74. At 4-bit, the perplexity is 341, very close to the baseline. The inference speed is also significantly improved, with a 4x speedup on an NVIDIA 1490 GPU. AWQ is robust across datasets and performs well on instruction-tuned models and multi-modal models. Transformers supports loading AWQ models, and many AWQ quantized models are available on the Hugging Face Hub. AWQ is a very cool technique. Finally, we have Half Quadratic Quantization, or HQQ. HQQ is a dynamic post-training quantization technique that does not require a calibration dataset. HQQ promises the same accuracy as static post-training quantization but without the calibration phase. HQQ minimizes quantization error by looking at weights and handling model outliers. HQQ outperforms GPTQ and AWQ on most tests, especially in quantization time. On an A100, it takes just four minutes to quantize LAMA-270B, compared to over four hours for GPTQ and about an hour and a half for AWQ. In terms of performance, HQQ matches or outperforms GPTQ and AWQ on most scenarios. For example, 4-bit quantization with HQQ on LAMA-270B results in a perplexity of 5.3, very close to the baseline of 5.8. Memory usage is about 3 to 4x lower, and the model can now be loaded due to reduced memory requirements. At 3-bit, the perplexity is slightly worse but still acceptable, and memory usage is even lower. At 2-bit, the perplexity is worse, but the model can run with even less GPU memory. HQQ offers a trade-off between performance and cost efficiency. Lastly, I want to talk about our Optimum Intel library, built by my colleagues in collaboration with Intel. This library focuses on accelerating Hugging Face models on Intel architectures, including a range of Intel platforms. Optimum Intel provides a simple transformers-like API and command-line interface for Intel Neural Compressor and Intel OpenVINO. These tools support static and dynamic quantization, quantization-aware training, pruning, and distillation. SmoothQuant is available in these tools, making it easy to try out. The library supports a wide range of Intel CPUs and accelerators. For example, I recorded a video on accelerating stable diffusion inference with OpenVINO on a Xeon, reducing inference time from 45 seconds to just 5 seconds. That's pretty much what I wanted to tell you about quantization. Go and review the material, try out the techniques, read the papers, and you will learn. More quantization techniques will likely emerge in the future, and I might cover them in a part three. Quantization is here to stay and is critical for shrinking and accelerating models, whether in the cloud or on smaller devices and at the edge. Good job on watching this, and until next time, keep rocking.

Deep Dive Quantizing Large Language Models part 2

Transcript

Tags

About the Author