Quantization, a game-changer for cloud-based machine learning efficiency - Part 2

Diego Coy
6 min read

This is the second article in our series of blog posts around quantization as an optimization technique for your AI models to make the most out of your NVIDIA H100 GPU Instances. In case you missed it, here’s part 1 of the series.

Quantization in the training phase

NVIDIA’s H100 Transformer Engine

The H100 Transformer Engine, introduced with NVIDIA's Hopper architecture and later incorporated into the NVIDIA Ada Lovelace architecture, significantly improves model training performance in terms of time and resources. It is particularly effective in training large models within a matter of days or even hours, depending on their size.

Key aspects of the H100 GPU Transformer Engine include:

  • Floating-Point Precision: It uses 16-bit floating-point (or “FP16”) precision combined with an 8-bit floating-point (or “FP8”) data format. This mix of precisions, along with advanced algorithms within the engine, improves performance without significantly compromising accuracy.
  • Tensor Core Technology: The Engine employs custom NVIDIA fourth-generation Tensor Core technology, designed to accelerate training for models built from transformers. These cores can apply mixed FP8 and FP16 formats, improving the speed and performance of AI calculations for transformers.
  • Dynamic Precision Management: The Engine allows you to convert a model’s weights into an FP8 format, and then compile those models in order to enjoy the benefits of FP8 kernels.
  • Inference Optimization: The H100 GPU’s Transformer Engine reduces memory resource consumption while maintaining a higher accuracy provided by FP8 quantization compared to INT8 or INT4 data formats. This gives you the ideal balance of speed and accuracy.

Check out NVIDIA’s Transformer User Engine guide for a step-by-step guide on how to make the most out of your models on an NVIDIA H100 GPU.

Quantization-Aware training

Quantization-Aware Training (QAT) is a method of preparing machine learning models for efficient deployment with minimal loss of accuracy. This technique is particularly advantageous for optimizing the hardware resources (GPU, CPU, TPU) that the model will run on. QAT involves adjusting the training process of a model to accommodate quantization, which results in models that are more robust to the loss of precision when deployed in low-precision formats.

At the heart of QAT lies the “fake quantization” technique, a process where both weights and activations within the model are rounded to mimic lower-precision formats (like FP8, FP4, or even lower) during training. However, unlike actual quantization, these operations are performed using higher-precision floating-point numbers. This means that while the model trains, it simulates the effects of quantization, becoming “aware” of the reduced precision it will eventually work with. This awareness enables the model to adjust and optimize its parameters accordingly during the training phase, maintaining accuracy even after quantization.

For a detailed, step-by-step guide on implementing Quantization-Aware Training, including code snippets and specific API usage, refer to the official tutorials on the topic:

These tutorials provide an in-depth look at the process and offer practical insights into effectively applying QAT to your models.

Post-training Quantization

Using a quantized model with Scaleway’s NVIDIA H100 GPU Instances

In this blog post, Golem.ai shares their experience and the steps they took to improve the performance of their H100 GPU Instances, and provides a practical application scenario of using NVIDIA H100 GPUs on Scaleway:

  • Model Used: Golem.ai ran the Llama-2-70B-chat model using llama.cpp, a C/C++ library for inference of LLaMA/LlaMA-2 models, with NVIDIA CUDA 12.2 on an H100-1-80G GPU Instance (the most powerful GPU instance Scaleway had at the moment, in early November 2023).
  • Quantization Method: The Llama-2-70B-chat model they used was processed with an FP6 quantization method. This level of quantization requires around 60GB of memory, which is well within the 80GB of VRAM provided by the H100-1-80G GPU Instance.

The application of NVIDIA's H100 Transformer Engine in the context of large-scale language models, as demonstrated by Golem.ai, illustrates the potential of this technology in enhancing AI performance. The combination of advanced precision management and the power of the H100 GPU Instances allows for efficient training and inference of large and complex models. Combined, this showcases a significant step forward in the field of AI and ML.

Make sure to read the How to Optimize LLM Performance with NVIDIA H100 GPUs from Scaleway, by Golem.ai for a detailed guide on how Golem.ai uses a pre-quantized model from TheBloke (AKA Tom Jobbins), one of the most popular quantized model contributors. While you’re at it, check out some of the other almost four thousand models they have made available to the public.

Measuring the trade-off

The trade-off of quantization is a balancing act between maintaining model accuracy and achieving computational efficiency. Measuring this trade-off involves a detailed analysis of both the performance metrics and the operational benefits.

Accuracy or Efficiency: Why don’t we have both?

Lower precision quantization, such as FP8 or FP4 representations, significantly reduces the model size and speeds up inference, thus optimizing the usage of hardware resources and response times. At the same time, this reduction in precision can lead to a decrease in model accuracy. The loss in accuracy varies depending on the model architecture, and quantization techniques, and target accuracy. Measuring this trade-off involves conducting extensive testing to compare the model's performance before and after the quantization.

To evaluate the trade-off, several key performance metrics are used:

  • Inference Speed: Quantization generally leads to faster inference times. Measuring the time the model takes to make accurate predictions before and after being quantized is a key indicator of improvement.
  • Model Size: One of the benefits of reducing the model's accuracy is the reduction of its size, which is crucial for both storage and memory resources. Comparing the model size before and after quantization gives you a clear picture of the gains.
  • Testing and Validation: It's crucial to ensure that the reduced precision will not significantly impact the end users. This process involves not only running the quantized model through a standard testing dataset but also potentially deploying it in a real-world or simulated environment to understand its practical performance: Are the inference results within the expected range? It's crucial to ensure that the reduced precision will not harm your end user’s experience by providing them with inaccurate outputs.


The decision of whether and how much to quantize a model depends on the application's requirements. For some applications, a slight decrease in accuracy is acceptable in exchange for significant performance gains. However, for applications where accuracy is critical, preserving accuracy is the highest priority.

By embracing quantization-aware training, leveraging NVIDIA’s Transformer Engine, or using existing quantized models, organizations can optimize the cost of their cloud spend, and achieve faster training and inference operations, while at the same time minimizing environmental impact.

By choosing the right approach, finding the appropriate balance between performance and accuracy can turn from being a trade-off to becoming a decision during your development and deployment processes that will allow you to unlock the full potential of your AI models at the right stage.

Where to go from here?

You can give any of these methods a try on an H100 PCIe GPU instance if you haven’t already done so. Furthermore, if you want a technical deep dive into quantization, you should check out HuggingFace’s guide on quantization, a great resource that will take your understanding of the topic to the next level.

Share on
Other articles about: