Skip to navigationSkip to main contentSkip to footerScaleway DocsAsk our AI
Ask our AI

Understanding the NVIDIA FP4 format

The NVIDIA FP4 format represents an impressive advancement in low-precision computing, specifically designed as a 4-bit floating-point standard to optimize AI inference workloads on modern GPU architectures. Developed by NVIDIA, FP4 builds on the principles of earlier reduced-precision formats like FP8, enabling ultra-efficient model compression while maintaining high accuracy for large-scale deployments. This format is particularly valuable for GPU Instances such as those in the Blackwell series, like Scaleway B300-SXM Instances, which can process FP4 natively.

FP4 addresses the limitations of traditional quantization techniques by introducing a sophisticated two-level scaling mechanism that minimizes errors when transitioning from higher-precision formats like FP16 or FP8. As AI models continue to grow in size and complexity—often exceeding trillions of parameters—the need for formats that balance memory savings, throughput, and fidelity becomes critical. The use of FP4 results in a 3.5x reduction in model memory footprint relative to FP16 and a 1.8x reduction compared to FP8, while maintaining model accuracy with less than 1% degradation on key language modeling tasks for some models. FP4 achieves this by using hardware-accelerated operations in NVIDIA's fifth-generation Tensor Cores, allowing for integration into inference pipelines without significant accuracy trade-offs.

Use cases and benefits

FP4 is highly effective for LLM inference, particularly for attention and token generation.

Its main benefits include: significantly improved memory efficiency (3.5x smaller model footprints), leading to the ability to run larger models; Performance acceleration through reduced memory bandwidth, which increases token throughput and lowers prefill latency (Blackwell GPUs show up to 25x greater energy efficiency than H100); strong accuracy retention (minimal perplexity increase); and substantial energy savings, supporting sustainable AI (Blackwell Ultra achieves up to 50x efficiency gains for massive models).

Primary use cases include LLM serving (via TensorRT-LLM or vLLM), non-LLM models (exportable via ONNX), and prequantized deployments on Hugging Face, such as Llama 3.1-405B-Instruct-FP4 or FLUX.1-dev.

For implementation, tools like NVIDIA TensorRT Model Optimizer and LLM Compressor help to optimize quantization workflows. Early support in frameworks like vLLM and upcoming integrations in SGLang further broaden accessibility.

For more details on FP4 deployment, refer to NVIDIA's official documentation.

Still need help?

Create a support ticket
No Results