BF16 vs. FP16 vs. FP32 | dzungphieuluuky's Blog

Last Updated:

As deep learning models balloon to billions or even trillions of parameters, the numeric format used for training is no longer a minor detail—it’s a fundamental constraint on what’s possible. We will delve into a new floating point data type that is currently pervasive everywhere in modern deep learning models, apart from the traditional FP16 and FP32 data types.

The Problem: Why FP32 Became a Bottleneck

For years, single-precision floating-point (FP32) was the reliable, default choice for deep learning. It offered high precision and a wide dynamic range, ensuring stable training. However, as models like GPT-4 began to demand billions, then trillions of parameters, a critical bottleneck emerged.

Every parameter stored in FP32 consumes 32 bits (4 bytes) of memory. This means a 1-trillion parameter model requires at least 4 terabytes of memory just to store its weights—before considering optimizers, schedulers, gradients, or activations. Furthermore, matrix operations in FP32 are computationally expensive, slowing down training and increasing costs prohibitively. The field needed a way to reduce the memory footprint and accelerate computation without breaking training stability.

Enter 16-bits format

The natural solution was to reduce precision. If we could cut the bit-width from 32 to 16, we’d immediately halve memory usage and, on modern hardware, achieve massive speedups. Two primary 16-bit contenders entered the arena:

IEEE Standard FP16: The existing 16-bit floating-point format according to IEEE 754 standard.
Brain Floating Point 16 (BF16): A format introduced by Google Brain, designed explicitly for the unique demands of neural network training.

The choice between them hinges on a fundamental trade-off in how those 16 bits are allocated and interpreted by the computers.

A floating-point number is represented by three parts:

Sign Bit (1 bit): Determines if the number is positive or negative.
Exponent Bits: Control the dynamic range—how large or small a number can be represented.
Mantissa Bits (or Fraction): Control the precision—the granularity between numbers within that range.

Format	Total Bits	Sign	Exponent	Mantissa	Key Property
FP32	32	1	8	23	High-precision, wide-range standard
FP16	16	1	5	10	Narrow range, medium precision
BF16	16	1	8	7	Wide range (as FP32), lower precision

BF16 copies the 8-bit exponent from FP32, giving it an identical, massive dynamic range (±3.4 × 10³⁸). It achieves this by drastically reducing the mantissa to only 7 bits, sacrificing fine-grained precision. FP16, in contrast, shortens both the exponent and mantissa.

Stability over Fidelity

Deep learning training is less about perfect mathematical precision and more about stable, convergent optimization. With its large structure with unfathomable numbers of neurons and weights co-adapting with each other, slightly lower precision doesn’t really affect the performance of the model too much. On the other hand, gradients, activations, and loss values can span an enormous range of magnitudes, easily leads to underflowing or overflowing of numerical values.

The FP16 Pitfall: With only 5 exponent bits, FP16’s range is tiny (±6.1 × 10⁴). Large values (common in gradients) overflow to infinity (inf), while tiny values underflow to zero. This causes NaN losses and unstable training, worsen the vanishing gradients and exploding gradients further, requiring complex workarounds like loss scaling, which draws our attention out of the core logic that we are supposed to focus on.
The BF16 Advantage: By keeping FP32’s wide dynamic range, BF16 almost entirely avoids overflow/underflow during training. Networks are remarkably tolerant to the lower precision of the 7-bit mantissa; the direction of gradients matters more than their hyper-precise magnitude. This means BF16 training is inherently more stable than FP16, often matching FP32 final accuracy with far less hassle.

We can have a glimpse into the structure of each data types using the figure below.

Diagram showing FP16's limited range vs. BF16/FP32's wide range

Expansion & Application: Hardware and Practical Usage

The rise of BF16 was cemented by synchronized hardware and software support.

Hardware Acceleration

Modern AI accelerators are built with BF16 in mind, offering dedicated tensor cores or units that perform BF16 operations at peak throughput:

NVIDIA (Ampere+): A100, H100, etc., have BF16 Tensor Cores.
Google TPUs: Were optimized for BF16 from their inception.
AMD / Intel: MI250/Xeon AMX provide robust BF16 support.

This means using BF16 isn’t just about saving memory—it’s about unlocking 2-3x faster matrix operations compared to FP32 on the same hardware, especially convenient and important for more efficient training and inference on large models.

Practical Implementation: Simpler Code

Using BF16 with frameworks like PyTorch leads to cleaner code than the older FP16 approach, as it often doesn’t require manual loss scaling.

# Modern BF16 Mixed-Precision Training (PyTorch)
with torch.autocast(device_type='cuda', dtype=torch.bfloat16): # Context manager
    predictions = model(inputs)
    loss = loss_fn(predictions, targets)
loss.backward() # No GradScaler needed!
optimizer.step()

Compare this to the classic FP16 pattern, which required careful management of a GradScaler object to prevent underflow—an extra complexity that BF16 largely obviates. With convenient library for distributed training and inference such as Accelerate, we can use the configuration parameter named mixed-precision to choose bf16 for its gains. Although this data type has not been fully supported on older hardware such as T4 GPUs, which are based on Turing architecture, we can still leverage fp16 instead or just use the original fp32 data type for maximum numerical accuracy.

Final talk

BF16 has effectively become the de facto standard for training large-scale neural networks. It solves the core problem of FP32’s cost while avoiding the instability of naive FP16.

Use BF16 for training and fine-tuning large models (e.g., with Hugging Face or Unsloth). It offers the best balance of speed, memory efficiency, and stability.
FP16 remains relevant for inference on memory-constrained edge devices where its slightly higher precision can be beneficial.
FP32 is still crucial for specific scientific computing tasks, small models where precision is critical.

The evolution from FP32 -> FP16 -> BF16 illustrates a key insight in deep learning systems: optimal performance comes from designing numerical formats around the algorithm’s robustness, not just abstract mathematical ideals.