Post

Float16, Float32, BFloat16 — What's the Difference and Why It Matters in Deep Learning

A clear breakdown of floating-point formats and why choosing between float16, bfloat16, and float32 makes a real difference in training speed, memory, and stability.

Float16, Float32, BFloat16 — What's the Difference and Why It Matters in Deep Learning

If you’ve trained a neural network, you’ve probably seen torch.float32, torch.float16, or torch.bfloat16 somewhere in your code and wondered: does it actually matter which one I use? Short answer: yes, a lot. Here’s why.


🔢 The Structure of a Floating-Point Number

Every floating-point number is stored using three fields:

\[(-1)^{\text{sign}} \times 2^{(\text{exponent} - \text{bias})} \times 1.\text{mantissa}\]

Think of it like scientific notation in binary — 1.5 × 10³ maps to sign=+, exponent=3, mantissa=5. The key insight: you have a fixed number of bits to split between exponent and mantissa, and that tradeoff determines everything.

FormatTotal BitsExponent BitsMantissa BitsRangePrecision
float1616510±65,504~3 decimal digits
bfloat161687~±3.4×10³⁸~2 decimal digits
float3232823~±3.4×10³⁸~7 decimal digits
float64641152~±1.8×10³⁰⁸~15 decimal digits
  • More exponent bits → wider representable range
  • More mantissa bits → finer precision between values
  • More total bits → costs more memory and is slower on GPU

🧠 Why Deep Learning Cares

Deep learning has two very different numerical demands that pull in opposite directions:

1. Weight updates are tiny. A typical gradient update might be w += 0.000003. In float16, this rounds to zero — the update is lost. In float32, it’s preserved. Over millions of steps, this matters enormously.

2. Activations just need the right ballpark. Whether a neuron fired with value 0.832 or 0.834 is irrelevant. The model only needs approximate magnitudes for the forward pass.

This asymmetry is exactly what mixed-precision training exploits:

OperationFormatReason
Stored weightsfloat32Accumulate small updates without rounding
Matrix multiplicationsfloat16 / bf162–8× faster on modern GPUs
Gradient accumulationfloat32Prevent vanishing updates
Loss valuefloat32Cheap to keep precise

Mixed-precision training is one of the highest-ROI optimizations you can apply — often 2× throughput with near-zero quality loss.


🏆 Why bfloat16 Won

Early mixed-precision training with float16 kept causing gradient explosions. The culprit: float16 has only 5 exponent bits, giving it a tiny range (max ~65k). Large gradients would overflow to inf.

Google’s bfloat16 (brain float) fixes this by using 8 exponent bits — the same as float32 — while cutting mantissa bits to 7. You get:

  • Same range as float32 → no overflow risk
  • Half the memory → 2× throughput
  • Slightly less precision → almost never matters in practice

This is why all modern LLM training (and most vision model training) defaults to bfloat16 when hardware supports it (A100, H100, recent consumer GPUs).


⚖️ The Loss Scaler

When you must use float16 (e.g., older GPUs without bfloat16 support), PyTorch’s GradScaler saves you:

1
2
3
4
5
6
7
8
9
scaler = torch.cuda.amp.GradScaler()

with torch.autocast(device_type='cuda', dtype=torch.float16):
    output = model(input)
    loss = criterion(output, target)

scaler.scale(loss).backward()   # inflates gradients before backward
scaler.step(optimizer)           # unscales before weight update
scaler.update()

It artificially multiplies the loss (and thus all gradients) by a large scale factor before backprop, then divides before the weight update. This keeps small gradients in float16’s representable range.

If you see inf or nan in your loss during float16 training, the GradScaler is your first tool to try.


🎯 Practical Takeaway

Use CaseRecommended Format
Standard trainingfloat32 or bfloat16 mixed
LLM / large model trainingbfloat16 (if A100/H100)
Inference (speed matters)float16 or int8 quantization
Scientific computingfloat64
Medical imaging (CT reconstruction)float32 or bfloat16 mixed — preserve subtle signal

For medical imaging (like CT reconstruction), lean toward float32 or bfloat16 mixed rather than float16. The signal-to-noise in Hounsfield unit differences is subtle enough that precision loss can degrade reconstruction quality.


📝 Summary

float32 = safe default; float16 = fast but fragile; bfloat16 = fast and stable; float64 = overkill for DL.

The format you choose is a tradeoff between memory, speed, range, and precision. Modern best practice is bfloat16 mixed precision — you get most of float32’s safety with float16’s speed.


Part of my deep learning fundamentals series. Next: gradient clipping, learning rate warmup, and training stability tricks.

This post is licensed under CC BY 4.0 by the author.