Float16, Float32, BFloat16 — What's the Difference and Why It Matters in Deep Learning
A clear breakdown of floating-point formats and why choosing between float16, bfloat16, and float32 makes a real difference in training speed, memory, and stability.
If you’ve trained a neural network, you’ve probably seen torch.float32, torch.float16, or torch.bfloat16 somewhere in your code and wondered: does it actually matter which one I use? Short answer: yes, a lot. Here’s why.
🔢 The Structure of a Floating-Point Number
Every floating-point number is stored using three fields:
\[(-1)^{\text{sign}} \times 2^{(\text{exponent} - \text{bias})} \times 1.\text{mantissa}\]Think of it like scientific notation in binary — 1.5 × 10³ maps to sign=+, exponent=3, mantissa=5. The key insight: you have a fixed number of bits to split between exponent and mantissa, and that tradeoff determines everything.
| Format | Total Bits | Exponent Bits | Mantissa Bits | Range | Precision |
|---|---|---|---|---|---|
float16 | 16 | 5 | 10 | ±65,504 | ~3 decimal digits |
bfloat16 | 16 | 8 | 7 | ~±3.4×10³⁸ | ~2 decimal digits |
float32 | 32 | 8 | 23 | ~±3.4×10³⁸ | ~7 decimal digits |
float64 | 64 | 11 | 52 | ~±1.8×10³⁰⁸ | ~15 decimal digits |
- More exponent bits → wider representable range
- More mantissa bits → finer precision between values
- More total bits → costs more memory and is slower on GPU
🧠 Why Deep Learning Cares
Deep learning has two very different numerical demands that pull in opposite directions:
1. Weight updates are tiny. A typical gradient update might be w += 0.000003. In float16, this rounds to zero — the update is lost. In float32, it’s preserved. Over millions of steps, this matters enormously.
2. Activations just need the right ballpark. Whether a neuron fired with value 0.832 or 0.834 is irrelevant. The model only needs approximate magnitudes for the forward pass.
This asymmetry is exactly what mixed-precision training exploits:
| Operation | Format | Reason |
|---|---|---|
| Stored weights | float32 | Accumulate small updates without rounding |
| Matrix multiplications | float16 / bf16 | 2–8× faster on modern GPUs |
| Gradient accumulation | float32 | Prevent vanishing updates |
| Loss value | float32 | Cheap to keep precise |
Mixed-precision training is one of the highest-ROI optimizations you can apply — often 2× throughput with near-zero quality loss.
🏆 Why bfloat16 Won
Early mixed-precision training with float16 kept causing gradient explosions. The culprit: float16 has only 5 exponent bits, giving it a tiny range (max ~65k). Large gradients would overflow to inf.
Google’s bfloat16 (brain float) fixes this by using 8 exponent bits — the same as float32 — while cutting mantissa bits to 7. You get:
- Same range as float32 → no overflow risk
- Half the memory → 2× throughput
- Slightly less precision → almost never matters in practice
This is why all modern LLM training (and most vision model training) defaults to bfloat16 when hardware supports it (A100, H100, recent consumer GPUs).
⚖️ The Loss Scaler
When you must use float16 (e.g., older GPUs without bfloat16 support), PyTorch’s GradScaler saves you:
1
2
3
4
5
6
7
8
9
scaler = torch.cuda.amp.GradScaler()
with torch.autocast(device_type='cuda', dtype=torch.float16):
output = model(input)
loss = criterion(output, target)
scaler.scale(loss).backward() # inflates gradients before backward
scaler.step(optimizer) # unscales before weight update
scaler.update()
It artificially multiplies the loss (and thus all gradients) by a large scale factor before backprop, then divides before the weight update. This keeps small gradients in float16’s representable range.
If you see
infornanin your loss during float16 training, the GradScaler is your first tool to try.
🎯 Practical Takeaway
| Use Case | Recommended Format |
|---|---|
| Standard training | float32 or bfloat16 mixed |
| LLM / large model training | bfloat16 (if A100/H100) |
| Inference (speed matters) | float16 or int8 quantization |
| Scientific computing | float64 |
| Medical imaging (CT reconstruction) | float32 or bfloat16 mixed — preserve subtle signal |
For medical imaging (like CT reconstruction), lean toward float32 or bfloat16 mixed rather than float16. The signal-to-noise in Hounsfield unit differences is subtle enough that precision loss can degrade reconstruction quality.
📝 Summary
float32 = safe default; float16 = fast but fragile; bfloat16 = fast and stable; float64 = overkill for DL.
The format you choose is a tradeoff between memory, speed, range, and precision. Modern best practice is bfloat16 mixed precision — you get most of float32’s safety with float16’s speed.
Part of my deep learning fundamentals series. Next: gradient clipping, learning rate warmup, and training stability tricks.