Float16, Float32, BFloat16 — What's the Difference and Why It Matters in Deep Learning

A clear breakdown of floating-point formats and why choosing between float16, bfloat16, and float32 makes a real difference in training speed, memory, and stability.

Posted Mar 6, 2026 Updated Mar 20, 2026

Binary numbers and precision

By YuXuan Yan

3 min read

Float16, Float32, BFloat16 — What's the Difference and Why It Matters in Deep Learning

If you’ve trained a neural network, you’ve probably seen torch.float32, torch.float16, or torch.bfloat16 somewhere in your code and wondered: does it actually matter which one I use? Short answer: yes, a lot. Here’s why.

🔢 The Structure of a Floating-Point Number

Every floating-point number is stored using three fields:

\[(-1)^{\text{sign}} \times 2^{(\text{exponent} - \text{bias})} \times 1.\text{mantissa}\]

Think of it like scientific notation in binary — 1.5 × 10³ maps to sign=+, exponent=3, mantissa=5. The key insight: you have a fixed number of bits to split between exponent and mantissa, and that tradeoff determines everything.

Format	Total Bits	Exponent Bits	Mantissa Bits	Range	Precision
`float16`	16	5	10	±65,504	~3 decimal digits
`bfloat16`	16	8	7	~±3.4×10³⁸	~2 decimal digits
`float32`	32	8	23	~±3.4×10³⁸	~7 decimal digits
`float64`	64	11	52	~±1.8×10³⁰⁸	~15 decimal digits

More exponent bits → wider representable range
More mantissa bits → finer precision between values
More total bits → costs more memory and is slower on GPU

🧠 Why Deep Learning Cares

Deep learning has two very different numerical demands that pull in opposite directions:

1. Weight updates are tiny. A typical gradient update might be w += 0.000003. In float16, this rounds to zero — the update is lost. In float32, it’s preserved. Over millions of steps, this matters enormously.

2. Activations just need the right ballpark. Whether a neuron fired with value 0.832 or 0.834 is irrelevant. The model only needs approximate magnitudes for the forward pass.

This asymmetry is exactly what mixed-precision training exploits:

Operation	Format	Reason
Stored weights	`float32`	Accumulate small updates without rounding
Matrix multiplications	`float16 / bf16`	2–8× faster on modern GPUs
Gradient accumulation	`float32`	Prevent vanishing updates
Loss value	`float32`	Cheap to keep precise

Mixed-precision training is one of the highest-ROI optimizations you can apply — often 2× throughput with near-zero quality loss.

🏆 Why bfloat16 Won

Early mixed-precision training with float16 kept causing gradient explosions. The culprit: float16 has only 5 exponent bits, giving it a tiny range (max ~65k). Large gradients would overflow to inf.

Google’s bfloat16 (brain float) fixes this by using 8 exponent bits — the same as float32 — while cutting mantissa bits to 7. You get:

Same range as float32 → no overflow risk
Half the memory → 2× throughput
Slightly less precision → almost never matters in practice

This is why all modern LLM training (and most vision model training) defaults to bfloat16 when hardware supports it (A100, H100, recent consumer GPUs).

⚖️ The Loss Scaler

When you must use float16 (e.g., older GPUs without bfloat16 support), PyTorch’s GradScaler saves you:

  
scaler = torch.cuda.amp.GradScaler()

with torch.autocast(device_type='cuda', dtype=torch.float16):
    output = model(input)
    loss = criterion(output, target)

scaler.scale(loss).backward()   # inflates gradients before backward
scaler.step(optimizer)           # unscales before weight update
scaler.update()

It artificially multiplies the loss (and thus all gradients) by a large scale factor before backprop, then divides before the weight update. This keeps small gradients in float16’s representable range.

If you see inf or nan in your loss during float16 training, the GradScaler is your first tool to try.

🎯 Practical Takeaway

Use Case	Recommended Format
Standard training	`float32` or `bfloat16` mixed
LLM / large model training	`bfloat16` (if A100/H100)
Inference (speed matters)	`float16` or int8 quantization
Scientific computing	`float64`
Medical imaging (CT reconstruction)	`float32` or `bfloat16` mixed — preserve subtle signal

For medical imaging (like CT reconstruction), lean toward float32 or bfloat16 mixed rather than float16. The signal-to-noise in Hounsfield unit differences is subtle enough that precision loss can degrade reconstruction quality.

📝 Summary

float32 = safe default; float16 = fast but fragile; bfloat16 = fast and stable; float64 = overkill for DL.

The format you choose is a tradeoff between memory, speed, range, and precision. Modern best practice is bfloat16 mixed precision — you get most of float32’s safety with float16’s speed.

Part of my deep learning fundamentals series. Next: gradient clipping, learning rate warmup, and training stability tricks.

Deep Learning, Fundamentals

This post is licensed under CC BY 4.0 by the author.