MSE, PSNR, and SSIM — The Image Quality Metrics Every CV Engineer Must Know
Three numbers that decide if your model actually works — a deep dive into MSE, PSNR, and SSIM with intuition, math, and medical imaging context.
Your model converged. Loss went down. But does the reconstructed image actually look good? MSE, PSNR, and SSIM are the three metrics that answer that question — and understanding the difference between them matters more than most people realize.
🗺️ The Big Picture — What Are We Measuring?
When you compare a ground truth image $y$ against a predicted/reconstructed image $\hat{y}$, you need a number that captures the quality of the prediction. These three metrics approach that problem in fundamentally different ways.
graph LR
A([Ground Truth y]) --> C{Compare}
B([Predicted ŷ]) --> C
C --> D[MSE\nPixel Error]
C --> E[PSNR\nSignal Ratio]
C --> F[SSIM\nPerceptual Quality]
style A fill:#4A90D9,color:#fff
style B fill:#9B6EBD,color:#fff
style D fill:#5BA85A,color:#fff
style E fill:#E8A838,color:#fff
style F fill:#D9534F,color:#fff
Each metric captures a different “dimension” of quality. In practice, you always report all three.
📐 MSE — Mean Squared Error
The Formula
\[\text{MSE} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2\]For every pixel: subtract, square, average. That’s it.
Step-by-Step Intuition
1
2
3
4
5
True: [100, 150, 200]
Predicted: [110, 145, 210]
Difference: [ 10, 5, 10]
Squared: [100, 25, 100]
MSE = (100 + 25 + 100) / 3 = 75
The squaring does two things simultaneously:
- Makes negative differences positive (can’t have cancellation)
- Penalizes large errors disproportionately — a pixel off by 20 contributes 4× more than one off by 10
Why It’s Not Enough
MSE treats every pixel equally, regardless of location or visual significance. The human visual system does not.
A blurry image that smooths everything can achieve deceptively low MSE — it avoids large per-pixel errors by being wrong everywhere by a little bit. That’s not a good reconstruction.
When to Use It
- ✅ As a training loss — differentiable, fast, stable
- ✅ As a baseline metric in evaluation tables
- ❌ Not as your primary quality indicator for perceptual tasks
📡 PSNR — Peak Signal-to-Noise Ratio
The Formula
\[\text{PSNR} = 10 \cdot \log_{10}\left(\frac{\text{MAX}^2}{\text{MSE}}\right)\]Where MAX is the maximum possible pixel value: 255 for 8-bit images, 1.0 for normalized tensors.
Intuition: Signal vs. Noise
Think of the image as a signal and the reconstruction error as noise. PSNR asks: “How much stronger is the true signal than the error noise?”
MAX²= maximum possible signal powerMSE= noise power- The ratio = signal-to-noise ratio (SNR)
10 × log₁₀(·)converts it to decibels (dB) — the standard engineering unit
What the Numbers Mean
| MSE | PSNR (8-bit) | Visual Quality |
|---|---|---|
| 0.01 | ~68 dB | Nearly perfect |
| 1 | ~48 dB | Excellent |
| 10 | ~38 dB | Good |
| 100 | ~28 dB | Noticeable degradation |
| 1000 | ~18 dB | Heavy distortion |
For CT reconstruction papers, a PSNR improvement of +2–3 dB over FBP baseline is considered meaningful. Getting above 40 dB is a strong result.
Why It Exists (If MSE Already Exists)
PSNR is just a more interpretable and industry-standard scale for MSE. It compresses the enormous dynamic range of MSE (0.001 to 10,000+) into a clean 20–70 dB range that’s easy to compare across papers and architectures.
Since PSNR is derived directly from MSE, it inherits the same perceptual blindspot: it doesn’t “see” structure or edges, only pixel magnitudes.
🧠 SSIM — Structural Similarity Index
This is where things get genuinely clever. SSIM was designed to mimic how the human visual system actually perceives image quality.
The Core Formula
\[\text{SSIM}(x, y) = \frac{(2\mu_x\mu_y + C_1)(2\sigma_{xy} + C_2)}{(\mu_x^2 + \mu_y^2 + C_1)(\sigma_x^2 + \sigma_y^2 + C_2)}\]But the formula is really a product of three components:
\[\text{SSIM} = \underbrace{l(x,y)}_{\text{luminance}} \times \underbrace{c(x,y)}_{\text{contrast}} \times \underbrace{s(x,y)}_{\text{structure}}\]Component 1 — Luminance
\[l(x,y) = \frac{2\mu_x \mu_y + C_1}{\mu_x^2 + \mu_y^2 + C_1}\]Compares the average brightness of each patch. If the reconstruction is systematically over- or under-exposed, this score drops.
Component 2 — Contrast
\[c(x,y) = \frac{2\sigma_x \sigma_y + C_2}{\sigma_x^2 + \sigma_y^2 + C_2}\]Compares local variation (standard deviation). A blurry patch has low $\sigma$; a sharp one has high $\sigma$. Over-smoothed reconstructions are caught here.
Component 3 — Structure
\[s(x,y) = \frac{\sigma_{xy} + C_3}{\sigma_x \sigma_y + C_3}\]This is the most perceptually meaningful part. $\sigma_{xy}$ is the cross-correlation between the two patches — it measures whether edge and texture patterns spatially align.
The constants $C_1$, $C_2$, $C_3$ (e.g., $C_1 = (0.01 \times 255)^2$) are tiny stability terms to prevent division by zero in flat/uniform regions.
How It’s Computed in Practice
SSIM doesn’t compare the whole image at once. It uses a sliding 11×11 Gaussian-weighted window:
graph TD
A[Input Image Pair] --> B[Slide 11x11 Window]
B --> C[Compute Local SSIM per Patch]
C --> D[Average All Patches]
D --> E[Final SSIM Score]
style A fill:#4A90D9,color:#fff
style B fill:#9B6EBD,color:#fff
style C fill:#E8A838,color:#fff
style D fill:#5BA85A,color:#fff
style E fill:#D9534F,color:#fff
This local windowing is why SSIM catches spatially localized distortions that global MSE misses.
The Key Insight — Why SSIM Beats MSE
Consider two distorted versions of the same image:
- Version A: uniform Gaussian noise added everywhere
- Version B: image shifted by 1 pixel
Both may have nearly identical MSE. But:
- Version A destroys texture and edges → SSIM drops sharply
- Version B preserves all structure, just displaced → SSIM stays relatively high
This matches human judgment far better.
SSIM can be used as a loss term:
L_SSIM = 1 - SSIM. However, its non-linearity makes gradient behavior less predictable than MSE. A combined loss works best.
🔬 Full Comparison
| Property | MSE | PSNR | SSIM |
|---|---|---|---|
| Range | [0, ∞) | [0, ∞) dB | [-1, 1] |
| Better = | ↓ Lower | ↑ Higher | ↑ Higher (→ 1) |
| Perceptual quality | ❌ Poor | ❌ Poor | ✅ Good |
| Catches blur | ❌ Partial | ❌ Partial | ✅ Yes |
| Catches noise | ✅ Yes | ✅ Yes | ✅ Yes |
| Catches shift | ❌ No | ❌ No | ⚠️ Partial |
| Use as loss | ✅ Trivial | ✅ Via MSE | ⚠️ Possible |
| Compute cost | 🟢 Cheap | 🟢 Cheap | 🟡 Moderate |
🏥 Application to CT Reconstruction
In medical and semiconductor imaging, SSIM is the most clinically meaningful metric because:
- Clinicians care about sharp boundaries (tissue edges, anomalies, implant contours)
- A slightly blurry reconstruction may have decent PSNR but miss critical diagnostic features
- Deep learning methods that optimize only MSE tend to produce over-smoothed outputs
Recommended training loss for CT reconstruction: \(\mathcal{L} = \lambda_1 \cdot \text{MSE} + \lambda_2 \cdot (1 - \text{SSIM}) + \lambda_3 \cdot \mathcal{L}_{\text{perceptual}}\) A typical starting point: $\lambda_1 = 0.8$, $\lambda_2 = 0.1$, $\lambda_3 = 0.1$.
One-sentence intuition: MSE counts pixels, PSNR scales that into decibels, but SSIM actually looks at the picture the way you do.
Part of FYP notes series. Next: Perceptual Loss and Feature-Level Similarity Metrics.