Post

MSE, PSNR, and SSIM — The Image Quality Metrics Every CV Engineer Must Know

Three numbers that decide if your model actually works — a deep dive into MSE, PSNR, and SSIM with intuition, math, and medical imaging context.

MSE, PSNR, and SSIM — The Image Quality Metrics Every CV Engineer Must Know

Your model converged. Loss went down. But does the reconstructed image actually look good? MSE, PSNR, and SSIM are the three metrics that answer that question — and understanding the difference between them matters more than most people realize.


🗺️ The Big Picture — What Are We Measuring?

When you compare a ground truth image $y$ against a predicted/reconstructed image $\hat{y}$, you need a number that captures the quality of the prediction. These three metrics approach that problem in fundamentally different ways.

graph LR
    A([Ground Truth y]) --> C{Compare}
    B([Predicted ŷ]) --> C
    C --> D[MSE\nPixel Error]
    C --> E[PSNR\nSignal Ratio]
    C --> F[SSIM\nPerceptual Quality]

    style A fill:#4A90D9,color:#fff
    style B fill:#9B6EBD,color:#fff
    style D fill:#5BA85A,color:#fff
    style E fill:#E8A838,color:#fff
    style F fill:#D9534F,color:#fff

Each metric captures a different “dimension” of quality. In practice, you always report all three.


📐 MSE — Mean Squared Error


The Formula

\[\text{MSE} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2\]

For every pixel: subtract, square, average. That’s it.


Step-by-Step Intuition

1
2
3
4
5
True:        [100, 150, 200]
Predicted:   [110, 145, 210]
Difference:  [ 10,   5,  10]
Squared:     [100,  25, 100]
MSE = (100 + 25 + 100) / 3 = 75

The squaring does two things simultaneously:

  • Makes negative differences positive (can’t have cancellation)
  • Penalizes large errors disproportionately — a pixel off by 20 contributes 4× more than one off by 10

Why It’s Not Enough

MSE treats every pixel equally, regardless of location or visual significance. The human visual system does not.

A blurry image that smooths everything can achieve deceptively low MSE — it avoids large per-pixel errors by being wrong everywhere by a little bit. That’s not a good reconstruction.


When to Use It

  • ✅ As a training loss — differentiable, fast, stable
  • ✅ As a baseline metric in evaluation tables
  • ❌ Not as your primary quality indicator for perceptual tasks

📡 PSNR — Peak Signal-to-Noise Ratio


The Formula

\[\text{PSNR} = 10 \cdot \log_{10}\left(\frac{\text{MAX}^2}{\text{MSE}}\right)\]

Where MAX is the maximum possible pixel value: 255 for 8-bit images, 1.0 for normalized tensors.


Intuition: Signal vs. Noise

Think of the image as a signal and the reconstruction error as noise. PSNR asks: “How much stronger is the true signal than the error noise?”

  • MAX² = maximum possible signal power
  • MSE = noise power
  • The ratio = signal-to-noise ratio (SNR)
  • 10 × log₁₀(·) converts it to decibels (dB) — the standard engineering unit

What the Numbers Mean

MSEPSNR (8-bit)Visual Quality
0.01~68 dBNearly perfect
1~48 dBExcellent
10~38 dBGood
100~28 dBNoticeable degradation
1000~18 dBHeavy distortion

For CT reconstruction papers, a PSNR improvement of +2–3 dB over FBP baseline is considered meaningful. Getting above 40 dB is a strong result.


Why It Exists (If MSE Already Exists)

PSNR is just a more interpretable and industry-standard scale for MSE. It compresses the enormous dynamic range of MSE (0.001 to 10,000+) into a clean 20–70 dB range that’s easy to compare across papers and architectures.

Since PSNR is derived directly from MSE, it inherits the same perceptual blindspot: it doesn’t “see” structure or edges, only pixel magnitudes.


🧠 SSIM — Structural Similarity Index

This is where things get genuinely clever. SSIM was designed to mimic how the human visual system actually perceives image quality.


The Core Formula

\[\text{SSIM}(x, y) = \frac{(2\mu_x\mu_y + C_1)(2\sigma_{xy} + C_2)}{(\mu_x^2 + \mu_y^2 + C_1)(\sigma_x^2 + \sigma_y^2 + C_2)}\]

But the formula is really a product of three components:

\[\text{SSIM} = \underbrace{l(x,y)}_{\text{luminance}} \times \underbrace{c(x,y)}_{\text{contrast}} \times \underbrace{s(x,y)}_{\text{structure}}\]

Component 1 — Luminance

\[l(x,y) = \frac{2\mu_x \mu_y + C_1}{\mu_x^2 + \mu_y^2 + C_1}\]

Compares the average brightness of each patch. If the reconstruction is systematically over- or under-exposed, this score drops.


Component 2 — Contrast

\[c(x,y) = \frac{2\sigma_x \sigma_y + C_2}{\sigma_x^2 + \sigma_y^2 + C_2}\]

Compares local variation (standard deviation). A blurry patch has low $\sigma$; a sharp one has high $\sigma$. Over-smoothed reconstructions are caught here.


Component 3 — Structure

\[s(x,y) = \frac{\sigma_{xy} + C_3}{\sigma_x \sigma_y + C_3}\]

This is the most perceptually meaningful part. $\sigma_{xy}$ is the cross-correlation between the two patches — it measures whether edge and texture patterns spatially align.

The constants $C_1$, $C_2$, $C_3$ (e.g., $C_1 = (0.01 \times 255)^2$) are tiny stability terms to prevent division by zero in flat/uniform regions.


How It’s Computed in Practice

SSIM doesn’t compare the whole image at once. It uses a sliding 11×11 Gaussian-weighted window:

graph TD
    A[Input Image Pair] --> B[Slide 11x11 Window]
    B --> C[Compute Local SSIM per Patch]
    C --> D[Average All Patches]
    D --> E[Final SSIM Score]

    style A fill:#4A90D9,color:#fff
    style B fill:#9B6EBD,color:#fff
    style C fill:#E8A838,color:#fff
    style D fill:#5BA85A,color:#fff
    style E fill:#D9534F,color:#fff

This local windowing is why SSIM catches spatially localized distortions that global MSE misses.


The Key Insight — Why SSIM Beats MSE

Consider two distorted versions of the same image:

  • Version A: uniform Gaussian noise added everywhere
  • Version B: image shifted by 1 pixel

Both may have nearly identical MSE. But:

  • Version A destroys texture and edges → SSIM drops sharply
  • Version B preserves all structure, just displaced → SSIM stays relatively high

This matches human judgment far better.

SSIM can be used as a loss term: L_SSIM = 1 - SSIM. However, its non-linearity makes gradient behavior less predictable than MSE. A combined loss works best.


🔬 Full Comparison

PropertyMSEPSNRSSIM
Range[0, ∞)[0, ∞) dB[-1, 1]
Better =↓ Lower↑ Higher↑ Higher (→ 1)
Perceptual quality❌ Poor❌ Poor✅ Good
Catches blur❌ Partial❌ Partial✅ Yes
Catches noise✅ Yes✅ Yes✅ Yes
Catches shift❌ No❌ No⚠️ Partial
Use as loss✅ Trivial✅ Via MSE⚠️ Possible
Compute cost🟢 Cheap🟢 Cheap🟡 Moderate

🏥 Application to CT Reconstruction

In medical and semiconductor imaging, SSIM is the most clinically meaningful metric because:

  • Clinicians care about sharp boundaries (tissue edges, anomalies, implant contours)
  • A slightly blurry reconstruction may have decent PSNR but miss critical diagnostic features
  • Deep learning methods that optimize only MSE tend to produce over-smoothed outputs

Recommended training loss for CT reconstruction: \(\mathcal{L} = \lambda_1 \cdot \text{MSE} + \lambda_2 \cdot (1 - \text{SSIM}) + \lambda_3 \cdot \mathcal{L}_{\text{perceptual}}\) A typical starting point: $\lambda_1 = 0.8$, $\lambda_2 = 0.1$, $\lambda_3 = 0.1$.


One-sentence intuition: MSE counts pixels, PSNR scales that into decibels, but SSIM actually looks at the picture the way you do.


Part of FYP notes series. Next: Perceptual Loss and Feature-Level Similarity Metrics.

This post is licensed under CC BY 4.0 by the author.