Post

But What Is a Convolution? — A Visual Deep Dive

Convolutions are everywhere — image blurring, CNNs, audio FX — yet the definition rarely clicks. This post builds the real intuition from first principles.

But What Is a Convolution? — A Visual Deep Dive

Convolutions are the backbone of image processing, convolutional neural networks, and digital signal processing — and yet most textbook definitions obscure what’s actually happening. Here’s the real intuition.


🎲 Start with Dice: Discrete Convolution

The cleanest way to understand convolution is through probability — specifically, adding two dice rolls.

Roll one 6-sided die. The outcome is uniform: each face has probability $\frac{1}{6}$. Roll two dice and add them. What’s the probability of getting a sum of 7?

\[P(\text{sum} = 7) = \sum_{k=1}^{6} P(A = k) \cdot P(B = 7 - k)\]

This is exactly convolution. You flip one distribution, slide it across the other, multiply corresponding values, and sum them up.

This is the discrete convolution formula: \((f * g)[n] = \sum_{k=-\infty}^{\infty} f[k] \cdot g[n-k]\) Read it as: for each output position $n$, take a dot product of $f$ with a flipped and shifted copy of $g$.


🔢 Polynomial Multiplication = Convolution in Disguise

Here’s a surprising reveal. Consider two polynomials:

\[f(x) = a_0 + a_1 x + a_2 x^2, \quad g(x) = b_0 + b_1 x + b_2 x^2\]

Multiply them. The coefficient of $x^n$ in the product is:

\[c_n = \sum_{k} a_k \cdot b_{n-k}\]

That’s exactly a discrete convolution of the coefficient sequences $[a_0, a_1, a_2]$ and $[b_0, b_1, b_2]$.

OperationWhat’s being convolved
Adding diceTwo probability distributions
Polynomial multiplicationCoefficient sequences
Image blurringImage pixel grid × Gaussian kernel
CNN forward passFeature maps × learned filters

Convolution is abstract precisely because the same operation shows up in wildly different contexts. The formula is always the same; only the things being convolved change.


📈 Continuous Convolution — Sliding Functions

The continuous version replaces the sum with an integral:

\[(f * g)(t) = \int_{-\infty}^{\infty} f(\tau) \cdot g(t - \tau) \, d\tau\]
flowchart LR
    A(["f(τ)\nFixed function"]):::blue
    B(["g(t - τ)\nFlipped + sliding"]):::green
    C(["Multiply\npointwise"]):::orange
    D(["Integrate\n(area under curve)"]):::purple
    E(["(f*g)(t)\nOutput at position t"]):::blue

    A --> C
    B --> C
    C --> D --> E

    classDef blue fill:#4A90D9,stroke:#2c5f8a,color:#fff
    classDef green fill:#5BA85A,stroke:#3a6e39,color:#fff
    classDef orange fill:#E8A838,stroke:#b07820,color:#fff
    classDef purple fill:#9B6EBD,stroke:#6b4785,color:#fff

The key move: $g(t - \tau)$ is $g$ flipped ($\tau \to -\tau$) and then shifted by $t$. As $t$ varies, the flipped copy slides across $f$. At each position, compute the overlap (integral of the product). The result is the output signal.


👁️ Image Blurring — The Spatial Convolution

In 2D image processing, convolution slides a small kernel (e.g. a Gaussian) across an image:

\[\text{Output}[x, y] = \sum_{i}\sum_{j} \text{Image}[x-i, y-j] \cdot \text{Kernel}[i, j]\]
KernelEffect
Gaussian (circular blob)Smooth blur
Laplacian (center positive, edge negative)Edge detection / sharpening
Sobel horizontalHorizontal edge detection
Identity (1 in center)No change

The Gaussian blur kernel has weight $K(i,j) \propto e^{-(i^2+j^2)/2\sigma^2}$. Larger $\sigma$ → wider blur. In CNNs, the network learns what kernels to use — no human crafting required.


⚡ The Fourier Connection — Convolution Theorem

Here’s the most profound result in all of signal processing:

\[\boxed{\mathcal{F}\{f * g\} = \mathcal{F}\{f\} \cdot \mathcal{F}\{g\}}\]

Convolution in time/space = pointwise multiplication in frequency.

flowchart TD
    A(["f(t) and g(t)\nTime domain"]):::blue
    B["Direct convolution\nO(n²) — SLOW"]:::warn
    C(["(f*g)(t)\nOutput"]):::blue

    A --> B --> C

    A2(["f(t) and g(t)\nTime domain"]):::blue
    D["FFT both signals\nO(n log n)"]:::green
    E["Multiply F·G\npointwise — O(n)"]:::green
    F["Inverse FFT\nO(n log n)"]:::green
    C2(["(f*g)(t)\nSame output!"]):::blue

    A2 --> D --> E --> F --> C2

    classDef blue fill:#4A90D9,stroke:#2c5f8a,color:#fff
    classDef green fill:#5BA85A,stroke:#3a6e39,color:#fff
    classDef warn fill:#D9534F,stroke:#a02020,color:#fff

Why this matters: Naive convolution of two length-$n$ signals takes $O(n^2)$ operations. The FFT-based approach takes only $O(n \log n)$ — a massive speedup. For a 1M-sample audio signal, that’s the difference between minutes and milliseconds.


🧠 Why Does This Work?

Sine waves are the eigenfunctions of convolution. Convolving any signal with a sine wave of frequency $f$ produces the same sine wave, just scaled and phase-shifted. That scale factor is exactly $G(f)$ — the Fourier transform of the kernel.

So instead of laboriously sliding $g$ across $f$:

  1. Decompose $f$ into its sine-wave components (FFT)
  2. Scale each component by how $g$ responds to that frequency
  3. Reconstruct (inverse FFT)

The Fourier transform turns the messy sliding operation into simple multiplication.

This is also why CNN features can be interpreted in frequency terms: early layers learn low-frequency detectors (edges, blobs), deeper layers learn higher-frequency, more abstract patterns.


🔗 Why Convolutions Are Everywhere in Deep Learning

flowchart TD
    IMG(["Input Image\n224×224×3"]):::blue
    C1["Conv Layer 1\n3×3 kernels × 64\n→ edges, colors"]:::green
    C2["Conv Layer 2\n3×3 × 128\n→ textures, corners"]:::green
    C3["Conv Layer 3+\n→ objects, semantic parts"]:::green
    OUT(["Classification / Detection"]):::purple

    IMG --> C1 --> C2 --> C3 --> OUT

    classDef blue fill:#4A90D9,stroke:#2c5f8a,color:#fff
    classDef green fill:#5BA85A,stroke:#3a6e39,color:#fff
    classDef purple fill:#9B6EBD,stroke:#6b4785,color:#fff

CNNs work because:

  • Translation equivariance — a convolution applied to a shifted image gives a shifted output. The kernel “sees” the same thing regardless of where it appears in the image.
  • Weight sharing — one kernel is used across the entire image, massively reducing parameters vs. a fully-connected layer.
  • Hierarchical feature learning — stack convolutions and you build up from pixels → edges → textures → parts → objects.
PropertyFully ConnectedConvolutional
Parameters (224×224 image, 64 features)~3.2 billion~1,728
Translation equivariant
Spatial structure preserved
ExpressivenessHigh (everywhere)High (local)

🩻 Convolution in My FYP — CT Reconstruction

In sparse-view CT reconstruction, convolution is baked into the physics:

  • Filtered Backprojection (FDK) applies a ramp filter via 1D convolution along the detector rows before backprojecting — directly using the convolution theorem to do it efficiently in the frequency domain.
  • The 3D U-Net artifact removal network uses strided 3D convolutions to learn spatially-aware features from the artifact-corrupted FDK output.
  • The data-consistency layer constrains the U-Net output to match the forward projection — itself implemented via discrete convolution in sinogram space.
The ramp filter $h(u)$ with $H(f) =f$ in frequency is applied by: FFT each detector row → multiply by $f$ → inverse FFT. That’s the convolution theorem in action inside clinical CT scanners.

💡 One-Sentence Intuition

A convolution asks: at each position, how much do these two shapes “overlap” — and the answer, traced across all positions, is the output.


▶️ Watch It Yourself

3Blue1Brown’s visual treatment is genuinely the best explanation of convolutions I’ve seen. The animated sliding-and-multiplying view makes the formula visceral.


Part of my notes series on signal processing and deep learning. Next: the Fourier Transform — what it actually means to decompose a signal into frequencies.

This post is licensed under CC BY 4.0 by the author.