But What Is a Convolution? — A Visual Deep Dive

Convolutions are everywhere — image blurring, CNNs, audio FX — yet the definition rarely clicks. This post builds the real intuition from first principles.

Posted Mar 22, 2026

Mathematical visualization of wave functions

By YuXuan Yan

5 min read

But What Is a Convolution? — A Visual Deep Dive

Convolutions are the backbone of image processing, convolutional neural networks, and digital signal processing — and yet most textbook definitions obscure what’s actually happening. Here’s the real intuition.

🎲 Start with Dice: Discrete Convolution

The cleanest way to understand convolution is through probability — specifically, adding two dice rolls.

Roll one 6-sided die. The outcome is uniform: each face has probability $\frac{1}{6}$. Roll two dice and add them. What’s the probability of getting a sum of 7?

\[P(\text{sum} = 7) = \sum_{k=1}^{6} P(A = k) \cdot P(B = 7 - k)\]

This is exactly convolution. You flip one distribution, slide it across the other, multiply corresponding values, and sum them up.

This is the discrete convolution formula: $(f * g)[n] = \sum_{k=-\infty}^{\infty} f[k] \cdot g[n-k]$ Read it as: for each output position $n$, take a dot product of $f$ with a flipped and shifted copy of $g$.

🔢 Polynomial Multiplication = Convolution in Disguise

Here’s a surprising reveal. Consider two polynomials:

\[f(x) = a_0 + a_1 x + a_2 x^2, \quad g(x) = b_0 + b_1 x + b_2 x^2\]

Multiply them. The coefficient of $x^n$ in the product is:

\[c_n = \sum_{k} a_k \cdot b_{n-k}\]

That’s exactly a discrete convolution of the coefficient sequences $[a_0, a_1, a_2]$ and $[b_0, b_1, b_2]$.

Operation	What’s being convolved
Adding dice	Two probability distributions
Polynomial multiplication	Coefficient sequences
Image blurring	Image pixel grid × Gaussian kernel
CNN forward pass	Feature maps × learned filters

Convolution is abstract precisely because the same operation shows up in wildly different contexts. The formula is always the same; only the things being convolved change.

📈 Continuous Convolution — Sliding Functions

The continuous version replaces the sum with an integral:

\[(f * g)(t) = \int_{-\infty}^{\infty} f(\tau) \cdot g(t - \tau) \, d\tau\]

flowchart LR
    A(["f(τ)\nFixed function"]):::blue
    B(["g(t - τ)\nFlipped + sliding"]):::green
    C(["Multiply\npointwise"]):::orange
    D(["Integrate\n(area under curve)"]):::purple
    E(["(f*g)(t)\nOutput at position t"]):::blue

    A --> C
    B --> C
    C --> D --> E

    classDef blue fill:#4A90D9,stroke:#2c5f8a,color:#fff
    classDef green fill:#5BA85A,stroke:#3a6e39,color:#fff
    classDef orange fill:#E8A838,stroke:#b07820,color:#fff
    classDef purple fill:#9B6EBD,stroke:#6b4785,color:#fff

The key move: $g(t - \tau)$ is $g$ flipped ($\tau \to -\tau$) and then shifted by $t$. As $t$ varies, the flipped copy slides across $f$. At each position, compute the overlap (integral of the product). The result is the output signal.

👁️ Image Blurring — The Spatial Convolution

In 2D image processing, convolution slides a small kernel (e.g. a Gaussian) across an image:

\[\text{Output}[x, y] = \sum_{i}\sum_{j} \text{Image}[x-i, y-j] \cdot \text{Kernel}[i, j]\]

Kernel	Effect
Gaussian (circular blob)	Smooth blur
Laplacian (center positive, edge negative)	Edge detection / sharpening
Sobel horizontal	Horizontal edge detection
Identity (1 in center)	No change

The Gaussian blur kernel has weight $K(i,j) \propto e^{-(i^2+j^2)/2\sigma^2}$. Larger $\sigma$ → wider blur. In CNNs, the network learns what kernels to use — no human crafting required.

⚡ The Fourier Connection — Convolution Theorem

Here’s the most profound result in all of signal processing:

\[\boxed{\mathcal{F}\{f * g\} = \mathcal{F}\{f\} \cdot \mathcal{F}\{g\}}\]

Convolution in time/space = pointwise multiplication in frequency.

flowchart TD
    A(["f(t) and g(t)\nTime domain"]):::blue
    B["Direct convolution\nO(n²) — SLOW"]:::warn
    C(["(f*g)(t)\nOutput"]):::blue

    A --> B --> C

    A2(["f(t) and g(t)\nTime domain"]):::blue
    D["FFT both signals\nO(n log n)"]:::green
    E["Multiply F·G\npointwise — O(n)"]:::green
    F["Inverse FFT\nO(n log n)"]:::green
    C2(["(f*g)(t)\nSame output!"]):::blue

    A2 --> D --> E --> F --> C2

    classDef blue fill:#4A90D9,stroke:#2c5f8a,color:#fff
    classDef green fill:#5BA85A,stroke:#3a6e39,color:#fff
    classDef warn fill:#D9534F,stroke:#a02020,color:#fff

Why this matters: Naive convolution of two length-$n$ signals takes $O(n^2)$ operations. The FFT-based approach takes only $O(n \log n)$ — a massive speedup. For a 1M-sample audio signal, that’s the difference between minutes and milliseconds.

🧠 Why Does This Work?

Sine waves are the eigenfunctions of convolution. Convolving any signal with a sine wave of frequency $f$ produces the same sine wave, just scaled and phase-shifted. That scale factor is exactly $G(f)$ — the Fourier transform of the kernel.

So instead of laboriously sliding $g$ across $f$:

Decompose $f$ into its sine-wave components (FFT)
Scale each component by how $g$ responds to that frequency
Reconstruct (inverse FFT)

The Fourier transform turns the messy sliding operation into simple multiplication.

This is also why CNN features can be interpreted in frequency terms: early layers learn low-frequency detectors (edges, blobs), deeper layers learn higher-frequency, more abstract patterns.

🔗 Why Convolutions Are Everywhere in Deep Learning

flowchart TD
    IMG(["Input Image\n224×224×3"]):::blue
    C1["Conv Layer 1\n3×3 kernels × 64\n→ edges, colors"]:::green
    C2["Conv Layer 2\n3×3 × 128\n→ textures, corners"]:::green
    C3["Conv Layer 3+\n→ objects, semantic parts"]:::green
    OUT(["Classification / Detection"]):::purple

    IMG --> C1 --> C2 --> C3 --> OUT

    classDef blue fill:#4A90D9,stroke:#2c5f8a,color:#fff
    classDef green fill:#5BA85A,stroke:#3a6e39,color:#fff
    classDef purple fill:#9B6EBD,stroke:#6b4785,color:#fff

CNNs work because:

Translation equivariance — a convolution applied to a shifted image gives a shifted output. The kernel “sees” the same thing regardless of where it appears in the image.
Weight sharing — one kernel is used across the entire image, massively reducing parameters vs. a fully-connected layer.
Hierarchical feature learning — stack convolutions and you build up from pixels → edges → textures → parts → objects.

Property	Fully Connected	Convolutional
Parameters (224×224 image, 64 features)	~3.2 billion	~1,728
Translation equivariant	❌	✅
Spatial structure preserved	❌	✅
Expressiveness	High (everywhere)	High (local)

🩻 Convolution in My FYP — CT Reconstruction

In sparse-view CT reconstruction, convolution is baked into the physics:

Filtered Backprojection (FDK) applies a ramp filter via 1D convolution along the detector rows before backprojecting — directly using the convolution theorem to do it efficiently in the frequency domain.
The 3D U-Net artifact removal network uses strided 3D convolutions to learn spatially-aware features from the artifact-corrupted FDK output.
The data-consistency layer constrains the U-Net output to match the forward projection — itself implemented via discrete convolution in sinogram space.

The ramp filter $h(u)$ with $H(f) = f $ in frequency is applied by: FFT each detector row → multiply by $ f $ → inverse FFT. That’s the convolution theorem in action inside clinical CT scanners.

💡 One-Sentence Intuition

A convolution asks: at each position, how much do these two shapes “overlap” — and the answer, traced across all positions, is the output.

▶️ Watch It Yourself

3Blue1Brown’s visual treatment is genuinely the best explanation of convolutions I’ve seen. The animated sliding-and-multiplying view makes the formula visceral.

Part of my notes series on signal processing and deep learning. Next: the Fourier Transform — what it actually means to decompose a signal into frequencies.

Signal Processing, Mathematics

This post is licensed under CC BY 4.0 by the author.