But What Is a Convolution? — A Visual Deep Dive
Convolutions are everywhere — image blurring, CNNs, audio FX — yet the definition rarely clicks. This post builds the real intuition from first principles.
Convolutions are the backbone of image processing, convolutional neural networks, and digital signal processing — and yet most textbook definitions obscure what’s actually happening. Here’s the real intuition.
🎲 Start with Dice: Discrete Convolution
The cleanest way to understand convolution is through probability — specifically, adding two dice rolls.
Roll one 6-sided die. The outcome is uniform: each face has probability $\frac{1}{6}$. Roll two dice and add them. What’s the probability of getting a sum of 7?
\[P(\text{sum} = 7) = \sum_{k=1}^{6} P(A = k) \cdot P(B = 7 - k)\]This is exactly convolution. You flip one distribution, slide it across the other, multiply corresponding values, and sum them up.
This is the discrete convolution formula: \((f * g)[n] = \sum_{k=-\infty}^{\infty} f[k] \cdot g[n-k]\) Read it as: for each output position $n$, take a dot product of $f$ with a flipped and shifted copy of $g$.
🔢 Polynomial Multiplication = Convolution in Disguise
Here’s a surprising reveal. Consider two polynomials:
\[f(x) = a_0 + a_1 x + a_2 x^2, \quad g(x) = b_0 + b_1 x + b_2 x^2\]Multiply them. The coefficient of $x^n$ in the product is:
\[c_n = \sum_{k} a_k \cdot b_{n-k}\]That’s exactly a discrete convolution of the coefficient sequences $[a_0, a_1, a_2]$ and $[b_0, b_1, b_2]$.
| Operation | What’s being convolved |
|---|---|
| Adding dice | Two probability distributions |
| Polynomial multiplication | Coefficient sequences |
| Image blurring | Image pixel grid × Gaussian kernel |
| CNN forward pass | Feature maps × learned filters |
Convolution is abstract precisely because the same operation shows up in wildly different contexts. The formula is always the same; only the things being convolved change.
📈 Continuous Convolution — Sliding Functions
The continuous version replaces the sum with an integral:
\[(f * g)(t) = \int_{-\infty}^{\infty} f(\tau) \cdot g(t - \tau) \, d\tau\]flowchart LR
A(["f(τ)\nFixed function"]):::blue
B(["g(t - τ)\nFlipped + sliding"]):::green
C(["Multiply\npointwise"]):::orange
D(["Integrate\n(area under curve)"]):::purple
E(["(f*g)(t)\nOutput at position t"]):::blue
A --> C
B --> C
C --> D --> E
classDef blue fill:#4A90D9,stroke:#2c5f8a,color:#fff
classDef green fill:#5BA85A,stroke:#3a6e39,color:#fff
classDef orange fill:#E8A838,stroke:#b07820,color:#fff
classDef purple fill:#9B6EBD,stroke:#6b4785,color:#fff
The key move: $g(t - \tau)$ is $g$ flipped ($\tau \to -\tau$) and then shifted by $t$. As $t$ varies, the flipped copy slides across $f$. At each position, compute the overlap (integral of the product). The result is the output signal.
👁️ Image Blurring — The Spatial Convolution
In 2D image processing, convolution slides a small kernel (e.g. a Gaussian) across an image:
\[\text{Output}[x, y] = \sum_{i}\sum_{j} \text{Image}[x-i, y-j] \cdot \text{Kernel}[i, j]\]| Kernel | Effect |
|---|---|
| Gaussian (circular blob) | Smooth blur |
| Laplacian (center positive, edge negative) | Edge detection / sharpening |
| Sobel horizontal | Horizontal edge detection |
| Identity (1 in center) | No change |
The Gaussian blur kernel has weight $K(i,j) \propto e^{-(i^2+j^2)/2\sigma^2}$. Larger $\sigma$ → wider blur. In CNNs, the network learns what kernels to use — no human crafting required.
⚡ The Fourier Connection — Convolution Theorem
Here’s the most profound result in all of signal processing:
\[\boxed{\mathcal{F}\{f * g\} = \mathcal{F}\{f\} \cdot \mathcal{F}\{g\}}\]Convolution in time/space = pointwise multiplication in frequency.
flowchart TD
A(["f(t) and g(t)\nTime domain"]):::blue
B["Direct convolution\nO(n²) — SLOW"]:::warn
C(["(f*g)(t)\nOutput"]):::blue
A --> B --> C
A2(["f(t) and g(t)\nTime domain"]):::blue
D["FFT both signals\nO(n log n)"]:::green
E["Multiply F·G\npointwise — O(n)"]:::green
F["Inverse FFT\nO(n log n)"]:::green
C2(["(f*g)(t)\nSame output!"]):::blue
A2 --> D --> E --> F --> C2
classDef blue fill:#4A90D9,stroke:#2c5f8a,color:#fff
classDef green fill:#5BA85A,stroke:#3a6e39,color:#fff
classDef warn fill:#D9534F,stroke:#a02020,color:#fff
Why this matters: Naive convolution of two length-$n$ signals takes $O(n^2)$ operations. The FFT-based approach takes only $O(n \log n)$ — a massive speedup. For a 1M-sample audio signal, that’s the difference between minutes and milliseconds.
🧠 Why Does This Work?
Sine waves are the eigenfunctions of convolution. Convolving any signal with a sine wave of frequency $f$ produces the same sine wave, just scaled and phase-shifted. That scale factor is exactly $G(f)$ — the Fourier transform of the kernel.
So instead of laboriously sliding $g$ across $f$:
- Decompose $f$ into its sine-wave components (FFT)
- Scale each component by how $g$ responds to that frequency
- Reconstruct (inverse FFT)
The Fourier transform turns the messy sliding operation into simple multiplication.
This is also why CNN features can be interpreted in frequency terms: early layers learn low-frequency detectors (edges, blobs), deeper layers learn higher-frequency, more abstract patterns.
🔗 Why Convolutions Are Everywhere in Deep Learning
flowchart TD
IMG(["Input Image\n224×224×3"]):::blue
C1["Conv Layer 1\n3×3 kernels × 64\n→ edges, colors"]:::green
C2["Conv Layer 2\n3×3 × 128\n→ textures, corners"]:::green
C3["Conv Layer 3+\n→ objects, semantic parts"]:::green
OUT(["Classification / Detection"]):::purple
IMG --> C1 --> C2 --> C3 --> OUT
classDef blue fill:#4A90D9,stroke:#2c5f8a,color:#fff
classDef green fill:#5BA85A,stroke:#3a6e39,color:#fff
classDef purple fill:#9B6EBD,stroke:#6b4785,color:#fff
CNNs work because:
- Translation equivariance — a convolution applied to a shifted image gives a shifted output. The kernel “sees” the same thing regardless of where it appears in the image.
- Weight sharing — one kernel is used across the entire image, massively reducing parameters vs. a fully-connected layer.
- Hierarchical feature learning — stack convolutions and you build up from pixels → edges → textures → parts → objects.
| Property | Fully Connected | Convolutional |
|---|---|---|
| Parameters (224×224 image, 64 features) | ~3.2 billion | ~1,728 |
| Translation equivariant | ❌ | ✅ |
| Spatial structure preserved | ❌ | ✅ |
| Expressiveness | High (everywhere) | High (local) |
🩻 Convolution in My FYP — CT Reconstruction
In sparse-view CT reconstruction, convolution is baked into the physics:
- Filtered Backprojection (FDK) applies a ramp filter via 1D convolution along the detector rows before backprojecting — directly using the convolution theorem to do it efficiently in the frequency domain.
- The 3D U-Net artifact removal network uses strided 3D convolutions to learn spatially-aware features from the artifact-corrupted FDK output.
- The data-consistency layer constrains the U-Net output to match the forward projection — itself implemented via discrete convolution in sinogram space.
The ramp filter $h(u)$ with $H(f) = f $ in frequency is applied by: FFT each detector row → multiply by $ f $ → inverse FFT. That’s the convolution theorem in action inside clinical CT scanners.
💡 One-Sentence Intuition
A convolution asks: at each position, how much do these two shapes “overlap” — and the answer, traced across all positions, is the output.
▶️ Watch It Yourself
3Blue1Brown’s visual treatment is genuinely the best explanation of convolutions I’ve seen. The animated sliding-and-multiplying view makes the formula visceral.
Part of my notes series on signal processing and deep learning. Next: the Fourier Transform — what it actually means to decompose a signal into frequencies.