LLMs, Foundation Models, and Transformers: How They Relate

A clear breakdown of three core AI terms — Foundation Models, LLMs, and Transformers — and how they nest together.

Posted Mar 8, 2026 Updated Mar 20, 2026

Neural network visualization

By YuXuan Yan

2 min read

LLMs, Foundation Models, and Transformers: How They Relate

Three terms you’ll constantly encounter in AI — and they’re closely related but distinct.

🔵 The Hierarchy: Nested Circles

flowchart TD
    FM["🌐 Foundation Models\n(pre-trained on massive diverse data)"]:::fm
    LLM["📝 Large Language Models\n(text-specialized Foundation Models)"]:::llm
    EX["e.g. GPT-4, Claude, Llama, Gemini"]:::ex

    FM --> LLM --> EX

    classDef fm fill:#4A90D9,stroke:#2c5f8a,color:#fff
    classDef llm fill:#5BA85A,stroke:#3a6e39,color:#fff
    classDef ex fill:#9B6EBD,stroke:#6b4785,color:#fff

Foundation Models are the broadest category. They’re pre-trained on massive, diverse datasets using self-supervised learning, designed to be adapted to many downstream tasks. Crucially, they can be multimodal — handling images, audio, video, text, or code. Examples: CLIP (vision + text), Whisper (audio), DALL-E (image generation).

LLMs (Large Language Models) are a subset of Foundation Models — specifically text-focused. “Large” means billions of parameters trained on huge text corpora. Examples: GPT-4, Claude, Llama, Gemini.

All LLMs are Foundation Models. Not all Foundation Models are LLMs.

Term	What it is
Foundation Model	Pre-trained, general-purpose model
LLM	Foundation Model specialized for language

👁️ What is “Attention”?

Before Transformers (2017), models read text word by word, left to right (RNNs). By the time they reached word 50, they’d mostly “forgotten” word 1.

Attention solves this: when processing any word, look at all other words simultaneously and decide which ones matter most.

Example:

“The animal didn’t cross the street because it was too tired.”

What does “it” refer to? → “animal”, not “street”. Attention teaches the model to attend strongly to “animal” when processing “it”, and weakly to “street”.

Mechanically:

Each word gets a relevance score against every other word
Scores become weights (via softmax)
Weighted average = a context-aware representation of that word

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

Attention = “which words should I focus on right now?”

⚙️ What is a Transformer?

A Transformer is an architecture built entirely on attention — no RNNs, no convolutions.

Key components:

Multi-head attention — run attention multiple times in parallel; each “head” learns different relationships (syntax, coreference, semantics…)
Feed-forward layers — process each position independently after attention
Positional encoding — since attention sees all words at once (no inherent order), position info is injected manually
Stacked layers — early layers capture syntax, later layers capture semantics and meaning

Why Transformers Dominate

Old (RNN)	Transformer
Sequential, slow to train	Parallel, fast
Forgets long-range context	Sees everything at once
Hard to scale	Scales predictably with data + compute

The scaling point is key. Transformers follow scaling laws — more data + bigger model = smarter model, almost predictably. This is what unlocked GPT, Claude, and modern AI at large.

🧩 Putting It Together

flowchart LR
    T["⚙️ Transformer\n(the engine)"]:::engine --> FM["🌐 Foundation Model\n(the vehicle)"]:::fm
    FM --> LLM["📝 LLM\n(language-optimized vehicle)"]:::llm

    classDef engine fill:#D97B4A,stroke:#9e5430,color:#fff
    classDef fm fill:#4A90D9,stroke:#2c5f8a,color:#fff
    classDef llm fill:#5BA85A,stroke:#3a6e39,color:#fff

Transformers are the engine. Foundation Models are the vehicle built with that engine. LLMs are a specific type of vehicle (the language-optimized one).

Transformers enabled the scale that made Foundation Models practical — and LLMs are just the language-flavored version of that breakthrough.

Part of my AI fundamentals series. Next: how attention scales into multi-head attention and why that matters for reasoning.

AI, Concepts

This post is licensed under CC BY 4.0 by the author.