LLMs, Foundation Models, and Transformers: How They Relate
A clear breakdown of three core AI terms — Foundation Models, LLMs, and Transformers — and how they nest together.
Three terms you’ll constantly encounter in AI — and they’re closely related but distinct.
🔵 The Hierarchy: Nested Circles
flowchart TD
FM["🌐 Foundation Models\n(pre-trained on massive diverse data)"]:::fm
LLM["📝 Large Language Models\n(text-specialized Foundation Models)"]:::llm
EX["e.g. GPT-4, Claude, Llama, Gemini"]:::ex
FM --> LLM --> EX
classDef fm fill:#4A90D9,stroke:#2c5f8a,color:#fff
classDef llm fill:#5BA85A,stroke:#3a6e39,color:#fff
classDef ex fill:#9B6EBD,stroke:#6b4785,color:#fff
Foundation Models are the broadest category. They’re pre-trained on massive, diverse datasets using self-supervised learning, designed to be adapted to many downstream tasks. Crucially, they can be multimodal — handling images, audio, video, text, or code. Examples: CLIP (vision + text), Whisper (audio), DALL-E (image generation).
LLMs (Large Language Models) are a subset of Foundation Models — specifically text-focused. “Large” means billions of parameters trained on huge text corpora. Examples: GPT-4, Claude, Llama, Gemini.
All LLMs are Foundation Models. Not all Foundation Models are LLMs.
| Term | What it is |
|---|---|
| Foundation Model | Pre-trained, general-purpose model |
| LLM | Foundation Model specialized for language |
👁️ What is “Attention”?
Before Transformers (2017), models read text word by word, left to right (RNNs). By the time they reached word 50, they’d mostly “forgotten” word 1.
Attention solves this: when processing any word, look at all other words simultaneously and decide which ones matter most.
Example:
“The animal didn’t cross the street because it was too tired.”
What does “it” refer to? → “animal”, not “street”. Attention teaches the model to attend strongly to “animal” when processing “it”, and weakly to “street”.
Mechanically:
- Each word gets a relevance score against every other word
- Scores become weights (via softmax)
- Weighted average = a context-aware representation of that word
Attention = “which words should I focus on right now?”
⚙️ What is a Transformer?
A Transformer is an architecture built entirely on attention — no RNNs, no convolutions.
Key components:
- Multi-head attention — run attention multiple times in parallel; each “head” learns different relationships (syntax, coreference, semantics…)
- Feed-forward layers — process each position independently after attention
- Positional encoding — since attention sees all words at once (no inherent order), position info is injected manually
- Stacked layers — early layers capture syntax, later layers capture semantics and meaning
Why Transformers Dominate
| Old (RNN) | Transformer |
|---|---|
| Sequential, slow to train | Parallel, fast |
| Forgets long-range context | Sees everything at once |
| Hard to scale | Scales predictably with data + compute |
The scaling point is key. Transformers follow scaling laws — more data + bigger model = smarter model, almost predictably. This is what unlocked GPT, Claude, and modern AI at large.
🧩 Putting It Together
flowchart LR
T["⚙️ Transformer\n(the engine)"]:::engine --> FM["🌐 Foundation Model\n(the vehicle)"]:::fm
FM --> LLM["📝 LLM\n(language-optimized vehicle)"]:::llm
classDef engine fill:#D97B4A,stroke:#9e5430,color:#fff
classDef fm fill:#4A90D9,stroke:#2c5f8a,color:#fff
classDef llm fill:#5BA85A,stroke:#3a6e39,color:#fff
Transformers are the engine. Foundation Models are the vehicle built with that engine. LLMs are a specific type of vehicle (the language-optimized one).
Transformers enabled the scale that made Foundation Models practical — and LLMs are just the language-flavored version of that breakthrough.
Part of my AI fundamentals series. Next: how attention scales into multi-head attention and why that matters for reasoning.