Post

LLMs, Foundation Models, and Transformers: How They Relate

A clear breakdown of three core AI terms — Foundation Models, LLMs, and Transformers — and how they nest together.

LLMs, Foundation Models, and Transformers: How They Relate

Three terms you’ll constantly encounter in AI — and they’re closely related but distinct.


🔵 The Hierarchy: Nested Circles

flowchart TD
    FM["🌐 Foundation Models\n(pre-trained on massive diverse data)"]:::fm
    LLM["📝 Large Language Models\n(text-specialized Foundation Models)"]:::llm
    EX["e.g. GPT-4, Claude, Llama, Gemini"]:::ex

    FM --> LLM --> EX

    classDef fm fill:#4A90D9,stroke:#2c5f8a,color:#fff
    classDef llm fill:#5BA85A,stroke:#3a6e39,color:#fff
    classDef ex fill:#9B6EBD,stroke:#6b4785,color:#fff

Foundation Models are the broadest category. They’re pre-trained on massive, diverse datasets using self-supervised learning, designed to be adapted to many downstream tasks. Crucially, they can be multimodal — handling images, audio, video, text, or code. Examples: CLIP (vision + text), Whisper (audio), DALL-E (image generation).

LLMs (Large Language Models) are a subset of Foundation Models — specifically text-focused. “Large” means billions of parameters trained on huge text corpora. Examples: GPT-4, Claude, Llama, Gemini.

All LLMs are Foundation Models. Not all Foundation Models are LLMs.

TermWhat it is
Foundation ModelPre-trained, general-purpose model
LLMFoundation Model specialized for language

👁️ What is “Attention”?

Before Transformers (2017), models read text word by word, left to right (RNNs). By the time they reached word 50, they’d mostly “forgotten” word 1.

Attention solves this: when processing any word, look at all other words simultaneously and decide which ones matter most.

Example:

“The animal didn’t cross the street because it was too tired.”

What does “it” refer to? → “animal”, not “street”. Attention teaches the model to attend strongly to “animal” when processing “it”, and weakly to “street”.

Mechanically:

  • Each word gets a relevance score against every other word
  • Scores become weights (via softmax)
  • Weighted average = a context-aware representation of that word
\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

Attention = “which words should I focus on right now?”


⚙️ What is a Transformer?

A Transformer is an architecture built entirely on attention — no RNNs, no convolutions.

Key components:

  • Multi-head attention — run attention multiple times in parallel; each “head” learns different relationships (syntax, coreference, semantics…)
  • Feed-forward layers — process each position independently after attention
  • Positional encoding — since attention sees all words at once (no inherent order), position info is injected manually
  • Stacked layers — early layers capture syntax, later layers capture semantics and meaning

Why Transformers Dominate

Old (RNN)Transformer
Sequential, slow to trainParallel, fast
Forgets long-range contextSees everything at once
Hard to scaleScales predictably with data + compute

The scaling point is key. Transformers follow scaling laws — more data + bigger model = smarter model, almost predictably. This is what unlocked GPT, Claude, and modern AI at large.


🧩 Putting It Together

flowchart LR
    T["⚙️ Transformer\n(the engine)"]:::engine --> FM["🌐 Foundation Model\n(the vehicle)"]:::fm
    FM --> LLM["📝 LLM\n(language-optimized vehicle)"]:::llm

    classDef engine fill:#D97B4A,stroke:#9e5430,color:#fff
    classDef fm fill:#4A90D9,stroke:#2c5f8a,color:#fff
    classDef llm fill:#5BA85A,stroke:#3a6e39,color:#fff

Transformers are the engine. Foundation Models are the vehicle built with that engine. LLMs are a specific type of vehicle (the language-optimized one).

Transformers enabled the scale that made Foundation Models practical — and LLMs are just the language-flavored version of that breakthrough.


Part of my AI fundamentals series. Next: how attention scales into multi-head attention and why that matters for reasoning.

This post is licensed under CC BY 4.0 by the author.