The AI Engineer's CS Foundations Roadmap: 408 Syllabus, Top Courses & a 21-Post Blog Series

A first-principles CS study plan for AI engineers — checklist, curated course picks, and a 21-post blog series outline.

Posted Apr 8, 2026

By YuXuan Yan

17 min read

The AI Engineer's CS Foundations Roadmap: 408 Syllabus, Top Courses & a 21-Post Blog Series

If you’re on the path from ML practitioner to AI engineer, you’ve probably hit the wall — the one where library abstractions stop hiding the system beneath. This is my systematic attempt to tear that wall down, using the Chinese CS 408 syllabus as a spine for first-principles learning.

🗺️ Why CS Foundations Still Matter for AI Engineers

Knowing PyTorch is table stakes. Understanding why a DataLoader uses multiprocessing, how IEEE 754 causes gradient underflow, and what happens to TCP throughput during all-reduce in distributed training — that’s what separates engineers from practitioners.

This roadmap covers four pillars: Data Structures · Computer Organization · Operating Systems · Computer Networks — filtered for depth and AI relevance, not exam tricks.

✅ The Study Checklist

Exam-specific mnemonics stripped. Emphasis on system logic, implementation intuition, and AI engineering connections.

📦 Module 1 — Data Structures

Linear Structures
- Array vs. linked list: memory layout and cache locality implications
- Singly / doubly / circular linked list — pointer manipulation, sentinel nodes
- Dynamic arrays (amortized O(1) append) — how Python list and std::vector work under the hood
Stack & Queue
- Stack: LIFO, call stack, recursion unrolling to iteration
- Queue: FIFO, circular buffer, deque
- Applications: expression evaluation (infix → postfix), BFS frontier management
- Priority Queue: binary heap, O(log n) insert/extract-min
Trees & Binary Trees
- Binary tree traversals: pre/in/post-order (recursive + iterative), level-order (BFS)
- Threaded binary trees (space-efficient traversal without a stack)
- BST: insert, delete, search — average O(log n), worst O(n) on sorted input
- AVL Tree: balance factor, LL/LR/RL/RR rotations
- Red-Black Tree: five invariants, insertion fixup — understand why std::map uses this
- Huffman Tree: greedy construction, prefix-free codes, entropy connection
- B-Tree / B+ Tree: block-aware structure, why databases always index this way
Graphs
- Representations: adjacency matrix vs. adjacency list (space/time tradeoffs)
- DFS: recursive, iterative (explicit stack), finish-time ordering
- BFS: shortest path in unweighted graphs, level-order traversal
- Minimum Spanning Tree: Prim (priority queue + greedy), Kruskal (union-find)
- Shortest Path: Dijkstra (non-negative weights, O((V+E) log V)), Floyd-Warshall (all-pairs)
- Topological Sort: Kahn’s algorithm (BFS) and DFS-based (reverse finish time)
- Critical Path (AOE networks): forward/backward pass, float/slack
Search
- Sequential, binary, block (index) search
- B-Tree / B+ Tree (revisit from database storage perspective)
- Hash Table: chaining vs. open addressing, load factor α, rehashing
- Collision resolution: linear probing, quadratic probing, double hashing
Sorting
- Comparison sorts: Insertion, Shell, Bubble, QuickSort, MergeSort, HeapSort
- Non-comparison sorts: Counting, Radix, Bucket
- Full complexity table: best / avg / worst time + space (O(1) / O(log n) / O(n))
- Stability analysis: which algorithms preserve relative order of equal elements?
- QuickSort partition schemes: Lomuto vs. Hoare, pivot selection strategies

🖥️ Module 2 — Computer Organization

Architecture Fundamentals
- Von Neumann model: CPU, memory, I/O, bus — fetch-decode-execute cycle
- Performance metrics: CPI, MIPS, throughput vs. latency, Amdahl’s Law
Data Representation & Arithmetic
- Fixed-point: sign-magnitude, one’s complement, two’s complement, excess-N (biased)
- IEEE 754 float32 / float16 / bfloat16: sign · exponent · mantissa; NaN, Inf, denormals
- Practical ML impact: integer overflow, catastrophic cancellation, gradient underflow
- ALU design: ripple-carry adder, carry-lookahead adder (conceptual)
- Booth’s algorithm for multiplication (conceptual understanding)
Memory Hierarchy
- SRAM vs. DRAM: speed, cost, volatility, use cases
- Cache architecture: direct-mapped, set-associative, fully associative
- Cache mapping: tag / index / offset bit decomposition
- Replacement policies: LRU, LFU, FIFO, pseudo-LRU (Clock)
- Write policies: write-through vs. write-back + write-allocate
- Virtual memory: page tables, multi-level paging, page fault handling
- TLB: address translation fast path, TLB shootdown in multicore systems
Instruction Set Architecture
- Instruction formats: fixed-length (RISC) vs. variable-length (CISC)
- Addressing modes: immediate, direct, indirect, register, base+offset, PC-relative
- CISC vs. RISC: x86 vs. ARM/RISC-V design philosophy and tradeoffs
CPU Internals
- Single-cycle vs. multi-cycle datapath
- Hardwired vs. microprogrammed control unit
- Classic 5-stage pipeline: IF → ID → EX → MEM → WB
- Pipeline hazards: structural, data (RAW/WAW/WAR), control — forwarding, stalling, branch prediction
- Superscalar and out-of-order execution (conceptual overview)
I/O Systems
- I/O interface: data register, status register, control register
- Polling vs. interrupt-driven vs. DMA
- Interrupt vector table, context save/restore overhead
- DMA: bus arbitration, burst mode vs. cycle-steal mode

⚙️ Module 3 — Operating Systems

OS Fundamentals
- Kernel mode vs. user mode, hardware privilege rings
- System calls: trap mechanism, syscall table, overhead
- OS structures: monolithic kernel, microkernel, exokernel, hypervisor
Process & Thread Management
- Process states: new → ready → running → blocked → terminated
- PCB fields: PID, register set, page table pointer, file descriptor table
- Threads: kernel-level (1:1) vs. user-level (N:1) vs. hybrid (M:N)
- Context switch cost: what gets saved, TLB flush, cache cold-start
- CPU scheduling: FCFS, SJF (preemptive = SRTF), Round Robin, Priority, MLFQ
- Real-time scheduling: EDF, RMS (rate-monotonic)
Synchronization
- Race conditions, critical sections, mutual exclusion requirements
- Semaphores (counting + binary), mutex, spinlock
- Classic problems: Producer-Consumer (bounded buffer), Readers-Writers, Dining Philosophers
- Monitor: condition variables, wait / signal / broadcast semantics
- Lock-free structures: CAS (compare-and-swap), ABA problem, memory ordering
Deadlock
- Four necessary conditions: mutual exclusion, hold-and-wait, no preemption, circular wait
- Prevention: statically break one condition
- Avoidance: Banker’s Algorithm — safe state definition, resource allocation graph
- Detection + recovery: resource-allocation graph reduction, process termination / preemption
Memory Management
- Contiguous allocation: fixed/variable partition, fragmentation types
- Paging: page table structure, page fault sequence, demand paging, copy-on-write
- Page replacement: OPT (Belady optimal), FIFO (Belady’s anomaly), LRU, Clock/Second-Chance
- Thrashing: working set model, page fault frequency control
- Segmentation and segmented paging
File Systems
- Logical structure: sequential, indexed, hashed files
- Physical allocation: contiguous, linked (FAT), indexed (inode with direct/indirect blocks)
- Directory: hierarchical, hard links vs. symbolic links, DAG structure
- File sharing and protection: access control lists, capability lists
- Disk scheduling: FCFS, SSTF, SCAN (elevator), C-SCAN
I/O Management
- Device driver layers, I/O subsystem software stack
- Spooling (printer queue as canonical example)
- Buffer types: single, double, circular buffer management
- Block device vs. character device interfaces

🌐 Module 4 — Computer Networks

Architecture
- OSI 7-layer vs. TCP/IP 4-layer: purpose and key protocols at each layer
- Encapsulation / decapsulation, PDU naming (frame / packet / segment / message)
- Network performance: bandwidth, throughput, propagation delay, RTT, jitter
Physical Layer
- Nyquist theorem (noise-free channel capacity) vs. Shannon theorem (noisy channel)
- Encoding schemes: NRZ, Manchester, 4B/5B
- Modulation: AM/FM/PM, QAM for high-bandwidth channels
- Transmission media: copper, optical fiber, wireless (characteristics + tradeoffs)
Data Link Layer
- Framing: byte stuffing, bit stuffing
- Error detection: parity bits, CRC — polynomial division intuition
- Error correction: Hamming code distance
- Flow control + ARQ protocols: Stop-and-Wait, Go-Back-N (GBN), Selective Repeat (SR)
- Efficiency analysis: utilization formula with window size and propagation delay
- MAC protocols: CSMA/CD (Ethernet), CSMA/CA (WiFi 802.11), TDMA, FDMA
- Ethernet frame format, MAC addressing, switch vs. bridge (MAC learning table)
Network Layer
- IPv4 header fields: TTL, protocol, checksum, fragmentation (DF/MF flags)
- Subnetting: CIDR notation, VLSM — worked prefix calculations
- NAT: static / dynamic / PAT (overload), connection tracking table
- ARP: request/reply, cache, gratuitous ARP, proxy ARP
- DHCP: DORA 4-way handshake
- ICMP: ping (echo request/reply), traceroute (TTL expiry) mechanism
- Routing algorithms: distance vector (RIP, Bellman-Ford, count-to-infinity problem) vs. link state (OSPF, Dijkstra)
- BGP: path-vector protocol, AS, policy-based routing (prefer policy > shortest path)
- IPv6: 128-bit addressing, dual-stack transition, tunneling
Transport Layer
- UDP: connectionless, checksum-only, use cases (DNS, QUIC, video streaming)
- TCP segment header: sequence number, acknowledgement, flags, window size
- 3-way handshake (SYN → SYN-ACK → ACK) + 4-way teardown, full state machine
- TCP reliability: cumulative ACK, retransmission timer, SACK option
- Flow control: sliding window protocol, receive buffer, zero-window probe
- Congestion control: slow start, congestion avoidance (AIMD), fast retransmit, fast recovery (Reno / CUBIC)
Application Layer
- DNS: hierarchy (root → TLD → authoritative), recursive vs. iterative resolution, TTL caching
- HTTP/1.1 vs. HTTP/2 vs. HTTP/3 (QUIC): persistent connections, multiplexing, HOL blocking
- HTTPS: TLS handshake, certificate chain, HSTS
- FTP: control vs. data channel, active vs. passive mode
- Email: SMTP (push/send), POP3 / IMAP (pull/sync), MIME encoding

📚 Top Free Course Recommendations

Module	Course	Platform	Why Recommended
Data Structures	CS 61B — Data Structures (Josh Hug, UC Berkeley)	YouTube + sp21.datastructur.es	Exceptional pedagogy, Java-based, real projects with auto-graded labs. Covers DS + algorithms at depth — not just theory.
Computer Organization	CS:APP / 15-213 — Intro to Computer Systems (CMU)	cs.cmu.edu/~213 + YouTube	The definitive systems course. Caches, linking, concurrency, networking. Legendary labs (bomb, attack, malloc). Covers Module 2 + feeds into 3 & 4.
Operating Systems	MIT 6.S081 — OS Engineering (Frans Kaashoek)	pdos.csail.mit.edu/6.828 + YouTube	xv6 labs force you to implement real OS components in C. Best hands-on OS course publicly available.
Computer Networks	CS 144 — Intro to Computer Networking (Stanford, Nick McKeown)	YouTube (CS144 playlist)	Builds a TCP/IP stack from scratch in C++. Bridges theory and real protocol implementation. The best networks course online.
Systems (Unified)	CS:APP Textbook — Bryant & O’Hallaron	Free PDF widely available	Covers Organization + OS + Networks in one coherent narrative. Best single resource for Modules 2–4.
Algorithms	MIT 6.006 — Introduction to Algorithms	MIT OCW (Spring 2020)	Rigorous theory: proof-based analysis, dynamic programming, graph algorithms. Complements CS 61B for depth.

Bonus for AI context: Pair any systems course with MIT 6.172 Performance Engineering to see how these fundamentals directly apply to optimising ML training loops and inference pipelines.

📝 The Blog Series: 21 Posts

Each post is a standalone deep-dive, but reads coherently as a series. AI engineering angle woven throughout.

🧱 Series 0: The Foundation

Post 0 — This Post “The AI Engineer’s CS Foundations Roadmap: 408 Syllabus, Top Courses & 21 Posts to Come”

Why CS fundamentals matter in the LLM era
408 syllabus as a learning spine — filtered for engineering depth
Task checklist · course map · series outline

📦 Series 1: Data Structures (Posts 1–6)

Post 1 — “Memory Layout is Everything: Arrays vs. Linked Lists Through a Cache-Conscious Lens”

Physical memory layout: stack vs. heap, pointer indirection cost
Cache line = 64 bytes; spatial vs. temporal locality
Why PyTorch tensors must be contiguous — what .contiguous() actually does
Dynamic array amortized analysis: doubling strategy

Post 2 — “Stacks, Queues & the Hidden State Machine in Your Python Interpreter”

Call stack anatomy: frames, local variables, return addresses
Recursion vs. iteration: stack depth limits and tail-call optimisation
Queue applications: BFS, async task queues (Celery, asyncio event loop)
Shunting-yard algorithm: expression evaluation walkthrough

Post 3 — “BST to Red-Black Tree: Why std::map Beats a Naive Dictionary in Worst Cases”

BST pathological case: sorted insertion → O(n) search
AVL rotations: LL/LR/RL/RR — when and why each fires
Red-Black invariants — conceptual, no rote memorisation
B+ Trees: block-alignment, why every database index is a B+ tree

Post 4 — “Huffman Codes, Entropy, and the Mathematical Backbone of Data Compression in ML”

Greedy Huffman construction via priority queue
Shannon entropy: bits as a measure of surprise
Connection to cross-entropy loss in classification
Modern codecs: LZ77, Zstandard, and arithmetic coding

Post 5 — “Graphs as the Universal Data Structure: From Knowledge Graphs to GNNs”

Matrix vs. adjacency list: when each representation wins
BFS / DFS and what they actually guarantee
Dijkstra with heap optimisation: full walkthrough
Topological sort: dependency resolution (make, pip install)
Connection to GNN message-passing framework (MPNN)

Post 6 — “Sorting Algorithms Demystified: Why QuickSort Dominates and When MergeSort Wins”

Comparison sort lower bound: Ω(n log n) via decision tree argument
QuickSort: Lomuto vs. Hoare partition, pivot selection, Introsort
MergeSort: external sort, parallel/distributed contexts, stability
HeapSort: priority queue as a sorting primitive
Radix sort: O(nk) for integer keys — when it beats O(n log n)
Full stability table + use-case decision matrix

🖥️ Series 2: Computer Organization (Posts 7–11)

Post 7 — “Von Neumann to Transformer: How CPU Architecture Constraints Shape AI Hardware Design”

Fetch-decode-execute cycle in detail
Von Neumann bottleneck: memory bandwidth vs. compute throughput
How GPUs break the bottleneck: SIMD, massive parallelism, high-bandwidth memory
TPUs: systolic arrays as matrix-multiply engines
Amdahl’s Law applied to ML workloads: the serial fraction kills scaling

Post 8 — “IEEE 754 Deep Dive: Why Your Neural Network Loses Precision (And How to Fix It)”

Float32 anatomy: 1 sign bit · 8-bit exponent · 23-bit mantissa
Special values: NaN, Inf, denormals — when they appear in training
Float16 dynamic range vs. bfloat16: why Google chose bfloat16 for TPUs
Gradient underflow: loss scaling in mixed-precision training (AMP)
Catastrophic cancellation in attention: why FlashAttention uses tiling

Post 9 — “The Memory Hierarchy: Cache Misses, VRAM Limits, and the Real Bottleneck in LLM Inference”

SRAM vs. DRAM: speed / cost / capacity tradeoffs
Cache geometry: set-associative mapping, line size, eviction policy
Writing cache-friendly code: matrix transposition as a worked example
GPU memory hierarchy: L1 / L2 / VRAM / NVLink bandwidth numbers
FlashAttention’s core insight: tile to stay in SRAM, avoid VRAM round-trips

Post 10 — “CPU Pipelines & Branch Prediction: The Invisible Performance Tax on Your Python Code”

5-stage pipeline: IF → ID → EX → MEM → WB
Hazards: data forwarding, load-use stalls, branch misprediction penalty (~15 cycles)
Superscalar + out-of-order execution: how modern CPUs find ILP
Spectre / Meltdown: branch prediction as a security vulnerability
Implications for Python: GIL, CPython bytecode interpreter overhead

Post 11 — “Interrupts, DMA, and the Data Path of a Deep Learning Training Loop”

I/O without DMA: CPU burns cycles polling status registers
DMA: bus arbitration, burst mode vs. cycle-steal, memory-to-memory transfer
Interrupt handling: context save, ISR, return overhead
Modern AI training I/O: NVMe SSDs, PCIe bandwidth, RDMA / InfiniBand
DataLoader internals: why num_workers uses multiprocessing, not threading

⚙️ Series 3: Operating Systems (Posts 12–16)

Post 12 — “Processes, Threads, and Python’s GIL: The Concurrency Model Every AI Engineer Should Know”

Process vs. thread: separate address space vs. shared memory
Kernel threads (1:1) vs. user threads (N:1) — scheduling and blocking I/O
Python GIL: why CPU-bound threads don’t parallelise, and why it exists
multiprocessing vs. threading vs. asyncio: decision tree
PyTorch DataLoader and DistributedDataParallel: process-level concurrency

Post 13 — “CPU Scheduling & Context Switching: The Hidden Tax on Your Model Training Job”

Scheduling algorithms: FCFS, SJF, SRTF, Round Robin, MLFQ
Context switch cost: register save, TLB flush, cache cold-start (~microseconds)
Priority inversion: Mars Pathfinder bug case study
Linux cgroups, CPU pinning, and NUMA-aware allocation for ML workloads
GPU scheduling: CUDA streams, software pipeline, compute/copy overlap

Post 14 — “Deadlock, Semaphores & Why Distributed Training Needs Careful Synchronization”

Race conditions: data corruption from unsynchronised access
Mutex, counting semaphore, condition variable — semantics and implementation
Classic problems solved: Producer-Consumer (bounded buffer), Readers-Writers
Deadlock: four conditions + Banker’s Algorithm walkthrough with example
Distributed AI training: AllReduce deadlock, NCCL timeout, collective communication pitfalls

Post 15 — “Virtual Memory & Paging: Understanding CUDA’s Unified Memory from First Principles”

Virtual address space layout: code · heap · stack · shared libraries
Two-level page tables, TLB fast path, TLB shootdown in multicore
Page fault types: minor (zero page) · major (disk I/O) · copy-on-write
Page replacement: LRU approximation via Clock algorithm
CUDA unified memory: page migration, access counters, prefetching hints
np.memmap: how memory-mapped files avoid loading full arrays into RAM

Post 16 — “File Systems & I/O: How HDF5, Arrow, Parquet, and DataLoaders Actually Work”

inode-based indexed allocation: why file deletion is O(1)
Directory as a special file: path resolution algorithm
Journaling: crash consistency, write-ahead log, fsync semantics
Disk scheduling: SSTF vs. SCAN — seek time optimisation
HDF5: chunked storage, B-tree metadata index
Apache Arrow: columnar memory layout, zero-copy IPC, memory-mapped files
PyTorch DataLoader: prefetching, pin_memory, and why it speeds up GPU transfer

🌐 Series 4: Computer Networks (Posts 17–20)

Post 17 — “OSI vs. TCP/IP: The Abstraction Layers Powering Distributed AI Systems”

OSI 7-layer (pedagogical) vs. TCP/IP 4-layer (practical)
Encapsulation walkthrough: HTTP request from browser to wire (every header)
gRPC, Protobuf, and how ML serving maps to network layers
Service mesh (Istio/Envoy): why distributed inference needs L7 load balancing

Post 18 — “TCP Under the Hood: Reliability, Flow Control & Why gRPC Uses HTTP/2”

3-way handshake full state machine: SYN_SENT → ESTABLISHED → TIME_WAIT
Sequence numbers, cumulative ACK, retransmission timer (RTO, Karn’s algorithm)
Sliding window: receiver buffer, zero-window probe, silly window syndrome
Congestion control: slow start, AIMD, fast retransmit/recovery (Reno) vs. CUBIC
HTTP/1.1 vs. HTTP/2: multiplexing, HPACK header compression, server push
QUIC / HTTP/3: 0-RTT, stream multiplexing without HoL blocking
gRPC: why HTTP/2 + Protobuf beats REST + JSON for ML serving

Post 19 — “The Network Layer: IP, Routing Protocols & How Model Parallelism Crosses Nodes”

IPv4 header deep dive: TTL, fragmentation (DF/MF), checksum
Subnetting: CIDR, VLSM — worked prefix decomposition examples
ARP · DHCP · ICMP: the glue protocols every engineer hits in debugging
RIP (distance vector, count-to-infinity problem) vs. OSPF (link state, Dijkstra)
BGP: inter-AS routing, why policy trumps shortest path
InfiniBand vs. Ethernet for HPC/AI clusters: latency and bandwidth numbers
RDMA: kernel bypass for low-latency all-reduce in distributed training

Post 20 — “Application Protocols: DNS, HTTP, TLS & the APIs of Modern AI Services”

DNS: resolution chain, caching, negative TTL, DNSSEC basics
HTTP methods, status codes, REST vs. RPC semantics
TLS 1.3 handshake: 1-RTT, certificate chain, forward secrecy
WebSockets: persistent bidirectional channels for real-time inference
SSE (Server-Sent Events): how ChatGPT streams tokens to the browser
OpenAI API design patterns: rate limiting, token streaming, error handling

🏁 Series 5: Capstone

Post 21 — “Full-Stack Systems Thinking: From Bits to Transformers — CS Foundations Retrospective”

How the four modules connect: DS → Algorithms → OS → Networks → Distributed Systems → ML Systems
Mental models that compound: cache locality, pipeline hazards, TCP congestion window
What to learn next: distributed systems (DDIA), ML systems (MLSys), compilers (LLVM/Triton)
Full resource list and reading recommendations

🗓️ Suggested Study Timeline

gantt
    title CS Foundations — 12-Week Study Timeline
    dateFormat  YYYY-MM-DD
    section Data Structures
    Linear + Stack/Queue     :ds1, 2026-04-08, 7d
    Trees                    :ds2, after ds1, 7d
    Graphs + Search + Sort   :ds3, after ds2, 7d
    section Computer Org
    Arch + Numbers + ALU     :co1, after ds3, 7d
    Memory Hierarchy         :co2, after co1, 7d
    CPU Pipeline + I/O       :co3, after co2, 7d
    section Operating Systems
    Process / Thread / Sync  :os1, after co3, 7d
    Memory + File + I/O      :os2, after os1, 7d
    section Networks
    Physical + Data Link     :net1, after os2, 7d
    Network + Transport      :net2, after net1, 7d
    Application Layer        :net3, after net2, 7d
    Capstone Review          :cap, after net3, 7d

At one module per week, the full checklist completes in ~12 weeks. Given FYP deadlines (final report 30 Apr · demo 4–8 May), a realistic start is post-FYP submission — treat this roadmap as a summer milestone.

🔗 What’s Next

This post is the map. The journey starts with Post 1 → Memory Layout: Arrays vs. Linked Lists Through a Cache-Conscious Lens.

Every post in this series follows the same structure: concept → implementation intuition → AI engineering connection. No rote memorisation. Just systems thinking, built bottom-up.

Part of the CS Foundations for AI Engineers series. Next: Post 1 — “Memory Layout is Everything.”

Learning, Systems

This post is licensed under CC BY 4.0 by the author.