The AI Engineer's CS Foundations Roadmap: 408 Syllabus, Top Courses & a 21-Post Blog Series
A first-principles CS study plan for AI engineers — checklist, curated course picks, and a 21-post blog series outline.
If you’re on the path from ML practitioner to AI engineer, you’ve probably hit the wall — the one where library abstractions stop hiding the system beneath. This is my systematic attempt to tear that wall down, using the Chinese CS 408 syllabus as a spine for first-principles learning.
🗺️ Why CS Foundations Still Matter for AI Engineers
Knowing PyTorch is table stakes. Understanding why a DataLoader uses multiprocessing, how IEEE 754 causes gradient underflow, and what happens to TCP throughput during all-reduce in distributed training — that’s what separates engineers from practitioners.
This roadmap covers four pillars: Data Structures · Computer Organization · Operating Systems · Computer Networks — filtered for depth and AI relevance, not exam tricks.
✅ The Study Checklist
Exam-specific mnemonics stripped. Emphasis on system logic, implementation intuition, and AI engineering connections.
📦 Module 1 — Data Structures
- Linear Structures
- Array vs. linked list: memory layout and cache locality implications
- Singly / doubly / circular linked list — pointer manipulation, sentinel nodes
- Dynamic arrays (amortized O(1) append) — how Python
listandstd::vectorwork under the hood
- Stack & Queue
- Stack: LIFO, call stack, recursion unrolling to iteration
- Queue: FIFO, circular buffer, deque
- Applications: expression evaluation (infix → postfix), BFS frontier management
- Priority Queue: binary heap, O(log n) insert/extract-min
- Trees & Binary Trees
- Binary tree traversals: pre/in/post-order (recursive + iterative), level-order (BFS)
- Threaded binary trees (space-efficient traversal without a stack)
- BST: insert, delete, search — average O(log n), worst O(n) on sorted input
- AVL Tree: balance factor, LL/LR/RL/RR rotations
- Red-Black Tree: five invariants, insertion fixup — understand why
std::mapuses this - Huffman Tree: greedy construction, prefix-free codes, entropy connection
- B-Tree / B+ Tree: block-aware structure, why databases always index this way
- Graphs
- Representations: adjacency matrix vs. adjacency list (space/time tradeoffs)
- DFS: recursive, iterative (explicit stack), finish-time ordering
- BFS: shortest path in unweighted graphs, level-order traversal
- Minimum Spanning Tree: Prim (priority queue + greedy), Kruskal (union-find)
- Shortest Path: Dijkstra (non-negative weights, O((V+E) log V)), Floyd-Warshall (all-pairs)
- Topological Sort: Kahn’s algorithm (BFS) and DFS-based (reverse finish time)
- Critical Path (AOE networks): forward/backward pass, float/slack
- Search
- Sequential, binary, block (index) search
- B-Tree / B+ Tree (revisit from database storage perspective)
- Hash Table: chaining vs. open addressing, load factor α, rehashing
- Collision resolution: linear probing, quadratic probing, double hashing
- Sorting
- Comparison sorts: Insertion, Shell, Bubble, QuickSort, MergeSort, HeapSort
- Non-comparison sorts: Counting, Radix, Bucket
- Full complexity table: best / avg / worst time + space (O(1) / O(log n) / O(n))
- Stability analysis: which algorithms preserve relative order of equal elements?
- QuickSort partition schemes: Lomuto vs. Hoare, pivot selection strategies
🖥️ Module 2 — Computer Organization
- Architecture Fundamentals
- Von Neumann model: CPU, memory, I/O, bus — fetch-decode-execute cycle
- Performance metrics: CPI, MIPS, throughput vs. latency, Amdahl’s Law
- Data Representation & Arithmetic
- Fixed-point: sign-magnitude, one’s complement, two’s complement, excess-N (biased)
- IEEE 754 float32 / float16 / bfloat16: sign · exponent · mantissa; NaN, Inf, denormals
- Practical ML impact: integer overflow, catastrophic cancellation, gradient underflow
- ALU design: ripple-carry adder, carry-lookahead adder (conceptual)
- Booth’s algorithm for multiplication (conceptual understanding)
- Memory Hierarchy
- SRAM vs. DRAM: speed, cost, volatility, use cases
- Cache architecture: direct-mapped, set-associative, fully associative
- Cache mapping: tag / index / offset bit decomposition
- Replacement policies: LRU, LFU, FIFO, pseudo-LRU (Clock)
- Write policies: write-through vs. write-back + write-allocate
- Virtual memory: page tables, multi-level paging, page fault handling
- TLB: address translation fast path, TLB shootdown in multicore systems
- Instruction Set Architecture
- Instruction formats: fixed-length (RISC) vs. variable-length (CISC)
- Addressing modes: immediate, direct, indirect, register, base+offset, PC-relative
- CISC vs. RISC: x86 vs. ARM/RISC-V design philosophy and tradeoffs
- CPU Internals
- Single-cycle vs. multi-cycle datapath
- Hardwired vs. microprogrammed control unit
- Classic 5-stage pipeline: IF → ID → EX → MEM → WB
- Pipeline hazards: structural, data (RAW/WAW/WAR), control — forwarding, stalling, branch prediction
- Superscalar and out-of-order execution (conceptual overview)
- I/O Systems
- I/O interface: data register, status register, control register
- Polling vs. interrupt-driven vs. DMA
- Interrupt vector table, context save/restore overhead
- DMA: bus arbitration, burst mode vs. cycle-steal mode
⚙️ Module 3 — Operating Systems
- OS Fundamentals
- Kernel mode vs. user mode, hardware privilege rings
- System calls: trap mechanism, syscall table, overhead
- OS structures: monolithic kernel, microkernel, exokernel, hypervisor
- Process & Thread Management
- Process states: new → ready → running → blocked → terminated
- PCB fields: PID, register set, page table pointer, file descriptor table
- Threads: kernel-level (1:1) vs. user-level (N:1) vs. hybrid (M:N)
- Context switch cost: what gets saved, TLB flush, cache cold-start
- CPU scheduling: FCFS, SJF (preemptive = SRTF), Round Robin, Priority, MLFQ
- Real-time scheduling: EDF, RMS (rate-monotonic)
- Synchronization
- Race conditions, critical sections, mutual exclusion requirements
- Semaphores (counting + binary), mutex, spinlock
- Classic problems: Producer-Consumer (bounded buffer), Readers-Writers, Dining Philosophers
- Monitor: condition variables, wait / signal / broadcast semantics
- Lock-free structures: CAS (compare-and-swap), ABA problem, memory ordering
- Deadlock
- Four necessary conditions: mutual exclusion, hold-and-wait, no preemption, circular wait
- Prevention: statically break one condition
- Avoidance: Banker’s Algorithm — safe state definition, resource allocation graph
- Detection + recovery: resource-allocation graph reduction, process termination / preemption
- Memory Management
- Contiguous allocation: fixed/variable partition, fragmentation types
- Paging: page table structure, page fault sequence, demand paging, copy-on-write
- Page replacement: OPT (Belady optimal), FIFO (Belady’s anomaly), LRU, Clock/Second-Chance
- Thrashing: working set model, page fault frequency control
- Segmentation and segmented paging
- File Systems
- Logical structure: sequential, indexed, hashed files
- Physical allocation: contiguous, linked (FAT), indexed (inode with direct/indirect blocks)
- Directory: hierarchical, hard links vs. symbolic links, DAG structure
- File sharing and protection: access control lists, capability lists
- Disk scheduling: FCFS, SSTF, SCAN (elevator), C-SCAN
- I/O Management
- Device driver layers, I/O subsystem software stack
- Spooling (printer queue as canonical example)
- Buffer types: single, double, circular buffer management
- Block device vs. character device interfaces
🌐 Module 4 — Computer Networks
- Architecture
- OSI 7-layer vs. TCP/IP 4-layer: purpose and key protocols at each layer
- Encapsulation / decapsulation, PDU naming (frame / packet / segment / message)
- Network performance: bandwidth, throughput, propagation delay, RTT, jitter
- Physical Layer
- Nyquist theorem (noise-free channel capacity) vs. Shannon theorem (noisy channel)
- Encoding schemes: NRZ, Manchester, 4B/5B
- Modulation: AM/FM/PM, QAM for high-bandwidth channels
- Transmission media: copper, optical fiber, wireless (characteristics + tradeoffs)
- Data Link Layer
- Framing: byte stuffing, bit stuffing
- Error detection: parity bits, CRC — polynomial division intuition
- Error correction: Hamming code distance
- Flow control + ARQ protocols: Stop-and-Wait, Go-Back-N (GBN), Selective Repeat (SR)
- Efficiency analysis: utilization formula with window size and propagation delay
- MAC protocols: CSMA/CD (Ethernet), CSMA/CA (WiFi 802.11), TDMA, FDMA
- Ethernet frame format, MAC addressing, switch vs. bridge (MAC learning table)
- Network Layer
- IPv4 header fields: TTL, protocol, checksum, fragmentation (DF/MF flags)
- Subnetting: CIDR notation, VLSM — worked prefix calculations
- NAT: static / dynamic / PAT (overload), connection tracking table
- ARP: request/reply, cache, gratuitous ARP, proxy ARP
- DHCP: DORA 4-way handshake
- ICMP: ping (echo request/reply), traceroute (TTL expiry) mechanism
- Routing algorithms: distance vector (RIP, Bellman-Ford, count-to-infinity problem) vs. link state (OSPF, Dijkstra)
- BGP: path-vector protocol, AS, policy-based routing (prefer policy > shortest path)
- IPv6: 128-bit addressing, dual-stack transition, tunneling
- Transport Layer
- UDP: connectionless, checksum-only, use cases (DNS, QUIC, video streaming)
- TCP segment header: sequence number, acknowledgement, flags, window size
- 3-way handshake (SYN → SYN-ACK → ACK) + 4-way teardown, full state machine
- TCP reliability: cumulative ACK, retransmission timer, SACK option
- Flow control: sliding window protocol, receive buffer, zero-window probe
- Congestion control: slow start, congestion avoidance (AIMD), fast retransmit, fast recovery (Reno / CUBIC)
- Application Layer
- DNS: hierarchy (root → TLD → authoritative), recursive vs. iterative resolution, TTL caching
- HTTP/1.1 vs. HTTP/2 vs. HTTP/3 (QUIC): persistent connections, multiplexing, HOL blocking
- HTTPS: TLS handshake, certificate chain, HSTS
- FTP: control vs. data channel, active vs. passive mode
- Email: SMTP (push/send), POP3 / IMAP (pull/sync), MIME encoding
📚 Top Free Course Recommendations
| Module | Course | Platform | Why Recommended |
|---|---|---|---|
| Data Structures | CS 61B — Data Structures (Josh Hug, UC Berkeley) | YouTube + sp21.datastructur.es | Exceptional pedagogy, Java-based, real projects with auto-graded labs. Covers DS + algorithms at depth — not just theory. |
| Computer Organization | CS:APP / 15-213 — Intro to Computer Systems (CMU) | cs.cmu.edu/~213 + YouTube | The definitive systems course. Caches, linking, concurrency, networking. Legendary labs (bomb, attack, malloc). Covers Module 2 + feeds into 3 & 4. |
| Operating Systems | MIT 6.S081 — OS Engineering (Frans Kaashoek) | pdos.csail.mit.edu/6.828 + YouTube | xv6 labs force you to implement real OS components in C. Best hands-on OS course publicly available. |
| Computer Networks | CS 144 — Intro to Computer Networking (Stanford, Nick McKeown) | YouTube (CS144 playlist) | Builds a TCP/IP stack from scratch in C++. Bridges theory and real protocol implementation. The best networks course online. |
| Systems (Unified) | CS:APP Textbook — Bryant & O’Hallaron | Free PDF widely available | Covers Organization + OS + Networks in one coherent narrative. Best single resource for Modules 2–4. |
| Algorithms | MIT 6.006 — Introduction to Algorithms | MIT OCW (Spring 2020) | Rigorous theory: proof-based analysis, dynamic programming, graph algorithms. Complements CS 61B for depth. |
Bonus for AI context: Pair any systems course with MIT 6.172 Performance Engineering to see how these fundamentals directly apply to optimising ML training loops and inference pipelines.
📝 The Blog Series: 21 Posts
Each post is a standalone deep-dive, but reads coherently as a series. AI engineering angle woven throughout.
🧱 Series 0: The Foundation
Post 0 — This Post “The AI Engineer’s CS Foundations Roadmap: 408 Syllabus, Top Courses & 21 Posts to Come”
- Why CS fundamentals matter in the LLM era
- 408 syllabus as a learning spine — filtered for engineering depth
- Task checklist · course map · series outline
📦 Series 1: Data Structures (Posts 1–6)
Post 1 — “Memory Layout is Everything: Arrays vs. Linked Lists Through a Cache-Conscious Lens”
- Physical memory layout: stack vs. heap, pointer indirection cost
- Cache line = 64 bytes; spatial vs. temporal locality
- Why PyTorch tensors must be contiguous — what
.contiguous()actually does - Dynamic array amortized analysis: doubling strategy
Post 2 — “Stacks, Queues & the Hidden State Machine in Your Python Interpreter”
- Call stack anatomy: frames, local variables, return addresses
- Recursion vs. iteration: stack depth limits and tail-call optimisation
- Queue applications: BFS, async task queues (Celery, asyncio event loop)
- Shunting-yard algorithm: expression evaluation walkthrough
Post 3 — “BST to Red-Black Tree: Why std::map Beats a Naive Dictionary in Worst Cases”
- BST pathological case: sorted insertion → O(n) search
- AVL rotations: LL/LR/RL/RR — when and why each fires
- Red-Black invariants — conceptual, no rote memorisation
- B+ Trees: block-alignment, why every database index is a B+ tree
Post 4 — “Huffman Codes, Entropy, and the Mathematical Backbone of Data Compression in ML”
- Greedy Huffman construction via priority queue
- Shannon entropy: bits as a measure of surprise
- Connection to cross-entropy loss in classification
- Modern codecs: LZ77, Zstandard, and arithmetic coding
Post 5 — “Graphs as the Universal Data Structure: From Knowledge Graphs to GNNs”
- Matrix vs. adjacency list: when each representation wins
- BFS / DFS and what they actually guarantee
- Dijkstra with heap optimisation: full walkthrough
- Topological sort: dependency resolution (make, pip install)
- Connection to GNN message-passing framework (MPNN)
Post 6 — “Sorting Algorithms Demystified: Why QuickSort Dominates and When MergeSort Wins”
- Comparison sort lower bound: Ω(n log n) via decision tree argument
- QuickSort: Lomuto vs. Hoare partition, pivot selection, Introsort
- MergeSort: external sort, parallel/distributed contexts, stability
- HeapSort: priority queue as a sorting primitive
- Radix sort: O(nk) for integer keys — when it beats O(n log n)
- Full stability table + use-case decision matrix
🖥️ Series 2: Computer Organization (Posts 7–11)
Post 7 — “Von Neumann to Transformer: How CPU Architecture Constraints Shape AI Hardware Design”
- Fetch-decode-execute cycle in detail
- Von Neumann bottleneck: memory bandwidth vs. compute throughput
- How GPUs break the bottleneck: SIMD, massive parallelism, high-bandwidth memory
- TPUs: systolic arrays as matrix-multiply engines
- Amdahl’s Law applied to ML workloads: the serial fraction kills scaling
Post 8 — “IEEE 754 Deep Dive: Why Your Neural Network Loses Precision (And How to Fix It)”
- Float32 anatomy: 1 sign bit · 8-bit exponent · 23-bit mantissa
- Special values: NaN, Inf, denormals — when they appear in training
- Float16 dynamic range vs. bfloat16: why Google chose bfloat16 for TPUs
- Gradient underflow: loss scaling in mixed-precision training (AMP)
- Catastrophic cancellation in attention: why FlashAttention uses tiling
Post 9 — “The Memory Hierarchy: Cache Misses, VRAM Limits, and the Real Bottleneck in LLM Inference”
- SRAM vs. DRAM: speed / cost / capacity tradeoffs
- Cache geometry: set-associative mapping, line size, eviction policy
- Writing cache-friendly code: matrix transposition as a worked example
- GPU memory hierarchy: L1 / L2 / VRAM / NVLink bandwidth numbers
- FlashAttention’s core insight: tile to stay in SRAM, avoid VRAM round-trips
Post 10 — “CPU Pipelines & Branch Prediction: The Invisible Performance Tax on Your Python Code”
- 5-stage pipeline: IF → ID → EX → MEM → WB
- Hazards: data forwarding, load-use stalls, branch misprediction penalty (~15 cycles)
- Superscalar + out-of-order execution: how modern CPUs find ILP
- Spectre / Meltdown: branch prediction as a security vulnerability
- Implications for Python: GIL, CPython bytecode interpreter overhead
Post 11 — “Interrupts, DMA, and the Data Path of a Deep Learning Training Loop”
- I/O without DMA: CPU burns cycles polling status registers
- DMA: bus arbitration, burst mode vs. cycle-steal, memory-to-memory transfer
- Interrupt handling: context save, ISR, return overhead
- Modern AI training I/O: NVMe SSDs, PCIe bandwidth, RDMA / InfiniBand
DataLoaderinternals: whynum_workersusesmultiprocessing, notthreading
⚙️ Series 3: Operating Systems (Posts 12–16)
Post 12 — “Processes, Threads, and Python’s GIL: The Concurrency Model Every AI Engineer Should Know”
- Process vs. thread: separate address space vs. shared memory
- Kernel threads (1:1) vs. user threads (N:1) — scheduling and blocking I/O
- Python GIL: why CPU-bound threads don’t parallelise, and why it exists
multiprocessingvs.threadingvs.asyncio: decision tree- PyTorch
DataLoaderandDistributedDataParallel: process-level concurrency
Post 13 — “CPU Scheduling & Context Switching: The Hidden Tax on Your Model Training Job”
- Scheduling algorithms: FCFS, SJF, SRTF, Round Robin, MLFQ
- Context switch cost: register save, TLB flush, cache cold-start (~microseconds)
- Priority inversion: Mars Pathfinder bug case study
- Linux cgroups, CPU pinning, and NUMA-aware allocation for ML workloads
- GPU scheduling: CUDA streams, software pipeline, compute/copy overlap
Post 14 — “Deadlock, Semaphores & Why Distributed Training Needs Careful Synchronization”
- Race conditions: data corruption from unsynchronised access
- Mutex, counting semaphore, condition variable — semantics and implementation
- Classic problems solved: Producer-Consumer (bounded buffer), Readers-Writers
- Deadlock: four conditions + Banker’s Algorithm walkthrough with example
- Distributed AI training: AllReduce deadlock, NCCL timeout, collective communication pitfalls
Post 15 — “Virtual Memory & Paging: Understanding CUDA’s Unified Memory from First Principles”
- Virtual address space layout: code · heap · stack · shared libraries
- Two-level page tables, TLB fast path, TLB shootdown in multicore
- Page fault types: minor (zero page) · major (disk I/O) · copy-on-write
- Page replacement: LRU approximation via Clock algorithm
- CUDA unified memory: page migration, access counters, prefetching hints
np.memmap: how memory-mapped files avoid loading full arrays into RAM
Post 16 — “File Systems & I/O: How HDF5, Arrow, Parquet, and DataLoaders Actually Work”
- inode-based indexed allocation: why file deletion is O(1)
- Directory as a special file: path resolution algorithm
- Journaling: crash consistency, write-ahead log, fsync semantics
- Disk scheduling: SSTF vs. SCAN — seek time optimisation
- HDF5: chunked storage, B-tree metadata index
- Apache Arrow: columnar memory layout, zero-copy IPC, memory-mapped files
- PyTorch
DataLoader: prefetching,pin_memory, and why it speeds up GPU transfer
🌐 Series 4: Computer Networks (Posts 17–20)
Post 17 — “OSI vs. TCP/IP: The Abstraction Layers Powering Distributed AI Systems”
- OSI 7-layer (pedagogical) vs. TCP/IP 4-layer (practical)
- Encapsulation walkthrough: HTTP request from browser to wire (every header)
- gRPC, Protobuf, and how ML serving maps to network layers
- Service mesh (Istio/Envoy): why distributed inference needs L7 load balancing
Post 18 — “TCP Under the Hood: Reliability, Flow Control & Why gRPC Uses HTTP/2”
- 3-way handshake full state machine: SYN_SENT → ESTABLISHED → TIME_WAIT
- Sequence numbers, cumulative ACK, retransmission timer (RTO, Karn’s algorithm)
- Sliding window: receiver buffer, zero-window probe, silly window syndrome
- Congestion control: slow start, AIMD, fast retransmit/recovery (Reno) vs. CUBIC
- HTTP/1.1 vs. HTTP/2: multiplexing, HPACK header compression, server push
- QUIC / HTTP/3: 0-RTT, stream multiplexing without HoL blocking
- gRPC: why HTTP/2 + Protobuf beats REST + JSON for ML serving
Post 19 — “The Network Layer: IP, Routing Protocols & How Model Parallelism Crosses Nodes”
- IPv4 header deep dive: TTL, fragmentation (DF/MF), checksum
- Subnetting: CIDR, VLSM — worked prefix decomposition examples
- ARP · DHCP · ICMP: the glue protocols every engineer hits in debugging
- RIP (distance vector, count-to-infinity problem) vs. OSPF (link state, Dijkstra)
- BGP: inter-AS routing, why policy trumps shortest path
- InfiniBand vs. Ethernet for HPC/AI clusters: latency and bandwidth numbers
- RDMA: kernel bypass for low-latency all-reduce in distributed training
Post 20 — “Application Protocols: DNS, HTTP, TLS & the APIs of Modern AI Services”
- DNS: resolution chain, caching, negative TTL, DNSSEC basics
- HTTP methods, status codes, REST vs. RPC semantics
- TLS 1.3 handshake: 1-RTT, certificate chain, forward secrecy
- WebSockets: persistent bidirectional channels for real-time inference
- SSE (Server-Sent Events): how ChatGPT streams tokens to the browser
- OpenAI API design patterns: rate limiting, token streaming, error handling
🏁 Series 5: Capstone
Post 21 — “Full-Stack Systems Thinking: From Bits to Transformers — CS Foundations Retrospective”
- How the four modules connect: DS → Algorithms → OS → Networks → Distributed Systems → ML Systems
- Mental models that compound: cache locality, pipeline hazards, TCP congestion window
- What to learn next: distributed systems (DDIA), ML systems (MLSys), compilers (LLVM/Triton)
- Full resource list and reading recommendations
🗓️ Suggested Study Timeline
gantt
title CS Foundations — 12-Week Study Timeline
dateFormat YYYY-MM-DD
section Data Structures
Linear + Stack/Queue :ds1, 2026-04-08, 7d
Trees :ds2, after ds1, 7d
Graphs + Search + Sort :ds3, after ds2, 7d
section Computer Org
Arch + Numbers + ALU :co1, after ds3, 7d
Memory Hierarchy :co2, after co1, 7d
CPU Pipeline + I/O :co3, after co2, 7d
section Operating Systems
Process / Thread / Sync :os1, after co3, 7d
Memory + File + I/O :os2, after os1, 7d
section Networks
Physical + Data Link :net1, after os2, 7d
Network + Transport :net2, after net1, 7d
Application Layer :net3, after net2, 7d
Capstone Review :cap, after net3, 7d
At one module per week, the full checklist completes in ~12 weeks. Given FYP deadlines (final report 30 Apr · demo 4–8 May), a realistic start is post-FYP submission — treat this roadmap as a summer milestone.
🔗 What’s Next
This post is the map. The journey starts with Post 1 → Memory Layout: Arrays vs. Linked Lists Through a Cache-Conscious Lens.
Every post in this series follows the same structure: concept → implementation intuition → AI engineering connection. No rote memorisation. Just systems thinking, built bottom-up.
Part of the CS Foundations for AI Engineers series. Next: Post 1 — “Memory Layout is Everything.”