Post

The AI Engineer's CS Foundations Roadmap: 408 Syllabus, Top Courses & a 21-Post Blog Series

A first-principles CS study plan for AI engineers — checklist, curated course picks, and a 21-post blog series outline.

The AI Engineer's CS Foundations Roadmap: 408 Syllabus, Top Courses & a 21-Post Blog Series

If you’re on the path from ML practitioner to AI engineer, you’ve probably hit the wall — the one where library abstractions stop hiding the system beneath. This is my systematic attempt to tear that wall down, using the Chinese CS 408 syllabus as a spine for first-principles learning.


🗺️ Why CS Foundations Still Matter for AI Engineers

Knowing PyTorch is table stakes. Understanding why a DataLoader uses multiprocessing, how IEEE 754 causes gradient underflow, and what happens to TCP throughput during all-reduce in distributed training — that’s what separates engineers from practitioners.

This roadmap covers four pillars: Data Structures · Computer Organization · Operating Systems · Computer Networks — filtered for depth and AI relevance, not exam tricks.


✅ The Study Checklist

Exam-specific mnemonics stripped. Emphasis on system logic, implementation intuition, and AI engineering connections.


📦 Module 1 — Data Structures

  • Linear Structures
    • Array vs. linked list: memory layout and cache locality implications
    • Singly / doubly / circular linked list — pointer manipulation, sentinel nodes
    • Dynamic arrays (amortized O(1) append) — how Python list and std::vector work under the hood
  • Stack & Queue
    • Stack: LIFO, call stack, recursion unrolling to iteration
    • Queue: FIFO, circular buffer, deque
    • Applications: expression evaluation (infix → postfix), BFS frontier management
    • Priority Queue: binary heap, O(log n) insert/extract-min
  • Trees & Binary Trees
    • Binary tree traversals: pre/in/post-order (recursive + iterative), level-order (BFS)
    • Threaded binary trees (space-efficient traversal without a stack)
    • BST: insert, delete, search — average O(log n), worst O(n) on sorted input
    • AVL Tree: balance factor, LL/LR/RL/RR rotations
    • Red-Black Tree: five invariants, insertion fixup — understand why std::map uses this
    • Huffman Tree: greedy construction, prefix-free codes, entropy connection
    • B-Tree / B+ Tree: block-aware structure, why databases always index this way
  • Graphs
    • Representations: adjacency matrix vs. adjacency list (space/time tradeoffs)
    • DFS: recursive, iterative (explicit stack), finish-time ordering
    • BFS: shortest path in unweighted graphs, level-order traversal
    • Minimum Spanning Tree: Prim (priority queue + greedy), Kruskal (union-find)
    • Shortest Path: Dijkstra (non-negative weights, O((V+E) log V)), Floyd-Warshall (all-pairs)
    • Topological Sort: Kahn’s algorithm (BFS) and DFS-based (reverse finish time)
    • Critical Path (AOE networks): forward/backward pass, float/slack
  • Search
    • Sequential, binary, block (index) search
    • B-Tree / B+ Tree (revisit from database storage perspective)
    • Hash Table: chaining vs. open addressing, load factor α, rehashing
    • Collision resolution: linear probing, quadratic probing, double hashing
  • Sorting
    • Comparison sorts: Insertion, Shell, Bubble, QuickSort, MergeSort, HeapSort
    • Non-comparison sorts: Counting, Radix, Bucket
    • Full complexity table: best / avg / worst time + space (O(1) / O(log n) / O(n))
    • Stability analysis: which algorithms preserve relative order of equal elements?
    • QuickSort partition schemes: Lomuto vs. Hoare, pivot selection strategies

🖥️ Module 2 — Computer Organization

  • Architecture Fundamentals
    • Von Neumann model: CPU, memory, I/O, bus — fetch-decode-execute cycle
    • Performance metrics: CPI, MIPS, throughput vs. latency, Amdahl’s Law
  • Data Representation & Arithmetic
    • Fixed-point: sign-magnitude, one’s complement, two’s complement, excess-N (biased)
    • IEEE 754 float32 / float16 / bfloat16: sign · exponent · mantissa; NaN, Inf, denormals
    • Practical ML impact: integer overflow, catastrophic cancellation, gradient underflow
    • ALU design: ripple-carry adder, carry-lookahead adder (conceptual)
    • Booth’s algorithm for multiplication (conceptual understanding)
  • Memory Hierarchy
    • SRAM vs. DRAM: speed, cost, volatility, use cases
    • Cache architecture: direct-mapped, set-associative, fully associative
    • Cache mapping: tag / index / offset bit decomposition
    • Replacement policies: LRU, LFU, FIFO, pseudo-LRU (Clock)
    • Write policies: write-through vs. write-back + write-allocate
    • Virtual memory: page tables, multi-level paging, page fault handling
    • TLB: address translation fast path, TLB shootdown in multicore systems
  • Instruction Set Architecture
    • Instruction formats: fixed-length (RISC) vs. variable-length (CISC)
    • Addressing modes: immediate, direct, indirect, register, base+offset, PC-relative
    • CISC vs. RISC: x86 vs. ARM/RISC-V design philosophy and tradeoffs
  • CPU Internals
    • Single-cycle vs. multi-cycle datapath
    • Hardwired vs. microprogrammed control unit
    • Classic 5-stage pipeline: IF → ID → EX → MEM → WB
    • Pipeline hazards: structural, data (RAW/WAW/WAR), control — forwarding, stalling, branch prediction
    • Superscalar and out-of-order execution (conceptual overview)
  • I/O Systems
    • I/O interface: data register, status register, control register
    • Polling vs. interrupt-driven vs. DMA
    • Interrupt vector table, context save/restore overhead
    • DMA: bus arbitration, burst mode vs. cycle-steal mode

⚙️ Module 3 — Operating Systems

  • OS Fundamentals
    • Kernel mode vs. user mode, hardware privilege rings
    • System calls: trap mechanism, syscall table, overhead
    • OS structures: monolithic kernel, microkernel, exokernel, hypervisor
  • Process & Thread Management
    • Process states: new → ready → running → blocked → terminated
    • PCB fields: PID, register set, page table pointer, file descriptor table
    • Threads: kernel-level (1:1) vs. user-level (N:1) vs. hybrid (M:N)
    • Context switch cost: what gets saved, TLB flush, cache cold-start
    • CPU scheduling: FCFS, SJF (preemptive = SRTF), Round Robin, Priority, MLFQ
    • Real-time scheduling: EDF, RMS (rate-monotonic)
  • Synchronization
    • Race conditions, critical sections, mutual exclusion requirements
    • Semaphores (counting + binary), mutex, spinlock
    • Classic problems: Producer-Consumer (bounded buffer), Readers-Writers, Dining Philosophers
    • Monitor: condition variables, wait / signal / broadcast semantics
    • Lock-free structures: CAS (compare-and-swap), ABA problem, memory ordering
  • Deadlock
    • Four necessary conditions: mutual exclusion, hold-and-wait, no preemption, circular wait
    • Prevention: statically break one condition
    • Avoidance: Banker’s Algorithm — safe state definition, resource allocation graph
    • Detection + recovery: resource-allocation graph reduction, process termination / preemption
  • Memory Management
    • Contiguous allocation: fixed/variable partition, fragmentation types
    • Paging: page table structure, page fault sequence, demand paging, copy-on-write
    • Page replacement: OPT (Belady optimal), FIFO (Belady’s anomaly), LRU, Clock/Second-Chance
    • Thrashing: working set model, page fault frequency control
    • Segmentation and segmented paging
  • File Systems
    • Logical structure: sequential, indexed, hashed files
    • Physical allocation: contiguous, linked (FAT), indexed (inode with direct/indirect blocks)
    • Directory: hierarchical, hard links vs. symbolic links, DAG structure
    • File sharing and protection: access control lists, capability lists
    • Disk scheduling: FCFS, SSTF, SCAN (elevator), C-SCAN
  • I/O Management
    • Device driver layers, I/O subsystem software stack
    • Spooling (printer queue as canonical example)
    • Buffer types: single, double, circular buffer management
    • Block device vs. character device interfaces

🌐 Module 4 — Computer Networks

  • Architecture
    • OSI 7-layer vs. TCP/IP 4-layer: purpose and key protocols at each layer
    • Encapsulation / decapsulation, PDU naming (frame / packet / segment / message)
    • Network performance: bandwidth, throughput, propagation delay, RTT, jitter
  • Physical Layer
    • Nyquist theorem (noise-free channel capacity) vs. Shannon theorem (noisy channel)
    • Encoding schemes: NRZ, Manchester, 4B/5B
    • Modulation: AM/FM/PM, QAM for high-bandwidth channels
    • Transmission media: copper, optical fiber, wireless (characteristics + tradeoffs)
  • Data Link Layer
    • Framing: byte stuffing, bit stuffing
    • Error detection: parity bits, CRC — polynomial division intuition
    • Error correction: Hamming code distance
    • Flow control + ARQ protocols: Stop-and-Wait, Go-Back-N (GBN), Selective Repeat (SR)
    • Efficiency analysis: utilization formula with window size and propagation delay
    • MAC protocols: CSMA/CD (Ethernet), CSMA/CA (WiFi 802.11), TDMA, FDMA
    • Ethernet frame format, MAC addressing, switch vs. bridge (MAC learning table)
  • Network Layer
    • IPv4 header fields: TTL, protocol, checksum, fragmentation (DF/MF flags)
    • Subnetting: CIDR notation, VLSM — worked prefix calculations
    • NAT: static / dynamic / PAT (overload), connection tracking table
    • ARP: request/reply, cache, gratuitous ARP, proxy ARP
    • DHCP: DORA 4-way handshake
    • ICMP: ping (echo request/reply), traceroute (TTL expiry) mechanism
    • Routing algorithms: distance vector (RIP, Bellman-Ford, count-to-infinity problem) vs. link state (OSPF, Dijkstra)
    • BGP: path-vector protocol, AS, policy-based routing (prefer policy > shortest path)
    • IPv6: 128-bit addressing, dual-stack transition, tunneling
  • Transport Layer
    • UDP: connectionless, checksum-only, use cases (DNS, QUIC, video streaming)
    • TCP segment header: sequence number, acknowledgement, flags, window size
    • 3-way handshake (SYN → SYN-ACK → ACK) + 4-way teardown, full state machine
    • TCP reliability: cumulative ACK, retransmission timer, SACK option
    • Flow control: sliding window protocol, receive buffer, zero-window probe
    • Congestion control: slow start, congestion avoidance (AIMD), fast retransmit, fast recovery (Reno / CUBIC)
  • Application Layer
    • DNS: hierarchy (root → TLD → authoritative), recursive vs. iterative resolution, TTL caching
    • HTTP/1.1 vs. HTTP/2 vs. HTTP/3 (QUIC): persistent connections, multiplexing, HOL blocking
    • HTTPS: TLS handshake, certificate chain, HSTS
    • FTP: control vs. data channel, active vs. passive mode
    • Email: SMTP (push/send), POP3 / IMAP (pull/sync), MIME encoding

📚 Top Free Course Recommendations

ModuleCoursePlatformWhy Recommended
Data StructuresCS 61B — Data Structures (Josh Hug, UC Berkeley)YouTube + sp21.datastructur.esExceptional pedagogy, Java-based, real projects with auto-graded labs. Covers DS + algorithms at depth — not just theory.
Computer OrganizationCS:APP / 15-213 — Intro to Computer Systems (CMU)cs.cmu.edu/~213 + YouTubeThe definitive systems course. Caches, linking, concurrency, networking. Legendary labs (bomb, attack, malloc). Covers Module 2 + feeds into 3 & 4.
Operating SystemsMIT 6.S081 — OS Engineering (Frans Kaashoek)pdos.csail.mit.edu/6.828 + YouTubexv6 labs force you to implement real OS components in C. Best hands-on OS course publicly available.
Computer NetworksCS 144 — Intro to Computer Networking (Stanford, Nick McKeown)YouTube (CS144 playlist)Builds a TCP/IP stack from scratch in C++. Bridges theory and real protocol implementation. The best networks course online.
Systems (Unified)CS:APP Textbook — Bryant & O’HallaronFree PDF widely availableCovers Organization + OS + Networks in one coherent narrative. Best single resource for Modules 2–4.
AlgorithmsMIT 6.006 — Introduction to AlgorithmsMIT OCW (Spring 2020)Rigorous theory: proof-based analysis, dynamic programming, graph algorithms. Complements CS 61B for depth.

Bonus for AI context: Pair any systems course with MIT 6.172 Performance Engineering to see how these fundamentals directly apply to optimising ML training loops and inference pipelines.


📝 The Blog Series: 21 Posts

Each post is a standalone deep-dive, but reads coherently as a series. AI engineering angle woven throughout.


🧱 Series 0: The Foundation

Post 0 — This Post “The AI Engineer’s CS Foundations Roadmap: 408 Syllabus, Top Courses & 21 Posts to Come”

  • Why CS fundamentals matter in the LLM era
  • 408 syllabus as a learning spine — filtered for engineering depth
  • Task checklist · course map · series outline

📦 Series 1: Data Structures (Posts 1–6)

Post 1“Memory Layout is Everything: Arrays vs. Linked Lists Through a Cache-Conscious Lens”

  • Physical memory layout: stack vs. heap, pointer indirection cost
  • Cache line = 64 bytes; spatial vs. temporal locality
  • Why PyTorch tensors must be contiguous — what .contiguous() actually does
  • Dynamic array amortized analysis: doubling strategy

Post 2“Stacks, Queues & the Hidden State Machine in Your Python Interpreter”

  • Call stack anatomy: frames, local variables, return addresses
  • Recursion vs. iteration: stack depth limits and tail-call optimisation
  • Queue applications: BFS, async task queues (Celery, asyncio event loop)
  • Shunting-yard algorithm: expression evaluation walkthrough

Post 3“BST to Red-Black Tree: Why std::map Beats a Naive Dictionary in Worst Cases”

  • BST pathological case: sorted insertion → O(n) search
  • AVL rotations: LL/LR/RL/RR — when and why each fires
  • Red-Black invariants — conceptual, no rote memorisation
  • B+ Trees: block-alignment, why every database index is a B+ tree

Post 4“Huffman Codes, Entropy, and the Mathematical Backbone of Data Compression in ML”

  • Greedy Huffman construction via priority queue
  • Shannon entropy: bits as a measure of surprise
  • Connection to cross-entropy loss in classification
  • Modern codecs: LZ77, Zstandard, and arithmetic coding

Post 5“Graphs as the Universal Data Structure: From Knowledge Graphs to GNNs”

  • Matrix vs. adjacency list: when each representation wins
  • BFS / DFS and what they actually guarantee
  • Dijkstra with heap optimisation: full walkthrough
  • Topological sort: dependency resolution (make, pip install)
  • Connection to GNN message-passing framework (MPNN)

Post 6“Sorting Algorithms Demystified: Why QuickSort Dominates and When MergeSort Wins”

  • Comparison sort lower bound: Ω(n log n) via decision tree argument
  • QuickSort: Lomuto vs. Hoare partition, pivot selection, Introsort
  • MergeSort: external sort, parallel/distributed contexts, stability
  • HeapSort: priority queue as a sorting primitive
  • Radix sort: O(nk) for integer keys — when it beats O(n log n)
  • Full stability table + use-case decision matrix

🖥️ Series 2: Computer Organization (Posts 7–11)

Post 7“Von Neumann to Transformer: How CPU Architecture Constraints Shape AI Hardware Design”

  • Fetch-decode-execute cycle in detail
  • Von Neumann bottleneck: memory bandwidth vs. compute throughput
  • How GPUs break the bottleneck: SIMD, massive parallelism, high-bandwidth memory
  • TPUs: systolic arrays as matrix-multiply engines
  • Amdahl’s Law applied to ML workloads: the serial fraction kills scaling

Post 8“IEEE 754 Deep Dive: Why Your Neural Network Loses Precision (And How to Fix It)”

  • Float32 anatomy: 1 sign bit · 8-bit exponent · 23-bit mantissa
  • Special values: NaN, Inf, denormals — when they appear in training
  • Float16 dynamic range vs. bfloat16: why Google chose bfloat16 for TPUs
  • Gradient underflow: loss scaling in mixed-precision training (AMP)
  • Catastrophic cancellation in attention: why FlashAttention uses tiling

Post 9“The Memory Hierarchy: Cache Misses, VRAM Limits, and the Real Bottleneck in LLM Inference”

  • SRAM vs. DRAM: speed / cost / capacity tradeoffs
  • Cache geometry: set-associative mapping, line size, eviction policy
  • Writing cache-friendly code: matrix transposition as a worked example
  • GPU memory hierarchy: L1 / L2 / VRAM / NVLink bandwidth numbers
  • FlashAttention’s core insight: tile to stay in SRAM, avoid VRAM round-trips

Post 10“CPU Pipelines & Branch Prediction: The Invisible Performance Tax on Your Python Code”

  • 5-stage pipeline: IF → ID → EX → MEM → WB
  • Hazards: data forwarding, load-use stalls, branch misprediction penalty (~15 cycles)
  • Superscalar + out-of-order execution: how modern CPUs find ILP
  • Spectre / Meltdown: branch prediction as a security vulnerability
  • Implications for Python: GIL, CPython bytecode interpreter overhead

Post 11“Interrupts, DMA, and the Data Path of a Deep Learning Training Loop”

  • I/O without DMA: CPU burns cycles polling status registers
  • DMA: bus arbitration, burst mode vs. cycle-steal, memory-to-memory transfer
  • Interrupt handling: context save, ISR, return overhead
  • Modern AI training I/O: NVMe SSDs, PCIe bandwidth, RDMA / InfiniBand
  • DataLoader internals: why num_workers uses multiprocessing, not threading

⚙️ Series 3: Operating Systems (Posts 12–16)

Post 12“Processes, Threads, and Python’s GIL: The Concurrency Model Every AI Engineer Should Know”

  • Process vs. thread: separate address space vs. shared memory
  • Kernel threads (1:1) vs. user threads (N:1) — scheduling and blocking I/O
  • Python GIL: why CPU-bound threads don’t parallelise, and why it exists
  • multiprocessing vs. threading vs. asyncio: decision tree
  • PyTorch DataLoader and DistributedDataParallel: process-level concurrency

Post 13“CPU Scheduling & Context Switching: The Hidden Tax on Your Model Training Job”

  • Scheduling algorithms: FCFS, SJF, SRTF, Round Robin, MLFQ
  • Context switch cost: register save, TLB flush, cache cold-start (~microseconds)
  • Priority inversion: Mars Pathfinder bug case study
  • Linux cgroups, CPU pinning, and NUMA-aware allocation for ML workloads
  • GPU scheduling: CUDA streams, software pipeline, compute/copy overlap

Post 14“Deadlock, Semaphores & Why Distributed Training Needs Careful Synchronization”

  • Race conditions: data corruption from unsynchronised access
  • Mutex, counting semaphore, condition variable — semantics and implementation
  • Classic problems solved: Producer-Consumer (bounded buffer), Readers-Writers
  • Deadlock: four conditions + Banker’s Algorithm walkthrough with example
  • Distributed AI training: AllReduce deadlock, NCCL timeout, collective communication pitfalls

Post 15“Virtual Memory & Paging: Understanding CUDA’s Unified Memory from First Principles”

  • Virtual address space layout: code · heap · stack · shared libraries
  • Two-level page tables, TLB fast path, TLB shootdown in multicore
  • Page fault types: minor (zero page) · major (disk I/O) · copy-on-write
  • Page replacement: LRU approximation via Clock algorithm
  • CUDA unified memory: page migration, access counters, prefetching hints
  • np.memmap: how memory-mapped files avoid loading full arrays into RAM

Post 16“File Systems & I/O: How HDF5, Arrow, Parquet, and DataLoaders Actually Work”

  • inode-based indexed allocation: why file deletion is O(1)
  • Directory as a special file: path resolution algorithm
  • Journaling: crash consistency, write-ahead log, fsync semantics
  • Disk scheduling: SSTF vs. SCAN — seek time optimisation
  • HDF5: chunked storage, B-tree metadata index
  • Apache Arrow: columnar memory layout, zero-copy IPC, memory-mapped files
  • PyTorch DataLoader: prefetching, pin_memory, and why it speeds up GPU transfer

🌐 Series 4: Computer Networks (Posts 17–20)

Post 17“OSI vs. TCP/IP: The Abstraction Layers Powering Distributed AI Systems”

  • OSI 7-layer (pedagogical) vs. TCP/IP 4-layer (practical)
  • Encapsulation walkthrough: HTTP request from browser to wire (every header)
  • gRPC, Protobuf, and how ML serving maps to network layers
  • Service mesh (Istio/Envoy): why distributed inference needs L7 load balancing

Post 18“TCP Under the Hood: Reliability, Flow Control & Why gRPC Uses HTTP/2”

  • 3-way handshake full state machine: SYN_SENT → ESTABLISHED → TIME_WAIT
  • Sequence numbers, cumulative ACK, retransmission timer (RTO, Karn’s algorithm)
  • Sliding window: receiver buffer, zero-window probe, silly window syndrome
  • Congestion control: slow start, AIMD, fast retransmit/recovery (Reno) vs. CUBIC
  • HTTP/1.1 vs. HTTP/2: multiplexing, HPACK header compression, server push
  • QUIC / HTTP/3: 0-RTT, stream multiplexing without HoL blocking
  • gRPC: why HTTP/2 + Protobuf beats REST + JSON for ML serving

Post 19“The Network Layer: IP, Routing Protocols & How Model Parallelism Crosses Nodes”

  • IPv4 header deep dive: TTL, fragmentation (DF/MF), checksum
  • Subnetting: CIDR, VLSM — worked prefix decomposition examples
  • ARP · DHCP · ICMP: the glue protocols every engineer hits in debugging
  • RIP (distance vector, count-to-infinity problem) vs. OSPF (link state, Dijkstra)
  • BGP: inter-AS routing, why policy trumps shortest path
  • InfiniBand vs. Ethernet for HPC/AI clusters: latency and bandwidth numbers
  • RDMA: kernel bypass for low-latency all-reduce in distributed training

Post 20“Application Protocols: DNS, HTTP, TLS & the APIs of Modern AI Services”

  • DNS: resolution chain, caching, negative TTL, DNSSEC basics
  • HTTP methods, status codes, REST vs. RPC semantics
  • TLS 1.3 handshake: 1-RTT, certificate chain, forward secrecy
  • WebSockets: persistent bidirectional channels for real-time inference
  • SSE (Server-Sent Events): how ChatGPT streams tokens to the browser
  • OpenAI API design patterns: rate limiting, token streaming, error handling

🏁 Series 5: Capstone

Post 21“Full-Stack Systems Thinking: From Bits to Transformers — CS Foundations Retrospective”

  • How the four modules connect: DS → Algorithms → OS → Networks → Distributed Systems → ML Systems
  • Mental models that compound: cache locality, pipeline hazards, TCP congestion window
  • What to learn next: distributed systems (DDIA), ML systems (MLSys), compilers (LLVM/Triton)
  • Full resource list and reading recommendations

🗓️ Suggested Study Timeline

gantt
    title CS Foundations — 12-Week Study Timeline
    dateFormat  YYYY-MM-DD
    section Data Structures
    Linear + Stack/Queue     :ds1, 2026-04-08, 7d
    Trees                    :ds2, after ds1, 7d
    Graphs + Search + Sort   :ds3, after ds2, 7d
    section Computer Org
    Arch + Numbers + ALU     :co1, after ds3, 7d
    Memory Hierarchy         :co2, after co1, 7d
    CPU Pipeline + I/O       :co3, after co2, 7d
    section Operating Systems
    Process / Thread / Sync  :os1, after co3, 7d
    Memory + File + I/O      :os2, after os1, 7d
    section Networks
    Physical + Data Link     :net1, after os2, 7d
    Network + Transport      :net2, after net1, 7d
    Application Layer        :net3, after net2, 7d
    Capstone Review          :cap, after net3, 7d

At one module per week, the full checklist completes in ~12 weeks. Given FYP deadlines (final report 30 Apr · demo 4–8 May), a realistic start is post-FYP submission — treat this roadmap as a summer milestone.


🔗 What’s Next

This post is the map. The journey starts with Post 1 → Memory Layout: Arrays vs. Linked Lists Through a Cache-Conscious Lens.

Every post in this series follows the same structure: concept → implementation intuition → AI engineering connection. No rote memorisation. Just systems thinking, built bottom-up.


Part of the CS Foundations for AI Engineers series. Next: Post 1 — “Memory Layout is Everything.”

This post is licensed under CC BY 4.0 by the author.