Post

Slurm on HPC: srun vs sbatch — A Practical Guide for ML Researchers

Slurm is the gatekeeper between your code and the GPU — master interactive, batch, and array jobs and you'll never waste a compute allocation again.

Slurm on HPC: srun vs sbatch — A Practical Guide for ML Researchers

You’ve written the model, the data pipeline is ready — but now you’re staring at a login node on an HPC cluster with no GUI, no desktop, and a python train.py that you definitely should not just run right there. This is Slurm territory. Once you understand the three job modes, the cluster stops being intimidating and starts being your best asset.


🤔 Why HPC? — The 10% Utilisation Problem


Here’s a story that reframes how you think about compute.

A researcher buys a $10,000 workstation. Powerful, dedicated, all theirs. Over months of logging, they find they actually use it 10% of the time — a few heavy days, many idle ones. Effective cost? $100,000 per unit of compute used.

An HPC flips this. Ten researchers each need 10% of a machine. Pool them together, and one machine serves them all — cheaper for everyone, and the hardware is almost never idle.

This is why virtually every major university, research lab, and cloud provider has moved to shared high-performance computing. You’re not just borrowing a fast computer — you’re buying into a much more efficient economic model.

Beyond cost, HPC unlocks resources you simply can’t own: 100 GB RAM nodes for genome assembly, clusters of A100s for transformer pretraining, weeks of uninterrupted walltime. Your laptop becomes a terminal to log in — nothing more.


🧭 What is Slurm?


Slurm (Simple Linux Utility for Resource Management) is the most widely-used job scheduler on HPC clusters worldwide — NSCC, NTU’s GPU cluster, AWS HPC, and most academic supercomputers all run it. Some older clusters use Torque (same concepts, slightly different syntax — check with your sysadmin which one you have).

Its job is simple: you have many users fighting over a limited number of GPUs. Slurm decides who gets what, when.

flowchart TD
    U1(["👤 You\n(training 3D U-Net)"]):::user
    U2(["👤 User B\n(genomic analysis)"]):::user
    U3(["👤 User C\n(batch inference)"]):::user

    HN["🖥️ Head / Login Node\n(submit jobs here — never run heavy work here)"]:::head

    SL["🧮 Slurm Scheduler\n(Priority Queue)"]:::slurm

    N1(["⚙️ Compute Node 1\n4× A100"]):::node
    N2(["⚙️ Compute Node 2\n8× V100"]):::node
    N3(["⚙️ Compute Node 3\n2× RTX 3090"]):::node

    U1 -->|"srun / sbatch"| HN
    U2 --> HN
    U3 --> HN
    HN --> SL

    SL -->|"allocates"| N1
    SL -->|"allocates"| N2
    SL -->|"allocates"| N3

    classDef user fill:#4A90D9,stroke:#2c5f8a,color:#fff
    classDef head fill:#D9534F,stroke:#9e2c2c,color:#fff
    classDef slurm fill:#E8A838,stroke:#b07820,color:#fff
    classDef node fill:#5BA85A,stroke:#3a6e39,color:#fff

Never run heavy jobs on the login node. The login node is shared by every user for file editing and job submission only. Running python train.py there will get you throttled or banned. Always go through Slurm.


🗂️ Key Concepts Before You Start


TermWhat It Means
Login / Head NodeThe machine you SSH into — for editing code and submitting jobs only
Compute NodeThe actual workhorse machines Slurm allocates your job to
PartitionA named group of nodes (like a queue), e.g. gpu, cpu, short
JobA unit of work submitted to Slurm
TaskA process within a job (for MPI, you might have many tasks)
AllocationThe reserved resources Slurm grants you for a job
GRESGeneric Resource — how you request GPUs: --gres=gpu:1

You tell Slurm: “I need X CPUs, Y GB RAM, Z GPUs, for at most T hours.” Slurm queues your job, finds a compute node that fits, and runs it — sending you an email when it’s done or failed.


⚡ Mode 1: Interactive (srun --pty bash)


The interactive mode gets you a live shell on a compute node. You’re no longer on the login node — you’re on the GPU machine itself, and everything you type runs there in real time.

The cleanest way to do this is a reusable shell script:

1
2
3
4
5
6
7
8
9
# interactive.slurm
#!/bin/bash
srun --account=your_account \
     --partition=gpu \
     --gres=gpu:1 \
     --cpus-per-task=4 \
     --mem=16G \
     --time=02:00:00 \
     --pty /bin/bash

Make it executable and run it:

1
2
chmod +x interactive.slurm
./interactive.slurm

You’ll see your prompt change from the login node (gl-login2) to a compute node (gl3062). From there, run nvidia-smi, load your conda env, test your dataloader — anything interactive.

Save this script and reuse it across projects. One write, move it anywhere. The resource settings (mem, time, CPUs) are your only project-specific tweaks.

Type exit to leave the compute node and return to the login node.


When to Use Interactive Mode

  • Debugging your training script (test with 1 epoch first)
  • Profiling GPU memory usage
  • Exploring data, checking shapes, sanity-checking pipelines
  • Any task where you need to see output in real time

📋 Mode 2: Batch Job (sbatch)


sbatch submits a shell script to the queue and returns immediately — your terminal is free. Slurm runs the job when resources are available and writes all stdout/stderr to a log file.


Anatomy of an sbatch Script

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#!/bin/bash
#SBATCH --job-name=fyp_unet_train     # Visible in squeue
#SBATCH --partition=gpu               # Which node group to use
#SBATCH --gres=gpu:1                  # 1 GPU
#SBATCH --cpus-per-task=8             # CPU cores
#SBATCH --mem=32G                     # RAM
#SBATCH --time=08:00:00               # Max walltime (HH:MM:SS)
#SBATCH --output=logs/%j_out.txt      # %j = job ID
#SBATCH --error=logs/%j_err.txt       # Separate stderr log
#SBATCH --mail-type=BEGIN,END,FAIL    # Email on start, end, or failure
#SBATCH --mail-user=your@ntu.edu.sg

# ── Environment Setup ─────────────────────────────────
echo "Job started: $(date)"
echo "Node: $SLURMD_NODENAME"
echo "GPUs: $CUDA_VISIBLE_DEVICES"

module load cuda/11.8
source ~/miniconda3/etc/profile.d/conda.sh
conda activate fyp_env

# ── Your Actual Job ───────────────────────────────────
cd ~/fyp/sparse-ct

srun python train.py \
    --epochs 100 \
    --batch-size 4 \
    --views 60 \
    --output-dir checkpoints/

echo "Job finished: $(date)"

Submit it:

1
2
sbatch train_job.sh
# → Submitted batch job 482931

Notice the srun inside the sbatch script. This is the correct pattern — srun within sbatch properly integrates with Slurm’s step tracking. Calling python directly works but loses resource accounting.


When to Use Batch Mode

  • Long training runs (hours to days)
  • Overnight jobs — submit before sleep, check results in the morning
  • Any job where you don’t need live output
  • Running multiple commands in series (just add lines after the first srun)

🔢 Mode 3: Array Jobs — Run Many at Once


Array jobs are Slurm’s killer feature. Submit one script, Slurm runs it N times in parallel — each instance gets a unique $SLURM_ARRAY_TASK_ID.

1
2
#SBATCH --array=1-100        # 100 parallel jobs: IDs 1 to 100
#SBATCH --array=1-20%4       # 20 jobs, max 4 running simultaneously

Example — training with 100 different random seeds:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#!/bin/bash
#SBATCH --job-name=seed_sweep
#SBATCH --array=1-100
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --time=02:00:00
#SBATCH --output=logs/array_%a_%j.txt   # %a = array task ID

SEED=$SLURM_ARRAY_TASK_ID

echo "Running seed=$SEED (task $SLURM_ARRAY_TASK_ID)"
srun python train.py \
    --seed $SEED \
    --output-dir results/seed_${SEED}/

What happens: Slurm spawns 100 jobs. Each runs with a different SEED (1–100). All run in parallel (up to available GPUs). Each gets its own log file. You’re done in the time it takes one job to finish.

Array jobs are the cleanest hyperparameter sweep on a cluster. No manual loop, no babysitting — each run is isolated with its own log and job ID.


🔍 Monitoring Your Jobs


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# See your jobs in the queue
squeue -u $USER

# Watch live (auto-refresh every 5s)
watch -n 5 squeue -u $USER

# All partitions and node availability
sinfo

# Cancel a specific job
scancel <JOB_ID>

# Cancel all your jobs
scancel -u $USER

# Efficiency stats after a job ends (was your RAM/CPU request accurate?)
seff <JOB_ID>

Reading squeue Output

1
2
3
JOBID      PARTITION  NAME           USER    ST  TIME      NODES  NODELIST
482931     gpu        fyp_unet       yuxuan  R   0:15:32   1      gpu-node-03
482932_[1-100] gpu   seed_sweep     yuxuan  PD  0:00:00   1      (Priority)
StatusMeaning
RRunning
PDPending — waiting in queue
CGCompleting — finishing up, releasing resources
FFailed
CACancelled

PD with reason (Resources) means no GPU is free — just wait. Reason (Priority) means other users rank ahead this cycle. Both are normal.


🧹 Cleaning Up Output Files


After jobs finish, Slurm drops output files (.slurm.o*) in your working directory. Clean them up:

1
2
3
4
5
# Remove array job output files
rm array.slurm.o*

# Remove batch job output files
rm single.slurm.o*

Or redirect them to a logs/ folder upfront using --output=logs/%j_out.txt in your script — cleaner, and they’re all in one place.


⚖️ Three Modes at a Glance


 InteractiveBatchArray
How to submitsrun --pty bashsbatch script.shsbatch --array=1-N
Blocks terminal✅ Yes❌ No❌ No
Live output✅ Yes❌ Log file❌ Per-task log
Best forDebuggingLong single runsSweeps / parallel runs
Email notify
Parallelism1 job1 jobN jobs simultaneously


flowchart TD
    A(["💻 Login Node\n(edit code, submit jobs)"]):::login

    B["./interactive.slurm\n(get shell on compute node)"]:::srun
    C{"✅ 1-epoch\ntest passes?"}:::check

    D["Write sbatch script\n(full config, email, logs)"]:::sbatch
    E["sbatch train.sh"]:::submit
    F(["📬 Queue: PD → R"]):::queue
    G(["✅ Done\nCheckpoint + log saved"]):::done

    A --> B --> C
    C -->|"❌ debug"| B
    C -->|"✅ ready"| D
    D --> E --> F --> G

    classDef login fill:#9B6EBD,stroke:#6b4785,color:#fff
    classDef srun fill:#4A90D9,stroke:#2c5f8a,color:#fff
    classDef check fill:#E8A838,stroke:#b07820,color:#fff
    classDef sbatch fill:#5BA85A,stroke:#3a6e39,color:#fff
    classDef submit fill:#5BA85A,stroke:#3a6e39,color:#fff
    classDef queue fill:#888,stroke:#555,color:#fff
    classDef done fill:#4A90D9,stroke:#2c5f8a,color:#fff

⚠️ Common Gotchas


Running on the login node — the cardinal sin. Always use srun or sbatch. No exceptions.

Walltime buffer — set --time to 10–20% more than expected. Slurm kills jobs the instant they hit the limit: no checkpoint, no warning, no mercy.

Memory requests — always request slightly more RAM than your model + data needs. If you exceed your reserved allocation, Slurm terminates the job with OUT_OF_MEMORY even if the node has free RAM.

Checkpoint often — save model state every N epochs. If the cluster reboots or your job gets preempted, you resume from the last checkpoint instead of restarting from scratch.


🧠 One-Sentence Intuition


Use interactive mode to think, batch mode to execute, and array mode to scale — three tools, one mental model.


🎬 Learn More


Prof. Pat Schloss’s HPC intro — covers the economics of shared compute, interactive/batch/array modes end-to-end with live demos. A great 30-minute deep-dive if you’re new to SLURM.


Part of my HPC & ML workflow notes series. Next: setting up conda environments on a cluster, checkpointing with PyTorch Lightning, and multi-GPU training with --ntasks on Slurm.

This post is licensed under CC BY 4.0 by the author.