Slurm on HPC: srun vs sbatch — A Practical Guide for ML Researchers
Slurm is the gatekeeper between your code and the GPU — master interactive, batch, and array jobs and you'll never waste a compute allocation again.
You’ve written the model, the data pipeline is ready — but now you’re staring at a login node on an HPC cluster with no GUI, no desktop, and a python train.py that you definitely should not just run right there. This is Slurm territory. Once you understand the three job modes, the cluster stops being intimidating and starts being your best asset.
🤔 Why HPC? — The 10% Utilisation Problem
Here’s a story that reframes how you think about compute.
A researcher buys a $10,000 workstation. Powerful, dedicated, all theirs. Over months of logging, they find they actually use it 10% of the time — a few heavy days, many idle ones. Effective cost? $100,000 per unit of compute used.
An HPC flips this. Ten researchers each need 10% of a machine. Pool them together, and one machine serves them all — cheaper for everyone, and the hardware is almost never idle.
This is why virtually every major university, research lab, and cloud provider has moved to shared high-performance computing. You’re not just borrowing a fast computer — you’re buying into a much more efficient economic model.
Beyond cost, HPC unlocks resources you simply can’t own: 100 GB RAM nodes for genome assembly, clusters of A100s for transformer pretraining, weeks of uninterrupted walltime. Your laptop becomes a terminal to log in — nothing more.
🧭 What is Slurm?
Slurm (Simple Linux Utility for Resource Management) is the most widely-used job scheduler on HPC clusters worldwide — NSCC, NTU’s GPU cluster, AWS HPC, and most academic supercomputers all run it. Some older clusters use Torque (same concepts, slightly different syntax — check with your sysadmin which one you have).
Its job is simple: you have many users fighting over a limited number of GPUs. Slurm decides who gets what, when.
flowchart TD
U1(["👤 You\n(training 3D U-Net)"]):::user
U2(["👤 User B\n(genomic analysis)"]):::user
U3(["👤 User C\n(batch inference)"]):::user
HN["🖥️ Head / Login Node\n(submit jobs here — never run heavy work here)"]:::head
SL["🧮 Slurm Scheduler\n(Priority Queue)"]:::slurm
N1(["⚙️ Compute Node 1\n4× A100"]):::node
N2(["⚙️ Compute Node 2\n8× V100"]):::node
N3(["⚙️ Compute Node 3\n2× RTX 3090"]):::node
U1 -->|"srun / sbatch"| HN
U2 --> HN
U3 --> HN
HN --> SL
SL -->|"allocates"| N1
SL -->|"allocates"| N2
SL -->|"allocates"| N3
classDef user fill:#4A90D9,stroke:#2c5f8a,color:#fff
classDef head fill:#D9534F,stroke:#9e2c2c,color:#fff
classDef slurm fill:#E8A838,stroke:#b07820,color:#fff
classDef node fill:#5BA85A,stroke:#3a6e39,color:#fff
Never run heavy jobs on the login node. The login node is shared by every user for file editing and job submission only. Running
python train.pythere will get you throttled or banned. Always go through Slurm.
🗂️ Key Concepts Before You Start
| Term | What It Means |
|---|---|
| Login / Head Node | The machine you SSH into — for editing code and submitting jobs only |
| Compute Node | The actual workhorse machines Slurm allocates your job to |
| Partition | A named group of nodes (like a queue), e.g. gpu, cpu, short |
| Job | A unit of work submitted to Slurm |
| Task | A process within a job (for MPI, you might have many tasks) |
| Allocation | The reserved resources Slurm grants you for a job |
| GRES | Generic Resource — how you request GPUs: --gres=gpu:1 |
You tell Slurm: “I need X CPUs, Y GB RAM, Z GPUs, for at most T hours.” Slurm queues your job, finds a compute node that fits, and runs it — sending you an email when it’s done or failed.
⚡ Mode 1: Interactive (srun --pty bash)
The interactive mode gets you a live shell on a compute node. You’re no longer on the login node — you’re on the GPU machine itself, and everything you type runs there in real time.
The cleanest way to do this is a reusable shell script:
1
2
3
4
5
6
7
8
9
# interactive.slurm
#!/bin/bash
srun --account=your_account \
--partition=gpu \
--gres=gpu:1 \
--cpus-per-task=4 \
--mem=16G \
--time=02:00:00 \
--pty /bin/bash
Make it executable and run it:
1
2
chmod +x interactive.slurm
./interactive.slurm
You’ll see your prompt change from the login node (gl-login2) to a compute node (gl3062). From there, run nvidia-smi, load your conda env, test your dataloader — anything interactive.
Save this script and reuse it across projects. One write, move it anywhere. The resource settings (mem, time, CPUs) are your only project-specific tweaks.
Type exit to leave the compute node and return to the login node.
When to Use Interactive Mode
- Debugging your training script (test with 1 epoch first)
- Profiling GPU memory usage
- Exploring data, checking shapes, sanity-checking pipelines
- Any task where you need to see output in real time
📋 Mode 2: Batch Job (sbatch)
sbatch submits a shell script to the queue and returns immediately — your terminal is free. Slurm runs the job when resources are available and writes all stdout/stderr to a log file.
Anatomy of an sbatch Script
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#!/bin/bash
#SBATCH --job-name=fyp_unet_train # Visible in squeue
#SBATCH --partition=gpu # Which node group to use
#SBATCH --gres=gpu:1 # 1 GPU
#SBATCH --cpus-per-task=8 # CPU cores
#SBATCH --mem=32G # RAM
#SBATCH --time=08:00:00 # Max walltime (HH:MM:SS)
#SBATCH --output=logs/%j_out.txt # %j = job ID
#SBATCH --error=logs/%j_err.txt # Separate stderr log
#SBATCH --mail-type=BEGIN,END,FAIL # Email on start, end, or failure
#SBATCH --mail-user=your@ntu.edu.sg
# ── Environment Setup ─────────────────────────────────
echo "Job started: $(date)"
echo "Node: $SLURMD_NODENAME"
echo "GPUs: $CUDA_VISIBLE_DEVICES"
module load cuda/11.8
source ~/miniconda3/etc/profile.d/conda.sh
conda activate fyp_env
# ── Your Actual Job ───────────────────────────────────
cd ~/fyp/sparse-ct
srun python train.py \
--epochs 100 \
--batch-size 4 \
--views 60 \
--output-dir checkpoints/
echo "Job finished: $(date)"
Submit it:
1
2
sbatch train_job.sh
# → Submitted batch job 482931
Notice the
sruninside the sbatch script. This is the correct pattern —srunwithinsbatchproperly integrates with Slurm’s step tracking. Callingpythondirectly works but loses resource accounting.
When to Use Batch Mode
- Long training runs (hours to days)
- Overnight jobs — submit before sleep, check results in the morning
- Any job where you don’t need live output
- Running multiple commands in series (just add lines after the first
srun)
🔢 Mode 3: Array Jobs — Run Many at Once
Array jobs are Slurm’s killer feature. Submit one script, Slurm runs it N times in parallel — each instance gets a unique $SLURM_ARRAY_TASK_ID.
1
2
#SBATCH --array=1-100 # 100 parallel jobs: IDs 1 to 100
#SBATCH --array=1-20%4 # 20 jobs, max 4 running simultaneously
Example — training with 100 different random seeds:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#!/bin/bash
#SBATCH --job-name=seed_sweep
#SBATCH --array=1-100
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --time=02:00:00
#SBATCH --output=logs/array_%a_%j.txt # %a = array task ID
SEED=$SLURM_ARRAY_TASK_ID
echo "Running seed=$SEED (task $SLURM_ARRAY_TASK_ID)"
srun python train.py \
--seed $SEED \
--output-dir results/seed_${SEED}/
What happens: Slurm spawns 100 jobs. Each runs with a different SEED (1–100). All run in parallel (up to available GPUs). Each gets its own log file. You’re done in the time it takes one job to finish.
Array jobs are the cleanest hyperparameter sweep on a cluster. No manual loop, no babysitting — each run is isolated with its own log and job ID.
🔍 Monitoring Your Jobs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# See your jobs in the queue
squeue -u $USER
# Watch live (auto-refresh every 5s)
watch -n 5 squeue -u $USER
# All partitions and node availability
sinfo
# Cancel a specific job
scancel <JOB_ID>
# Cancel all your jobs
scancel -u $USER
# Efficiency stats after a job ends (was your RAM/CPU request accurate?)
seff <JOB_ID>
Reading squeue Output
1
2
3
JOBID PARTITION NAME USER ST TIME NODES NODELIST
482931 gpu fyp_unet yuxuan R 0:15:32 1 gpu-node-03
482932_[1-100] gpu seed_sweep yuxuan PD 0:00:00 1 (Priority)
| Status | Meaning |
|---|---|
R | Running |
PD | Pending — waiting in queue |
CG | Completing — finishing up, releasing resources |
F | Failed |
CA | Cancelled |
PDwith reason(Resources)means no GPU is free — just wait. Reason(Priority)means other users rank ahead this cycle. Both are normal.
🧹 Cleaning Up Output Files
After jobs finish, Slurm drops output files (.slurm.o*) in your working directory. Clean them up:
1
2
3
4
5
# Remove array job output files
rm array.slurm.o*
# Remove batch job output files
rm single.slurm.o*
Or redirect them to a logs/ folder upfront using --output=logs/%j_out.txt in your script — cleaner, and they’re all in one place.
⚖️ Three Modes at a Glance
| Interactive | Batch | Array | |
|---|---|---|---|
| How to submit | srun --pty bash | sbatch script.sh | sbatch --array=1-N |
| Blocks terminal | ✅ Yes | ❌ No | ❌ No |
| Live output | ✅ Yes | ❌ Log file | ❌ Per-task log |
| Best for | Debugging | Long single runs | Sweeps / parallel runs |
| Email notify | ❌ | ✅ | ✅ |
| Parallelism | 1 job | 1 job | N jobs simultaneously |
🛠️ Recommended Workflow
flowchart TD
A(["💻 Login Node\n(edit code, submit jobs)"]):::login
B["./interactive.slurm\n(get shell on compute node)"]:::srun
C{"✅ 1-epoch\ntest passes?"}:::check
D["Write sbatch script\n(full config, email, logs)"]:::sbatch
E["sbatch train.sh"]:::submit
F(["📬 Queue: PD → R"]):::queue
G(["✅ Done\nCheckpoint + log saved"]):::done
A --> B --> C
C -->|"❌ debug"| B
C -->|"✅ ready"| D
D --> E --> F --> G
classDef login fill:#9B6EBD,stroke:#6b4785,color:#fff
classDef srun fill:#4A90D9,stroke:#2c5f8a,color:#fff
classDef check fill:#E8A838,stroke:#b07820,color:#fff
classDef sbatch fill:#5BA85A,stroke:#3a6e39,color:#fff
classDef submit fill:#5BA85A,stroke:#3a6e39,color:#fff
classDef queue fill:#888,stroke:#555,color:#fff
classDef done fill:#4A90D9,stroke:#2c5f8a,color:#fff
⚠️ Common Gotchas
Running on the login node — the cardinal sin. Always use
srunorsbatch. No exceptions.
Walltime buffer — set
--timeto 10–20% more than expected. Slurm kills jobs the instant they hit the limit: no checkpoint, no warning, no mercy.
Memory requests — always request slightly more RAM than your model + data needs. If you exceed your reserved allocation, Slurm terminates the job with
OUT_OF_MEMORYeven if the node has free RAM.
Checkpoint often — save model state every N epochs. If the cluster reboots or your job gets preempted, you resume from the last checkpoint instead of restarting from scratch.
🧠 One-Sentence Intuition
Use interactive mode to think, batch mode to execute, and array mode to scale — three tools, one mental model.
🎬 Learn More
Prof. Pat Schloss’s HPC intro — covers the economics of shared compute, interactive/batch/array modes end-to-end with live demos. A great 30-minute deep-dive if you’re new to SLURM.
Part of my HPC & ML workflow notes series. Next: setting up conda environments on a cluster, checkpointing with PyTorch Lightning, and multi-GPU training with --ntasks on Slurm.