SLURM Cheat Sheet — The Commands You Actually Use on HPC
Stop grepping man pages mid-training — the essential SLURM commands for ML researchers, from job submission to live log tailing.
You’ve got a model to train and a GPU queue to fight. Here’s every SLURM command worth knowing — no fluff, just the ones that actually matter.
📡 Monitoring Jobs — Stop Polling Manually
The worst habit: SSHing in every 5 minutes to run squeue. Use watch instead:
1
watch -n 60 squeue -u your_username
watch reruns the command in-place every N seconds. Your terminal becomes a live dashboard — no scrolling, no noise.
One-liner intuition:
watch -n 60 squeue -u <user>is your HPC heartbeat monitor.
🔁 Useful squeue Variations
1
2
squeue -u n2500633e # your jobs only
squeue --format="%.10i %.9P %.30j %.8u %.8T %.10M %.6D %R" # detailed columns
| Column flag | Shows |
|---|---|
%i | Job ID |
%P | Partition |
%j | Job name |
%T | State (PENDING / RUNNING / FAILED) |
%M | Elapsed runtime |
%R | Reason (why pending, or node assigned) |
🚀 Submitting & Cancelling Jobs
Submit
1
sbatch job.sh
Your job.sh should have #SBATCH headers for resources — GPUs, memory, time limit, output file.
Cancel
1
2
scancel <job_id> # cancel one job
scancel -u n2500633e # cancel ALL your jobs (nuclear option)
scancel -uwill kill everything in your queue — pending and running. Use with care during long training runs.
Pause & Resume
1
2
scontrol hold <job_id> # freeze a pending job (won't start)
scontrol release <job_id> # unfreeze it
Useful when you want to edit your script before a job starts.
🔍 Inspecting Jobs — During & After
While Running
1
2
3
scontrol show job <job_id> # full job details
scontrol show job <job_id> | grep -i gres # GPU allocation
tail -f slurm-<job_id>.out # live training log
tail -f is probably your most-used command during training. Watches output as it streams — no need to SSH into the compute node.
Name your output file explicitly in
#SBATCH --output=logs/train_%j.outso logs don’t pile up in your home dir.%jgets replaced with the job ID automatically.
After Finishing
1
2
sacct -j <job_id> # job history + exit code
sacct -j <job_id> --format=JobID,State,ExitCode,Elapsed,MaxRSS
sacct is your post-mortem tool. If a job failed silently, check ExitCode and State here.
| State | Meaning |
|---|---|
COMPLETED | Finished cleanly ✅ |
FAILED | Non-zero exit code ❌ |
TIMEOUT | Hit the time limit ⏰ |
OUT_OF_MEMORY | RAM/GPU memory exhausted 💥 |
CANCELLED | Manually killed |
🖥️ Cluster Resource Check
1
2
sinfo # all nodes + states
sinfo -o "%n %C %m %G" # node: CPU (used/idle/other/total), memory, GPUs
sinfotells you which partitions have free GPUs before you submit. Saves you from waiting 3 hours in a queue for a partition that’s fully booked.
🧠 Putting It Together — Typical FYP Workflow
flowchart LR
A[Edit job.sh] --> B[sbatch job.sh]
B --> C{watch squeue}
C -->|PENDING| C
C -->|RUNNING| D[tail -f slurm-ID.out]
D --> E{Job done?}
E -->|FAILED| F[sacct -j ID\ncheck ExitCode]
E -->|COMPLETED| G[Analyze results]
F --> A
style A fill:#4A90D9,color:#fff
style B fill:#5BA85A,color:#fff
style C fill:#E8A838,color:#fff
style D fill:#9B6EBD,color:#fff
style G fill:#5BA85A,color:#fff
style F fill:#D9534F,color:#fff
One-Sentence Intuition:
watchto monitor,tail -fto observe,sacctto diagnose — three commands cover 90% of your HPC day.
Part of HPC & tooling notes. Next: writing efficient SBATCH scripts for multi-GPU training.