Post

SLURM Cheat Sheet — The Commands You Actually Use on HPC

Stop grepping man pages mid-training — the essential SLURM commands for ML researchers, from job submission to live log tailing.

SLURM Cheat Sheet — The Commands You Actually Use on HPC

You’ve got a model to train and a GPU queue to fight. Here’s every SLURM command worth knowing — no fluff, just the ones that actually matter.


📡 Monitoring Jobs — Stop Polling Manually

The worst habit: SSHing in every 5 minutes to run squeue. Use watch instead:

1
watch -n 60 squeue -u your_username

watch reruns the command in-place every N seconds. Your terminal becomes a live dashboard — no scrolling, no noise.


One-liner intuition: watch -n 60 squeue -u <user> is your HPC heartbeat monitor.


🔁 Useful squeue Variations

1
2
squeue -u n2500633e                # your jobs only
squeue --format="%.10i %.9P %.30j %.8u %.8T %.10M %.6D %R"  # detailed columns
Column flagShows
%iJob ID
%PPartition
%jJob name
%TState (PENDING / RUNNING / FAILED)
%MElapsed runtime
%RReason (why pending, or node assigned)

🚀 Submitting & Cancelling Jobs


Submit

1
sbatch job.sh

Your job.sh should have #SBATCH headers for resources — GPUs, memory, time limit, output file.


Cancel

1
2
scancel <job_id>          # cancel one job
scancel -u n2500633e      # cancel ALL your jobs (nuclear option)

scancel -u will kill everything in your queue — pending and running. Use with care during long training runs.


Pause & Resume

1
2
scontrol hold <job_id>      # freeze a pending job (won't start)
scontrol release <job_id>   # unfreeze it

Useful when you want to edit your script before a job starts.


🔍 Inspecting Jobs — During & After


While Running

1
2
3
scontrol show job <job_id>          # full job details
scontrol show job <job_id> | grep -i gres   # GPU allocation
tail -f slurm-<job_id>.out          # live training log

tail -f is probably your most-used command during training. Watches output as it streams — no need to SSH into the compute node.

Name your output file explicitly in #SBATCH --output=logs/train_%j.out so logs don’t pile up in your home dir. %j gets replaced with the job ID automatically.


After Finishing

1
2
sacct -j <job_id>                          # job history + exit code
sacct -j <job_id> --format=JobID,State,ExitCode,Elapsed,MaxRSS

sacct is your post-mortem tool. If a job failed silently, check ExitCode and State here.

StateMeaning
COMPLETEDFinished cleanly ✅
FAILEDNon-zero exit code ❌
TIMEOUTHit the time limit ⏰
OUT_OF_MEMORYRAM/GPU memory exhausted 💥
CANCELLEDManually killed

🖥️ Cluster Resource Check

1
2
sinfo                              # all nodes + states
sinfo -o "%n %C %m %G"            # node: CPU (used/idle/other/total), memory, GPUs

sinfo tells you which partitions have free GPUs before you submit. Saves you from waiting 3 hours in a queue for a partition that’s fully booked.


🧠 Putting It Together — Typical FYP Workflow

flowchart LR
    A[Edit job.sh] --> B[sbatch job.sh]
    B --> C{watch squeue}
    C -->|PENDING| C
    C -->|RUNNING| D[tail -f slurm-ID.out]
    D --> E{Job done?}
    E -->|FAILED| F[sacct -j ID\ncheck ExitCode]
    E -->|COMPLETED| G[Analyze results]
    F --> A

    style A fill:#4A90D9,color:#fff
    style B fill:#5BA85A,color:#fff
    style C fill:#E8A838,color:#fff
    style D fill:#9B6EBD,color:#fff
    style G fill:#5BA85A,color:#fff
    style F fill:#D9534F,color:#fff

One-Sentence Intuition: watch to monitor, tail -f to observe, sacct to diagnose — three commands cover 90% of your HPC day.


Part of HPC & tooling notes. Next: writing efficient SBATCH scripts for multi-GPU training.

This post is licensed under CC BY 4.0 by the author.