SLURM Cheat Sheet — The Commands You Actually Use on HPC

Stop grepping man pages mid-training — the essential SLURM commands for ML researchers, from job submission to live log tailing.

Posted Mar 27, 2026

Server racks in a data center

By YuXuan Yan

2 min read

SLURM Cheat Sheet — The Commands You Actually Use on HPC

You’ve got a model to train and a GPU queue to fight. Here’s every SLURM command worth knowing — no fluff, just the ones that actually matter.

📡 Monitoring Jobs — Stop Polling Manually

The worst habit: SSHing in every 5 minutes to run squeue. Use watch instead:

watch -n 60 squeue -u your_username

watch reruns the command in-place every N seconds. Your terminal becomes a live dashboard — no scrolling, no noise.

One-liner intuition: watch -n 60 squeue -u <user> is your HPC heartbeat monitor.

🔁 Useful `squeue` Variations

  
squeue -u n2500633e                # your jobs only
squeue --format="%.10i %.9P %.30j %.8u %.8T %.10M %.6D %R"  # detailed columns

Column flag	Shows
`%i`	Job ID
`%P`	Partition
`%j`	Job name
`%T`	State (PENDING / RUNNING / FAILED)
`%M`	Elapsed runtime
`%R`	Reason (why pending, or node assigned)

🚀 Submitting & Cancelling Jobs

Submit

sbatch job.sh

Your job.sh should have #SBATCH headers for resources — GPUs, memory, time limit, output file.

Cancel

  
scancel <job_id>          # cancel one job
scancel -u n2500633e      # cancel ALL your jobs (nuclear option)

scancel -u will kill everything in your queue — pending and running. Use with care during long training runs.

Pause & Resume

scontrol hold <job_id>      # freeze a pending job (won't start)
scontrol release <job_id>   # unfreeze it

Useful when you want to edit your script before a job starts.

🔍 Inspecting Jobs — During & After

While Running

  
scontrol show job <job_id>          # full job details
scontrol show job <job_id> | grep -i gres   # GPU allocation
tail -f slurm-<job_id>.out          # live training log

tail -f is probably your most-used command during training. Watches output as it streams — no need to SSH into the compute node.

Name your output file explicitly in #SBATCH --output=logs/train_%j.out so logs don’t pile up in your home dir. %j gets replaced with the job ID automatically.

After Finishing

  
sacct -j <job_id>                          # job history + exit code
sacct -j <job_id> --format=JobID,State,ExitCode,Elapsed,MaxRSS

sacct is your post-mortem tool. If a job failed silently, check ExitCode and State here.

State	Meaning
`COMPLETED`	Finished cleanly ✅
`FAILED`	Non-zero exit code ❌
`TIMEOUT`	Hit the time limit ⏰
`OUT_OF_MEMORY`	RAM/GPU memory exhausted 💥
`CANCELLED`	Manually killed

🖥️ Cluster Resource Check

  
sinfo                              # all nodes + states
sinfo -o "%n %C %m %G"            # node: CPU (used/idle/other/total), memory, GPUs

sinfo tells you which partitions have free GPUs before you submit. Saves you from waiting 3 hours in a queue for a partition that’s fully booked.

🧠 Putting It Together — Typical FYP Workflow

flowchart LR
    A[Edit job.sh] --> B[sbatch job.sh]
    B --> C{watch squeue}
    C -->|PENDING| C
    C -->|RUNNING| D[tail -f slurm-ID.out]
    D --> E{Job done?}
    E -->|FAILED| F[sacct -j ID\ncheck ExitCode]
    E -->|COMPLETED| G[Analyze results]
    F --> A

    style A fill:#4A90D9,color:#fff
    style B fill:#5BA85A,color:#fff
    style C fill:#E8A838,color:#fff
    style D fill:#9B6EBD,color:#fff
    style G fill:#5BA85A,color:#fff
    style F fill:#D9534F,color:#fff

One-Sentence Intuition: watch to monitor, tail -f to observe, sacct to diagnose — three commands cover 90% of your HPC day.

Part of HPC & tooling notes. Next: writing efficient SBATCH scripts for multi-GPU training.

Tools, HPC

This post is licensed under CC BY 4.0 by the author.