Post

HPC & Server-Based Systems: A Practical Guide for Researchers

A concise breakdown of how high-performance computing clusters actually work — from node architecture and storage tiers to job schedulers, environment management, and scaling best practices.

HPC & Server-Based Systems: A Practical Guide for Researchers

If you’re running deep learning experiments, large-scale simulations, or any compute-intensive research, you’ll eventually outgrow your local machine and need to work on an HPC cluster. These systems look intimidating at first — but once you understand the underlying principles, they follow a clean and logical design.


🖥️ What is HPC?

High Performance Computing (HPC) refers to tightly coupled clusters of servers that collectively deliver far more compute power than any single machine. Performance is measured in FLOPS (Floating Point Operations Per Second) — modern supercomputers operate at the petascale (10¹⁵ FLOPS).

The two fundamental building blocks of any HPC cluster:

Node TypeRole
Login NodeYour entry point — for submitting jobs, editing scripts, and transferring files only
Compute NodeWhere actual computation happens — CPU, GPU, or memory-optimised variants

Never run heavy workloads on a login node. It’s a shared gateway — abusing it degrades the experience for every other user on the cluster.


🔧 Node Architecture

Modern HPC clusters offer several specialised compute node types, each tuned for different workloads:

Node TypeSpecsBest For
CPU Nodes128 cores, 512GB+ RAMMPI-parallelised simulations, data processing
GPU NodesMulti A100/H100Deep learning training
Large Memory NodesTB-range RAMGenomics, graph analytics, in-memory workloads
High Frequency NodesFewer cores, high clockLatency-sensitive, sequential workloads
Visualisation NodesGPU-accelerated renderingParaview, VMD

Nodes are connected via high-speed interconnects (e.g., 100G InfiniBand or HPE Slingshot) using Dragonfly topology — optimised for low latency and high bisection bandwidth at scale.


💾 Storage Hierarchy

HPC systems expose multiple storage tiers, each with different performance characteristics and retention policies:

TierMount PointSpeedQuotaRetention
Home$HOMEModerateSmall (~50GB)Persistent
Scratch (Lustre)$HOME/scratchVery Fast (parallel I/O)Large (~100TB)Auto-purged (30 days)
Project/home/project/<id>ModerateBy allocationProject lifetime
Local NVMe/raidFastestNode-local onlyWiped after job ends

The Golden Workflow

flowchart LR
    A["📁 Project Dir\n(input data)"]:::store --> B["⚡ Scratch\n(fast I/O)"]:::fast
    B --> C["🖥️ Compute Job\n(read/write)"]:::compute
    C --> B
    B --> D["📁 Project Dir\n(persist results)"]:::store

    classDef store fill:#4A90D9,stroke:#2c5f8a,color:#fff
    classDef fast fill:#D97B4A,stroke:#9e5430,color:#fff
    classDef compute fill:#5BA85A,stroke:#3a6e39,color:#fff
  1. Copy input data from project → scratch before the job
  2. Run the job reading and writing to scratch (Lustre delivers high parallel I/O throughput)
  3. Move important outputs back to project storage when done

Use lfs setstripe on Lustre for large files to stripe data across multiple OSTs (Object Storage Targets) and maximise read/write bandwidth.


🔌 Accessing the Cluster

Users connect to the login node via SSH from any SSH client:

1
ssh username@cluster.hostname.edu.sg
  • Windows: MobaXterm, PuTTY
  • Mac/Linux: built-in terminal

Most clusters require either a direct institutional network connection or a VPN to reach the login node.


📤 File Transfer

Getting data onto and off the cluster is a routine task. Three main methods:

SCP (Secure Copy)

Simple and secure — uses SSH under the hood.

1
2
3
4
5
# Local → Cluster
scp /path/to/localfile username@cluster:/path/to/destination

# Cluster → Local
scp username@cluster:/path/to/remotefile /local/destination

Rsync

Better for large or repeated transfers — only syncs changed files.

1
rsync -avz /path/to/source username@cluster:/path/to/destination
  • -a — archive mode (preserves permissions, timestamps)
  • -v — verbose
  • -z — compress during transfer

FileZilla (GUI)

Drag-and-drop interface over SFTP — ideal for users who prefer a visual workflow.

For transferring large files between storage tiers within the cluster (e.g., GPFS → Lustre), always do it inside a compute job — not on the login node.


📺 The screen Command — Keep Sessions Alive

When you’re running something interactive on the cluster, you don’t want it to die if your SSH connection drops. That’s where GNU Screen comes in — a terminal multiplexer that runs persistent shell sessions that survive disconnection.

Key Commands

1
2
3
4
5
6
7
8
9
10
11
# Start a new named session
screen -S mysession

# Detach from current session (keeps running in background)
Ctrl + A, then D

# List all running screen sessions
screen -ls

# Reattach to a session
screen -r mysession

Screen is especially useful for interactive compute jobs that you want to resume later without the job being killed when your terminal closes.


💳 Service Units (Compute Credits)

HPC clusters are shared facilities — compute time is tracked and budgeted using Service Units (SUs).

ResourceCost
CPU job1 SU per core-hour
GPU job64 SU per GPU-hour (CPU cores on GPU nodes are free)

Example — CPU job:

1 node × 128 cores × 2 hours = 256 SUs

Example — GPU job:

2 nodes × 4 GPUs × 3 hours × 64 = 1,536 SUs

Request only what you need. Over-requesting locks up credits unnecessarily — the scheduler blocks SUs upfront when a job is submitted, and unused SUs are only refunded after completion.


📋 The Job Scheduler

On HPC systems, you don’t run programs directly on compute nodes. You submit job requests to a scheduler (e.g., PBS Pro, SLURM), which queues them and allocates resources when available.

flowchart TD
    U["👤 User\nqsub script.sh"]:::user --> PBS["📋 PBS Server\nassigns Job ID"]:::pbs
    PBS --> SCHED["🧮 PBS Scheduler\nfinds available nodes"]:::pbs
    SCHED --> EXEC["🖥️ Compute Node\nexecutes script"]:::compute
    EXEC --> OUT["📄 Output files\nreturned to user"]:::output

    classDef user fill:#4A90D9,stroke:#2c5f8a,color:#fff
    classDef pbs fill:#9B6EBD,stroke:#6b4785,color:#fff
    classDef compute fill:#5BA85A,stroke:#3a6e39,color:#fff
    classDef output fill:#D97B4A,stroke:#9e5430,color:#fff

Sample Batch Job Script

1
2
3
4
5
6
7
8
9
10
11
#!/bin/bash
#PBS -N MyExperiment
#PBS -l select=1:ncpus=128:mem=440GB
#PBS -q normal
#PBS -l walltime=24:00:00
#PBS -j oe
#PBS -P <Project-ID>

cd $PBS_O_WORKDIR || exit $?

python train.py --config config.yaml

Key PBS Commands

CommandAction
qsub <script>Submit a batch job
qsub -I -l select=1:ncpus=16:mem=48G -l walltime=01:00:00Start interactive session
qstat -answView job status
qdel <job-id>Cancel a job
myqstatSimplified job status
myusageCheck compute credit consumption

Job Queues

QueueMax WalltimeUse Case
normal24 hoursStandard jobs
qlong120 hoursLong-running simulations
ai2 hoursInteractive GPU jobs

🧰 Environment Management

Environment Modules

1
2
3
4
module avail              # List all available software
module load pytorch/2.1   # Load a specific version
module list               # See currently loaded modules
module swap gcc/11 gcc/12 # Swap versions

Containers (Singularity)

For maximum reproducibility, Singularity containers package your entire software stack:

  • No root/sudo required — safe for shared HPC environments
  • Docker images can be converted to Singularity format (singularity pull docker://...)

Conda / Miniforge

1
2
3
conda create -n myenv python=3.10
conda activate myenv
pip install torch torchvision

📈 Scaling Your Workload

The biggest mistake new HPC users make is requesting too many resources before knowing if their code can use them.

  1. Start small — run on 1 CPU node or 1 GPU card first
  2. Profile utilisationtop for CPU, nvidia-smi for GPU. Aim for >90% utilisation before scaling
  3. Scale incrementally — 2, 4, 8, 16 nodes; measure speedup at each step
  4. Check parallel efficiency — ideal scaling is linear, but real-world efficiency drops at higher node counts due to communication overhead

GPU Best Practices

  • Load the GPU version of your application (e.g., pytorch with CUDA support)
  • Batch size matters — small batches underutilise GPU memory bandwidth
  • Use mixed precision (bfloat16) to double throughput on modern A100s

✅ Do’s and Don’ts

✅ Do❌ Don’t
Install apps in Home or Project directoryRun compute jobs on login nodes
Use Scratch for active job I/OInstall apps on Scratch (it gets purged)
Move output back to Project when doneCopy-paste PBS scripts from Windows (hidden chars break things)
Use find + rm for targeted cleanuprm -rf * on large directories
Use environment modules for softwareHardcode paths in .bashrc
Request resources proportional to workloadOver-request to “be safe”

💡 Summary

HPC clusters are shared infrastructure governed by three core design principles:

  1. Isolation — users can’t interfere with each other (via job scheduler, storage quotas, module environments, containers)
  2. Tiered resources — fast-but-ephemeral scratch vs. slow-but-persistent home/project storage
  3. Explicit resource contracts — you declare exactly what you need; the scheduler enforces it

Master these, and you can work effectively on any HPC system — whether it’s a national supercomputer, a university cluster, or cloud-based HPC (AWS ParallelCluster, Google Cloud HPC Toolkit).


Based on HPC workshop materials and best practices for server-based computing environments. Next: GPU profiling with nvidia-smi and PyTorch Profiler.

This post is licensed under CC BY 4.0 by the author.