HPC & Server-Based Systems: A Practical Guide for Researchers
A concise breakdown of how high-performance computing clusters actually work — from node architecture and storage tiers to job schedulers, environment management, and scaling best practices.
If you’re running deep learning experiments, large-scale simulations, or any compute-intensive research, you’ll eventually outgrow your local machine and need to work on an HPC cluster. These systems look intimidating at first — but once you understand the underlying principles, they follow a clean and logical design.
🖥️ What is HPC?
High Performance Computing (HPC) refers to tightly coupled clusters of servers that collectively deliver far more compute power than any single machine. Performance is measured in FLOPS (Floating Point Operations Per Second) — modern supercomputers operate at the petascale (10¹⁵ FLOPS).
The two fundamental building blocks of any HPC cluster:
| Node Type | Role |
|---|---|
| Login Node | Your entry point — for submitting jobs, editing scripts, and transferring files only |
| Compute Node | Where actual computation happens — CPU, GPU, or memory-optimised variants |
Never run heavy workloads on a login node. It’s a shared gateway — abusing it degrades the experience for every other user on the cluster.
🔧 Node Architecture
Modern HPC clusters offer several specialised compute node types, each tuned for different workloads:
| Node Type | Specs | Best For |
|---|---|---|
| CPU Nodes | 128 cores, 512GB+ RAM | MPI-parallelised simulations, data processing |
| GPU Nodes | Multi A100/H100 | Deep learning training |
| Large Memory Nodes | TB-range RAM | Genomics, graph analytics, in-memory workloads |
| High Frequency Nodes | Fewer cores, high clock | Latency-sensitive, sequential workloads |
| Visualisation Nodes | GPU-accelerated rendering | Paraview, VMD |
Nodes are connected via high-speed interconnects (e.g., 100G InfiniBand or HPE Slingshot) using Dragonfly topology — optimised for low latency and high bisection bandwidth at scale.
💾 Storage Hierarchy
HPC systems expose multiple storage tiers, each with different performance characteristics and retention policies:
| Tier | Mount Point | Speed | Quota | Retention |
|---|---|---|---|---|
| Home | $HOME | Moderate | Small (~50GB) | Persistent |
| Scratch (Lustre) | $HOME/scratch | Very Fast (parallel I/O) | Large (~100TB) | Auto-purged (30 days) |
| Project | /home/project/<id> | Moderate | By allocation | Project lifetime |
| Local NVMe | /raid | Fastest | Node-local only | Wiped after job ends |
The Golden Workflow
flowchart LR
A["📁 Project Dir\n(input data)"]:::store --> B["⚡ Scratch\n(fast I/O)"]:::fast
B --> C["🖥️ Compute Job\n(read/write)"]:::compute
C --> B
B --> D["📁 Project Dir\n(persist results)"]:::store
classDef store fill:#4A90D9,stroke:#2c5f8a,color:#fff
classDef fast fill:#D97B4A,stroke:#9e5430,color:#fff
classDef compute fill:#5BA85A,stroke:#3a6e39,color:#fff
- Copy input data from project → scratch before the job
- Run the job reading and writing to scratch (Lustre delivers high parallel I/O throughput)
- Move important outputs back to project storage when done
Use
lfs setstripeon Lustre for large files to stripe data across multiple OSTs (Object Storage Targets) and maximise read/write bandwidth.
🔌 Accessing the Cluster
Users connect to the login node via SSH from any SSH client:
1
ssh username@cluster.hostname.edu.sg
- Windows: MobaXterm, PuTTY
- Mac/Linux: built-in terminal
Most clusters require either a direct institutional network connection or a VPN to reach the login node.
📤 File Transfer
Getting data onto and off the cluster is a routine task. Three main methods:
SCP (Secure Copy)
Simple and secure — uses SSH under the hood.
1
2
3
4
5
# Local → Cluster
scp /path/to/localfile username@cluster:/path/to/destination
# Cluster → Local
scp username@cluster:/path/to/remotefile /local/destination
Rsync
Better for large or repeated transfers — only syncs changed files.
1
rsync -avz /path/to/source username@cluster:/path/to/destination
-a— archive mode (preserves permissions, timestamps)-v— verbose-z— compress during transfer
FileZilla (GUI)
Drag-and-drop interface over SFTP — ideal for users who prefer a visual workflow.
For transferring large files between storage tiers within the cluster (e.g., GPFS → Lustre), always do it inside a compute job — not on the login node.
📺 The screen Command — Keep Sessions Alive
When you’re running something interactive on the cluster, you don’t want it to die if your SSH connection drops. That’s where GNU Screen comes in — a terminal multiplexer that runs persistent shell sessions that survive disconnection.
Key Commands
1
2
3
4
5
6
7
8
9
10
11
# Start a new named session
screen -S mysession
# Detach from current session (keeps running in background)
Ctrl + A, then D
# List all running screen sessions
screen -ls
# Reattach to a session
screen -r mysession
Screen is especially useful for interactive compute jobs that you want to resume later without the job being killed when your terminal closes.
💳 Service Units (Compute Credits)
HPC clusters are shared facilities — compute time is tracked and budgeted using Service Units (SUs).
| Resource | Cost |
|---|---|
| CPU job | 1 SU per core-hour |
| GPU job | 64 SU per GPU-hour (CPU cores on GPU nodes are free) |
Example — CPU job:
1 node × 128 cores × 2 hours = 256 SUs
Example — GPU job:
2 nodes × 4 GPUs × 3 hours × 64 = 1,536 SUs
Request only what you need. Over-requesting locks up credits unnecessarily — the scheduler blocks SUs upfront when a job is submitted, and unused SUs are only refunded after completion.
📋 The Job Scheduler
On HPC systems, you don’t run programs directly on compute nodes. You submit job requests to a scheduler (e.g., PBS Pro, SLURM), which queues them and allocates resources when available.
flowchart TD
U["👤 User\nqsub script.sh"]:::user --> PBS["📋 PBS Server\nassigns Job ID"]:::pbs
PBS --> SCHED["🧮 PBS Scheduler\nfinds available nodes"]:::pbs
SCHED --> EXEC["🖥️ Compute Node\nexecutes script"]:::compute
EXEC --> OUT["📄 Output files\nreturned to user"]:::output
classDef user fill:#4A90D9,stroke:#2c5f8a,color:#fff
classDef pbs fill:#9B6EBD,stroke:#6b4785,color:#fff
classDef compute fill:#5BA85A,stroke:#3a6e39,color:#fff
classDef output fill:#D97B4A,stroke:#9e5430,color:#fff
Sample Batch Job Script
1
2
3
4
5
6
7
8
9
10
11
#!/bin/bash
#PBS -N MyExperiment
#PBS -l select=1:ncpus=128:mem=440GB
#PBS -q normal
#PBS -l walltime=24:00:00
#PBS -j oe
#PBS -P <Project-ID>
cd $PBS_O_WORKDIR || exit $?
python train.py --config config.yaml
Key PBS Commands
| Command | Action |
|---|---|
qsub <script> | Submit a batch job |
qsub -I -l select=1:ncpus=16:mem=48G -l walltime=01:00:00 | Start interactive session |
qstat -answ | View job status |
qdel <job-id> | Cancel a job |
myqstat | Simplified job status |
myusage | Check compute credit consumption |
Job Queues
| Queue | Max Walltime | Use Case |
|---|---|---|
normal | 24 hours | Standard jobs |
qlong | 120 hours | Long-running simulations |
ai | 2 hours | Interactive GPU jobs |
🧰 Environment Management
Environment Modules
1
2
3
4
module avail # List all available software
module load pytorch/2.1 # Load a specific version
module list # See currently loaded modules
module swap gcc/11 gcc/12 # Swap versions
Containers (Singularity)
For maximum reproducibility, Singularity containers package your entire software stack:
- No root/sudo required — safe for shared HPC environments
- Docker images can be converted to Singularity format (
singularity pull docker://...)
Conda / Miniforge
1
2
3
conda create -n myenv python=3.10
conda activate myenv
pip install torch torchvision
📈 Scaling Your Workload
The biggest mistake new HPC users make is requesting too many resources before knowing if their code can use them.
- Start small — run on 1 CPU node or 1 GPU card first
- Profile utilisation —
topfor CPU,nvidia-smifor GPU. Aim for >90% utilisation before scaling - Scale incrementally — 2, 4, 8, 16 nodes; measure speedup at each step
- Check parallel efficiency — ideal scaling is linear, but real-world efficiency drops at higher node counts due to communication overhead
GPU Best Practices
- Load the GPU version of your application (e.g.,
pytorchwith CUDA support) - Batch size matters — small batches underutilise GPU memory bandwidth
- Use mixed precision (
bfloat16) to double throughput on modern A100s
✅ Do’s and Don’ts
| ✅ Do | ❌ Don’t |
|---|---|
| Install apps in Home or Project directory | Run compute jobs on login nodes |
| Use Scratch for active job I/O | Install apps on Scratch (it gets purged) |
| Move output back to Project when done | Copy-paste PBS scripts from Windows (hidden chars break things) |
Use find + rm for targeted cleanup | rm -rf * on large directories |
| Use environment modules for software | Hardcode paths in .bashrc |
| Request resources proportional to workload | Over-request to “be safe” |
💡 Summary
HPC clusters are shared infrastructure governed by three core design principles:
- Isolation — users can’t interfere with each other (via job scheduler, storage quotas, module environments, containers)
- Tiered resources — fast-but-ephemeral scratch vs. slow-but-persistent home/project storage
- Explicit resource contracts — you declare exactly what you need; the scheduler enforces it
Master these, and you can work effectively on any HPC system — whether it’s a national supercomputer, a university cluster, or cloud-based HPC (AWS ParallelCluster, Google Cloud HPC Toolkit).
Based on HPC workshop materials and best practices for server-based computing environments. Next: GPU profiling with nvidia-smi and PyTorch Profiler.