HPC & Server-Based Systems: A Practical Guide for Researchers

A concise breakdown of how high-performance computing clusters actually work — from node architecture and storage tiers to job schedulers, environment management, and scaling best practices.

Posted Mar 17, 2026 Updated Jul 8, 2026

Server racks in a data center

By YuXuan Yan

7 min read

HPC & Server-Based Systems: A Practical Guide for Researchers

If you’re running deep learning experiments, large-scale simulations, or any compute-intensive research, you’ll eventually outgrow your local machine and need to work on an HPC cluster. These systems look intimidating at first — but once you understand the underlying principles, they follow a clean and logical design.

🖥️ What is HPC?

High Performance Computing (HPC) refers to tightly coupled clusters of servers that collectively deliver far more compute power than any single machine. Performance is measured in FLOPS (Floating Point Operations Per Second) — modern supercomputers operate at the petascale (10¹⁵ FLOPS).

The two fundamental building blocks of any HPC cluster:

Node Type	Role
Login Node	Your entry point — for submitting jobs, editing scripts, and transferring files only
Compute Node	Where actual computation happens — CPU, GPU, or memory-optimised variants

Never run heavy workloads on a login node. It’s a shared gateway — abusing it degrades the experience for every other user on the cluster.

🔧 Node Architecture

Modern HPC clusters offer several specialised compute node types, each tuned for different workloads:

Node Type	Specs	Best For
CPU Nodes	128 cores, 512GB+ RAM	MPI-parallelised simulations, data processing
GPU Nodes	Multi A100/H100	Deep learning training
Large Memory Nodes	TB-range RAM	Genomics, graph analytics, in-memory workloads
High Frequency Nodes	Fewer cores, high clock	Latency-sensitive, sequential workloads
Visualisation Nodes	GPU-accelerated rendering	Paraview, VMD

Nodes are connected via high-speed interconnects (e.g., 100G InfiniBand or HPE Slingshot) using Dragonfly topology — optimised for low latency and high bisection bandwidth at scale.

💾 Storage Hierarchy

HPC systems expose multiple storage tiers, each with different performance characteristics and retention policies:

Tier	Mount Point	Speed	Quota	Retention
Home	`$HOME`	Moderate	Small (~50GB)	Persistent
Scratch (Lustre)	`$HOME/scratch`	Very Fast (parallel I/O)	Large (~100TB)	Auto-purged (30 days)
Project	`/home/project/<id>`	Moderate	By allocation	Project lifetime
Local NVMe	`/raid`	Fastest	Node-local only	Wiped after job ends

The Golden Workflow

flowchart LR
    A["📁 Project Dir\n(input data)"]:::store --> B["⚡ Scratch\n(fast I/O)"]:::fast
    B --> C["🖥️ Compute Job\n(read/write)"]:::compute
    C --> B
    B --> D["📁 Project Dir\n(persist results)"]:::store

    classDef store fill:#4A90D9,stroke:#2c5f8a,color:#fff
    classDef fast fill:#D97B4A,stroke:#9e5430,color:#fff
    classDef compute fill:#5BA85A,stroke:#3a6e39,color:#fff

Copy input data from project → scratch before the job
Run the job reading and writing to scratch (Lustre delivers high parallel I/O throughput)
Move important outputs back to project storage when done

Use lfs setstripe on Lustre for large files to stripe data across multiple OSTs (Object Storage Targets) and maximise read/write bandwidth.

🔌 Accessing the Cluster

Users connect to the login node via SSH from any SSH client:

ssh username@cluster.hostname.edu.sg

Windows: MobaXterm, PuTTY
Mac/Linux: built-in terminal

Most clusters require either a direct institutional network connection or a VPN to reach the login node.

📤 File Transfer

Getting data onto and off the cluster is a routine task. Three main methods:

SCP (Secure Copy)

Simple and secure — uses SSH under the hood.

# Local → Cluster
scp /path/to/localfile username@cluster:/path/to/destination

# Cluster → Local
scp username@cluster:/path/to/remotefile /local/destination

Rsync

Better for large or repeated transfers — only syncs changed files.

rsync -avz /path/to/source username@cluster:/path/to/destination

-a — archive mode (preserves permissions, timestamps)
-v — verbose
-z — compress during transfer

FileZilla (GUI)

Drag-and-drop interface over SFTP — ideal for users who prefer a visual workflow.

For transferring large files between storage tiers within the cluster (e.g., GPFS → Lustre), always do it inside a compute job — not on the login node.

📺 The `screen` Command — Keep Sessions Alive

When you’re running something interactive on the cluster, you don’t want it to die if your SSH connection drops. That’s where GNU Screen comes in — a terminal multiplexer that runs persistent shell sessions that survive disconnection.

Key Commands

  
# Start a new named session
screen -S mysession

# Detach from current session (keeps running in background)
Ctrl + A, then D

# List all running screen sessions
screen -ls

# Reattach to a session
screen -r mysession

Screen is especially useful for interactive compute jobs that you want to resume later without the job being killed when your terminal closes.

💳 Service Units (Compute Credits)

HPC clusters are shared facilities — compute time is tracked and budgeted using Service Units (SUs).

Resource	Cost
CPU job	1 SU per core-hour
GPU job	64 SU per GPU-hour (CPU cores on GPU nodes are free)

Example — CPU job:

1 node × 128 cores × 2 hours = 256 SUs

Example — GPU job:

2 nodes × 4 GPUs × 3 hours × 64 = 1,536 SUs

Request only what you need. Over-requesting locks up credits unnecessarily — the scheduler blocks SUs upfront when a job is submitted, and unused SUs are only refunded after completion.

📋 The Job Scheduler

On HPC systems, you don’t run programs directly on compute nodes. You submit job requests to a scheduler (e.g., PBS Pro, SLURM), which queues them and allocates resources when available.

flowchart TD
    U["👤 User\nqsub script.sh"]:::user --> PBS["📋 PBS Server\nassigns Job ID"]:::pbs
    PBS --> SCHED["🧮 PBS Scheduler\nfinds available nodes"]:::pbs
    SCHED --> EXEC["🖥️ Compute Node\nexecutes script"]:::compute
    EXEC --> OUT["📄 Output files\nreturned to user"]:::output

    classDef user fill:#4A90D9,stroke:#2c5f8a,color:#fff
    classDef pbs fill:#9B6EBD,stroke:#6b4785,color:#fff
    classDef compute fill:#5BA85A,stroke:#3a6e39,color:#fff
    classDef output fill:#D97B4A,stroke:#9e5430,color:#fff

Sample Batch Job Script

  
#!/bin/bash
#PBS -N MyExperiment
#PBS -l select=1:ncpus=128:mem=440GB
#PBS -q normal
#PBS -l walltime=24:00:00
#PBS -j oe
#PBS -P <Project-ID>

cd $PBS_O_WORKDIR || exit $?

python train.py --config config.yaml

Key PBS Commands

Command	Action
`qsub <script>`	Submit a batch job
`qsub -I -l select=1:ncpus=16:mem=48G -l walltime=01:00:00`	Start interactive session
`qstat -answ`	View job status
`qdel <job-id>`	Cancel a job
`myqstat`	Simplified job status
`myusage`	Check compute credit consumption

Job Queues

Queue	Max Walltime	Use Case
`normal`	24 hours	Standard jobs
`qlong`	120 hours	Long-running simulations
`ai`	2 hours	Interactive GPU jobs

🧰 Environment Management

Environment Modules

  
module avail              # List all available software
module load pytorch/2.1   # Load a specific version
module list               # See currently loaded modules
module swap gcc/11 gcc/12 # Swap versions

Containers (Singularity)

For maximum reproducibility, Singularity containers package your entire software stack:

No root/sudo required — safe for shared HPC environments
Docker images can be converted to Singularity format (singularity pull docker://...)

Conda / Miniforge

  
conda create -n myenv python=3.10
conda activate myenv
pip install torch torchvision

📈 Scaling Your Workload

The biggest mistake new HPC users make is requesting too many resources before knowing if their code can use them.

Start small — run on 1 CPU node or 1 GPU card first
Profile utilisation — top for CPU, nvidia-smi for GPU. Aim for >90% utilisation before scaling
Scale incrementally — 2, 4, 8, 16 nodes; measure speedup at each step
Check parallel efficiency — ideal scaling is linear, but real-world efficiency drops at higher node counts due to communication overhead

GPU Best Practices

Load the GPU version of your application (e.g., pytorch with CUDA support)
Batch size matters — small batches underutilise GPU memory bandwidth
Use mixed precision (bfloat16) to double throughput on modern A100s

✅ Do’s and Don’ts

✅ Do	❌ Don’t
Install apps in Home or Project directory	Run compute jobs on login nodes
Use Scratch for active job I/O	Install apps on Scratch (it gets purged)
Move output back to Project when done	Copy-paste PBS scripts from Windows (hidden chars break things)
Use `find` + `rm` for targeted cleanup	`rm -rf *` on large directories
Use environment modules for software	Hardcode paths in `.bashrc`
Request resources proportional to workload	Over-request to “be safe”

💡 Summary

HPC clusters are shared infrastructure governed by three core design principles:

Isolation — users can’t interfere with each other (via job scheduler, storage quotas, module environments, containers)
Tiered resources — fast-but-ephemeral scratch vs. slow-but-persistent home/project storage
Explicit resource contracts — you declare exactly what you need; the scheduler enforces it

Master these, and you can work effectively on any HPC system — whether it’s a national supercomputer, a university cluster, or cloud-based HPC (AWS ParallelCluster, Google Cloud HPC Toolkit).

Based on HPC workshop materials and best practices for server-based computing environments. Next: GPU profiling with nvidia-smi and PyTorch Profiler.

Engineering, HPC

This post is licensed under CC BY 4.0 by the author.