HPC & GPU Programming Interview Guide¶

SIMD (Single Instruction, Multiple Data)¶

Instruction Set	Width	Floats/op	Use Case
SSE	128-bit	4 float / 2 double	Baseline SIMD
AVX/AVX2	256-bit	8 float / 4 double	General HPC
AVX-512	512-bit	16 float / 8 double	Intel Xeon, some AMD

#include <immintrin.h>
__m256 a = _mm256_load_ps(data);
__m256 b = _mm256_load_ps(data + 8);
__m256 c = _mm256_add_ps(a, b);  // 8 additions in one instruction

→ src/systems/hpc_gpu/simd_vectorization.cpp

Cache Hierarchy¶

Level	Size	Latency	Shared
L1	32-64 KB	~1 ns (4 cycles)	Per core
L2	256 KB-1 MB	~4 ns (12 cycles)	Per core
L3	8-64 MB	~12 ns (40 cycles)	Per socket
DRAM	GBs	~100 ns	All cores

Cache line: 64 bytes. Access patterns matter more than algorithm complexity at scale.

NUMA Awareness¶

Non-Uniform Memory Access: memory latency depends on which socket owns it
numactl --membind=0 pin memory to local node
numa_alloc_local() in code
Thread pinning: pthread_setaffinity_np or taskset
Crossing NUMA boundary: ~1.5-2x latency penalty

Vectorization Tips¶

Align data to cache lines (alignas(64))
Avoid branches in loops (use masks instead)
#pragma omp simd or __attribute__((vector)) hints
Check with -fopt-info-vec-missed (GCC) or -Rpass=loop-vectorize (Clang)

OpenMP Basics¶

#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < n; ++i) sum += data[i];

Key directives: parallel, for, critical, atomic, reduction, schedule(dynamic)

MPI Basics¶

Distributed memory: each process has its own address space
MPI_Send/Recv (point-to-point), MPI_Bcast, MPI_Reduce (collective)
Hybrid: MPI between nodes + OpenMP within node

GPU / CUDA¶

Memory Model¶

Memory	Scope	Speed	Size
Registers	Per thread	Fastest	Limited
Shared	Per block	~5ns	48-96 KB
Global (DRAM)	All threads	~400 cycles	GBs
Constant	All threads (read-only)	Cached	64 KB

Thread Hierarchy¶

Thread → Warp (32 threads) → Block → Grid
All threads in a warp execute same instruction (SIMT)

Warp Divergence¶

if/else within a warp → both branches executed, results masked
Fix: structure code so warps take same path, or use predication

Memory Coalescing¶

Adjacent threads access adjacent memory → single transaction
Strided or random access → multiple transactions (huge slowdown)
AoS → SoA transformation for GPU

Common Interview Questions¶

Question	Key Answer
Why is GPU fast for ML?	Thousands of cores, optimized for data-parallel ops
Warp divergence	Threads in warp take different branches → serialization
Shared memory bank conflicts	32 banks; same bank access by multiple threads → sequential
CPU cache vs GPU shared memory	CPU cache is automatic; GPU shared memory is explicitly managed
How to detect NUMA issues?	`numastat`, `perf stat` with NUMA events, latency profiling