Skip to content

Lock-Free Concurrency Primitives

Overview

Pre-allocated, cache-line-aligned lock-free data structures for inter-thread communication. All primitives operate without locks, mutexes, or syscalls using only atomic operations with explicit memory ordering (C11 <stdatomic.h>).

┌─────────────────────────────────────────────────────────────────────┐
│                     Producer Thread (Core 2)                          │
│                            │                                         │
│              ┌─────────────▼─────────────────┐                       │
│              │     SPSC Ring Buffer           │                       │
│              │  [head ──────── tail]          │                       │
│              │  release ←─── acquire          │                       │
│              └─────────────┬─────────────────┘                       │
│                            │                                         │
│                     Consumer Thread (Core 3)                          │
│                                                                      │
│  ┌───────────────────────────────────────────────────────────────┐   │
│  │  Memory Pool (MPMC freelist):                                  │   │
│  │  alloc() ──► CAS pop from freelist head                       │   │
│  │  free()  ──► CAS push to freelist head                        │   │
│  │  ABA prevention via generation counter                         │   │
│  └───────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

Modules

Module Status Pattern Latency
Ring Buffer ✅ Implemented SPSC, memcpy-based < 10 ns
SPSC Queue ✅ Implemented SPSC, zero-copy claim/publish < 5 ns
Memory Pool ✅ Implemented MPMC, CAS freelist < 20 ns

Design Principles

Memory Ordering

/* Producer: write data, then RELEASE (publish) the index */
memcpy(slot, data, size);
atomic_store_explicit(&head, head + 1, memory_order_release);

/* Consumer: ACQUIRE the index, then read data */
uint64_t h = atomic_load_explicit(&head, memory_order_acquire);
memcpy(data, slot, size);
Ordering Meaning Use Case
relaxed No ordering guarantee Statistics counters, local reads
acquire Reads after this see writes before the paired release Consumer loading producer's index
release Writes before this are visible after the paired acquire Producer publishing new data
acq_rel Both acquire and release CAS operations (alloc/free)

Cache-Line Isolation

┌───────────────── Cache Line 0 (64 bytes) ─────────────────┐
│  _Atomic uint64_t head;                                     │
│  uint8_t padding[56];                                       │
├───────────────── Cache Line 1 (64 bytes) ─────────────────┤
│  _Atomic uint64_t tail;                                     │
│  uint8_t padding[56];                                       │
└─────────────────────────────────────────────────────────────┘

Without isolation: writing head invalidates tail's cache line
    → consumer stalls on MESI invalidation (~50-100 ns penalty)
With isolation: each index on its own line
    → no false sharing, sub-10ns operations

Power-of-2 Sizing

/* Modulo operation (SLOW division on critical path): */
index = sequence % capacity;

/* Bitwise AND mask (FAST single instruction): */
index = sequence & (capacity - 1);  /* Only works for power-of-2 */

Comparison

Feature Ring Buffer SPSC Queue Memory Pool
Concurrency SPSC SPSC MPMC
Copy model memcpy Zero-copy (pointer) N/A (alloc/free)
Hot-path ops 1 store + 1 load 1 store + 1 load (cached) 1 CAS
Latency ~8 ns ~4 ns ~15 ns
Use case Byte streams Typed messages Buffer management
ABA safe N/A (SPSC) N/A (SPSC) Yes (gen counter)

Build

# Build all concurrency modules
make -C src/concurrency/ring_buffer
make -C src/concurrency/spsc_queue
make -C src/concurrency/memory_pool

# Run benchmarks (pin to isolated cores)
taskset -c 2,3 ./build/ring_buffer_bench
taskset -c 2,3 ./build/spsc_queue_bench

Integration Pattern

/* Typical low-latency pipeline:
   NIC → raw_socket → ring_buffer → application → spsc_queue → TX */

#include "ring_buffer.h"
#include "spsc_queue.h"
#include "memory_pool.h"

/* Pre-allocate everything at startup */
static uint8_t rb_storage[4096 * 64]  __attribute__((aligned(64)));
static uint8_t sq_storage[1024 * 128] __attribute__((aligned(64)));
static uint8_t pool_storage[256 * 64] __attribute__((aligned(64)));

ring_buffer_t  rx_ring;
spsc_queue_t   tx_queue;
memory_pool_t  buf_pool;

void init(void) {
    ring_buffer_init(&rx_ring, rb_storage, 4096, 64);
    spsc_queue_init(&tx_queue, sq_storage, 1024, 128);
    memory_pool_init(&buf_pool, pool_storage, 256, 64);
}