Lock-Free Concurrency Primitives¶

Overview¶

Pre-allocated, cache-line-aligned lock-free data structures for inter-thread communication. All primitives operate without locks, mutexes, or syscalls using only atomic operations with explicit memory ordering (C11 <stdatomic.h>).

┌─────────────────────────────────────────────────────────────────────┐
│                     Producer Thread (Core 2)                          │
│                            │                                         │
│              ┌─────────────▼─────────────────┐                       │
│              │     SPSC Ring Buffer           │                       │
│              │  [head ──────── tail]          │                       │
│              │  release ←─── acquire          │                       │
│              └─────────────┬─────────────────┘                       │
│                            │                                         │
│                     Consumer Thread (Core 3)                          │
│                                                                      │
│  ┌───────────────────────────────────────────────────────────────┐   │
│  │  Memory Pool (MPMC freelist):                                  │   │
│  │  alloc() ──► CAS pop from freelist head                       │   │
│  │  free()  ──► CAS push to freelist head                        │   │
│  │  ABA prevention via generation counter                         │   │
│  └───────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

Modules¶

Module	Status	Pattern	Latency
Ring Buffer	✅ Implemented	SPSC, memcpy-based	< 10 ns
SPSC Queue	✅ Implemented	SPSC, zero-copy claim/publish	< 5 ns
Memory Pool	✅ Implemented	MPMC, CAS freelist	< 20 ns

Design Principles¶

Memory Ordering¶

/* Producer: write data, then RELEASE (publish) the index */
memcpy(slot, data, size);
atomic_store_explicit(&head, head + 1, memory_order_release);

/* Consumer: ACQUIRE the index, then read data */
uint64_t h = atomic_load_explicit(&head, memory_order_acquire);
memcpy(data, slot, size);

Ordering	Meaning	Use Case
`relaxed`	No ordering guarantee	Statistics counters, local reads
`acquire`	Reads after this see writes before the paired release	Consumer loading producer's index
`release`	Writes before this are visible after the paired acquire	Producer publishing new data
`acq_rel`	Both acquire and release	CAS operations (alloc/free)

Cache-Line Isolation¶

┌───────────────── Cache Line 0 (64 bytes) ─────────────────┐
│  _Atomic uint64_t head;                                     │
│  uint8_t padding[56];                                       │
├───────────────── Cache Line 1 (64 bytes) ─────────────────┤
│  _Atomic uint64_t tail;                                     │
│  uint8_t padding[56];                                       │
└─────────────────────────────────────────────────────────────┘

Without isolation: writing head invalidates tail's cache line
    → consumer stalls on MESI invalidation (~50-100 ns penalty)
With isolation: each index on its own line
    → no false sharing, sub-10ns operations

Power-of-2 Sizing¶

/* Modulo operation (SLOW division on critical path): */
index = sequence % capacity;

/* Bitwise AND mask (FAST single instruction): */
index = sequence & (capacity - 1);  /* Only works for power-of-2 */

Comparison¶

Feature	Ring Buffer	SPSC Queue	Memory Pool
Concurrency	SPSC	SPSC	MPMC
Copy model	memcpy	Zero-copy (pointer)	N/A (alloc/free)
Hot-path ops	1 store + 1 load	1 store + 1 load (cached)	1 CAS
Latency	~8 ns	~4 ns	~15 ns
Use case	Byte streams	Typed messages	Buffer management
ABA safe	N/A (SPSC)	N/A (SPSC)	Yes (gen counter)

Build¶

# Build all concurrency modules
make -C src/concurrency/ring_buffer
make -C src/concurrency/spsc_queue
make -C src/concurrency/memory_pool

# Run benchmarks (pin to isolated cores)
taskset -c 2,3 ./build/ring_buffer_bench
taskset -c 2,3 ./build/spsc_queue_bench

Integration Pattern¶

/* Typical low-latency pipeline:
   NIC → raw_socket → ring_buffer → application → spsc_queue → TX */

#include "ring_buffer.h"
#include "spsc_queue.h"
#include "memory_pool.h"

/* Pre-allocate everything at startup */
static uint8_t rb_storage[4096 * 64]  __attribute__((aligned(64)));
static uint8_t sq_storage[1024 * 128] __attribute__((aligned(64)));
static uint8_t pool_storage[256 * 64] __attribute__((aligned(64)));

ring_buffer_t  rx_ring;
spsc_queue_t   tx_queue;
memory_pool_t  buf_pool;

void init(void) {
    ring_buffer_init(&rx_ring, rb_storage, 4096, 64);
    spsc_queue_init(&tx_queue, sq_storage, 1024, 128);
    memory_pool_init(&buf_pool, pool_storage, 256, 64);
}