Lock-Free Concurrency Primitives
Overview
Pre-allocated, cache-line-aligned lock-free data structures for inter-thread communication. All primitives operate without locks, mutexes, or syscalls using only atomic operations with explicit memory ordering (C11 <stdatomic.h>).
┌─────────────────────────────────────────────────────────────────────┐
│ Producer Thread (Core 2) │
│ │ │
│ ┌─────────────▼─────────────────┐ │
│ │ SPSC Ring Buffer │ │
│ │ [head ──────── tail] │ │
│ │ release ←─── acquire │ │
│ └─────────────┬─────────────────┘ │
│ │ │
│ Consumer Thread (Core 3) │
│ │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ Memory Pool (MPMC freelist): │ │
│ │ alloc() ──► CAS pop from freelist head │ │
│ │ free() ──► CAS push to freelist head │ │
│ │ ABA prevention via generation counter │ │
│ └───────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Modules
| Module | Status | Pattern | Latency |
| Ring Buffer | ✅ Implemented | SPSC, memcpy-based | < 10 ns |
| SPSC Queue | ✅ Implemented | SPSC, zero-copy claim/publish | < 5 ns |
| Memory Pool | ✅ Implemented | MPMC, CAS freelist | < 20 ns |
Design Principles
Memory Ordering
/* Producer: write data, then RELEASE (publish) the index */
memcpy(slot, data, size);
atomic_store_explicit(&head, head + 1, memory_order_release);
/* Consumer: ACQUIRE the index, then read data */
uint64_t h = atomic_load_explicit(&head, memory_order_acquire);
memcpy(data, slot, size);
| Ordering | Meaning | Use Case |
relaxed | No ordering guarantee | Statistics counters, local reads |
acquire | Reads after this see writes before the paired release | Consumer loading producer's index |
release | Writes before this are visible after the paired acquire | Producer publishing new data |
acq_rel | Both acquire and release | CAS operations (alloc/free) |
Cache-Line Isolation
┌───────────────── Cache Line 0 (64 bytes) ─────────────────┐
│ _Atomic uint64_t head; │
│ uint8_t padding[56]; │
├───────────────── Cache Line 1 (64 bytes) ─────────────────┤
│ _Atomic uint64_t tail; │
│ uint8_t padding[56]; │
└─────────────────────────────────────────────────────────────┘
Without isolation: writing head invalidates tail's cache line
→ consumer stalls on MESI invalidation (~50-100 ns penalty)
With isolation: each index on its own line
→ no false sharing, sub-10ns operations
Power-of-2 Sizing
/* Modulo operation (SLOW division on critical path): */
index = sequence % capacity;
/* Bitwise AND mask (FAST single instruction): */
index = sequence & (capacity - 1); /* Only works for power-of-2 */
Comparison
| Feature | Ring Buffer | SPSC Queue | Memory Pool |
| Concurrency | SPSC | SPSC | MPMC |
| Copy model | memcpy | Zero-copy (pointer) | N/A (alloc/free) |
| Hot-path ops | 1 store + 1 load | 1 store + 1 load (cached) | 1 CAS |
| Latency | ~8 ns | ~4 ns | ~15 ns |
| Use case | Byte streams | Typed messages | Buffer management |
| ABA safe | N/A (SPSC) | N/A (SPSC) | Yes (gen counter) |
Build
# Build all concurrency modules
make -C src/concurrency/ring_buffer
make -C src/concurrency/spsc_queue
make -C src/concurrency/memory_pool
# Run benchmarks (pin to isolated cores)
taskset -c 2,3 ./build/ring_buffer_bench
taskset -c 2,3 ./build/spsc_queue_bench
Integration Pattern
/* Typical low-latency pipeline:
NIC → raw_socket → ring_buffer → application → spsc_queue → TX */
#include "ring_buffer.h"
#include "spsc_queue.h"
#include "memory_pool.h"
/* Pre-allocate everything at startup */
static uint8_t rb_storage[4096 * 64] __attribute__((aligned(64)));
static uint8_t sq_storage[1024 * 128] __attribute__((aligned(64)));
static uint8_t pool_storage[256 * 64] __attribute__((aligned(64)));
ring_buffer_t rx_ring;
spsc_queue_t tx_queue;
memory_pool_t buf_pool;
void init(void) {
ring_buffer_init(&rx_ring, rb_storage, 4096, 64);
spsc_queue_init(&tx_queue, sq_storage, 1024, 128);
memory_pool_init(&buf_pool, pool_storage, 256, 64);
}