Zero-Copy I/O¶
Status: Planned
This module is not yet implemented. The design below represents the target architecture.
Overview¶
Zero-copy data transfer mechanisms that eliminate kernel-to-userspace memory copies. Covers four complementary approaches: mmap, splice/sendfile, io_uring, and MSG_ZEROCOPY.
Target Mechanisms¶
┌──────────────────────────────────────────────────────────────────────┐
│ Zero-Copy Techniques │
├─────────────┬──────────────────────────────────────────────────────┤
│ mmap │ Map kernel pages into userspace (PACKET_MMAP, file) │
│ splice │ Pipe-based kernel-to-kernel transfer (no userspace) │
│ io_uring │ Async I/O with shared submission/completion rings │
│ MSG_ZEROCOPY│ Socket send from userspace pages (kernel pins pages) │
└─────────────┴──────────────────────────────────────────────────────┘
Architecture¶
graph LR
A[Application Buffer] -->|mmap| B[Kernel Page Cache]
B -->|splice| C[Socket Buffer]
A -->|MSG_ZEROCOPY| C
A -->|io_uring SQE| D[io_uring Ring]
D -->|Kernel| C
C --> E[NIC DMA] Planned API¶
io_uring Interface¶
typedef struct {
int ring_fd;
struct io_uring_sqe *sq_ring; /* Submission queue (shared mmap) */
struct io_uring_cqe *cq_ring; /* Completion queue (shared mmap) */
uint32_t sq_size;
uint32_t cq_size;
uint32_t sq_tail; /* Next submission slot */
uint32_t cq_head; /* Next completion to read */
} zero_copy_ring_t;
int zero_copy_ring_init(zero_copy_ring_t *ring, uint32_t queue_depth);
int zero_copy_submit_send(zero_copy_ring_t *ring, int fd,
const void *buf, size_t len);
int zero_copy_submit_recv(zero_copy_ring_t *ring, int fd,
void *buf, size_t len);
int zero_copy_complete(zero_copy_ring_t *ring, int *result);
void zero_copy_ring_destroy(zero_copy_ring_t *ring);
splice/sendfile Interface¶
/* Kernel-to-kernel zero-copy file → socket transfer */
ssize_t zero_copy_sendfile(int out_fd, int in_fd, off_t offset, size_t count);
/* Pipe-based splice for stream processing */
ssize_t zero_copy_splice(int fd_in, int fd_out, size_t len, unsigned int flags);
MSG_ZEROCOPY Socket Send¶
/* Enable MSG_ZEROCOPY on a socket */
int zero_copy_socket_enable(int sockfd);
/* Send with zero-copy (kernel pins user pages, signals completion) */
ssize_t zero_copy_send(int sockfd, const void *buf, size_t len);
/* Poll for zerocopy completion notifications */
int zero_copy_poll_completion(int sockfd, uint32_t *completed_id);
Performance Targets¶
| Mechanism | Latency | Throughput | Kernel Version |
|---|---|---|---|
splice | ~2 µs | 10+ Gbps | 2.6.17+ |
sendfile | ~2 µs | 10+ Gbps | 2.2+ |
io_uring | < 1 µs | 10+ Gbps | 5.1+ |
MSG_ZEROCOPY | ~5 µs | 10+ Gbps | 4.14+ |
io_uring Submission/Completion Ring Layout¶
┌───────────────────────────────────────────────────────────────────┐
│ io_uring shared memory (mmap'd between kernel and userspace): │
│ │
│ Submission Queue (SQ): │
│ ┌───────┬───────┬───────┬───────┬───────────────────────┐ │
│ │ SQE 0 │ SQE 1 │ SQE 2 │ SQE 3 │ ... │ │
│ └───────┴───────┴───────┴───────┴───────────────────────┘ │
│ ↑ tail (user writes) │
│ │
│ Completion Queue (CQ): │
│ ┌───────┬───────┬───────┬───────┬───────────────────────┐ │
│ │ CQE 0 │ CQE 1 │ CQE 2 │ CQE 3 │ ... │ │
│ └───────┴───────┴───────┴───────┴───────────────────────┘ │
│ ↑ head (user reads) │
│ │
│ Flow: user writes SQE → kernel processes → kernel writes CQE │
│ No syscall needed for submission (SQPOLL mode) │
└───────────────────────────────────────────────────────────────────┘
Dependencies¶
| Dependency | Version | Purpose |
|---|---|---|
| Linux kernel | ≥ 5.1 | io_uring |
| Linux kernel | ≥ 4.14 | MSG_ZEROCOPY |
| liburing | ≥ 2.0 | io_uring userspace helpers |
Implementation Roadmap¶
- io_uring ring setup and teardown
- io_uring async send/recv with fixed buffers
- splice-based file-to-socket transfer
- MSG_ZEROCOPY socket send with completion polling
- Benchmark suite (vs. standard read/write)
- Integration with epoll_reactor