Foundations: Operating Systems Essentials

“vLLM PagedAttention không phải invention mới — đó là virtual memory paging áp dụng vào KV cache. Cloudflare Workers V8 isolate là process isolation ở user-space. Kafka exactly-once dùng fsync barrier. eBPF observability đọc kernel syscalls. Mọi ‘magic’ của system design đều quay về 5-6 OS concepts cốt lõi. Architect không hiểu OS = không hiểu hệ thống mình thiết kế.”

Tags: cs-foundations operating-systems fundamentals Student: Hieu (Backend Dev → Architect) Liên quan: Tuan-Bonus-LLM-Serving-Infrastructure · Tuan-Bonus-Edge-Wasm-Architecture · Tuan-13-Monitoring-Observability · Tuan-Foundations-Computer-Architecture


1. Context & Why

Tại sao Backend Dev cần hiểu OS?

Topic em đang họcOS concept underlying
vLLM PagedAttentionVirtual memory paging
K8s containersnamespaces, cgroups
Cloudflare WorkersProcess / sandbox isolation
eBPF observabilityKernel syscalls, tracepoints
fsync trong DBFile system journaling
NodeJS event loopepoll / kqueue
Go goroutinesM:N threading
Postgres connectionsProcess per connection vs threading
Redis single-threadI/O multiplexing
NVMe IOPSBlock layer, page cache

Key insight: OS là layer giữa code em viết và hardware. Hiểu OS giúp em debug, optimize, và design hệ thống đúng.

Tham chiếu chính


2. Deep Dive — Khái niệm cốt lõi

2.1 Process, Thread, Coroutine

3 levels of execution unit:

2.1.1 Process

  • Isolated address space (its own virtual memory)
  • Heavy: ~10MB+ memory overhead per process
  • Slow context switch: ~10 μs (TLB flush, page tables)
  • Failure isolated: Process crash doesn’t affect others
  • IPC needed for communication: pipes, sockets, shared memory
pid_t pid = fork();
if (pid == 0) {
    // Child process
    execvp("ls", args);
} else {
    // Parent
    waitpid(pid, &status, 0);
}

Examples:

  • Each browser tab in Chrome (process per tab for isolation)
  • PostgreSQL (process per connection by default)
  • Apache prefork

2.1.2 Thread

  • Shared address space within process
  • Lighter: ~1MB stack overhead
  • Faster context switch: ~1-2 μs
  • NOT failure isolated: 1 thread crash → whole process down
  • Direct memory access between threads (need synchronization)
pthread_t tid;
pthread_create(&tid, NULL, worker_fn, arg);
pthread_join(tid, NULL);

Examples:

  • Java thread pool (Tomcat, Spring)
  • Nginx worker threads
  • Most Go programs

2.1.3 Coroutine / Green Thread / Fiber

  • User-space scheduling (no kernel involvement)
  • Very light: ~2KB stack typical
  • Sub-microsecond switch
  • Cooperative: yields explicitly (await, yield)
  • M:N model: M coroutines on N OS threads
go func() {
    // Goroutine: M:N scheduled by Go runtime
    process(req)
}()
async def handler():
    await db.query()  # cooperative yield

Examples:

  • Go goroutines (M:N scheduler)
  • Python asyncio
  • Kotlin coroutines
  • Rust async/await + Tokio
  • Erlang/Elixir processes (BEAM VM)

2.1.4 Comparison

ProcessThreadCoroutine
Address spaceSeparateSharedShared
Memory overhead~10 MB~1 MB~2 KB
Context switch~10 μs~1 μs<1 μs
CommunicationIPCShared memoryChannels
Crash isolation✅ Yes❌ No❌ No
Max instancesThousandsTens of thousandsMillions
Best forIsolation, securityCPU-bound parallelI/O-bound concurrent

Architectural choice:

  • Microservices = process boundary (security, scaling)
  • Within service: threads or coroutines for concurrency
  • High-concurrency I/O: coroutines (Go, Node, Python asyncio)
  • CPU-bound: threads with thread pool

2.2 Virtual Memory

Most important OS concept for systems engineers.

2.2.1 The problem

  • Physical RAM is limited & fragmented
  • Multiple processes need memory
  • Need isolation (process A can’t read process B’s memory)
  • Need flexibility (allocate more than physical RAM)

2.2.2 Solution — Virtual Memory

Each process sees own continuous virtual address space (e.g., 0 to 2^48 on x86-64).

MMU (Memory Management Unit) translates virtual → physical via page tables.

Process View (virtual):
┌───────────────────────┐ 0xFFFF...
│ Stack (grows down)    │
├───────────────────────┤
│ ↓                     │
│                       │
│ ↑                     │
├───────────────────────┤
│ Heap (grows up)       │
├───────────────────────┤
│ Data (globals)        │
├───────────────────────┤
│ Text (code)           │
└───────────────────────┘ 0x0

OS maps virtual pages to physical RAM pages

2.2.3 Pages

  • Memory divided into fixed-size pages (typically 4KB)
  • Page table maps virtual page → physical frame
  • Lookup is expensive → TLB (Translation Lookaside Buffer) caches recent translations
Virtual address: [Page Number | Offset]
                       ↓
                   Page Table
                       ↓
Physical address: [Frame Number | Offset]

TLB miss = ~100ns penalty (page table walk). TLB hit = ~1ns.

2.2.4 Page Faults

When process accesses page not in physical RAM:

TypeCauseCost
Minor page faultPage allocated but not mapped yet~1-10 μs
Major page faultPage on disk (swap)~ms (swap in from disk)
SegfaultInvalid accessProcess killed

Implication: Page fault → context switch → severe latency hit.

# Monitor page faults
$ vmstat 1
procs ... swap          io
 r b   si   so    bi    bo
 0 0    0    0     5     0 so > 0 = swapping (BAD)

2.2.5 Application: PagedAttention (vLLM)

vLLM PagedAttention (https://arxiv.org/abs/2309.06180) directly inspired by virtual memory:

Naive KV cache:
  Reserve max-length contiguous memory per request
  → 60-80% memory waste

PagedAttention:
  Divide KV cache into 16-token blocks (= "pages")
  Per-request "page table" maps logical positions → physical blocks
  → 4% memory waste

Same concept as OS virtual memory:
  - Logical (per-request) → Physical (GPU memory)
  - Page table per request
  - Block-level allocation

Why it matters: Hiểu virtual memory = hiểu PagedAttention. Tham chiếu Tuan-Bonus-LLM-Serving-Infrastructure.

2.2.6 Swap

When physical RAM full, OS moves cold pages to swap space on disk.

Implications for production:

  • Database never want swap (query latency 1ms → seconds)
  • Set vm.swappiness=1 on DB hosts
  • Disable swap entirely on Kafka brokers, Cassandra
  • Monitor si/so in vmstat

2.3 File Systems

2.3.1 Inodes & Files

  • Inode: metadata structure (size, permissions, timestamps, block pointers)
  • File: name → inode mapping (in directory)
  • Block: smallest disk allocation unit (4KB typical)
Directory entry: "data.txt" → inode #12345
Inode #12345:
  - Size: 8 KB
  - Permissions: 644
  - Block pointers: [block 100, block 101]
  - Indirect block (for large files): block 200

2.3.2 fsync — The Bottleneck

Default behavior: write goes to page cache (RAM), returned immediately. OS flushes to disk later.

Problem: Crash before flush → lose data.

Solution: fsync(fd) — block until data on disk.

write(fd, data, len);     // ~ μs (page cache only)
fsync(fd);                // ~1-10 ms (write to disk)

Database WAL (Write-Ahead Log) uses fsync:

Postgres COMMIT:
  1. Write WAL record
  2. fsync WAL file
  3. Reply "committed" to client
  4. Apply changes to data files (later)

Performance implication: Synchronous commit limited by disk fsync latency.

Storagefsync latencyCommits/sec
HDD5-15 ms100-200
SATA SSD0.5-2 ms500-2000
NVMe SSD0.05-0.5 ms2000-20000
Persistent Memory (Optane)0.001 ms1M+

Why DB on NVMe matters: Sync commit throughput.

2.3.3 Journaling

Most modern FSes (ext4, XFS, ZFS) journal metadata:

  1. Write metadata change to journal
  2. fsync journal
  3. Apply metadata change
  4. Mark journal entry done

Crash recovery: Replay journal → consistent state.

Trade-off: 2x writes for metadata, but crash-safe.

2.3.4 Modern File Systems

FSStrengthUse case
ext4Mature, default LinuxGeneral
XFSLarge files, parallel I/ODatabases, big data
ZFSChecksumming, snapshots, COWStorage servers, critical data
BtrfsSnapshots, subvolumesContainers
NFSNetwork FSShared storage
EFS (AWS)Elastic NFSCloud shared FS
FUSEUser-space FSCustom (S3FS, etc.)

2.4 I/O Models

How does code wait for I/O? 5 models:

2.4.1 Blocking I/O (default)

read(fd, buf, len);  // Thread blocked until data ready

Pros: Simple Cons: 1 thread per connection → thousands of threads → thousands of context switches.

2.4.2 Non-blocking I/O

fcntl(fd, F_SETFL, O_NONBLOCK);
int n = read(fd, buf, len);  // Returns -1 + EAGAIN if no data

Pros: No blocking Cons: Need busy loop or polling.

2.4.3 I/O Multiplexing (select / poll / epoll / kqueue)

One thread monitors many sockets. Kernel notifies when ready.

// Linux: epoll
int epfd = epoll_create1(0);
epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &event);
 
while (1) {
    int n = epoll_wait(epfd, events, MAX_EVENTS, -1);
    for (int i = 0; i < n; i++) {
        // Handle ready event
    }
}

This is C10K’s solution: 1 thread, 10000 connections.

Implementations:

  • Linux: epoll (best, O(1))
  • macOS/BSD: kqueue
  • Windows: IOCP

Used by: Nginx, Redis, Node.js (libuv), Netty, Tokio.

2.4.4 Async I/O (true async)

io_uring (Linux 5.1+, 2019) and Windows IOCP — completion-based (vs readiness-based for epoll).

Magic of io_uring:

  • Submit ring (kernel)
  • Completion ring
  • Zero syscall overhead in fast path
  • 2-3x faster than epoll for high IOPS
// Tokio with io_uring (in 2024+ stable)
let stream = TcpStream::connect("...").await?;
let n = stream.read(&mut buf).await?;

2.4.5 Comparison

ModelThreadsConcurrencyBest for
Blocking1 per conn~1KSimple servers
Non-blocking + multiplexing (epoll)1 worker100K+Most servers (Nginx, Redis)
Async (io_uring)1-N workers1M+High IOPS storage
Thread pool + blockingN (=cores)10KCPU-bound + some I/O

2.5 Process Scheduling

OS scheduler decides which process runs on which CPU.

2.5.1 Linux CFS (Completely Fair Scheduler)

  • Virtual runtime (vruntime): each task accumulates time used
  • Scheduler picks task with lowest vruntime
  • Ensures fairness over time
  • Default scheduler since Linux 2.6.23

2.5.2 Real-time scheduling

  • SCHED_FIFO, SCHED_RR: realtime priority
  • Used for low-latency workloads (audio, video, trading)
  • Risk: Starve other processes

2.5.3 CPU affinity

Pin process to specific cores:

cpu_set_t mask;
CPU_SET(2, &mask);  // Use core 2
sched_setaffinity(0, sizeof(mask), &mask);

Why pin?:

  • L1/L2 cache hot (avoid TLB flush)
  • NUMA locality
  • Real-time deterministic latency

Examples:

  • DPDK (Data Plane Development Kit) pins cores
  • High-frequency trading
  • Database fenced cores

2.5.4 cgroups (Control Groups)

Resource limits per group of processes:

# Limit memory to 1GB
echo 1073741824 > /sys/fs/cgroup/memory/mygroup/memory.limit_in_bytes
 
# Limit CPU to 50% of one core
echo 50000 > /sys/fs/cgroup/cpu/mygroup/cpu.cfs_quota_us
echo 100000 > /sys/fs/cgroup/cpu/mygroup/cpu.cfs_period_us

Used by: Docker, K8s (resources.limits.cpu/memory).

2.6 Inter-Process Communication (IPC)

5 main IPC mechanisms:

2.6.1 Pipes / FIFOs

  • Unidirectional byte stream
  • Anonymous (parent-child) or named (FIFO)
  • Used by shells: cat | grep | sort

2.6.2 Unix domain sockets

  • Bidirectional, like TCP but local-only
  • Faster than localhost TCP (no TCP/IP overhead)
  • Used by Nginx ↔ FPM, Postgres, Redis sockets

2.6.3 Shared memory

  • Fastest IPC (zero copy)
  • Mmap a file or anonymous region into multiple processes
  • Need synchronization (semaphores, futex)
int fd = shm_open("/myshm", O_CREAT | O_RDWR, 0600);
ftruncate(fd, SIZE);
void *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
               MAP_SHARED, fd, 0);

Used by: Redis with --protected-mode no + shared mmap, PostgreSQL shared buffers.

2.6.4 Message queues

  • POSIX mq, System V msgq
  • Less common in modern code (Kafka/Redis Streams replace)

2.6.5 Signals

  • Asynchronous notifications: SIGTERM, SIGKILL, SIGINT, SIGHUP
  • Limited info (no payload)
import signal
def handler(signum, frame):
    print("Graceful shutdown")
signal.signal(signal.SIGTERM, handler)

Best practice: Container should handle SIGTERM for graceful shutdown (drain connections, flush buffers).

2.7 Containers — namespaces + cgroups

Containers are NOT VMs. They are isolated processes using:

2.7.1 Linux namespaces

NamespaceIsolates
pidProcess IDs (container sees own PIDs)
netNetwork interfaces
mntMount points
utsHostname
ipcIPC resources
userUser IDs
cgroupcgroups view
# Create new PID namespace
unshare --pid --fork --mount-proc bash
ps  # Sees only own process

2.7.2 cgroups

Limit resources (CPU, memory, I/O, network).

Container = process(es) in a set of namespaces + cgroups limits + filesystem (overlay).

Why fast: No hypervisor, no separate OS kernel. Container = native Linux process with constraints.

2.8 Kernel vs User Space

Kernel space: privileged, direct hardware access. User space: applications, sandboxed.

Syscalls are bridge:

// User-space code
int fd = open("/etc/passwd", O_RDONLY);
//          ^---- syscall: trap to kernel mode

Cost: Syscall = ~100-500 ns (mode switch).

Why eBPF revolutionary:

  • Run sandboxed code in kernel without writing kernel module
  • Zero syscall overhead
  • Verifier ensures safety

Why io_uring: Submit/complete via shared rings, batch syscalls → near-zero overhead.

2.9 Memory Allocators

User-space allocators (malloc/free) on top of kernel page allocator:

AllocatorUsed byProperty
glibc malloc (ptmalloc2)Default LinuxMature, complex
jemallocRedis, Cassandra, FirefoxLower fragmentation
tcmalloc (Google)Chrome, gRPCThread-cache, scalable
mimalloc (Microsoft)Modern appsPerformance
rust allocRust defaultjemalloc-like

Production tip: For long-running servers (DBs, caches), switch to jemalloc to reduce memory fragmentation. Redis docs explicitly recommend this.

LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 ./my-server

3. Practical Applications — OS in System Design

3.1 Why Redis is single-threaded

Redis primary uses 1 thread + epoll:

  • Each operation is in-memory (μs)
  • Multi-threading would need locks → slow
  • epoll handles 100K+ concurrent connections
  • Background threads for fsync + AOF

Throughput: 100K-1M ops/sec/instance.

3.2 Why Postgres uses processes

Postgres = process per connection (with pooling via PgBouncer):

  • Crash isolation (1 conn crash doesn’t take down DB)
  • Mature codebase
  • Trade-off: ~10MB/connection → expensive

With PgBouncer: 10K clients → 100 backend connections.

3.3 Why Node.js single-threaded + libuv

Node.js = 1 main thread + libuv thread pool:

  • Main thread: JS event loop (epoll-based via libuv)
  • I/O offloaded to thread pool (4 threads default)
  • CPU-bound work blocks event loop → use Worker Threads

3.4 Why Go’s M:N scheduler

Goroutines:

  • M = OS threads (= GOMAXPROCS)
  • N = Goroutines (millions possible)
  • Cooperative + preemptive (since Go 1.14)
  • Network poller built-in (epoll/kqueue)

→ Why Go great for backend: handle 1M concurrent connections with reasonable memory.

3.5 Why Kubernetes pods

Pod = set of containers sharing network namespace + IPC:

  • Containers in same pod can localhost each other
  • Shared volumes (mount namespace shared)
  • Sidecar pattern: app + Envoy proxy in same pod

3.6 Why fsync is critical for DBs

Every database COMMIT involves fsync:

COMMIT in Postgres:
  1. Write WAL record to OS buffer
  2. fsync WAL file → wait for disk
  3. Reply OK to client

Skip fsync (synchronous_commit = off) → 10x throughput, but lose committed data on crash.

3.7 Why mmap for memory-mapped files

void *data = mmap(NULL, size, PROT_READ, MAP_SHARED, fd, 0);
// Now access file as memory

Used by:

  • LMDB, RocksDB partially
  • Index files in Lucene/Elasticsearch
  • Memory-mapped Postgres

Pros: OS handles caching, no double buffer Cons: Page faults can be unpredictable; Linus Torvalds famously hates it for some use cases (FS too)


4. Security First — OS-level Security

4.1 Principle of Least Privilege

  • Don’t run services as root
  • Use setcap for specific capabilities
  • Drop privileges after init (e.g., bind 80, then drop to ‘nobody’)

4.2 Container security

  • AppArmor / SELinux: Mandatory access control
  • Seccomp: Syscall filtering (block dangerous syscalls)
  • User namespaces: Container UID 0 ≠ host UID 0
  • Read-only root FS: --read-only Docker flag
  • No privileged containers: Only when absolutely needed

4.3 Memory protection

  • ASLR (Address Space Layout Randomization)
  • DEP/NX (No-eXecute on data pages)
  • Stack canaries

These are kernel + compiler features. Modern Linux + gcc/clang have all by default.

4.4 Side-channel attacks

  • Spectre/Meltdown (2018): exploit speculative execution + cache timing to leak data
  • Mitigations: kernel patches, microcode updates, KPTI (Kernel Page-Table Isolation)
  • Performance cost: 5-30%

5. DevOps — OS Tuning

5.1 Linux tuning for high-performance servers

# /etc/sysctl.conf
 
# Network
net.core.somaxconn = 65535
net.core.netdev_max_backlog = 16384
net.ipv4.tcp_max_syn_backlog = 8192
net.ipv4.tcp_fin_timeout = 30
net.ipv4.ip_local_port_range = 10000 65535
net.ipv4.tcp_tw_reuse = 1
 
# File handles
fs.file-max = 2097152
 
# Memory
vm.swappiness = 1            # Avoid swap
vm.dirty_ratio = 10          # Flush dirty pages aggressively
vm.dirty_background_ratio = 5
 
# Apply
sysctl -p

5.2 Per-process limits (ulimit)

# /etc/security/limits.conf
* soft nofile 1048576
* hard nofile 1048576
* soft nproc 1048576
* hard nproc 1048576

5.3 Monitoring tools

ToolUse case
top, htopCPU/memory by process
vmstatVirtual memory stats
iostatI/O statistics
iotopI/O per process
pidstatPer-process stats
perfPerformance profiling
straceSyscall tracing
ltraceLibrary call tracing
bpftrace / bcceBPF observability
ss / netstatNetwork connections
tcpdumpNetwork packet capture
lsofOpen files
sarHistorical stats

5.4 Brendan Gregg’s USE Method

For each resource, check:

  • Utilization
  • Saturation
  • Errors
# CPU
top                       # Util
uptime                    # Saturation (load avg)
dmesg | grep -i error    # Errors
 
# Memory
free -m                   # Util
vmstat 1                  # Saturation (si/so)
dmesg | grep -i oom
 
# Disk
iostat -x 1               # Util (%util), Saturation (await)
 
# Network
sar -n DEV 1              # Util
ss -s                     # Saturation
ifconfig | grep errors

6. Code Examples

6.1 epoll server in C

// Simple TCP echo server with epoll
#include <sys/epoll.h>
#include <sys/socket.h>
// ... boilerplate
 
int main() {
    int listen_fd = socket(AF_INET, SOCK_STREAM | SOCK_NONBLOCK, 0);
    bind(listen_fd, ...);
    listen(listen_fd, 1024);
 
    int epfd = epoll_create1(0);
    struct epoll_event ev = { .events = EPOLLIN, .data.fd = listen_fd };
    epoll_ctl(epfd, EPOLL_CTL_ADD, listen_fd, &ev);
 
    struct epoll_event events[64];
    while (1) {
        int n = epoll_wait(epfd, events, 64, -1);
        for (int i = 0; i < n; i++) {
            int fd = events[i].data.fd;
            if (fd == listen_fd) {
                int client = accept4(listen_fd, NULL, NULL, SOCK_NONBLOCK);
                struct epoll_event cev = { .events = EPOLLIN, .data.fd = client };
                epoll_ctl(epfd, EPOLL_CTL_ADD, client, &cev);
            } else {
                char buf[1024];
                int n = read(fd, buf, sizeof(buf));
                if (n > 0) write(fd, buf, n);
                else { close(fd); epoll_ctl(epfd, EPOLL_CTL_DEL, fd, NULL); }
            }
        }
    }
}

6.2 Goroutine vs Thread benchmark

package main
 
import (
    "fmt"
    "runtime"
    "sync"
    "time"
)
 
func main() {
    runtime.GOMAXPROCS(8)
 
    const N = 1_000_000
    var wg sync.WaitGroup
 
    start := time.Now()
    for i := 0; i < N; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            time.Sleep(100 * time.Millisecond)  // Simulate I/O
        }()
    }
    wg.Wait()
 
    fmt.Printf("1M goroutines: %v\n", time.Since(start))
    // Result: ~100-200ms (vs hours for 1M OS threads)
}

6.3 fsync benchmark

import os
import time
 
def benchmark_fsync(n=1000):
    fd = os.open("test.dat", os.O_CREAT | os.O_WRONLY)
 
    # Without fsync
    start = time.time()
    for _ in range(n):
        os.write(fd, b"x" * 100)
    elapsed = time.time() - start
    print(f"No fsync: {n / elapsed:.0f} writes/s")
 
    # With fsync
    start = time.time()
    for _ in range(n):
        os.write(fd, b"x" * 100)
        os.fsync(fd)
    elapsed = time.time() - start
    print(f"With fsync: {n / elapsed:.0f} writes/s")
 
    os.close(fd)
 
benchmark_fsync()
# Typical results:
#   No fsync: 500,000 writes/s
#   With fsync: 200-2000 writes/s (disk dependent)

6.4 Memory-mapped file

import mmap
 
with open("data.bin", "r+b") as f:
    mm = mmap.mmap(f.fileno(), 0)
    # Access file as memory
    print(mm[0:100])
    mm[0:5] = b"HELLO"
    mm.close()

7. System Design Diagrams

7.1 Process vs Thread vs Coroutine

flowchart LR
    subgraph Process["Process (heavy)"]
        P1[Process 1<br/>10MB]
        P2[Process 2<br/>10MB]
    end

    subgraph Thread["Threads (medium)"]
        TP[Process]
        T1[Thread 1<br/>1MB]
        T2[Thread 2<br/>1MB]
        T3[Thread 3<br/>1MB]
        TP --> T1
        TP --> T2
        TP --> T3
    end

    subgraph Coroutine["Coroutines (light)"]
        CP[Process]
        CT[OS Thread]
        C1[Coroutine 1<br/>2KB]
        C2[Coroutine 2<br/>2KB]
        C3[Coroutine 3<br/>2KB]
        Cn[... 1M coroutines]
        CP --> CT
        CT --> C1
        CT --> C2
        CT --> C3
        CT --> Cn
    end

7.2 Virtual Memory

flowchart TB
    subgraph Process["Process Address Space"]
        VP1[Virtual Page 0]
        VP2[Virtual Page 1]
        VP3[Virtual Page 2]
        VP4[Virtual Page 3]
    end

    subgraph PageTable["Page Table"]
        PT[V0→F5<br/>V1→F2<br/>V2→Disk<br/>V3→F8]
    end

    subgraph Physical["Physical RAM"]
        F1[Frame 0]
        F2[Frame 2]
        F5[Frame 5]
        F8[Frame 8]
    end

    subgraph Swap["Disk (Swap)"]
        SW[Swap Pages]
    end

    VP1 -.via PT.-> F5
    VP2 -.-> F2
    VP3 -.page fault.-> SW
    VP4 -.-> F8

    style VP3 fill:#ffcdd2

7.3 epoll Event Loop

sequenceDiagram
    participant App
    participant Kernel
    participant Net as Network

    App->>Kernel: epoll_create1()
    App->>Kernel: epoll_ctl(ADD, socket1)
    App->>Kernel: epoll_ctl(ADD, socket2)
    App->>Kernel: epoll_ctl(ADD, socket3)

    loop Event Loop
        App->>Kernel: epoll_wait(timeout)

        Net->>Kernel: Data on socket2
        Kernel-->>App: socket2 ready

        App->>Kernel: read(socket2)
        Kernel-->>App: data

        App->>App: process(data)
    end

7.4 fsync Path

sequenceDiagram
    participant App
    participant PC as Page Cache (RAM)
    participant Journal
    participant Disk

    App->>PC: write(fd, data)
    PC-->>App: returned (data in RAM)
    Note over App: Risk: crash → lose data

    App->>PC: fsync(fd)
    PC->>Journal: write metadata + data
    Journal->>Disk: physical write + flush
    Disk-->>Journal: written
    Journal-->>PC: synced
    PC-->>App: fsync returned

    Note over App,Disk: Now durable across crash

8. Aha Moments & Pitfalls

Aha Moments

#1: Virtual memory enables most “magic”. PagedAttention, mmap, copy-on-write fork, swap — all variations of paging.

#2: Coroutines = userspace scheduling. Cooperative, sub-microsecond switch. Why Go/Python asyncio handle 1M+ connections.

#3: fsync is the bottleneck. 100x slower than RAM write. Database commit throughput = fsync rate.

#4: epoll is OS magic for C10K. 1 thread, 100K connections. Foundation of Nginx, Redis, Node.js, Netty.

#5: Containers = namespaces + cgroups. Not VMs. Same kernel, just isolated view + resource limits.

#6: Page faults are expensive. Major fault = ms penalty. Avoid swap on critical services.

#7: Syscalls have cost (~500 ns). io_uring batches them → near-zero overhead. eBPF eliminates them entirely.

#8: TLB matters. Cache miss on virt→phys translation = 100ns. Huge pages reduce TLB pressure.

Pitfalls

Pitfall 1: Treating thread = process

Threads share memory → race conditions, deadlock. Need synchronization.

Pitfall 2: Blocking call in async loop

JavaScript: heavy CPU in handler → blocks event loop → all requests stall. Fix: Worker threads / offload.

Pitfall 3: Swap on database

Swap kicks in → query latency μs → seconds. Fix: vm.swappiness=1 or disable swap.

Pitfall 4: Default ulimits

nofile=1024 → server crashes at 1000 connections. Fix: ulimit -n 1048576.

Pitfall 5: No graceful shutdown

Container killed mid-request → corrupt state. Fix: Handle SIGTERM, drain connections.

Pitfall 6: Ignoring memory fragmentation

Long-running Redis → 10GB allocated, 5GB used. Fix: jemalloc or tcmalloc.

Pitfall 7: Privileged containers

--privileged → full host access. Fix: Specific capabilities only.

Pitfall 8: Process per request (CGI-style)

1000 req/s × 10ms fork = saturated. Fix: Worker pool, persistent processes.


TopicConnects to
Tuan-Bonus-LLM-Serving-InfrastructurePagedAttention = virtual memory paging
Tuan-Bonus-Edge-Wasm-ArchitectureV8 isolate, Wasm sandbox use OS process model
Tuan-13-Monitoring-ObservabilityeBPF for kernel-level observability
Tuan-07-Database-Sharding-Replicationfsync, mmap, B-tree (storage)
Tuan-11-Microservices-PatternContainers, namespaces, cgroups
Tuan-Foundations-Computer-ArchitectureMemory hierarchy, cache, NUMA
Tuan-Foundations-Database-InternalsStorage engines build on FS

Tham khảo

Books:

  • Operating Systems: Three Easy Pieces (free) — http://pages.cs.wisc.edu/~remzi/OSTEP/
  • The Linux Programming Interface (Michael Kerrisk)
  • Computer Systems: A Programmer’s Perspective (CSAPP)
  • Modern Operating Systems (Tanenbaum)

Online:

Courses:


Tiếp theo: Tuan-Foundations-Computer-Architecture — CPU, cache, memory, GPU.