Foundations: Operating Systems Essentials
“vLLM PagedAttention không phải invention mới — đó là virtual memory paging áp dụng vào KV cache. Cloudflare Workers V8 isolate là process isolation ở user-space. Kafka exactly-once dùng fsync barrier. eBPF observability đọc kernel syscalls. Mọi ‘magic’ của system design đều quay về 5-6 OS concepts cốt lõi. Architect không hiểu OS = không hiểu hệ thống mình thiết kế.”
Tags: cs-foundations operating-systems fundamentals Student: Hieu (Backend Dev → Architect) Liên quan: Tuan-Bonus-LLM-Serving-Infrastructure · Tuan-Bonus-Edge-Wasm-Architecture · Tuan-13-Monitoring-Observability · Tuan-Foundations-Computer-Architecture
1. Context & Why
Tại sao Backend Dev cần hiểu OS?
| Topic em đang học | OS concept underlying |
|---|---|
| vLLM PagedAttention | Virtual memory paging |
| K8s containers | namespaces, cgroups |
| Cloudflare Workers | Process / sandbox isolation |
| eBPF observability | Kernel syscalls, tracepoints |
| fsync trong DB | File system journaling |
| NodeJS event loop | epoll / kqueue |
| Go goroutines | M:N threading |
| Postgres connections | Process per connection vs threading |
| Redis single-thread | I/O multiplexing |
| NVMe IOPS | Block layer, page cache |
Key insight: OS là layer giữa code em viết và hardware. Hiểu OS giúp em debug, optimize, và design hệ thống đúng.
Tham chiếu chính
- Operating Systems: Three Easy Pieces (OSTEP) — free book — http://pages.cs.wisc.edu/~remzi/OSTEP/
- The Linux Programming Interface (Michael Kerrisk, 2010)
- Computer Systems: A Programmer’s Perspective (CSAPP, Bryant & O’Hallaron)
- Linux man pages —
man 2 syscalls - Brendan Gregg’s site — https://www.brendangregg.com/
2. Deep Dive — Khái niệm cốt lõi
2.1 Process, Thread, Coroutine
3 levels of execution unit:
2.1.1 Process
- Isolated address space (its own virtual memory)
- Heavy: ~10MB+ memory overhead per process
- Slow context switch: ~10 μs (TLB flush, page tables)
- Failure isolated: Process crash doesn’t affect others
- IPC needed for communication: pipes, sockets, shared memory
pid_t pid = fork();
if (pid == 0) {
// Child process
execvp("ls", args);
} else {
// Parent
waitpid(pid, &status, 0);
}Examples:
- Each browser tab in Chrome (process per tab for isolation)
- PostgreSQL (process per connection by default)
- Apache prefork
2.1.2 Thread
- Shared address space within process
- Lighter: ~1MB stack overhead
- Faster context switch: ~1-2 μs
- NOT failure isolated: 1 thread crash → whole process down
- Direct memory access between threads (need synchronization)
pthread_t tid;
pthread_create(&tid, NULL, worker_fn, arg);
pthread_join(tid, NULL);Examples:
- Java thread pool (Tomcat, Spring)
- Nginx worker threads
- Most Go programs
2.1.3 Coroutine / Green Thread / Fiber
- User-space scheduling (no kernel involvement)
- Very light: ~2KB stack typical
- Sub-microsecond switch
- Cooperative: yields explicitly (await, yield)
- M:N model: M coroutines on N OS threads
go func() {
// Goroutine: M:N scheduled by Go runtime
process(req)
}()async def handler():
await db.query() # cooperative yieldExamples:
- Go goroutines (M:N scheduler)
- Python asyncio
- Kotlin coroutines
- Rust async/await + Tokio
- Erlang/Elixir processes (BEAM VM)
2.1.4 Comparison
| Process | Thread | Coroutine | |
|---|---|---|---|
| Address space | Separate | Shared | Shared |
| Memory overhead | ~10 MB | ~1 MB | ~2 KB |
| Context switch | ~10 μs | ~1 μs | <1 μs |
| Communication | IPC | Shared memory | Channels |
| Crash isolation | ✅ Yes | ❌ No | ❌ No |
| Max instances | Thousands | Tens of thousands | Millions |
| Best for | Isolation, security | CPU-bound parallel | I/O-bound concurrent |
Architectural choice:
- Microservices = process boundary (security, scaling)
- Within service: threads or coroutines for concurrency
- High-concurrency I/O: coroutines (Go, Node, Python asyncio)
- CPU-bound: threads with thread pool
2.2 Virtual Memory
Most important OS concept for systems engineers.
2.2.1 The problem
- Physical RAM is limited & fragmented
- Multiple processes need memory
- Need isolation (process A can’t read process B’s memory)
- Need flexibility (allocate more than physical RAM)
2.2.2 Solution — Virtual Memory
Each process sees own continuous virtual address space (e.g., 0 to 2^48 on x86-64).
MMU (Memory Management Unit) translates virtual → physical via page tables.
Process View (virtual):
┌───────────────────────┐ 0xFFFF...
│ Stack (grows down) │
├───────────────────────┤
│ ↓ │
│ │
│ ↑ │
├───────────────────────┤
│ Heap (grows up) │
├───────────────────────┤
│ Data (globals) │
├───────────────────────┤
│ Text (code) │
└───────────────────────┘ 0x0
OS maps virtual pages to physical RAM pages
2.2.3 Pages
- Memory divided into fixed-size pages (typically 4KB)
- Page table maps virtual page → physical frame
- Lookup is expensive → TLB (Translation Lookaside Buffer) caches recent translations
Virtual address: [Page Number | Offset]
↓
Page Table
↓
Physical address: [Frame Number | Offset]
TLB miss = ~100ns penalty (page table walk). TLB hit = ~1ns.
2.2.4 Page Faults
When process accesses page not in physical RAM:
| Type | Cause | Cost |
|---|---|---|
| Minor page fault | Page allocated but not mapped yet | ~1-10 μs |
| Major page fault | Page on disk (swap) | ~ms (swap in from disk) |
| Segfault | Invalid access | Process killed |
Implication: Page fault → context switch → severe latency hit.
# Monitor page faults
$ vmstat 1
procs ... swap io
r b si so bi bo
0 0 0 0 5 0 ← so > 0 = swapping (BAD)2.2.5 Application: PagedAttention (vLLM)
vLLM PagedAttention (https://arxiv.org/abs/2309.06180) directly inspired by virtual memory:
Naive KV cache:
Reserve max-length contiguous memory per request
→ 60-80% memory waste
PagedAttention:
Divide KV cache into 16-token blocks (= "pages")
Per-request "page table" maps logical positions → physical blocks
→ 4% memory waste
Same concept as OS virtual memory:
- Logical (per-request) → Physical (GPU memory)
- Page table per request
- Block-level allocation
→ Why it matters: Hiểu virtual memory = hiểu PagedAttention. Tham chiếu Tuan-Bonus-LLM-Serving-Infrastructure.
2.2.6 Swap
When physical RAM full, OS moves cold pages to swap space on disk.
Implications for production:
- Database never want swap (query latency 1ms → seconds)
- Set
vm.swappiness=1on DB hosts - Disable swap entirely on Kafka brokers, Cassandra
- Monitor
si/soinvmstat
2.3 File Systems
2.3.1 Inodes & Files
- Inode: metadata structure (size, permissions, timestamps, block pointers)
- File: name → inode mapping (in directory)
- Block: smallest disk allocation unit (4KB typical)
Directory entry: "data.txt" → inode #12345
Inode #12345:
- Size: 8 KB
- Permissions: 644
- Block pointers: [block 100, block 101]
- Indirect block (for large files): block 200
2.3.2 fsync — The Bottleneck
Default behavior: write goes to page cache (RAM), returned immediately. OS flushes to disk later.
Problem: Crash before flush → lose data.
Solution: fsync(fd) — block until data on disk.
write(fd, data, len); // ~ μs (page cache only)
fsync(fd); // ~1-10 ms (write to disk)Database WAL (Write-Ahead Log) uses fsync:
Postgres COMMIT:
1. Write WAL record
2. fsync WAL file
3. Reply "committed" to client
4. Apply changes to data files (later)
Performance implication: Synchronous commit limited by disk fsync latency.
| Storage | fsync latency | Commits/sec |
|---|---|---|
| HDD | 5-15 ms | 100-200 |
| SATA SSD | 0.5-2 ms | 500-2000 |
| NVMe SSD | 0.05-0.5 ms | 2000-20000 |
| Persistent Memory (Optane) | 0.001 ms | 1M+ |
→ Why DB on NVMe matters: Sync commit throughput.
2.3.3 Journaling
Most modern FSes (ext4, XFS, ZFS) journal metadata:
- Write metadata change to journal
- fsync journal
- Apply metadata change
- Mark journal entry done
Crash recovery: Replay journal → consistent state.
Trade-off: 2x writes for metadata, but crash-safe.
2.3.4 Modern File Systems
| FS | Strength | Use case |
|---|---|---|
| ext4 | Mature, default Linux | General |
| XFS | Large files, parallel I/O | Databases, big data |
| ZFS | Checksumming, snapshots, COW | Storage servers, critical data |
| Btrfs | Snapshots, subvolumes | Containers |
| NFS | Network FS | Shared storage |
| EFS (AWS) | Elastic NFS | Cloud shared FS |
| FUSE | User-space FS | Custom (S3FS, etc.) |
2.4 I/O Models
How does code wait for I/O? 5 models:
2.4.1 Blocking I/O (default)
read(fd, buf, len); // Thread blocked until data readyPros: Simple Cons: 1 thread per connection → thousands of threads → thousands of context switches.
2.4.2 Non-blocking I/O
fcntl(fd, F_SETFL, O_NONBLOCK);
int n = read(fd, buf, len); // Returns -1 + EAGAIN if no dataPros: No blocking Cons: Need busy loop or polling.
2.4.3 I/O Multiplexing (select / poll / epoll / kqueue)
One thread monitors many sockets. Kernel notifies when ready.
// Linux: epoll
int epfd = epoll_create1(0);
epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &event);
while (1) {
int n = epoll_wait(epfd, events, MAX_EVENTS, -1);
for (int i = 0; i < n; i++) {
// Handle ready event
}
}This is C10K’s solution: 1 thread, 10000 connections.
Implementations:
- Linux:
epoll(best, O(1)) - macOS/BSD:
kqueue - Windows: IOCP
Used by: Nginx, Redis, Node.js (libuv), Netty, Tokio.
2.4.4 Async I/O (true async)
io_uring (Linux 5.1+, 2019) and Windows IOCP — completion-based (vs readiness-based for epoll).
Magic of io_uring:
- Submit ring (kernel)
- Completion ring
- Zero syscall overhead in fast path
- 2-3x faster than epoll for high IOPS
// Tokio with io_uring (in 2024+ stable)
let stream = TcpStream::connect("...").await?;
let n = stream.read(&mut buf).await?;2.4.5 Comparison
| Model | Threads | Concurrency | Best for |
|---|---|---|---|
| Blocking | 1 per conn | ~1K | Simple servers |
| Non-blocking + multiplexing (epoll) | 1 worker | 100K+ | Most servers (Nginx, Redis) |
| Async (io_uring) | 1-N workers | 1M+ | High IOPS storage |
| Thread pool + blocking | N (=cores) | 10K | CPU-bound + some I/O |
2.5 Process Scheduling
OS scheduler decides which process runs on which CPU.
2.5.1 Linux CFS (Completely Fair Scheduler)
- Virtual runtime (vruntime): each task accumulates time used
- Scheduler picks task with lowest vruntime
- Ensures fairness over time
- Default scheduler since Linux 2.6.23
2.5.2 Real-time scheduling
SCHED_FIFO,SCHED_RR: realtime priority- Used for low-latency workloads (audio, video, trading)
- Risk: Starve other processes
2.5.3 CPU affinity
Pin process to specific cores:
cpu_set_t mask;
CPU_SET(2, &mask); // Use core 2
sched_setaffinity(0, sizeof(mask), &mask);Why pin?:
- L1/L2 cache hot (avoid TLB flush)
- NUMA locality
- Real-time deterministic latency
Examples:
- DPDK (Data Plane Development Kit) pins cores
- High-frequency trading
- Database fenced cores
2.5.4 cgroups (Control Groups)
Resource limits per group of processes:
# Limit memory to 1GB
echo 1073741824 > /sys/fs/cgroup/memory/mygroup/memory.limit_in_bytes
# Limit CPU to 50% of one core
echo 50000 > /sys/fs/cgroup/cpu/mygroup/cpu.cfs_quota_us
echo 100000 > /sys/fs/cgroup/cpu/mygroup/cpu.cfs_period_usUsed by: Docker, K8s (resources.limits.cpu/memory).
2.6 Inter-Process Communication (IPC)
5 main IPC mechanisms:
2.6.1 Pipes / FIFOs
- Unidirectional byte stream
- Anonymous (parent-child) or named (FIFO)
- Used by shells:
cat | grep | sort
2.6.2 Unix domain sockets
- Bidirectional, like TCP but local-only
- Faster than localhost TCP (no TCP/IP overhead)
- Used by Nginx ↔ FPM, Postgres, Redis sockets
2.6.3 Shared memory
- Fastest IPC (zero copy)
- Mmap a file or anonymous region into multiple processes
- Need synchronization (semaphores, futex)
int fd = shm_open("/myshm", O_CREAT | O_RDWR, 0600);
ftruncate(fd, SIZE);
void *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
MAP_SHARED, fd, 0);Used by: Redis with --protected-mode no + shared mmap, PostgreSQL shared buffers.
2.6.4 Message queues
- POSIX mq, System V msgq
- Less common in modern code (Kafka/Redis Streams replace)
2.6.5 Signals
- Asynchronous notifications: SIGTERM, SIGKILL, SIGINT, SIGHUP
- Limited info (no payload)
import signal
def handler(signum, frame):
print("Graceful shutdown")
signal.signal(signal.SIGTERM, handler)Best practice: Container should handle SIGTERM for graceful shutdown (drain connections, flush buffers).
2.7 Containers — namespaces + cgroups
Containers are NOT VMs. They are isolated processes using:
2.7.1 Linux namespaces
| Namespace | Isolates |
|---|---|
pid | Process IDs (container sees own PIDs) |
net | Network interfaces |
mnt | Mount points |
uts | Hostname |
ipc | IPC resources |
user | User IDs |
cgroup | cgroups view |
# Create new PID namespace
unshare --pid --fork --mount-proc bash
ps # Sees only own process2.7.2 cgroups
Limit resources (CPU, memory, I/O, network).
Container = process(es) in a set of namespaces + cgroups limits + filesystem (overlay).
→ Why fast: No hypervisor, no separate OS kernel. Container = native Linux process with constraints.
2.8 Kernel vs User Space
Kernel space: privileged, direct hardware access. User space: applications, sandboxed.
Syscalls are bridge:
// User-space code
int fd = open("/etc/passwd", O_RDONLY);
// ^---- syscall: trap to kernel modeCost: Syscall = ~100-500 ns (mode switch).
Why eBPF revolutionary:
- Run sandboxed code in kernel without writing kernel module
- Zero syscall overhead
- Verifier ensures safety
Why io_uring: Submit/complete via shared rings, batch syscalls → near-zero overhead.
2.9 Memory Allocators
User-space allocators (malloc/free) on top of kernel page allocator:
| Allocator | Used by | Property |
|---|---|---|
| glibc malloc (ptmalloc2) | Default Linux | Mature, complex |
| jemalloc | Redis, Cassandra, Firefox | Lower fragmentation |
| tcmalloc (Google) | Chrome, gRPC | Thread-cache, scalable |
| mimalloc (Microsoft) | Modern apps | Performance |
| rust alloc | Rust default | jemalloc-like |
Production tip: For long-running servers (DBs, caches), switch to jemalloc to reduce memory fragmentation. Redis docs explicitly recommend this.
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 ./my-server3. Practical Applications — OS in System Design
3.1 Why Redis is single-threaded
Redis primary uses 1 thread + epoll:
- Each operation is in-memory (μs)
- Multi-threading would need locks → slow
- epoll handles 100K+ concurrent connections
- Background threads for fsync + AOF
Throughput: 100K-1M ops/sec/instance.
3.2 Why Postgres uses processes
Postgres = process per connection (with pooling via PgBouncer):
- Crash isolation (1 conn crash doesn’t take down DB)
- Mature codebase
- Trade-off: ~10MB/connection → expensive
With PgBouncer: 10K clients → 100 backend connections.
3.3 Why Node.js single-threaded + libuv
Node.js = 1 main thread + libuv thread pool:
- Main thread: JS event loop (epoll-based via libuv)
- I/O offloaded to thread pool (4 threads default)
- CPU-bound work blocks event loop → use Worker Threads
3.4 Why Go’s M:N scheduler
Goroutines:
- M = OS threads (= GOMAXPROCS)
- N = Goroutines (millions possible)
- Cooperative + preemptive (since Go 1.14)
- Network poller built-in (epoll/kqueue)
→ Why Go great for backend: handle 1M concurrent connections with reasonable memory.
3.5 Why Kubernetes pods
Pod = set of containers sharing network namespace + IPC:
- Containers in same pod can localhost each other
- Shared volumes (mount namespace shared)
- Sidecar pattern: app + Envoy proxy in same pod
3.6 Why fsync is critical for DBs
Every database COMMIT involves fsync:
COMMIT in Postgres:
1. Write WAL record to OS buffer
2. fsync WAL file → wait for disk
3. Reply OK to client
Skip fsync (synchronous_commit = off) → 10x throughput, but lose committed data on crash.
3.7 Why mmap for memory-mapped files
void *data = mmap(NULL, size, PROT_READ, MAP_SHARED, fd, 0);
// Now access file as memoryUsed by:
- LMDB, RocksDB partially
- Index files in Lucene/Elasticsearch
- Memory-mapped Postgres
Pros: OS handles caching, no double buffer Cons: Page faults can be unpredictable; Linus Torvalds famously hates it for some use cases (FS too)
4. Security First — OS-level Security
4.1 Principle of Least Privilege
- Don’t run services as root
- Use
setcapfor specific capabilities - Drop privileges after init (e.g., bind 80, then drop to ‘nobody’)
4.2 Container security
- AppArmor / SELinux: Mandatory access control
- Seccomp: Syscall filtering (block dangerous syscalls)
- User namespaces: Container UID 0 ≠ host UID 0
- Read-only root FS:
--read-onlyDocker flag - No privileged containers: Only when absolutely needed
4.3 Memory protection
- ASLR (Address Space Layout Randomization)
- DEP/NX (No-eXecute on data pages)
- Stack canaries
These are kernel + compiler features. Modern Linux + gcc/clang have all by default.
4.4 Side-channel attacks
- Spectre/Meltdown (2018): exploit speculative execution + cache timing to leak data
- Mitigations: kernel patches, microcode updates, KPTI (Kernel Page-Table Isolation)
- Performance cost: 5-30%
5. DevOps — OS Tuning
5.1 Linux tuning for high-performance servers
# /etc/sysctl.conf
# Network
net.core.somaxconn = 65535
net.core.netdev_max_backlog = 16384
net.ipv4.tcp_max_syn_backlog = 8192
net.ipv4.tcp_fin_timeout = 30
net.ipv4.ip_local_port_range = 10000 65535
net.ipv4.tcp_tw_reuse = 1
# File handles
fs.file-max = 2097152
# Memory
vm.swappiness = 1 # Avoid swap
vm.dirty_ratio = 10 # Flush dirty pages aggressively
vm.dirty_background_ratio = 5
# Apply
sysctl -p5.2 Per-process limits (ulimit)
# /etc/security/limits.conf
* soft nofile 1048576
* hard nofile 1048576
* soft nproc 1048576
* hard nproc 10485765.3 Monitoring tools
| Tool | Use case |
|---|---|
top, htop | CPU/memory by process |
vmstat | Virtual memory stats |
iostat | I/O statistics |
iotop | I/O per process |
pidstat | Per-process stats |
perf | Performance profiling |
strace | Syscall tracing |
ltrace | Library call tracing |
bpftrace / bcc | eBPF observability |
ss / netstat | Network connections |
tcpdump | Network packet capture |
lsof | Open files |
sar | Historical stats |
5.4 Brendan Gregg’s USE Method
For each resource, check:
- Utilization
- Saturation
- Errors
# CPU
top # Util
uptime # Saturation (load avg)
dmesg | grep -i error # Errors
# Memory
free -m # Util
vmstat 1 # Saturation (si/so)
dmesg | grep -i oom
# Disk
iostat -x 1 # Util (%util), Saturation (await)
# Network
sar -n DEV 1 # Util
ss -s # Saturation
ifconfig | grep errors6. Code Examples
6.1 epoll server in C
// Simple TCP echo server with epoll
#include <sys/epoll.h>
#include <sys/socket.h>
// ... boilerplate
int main() {
int listen_fd = socket(AF_INET, SOCK_STREAM | SOCK_NONBLOCK, 0);
bind(listen_fd, ...);
listen(listen_fd, 1024);
int epfd = epoll_create1(0);
struct epoll_event ev = { .events = EPOLLIN, .data.fd = listen_fd };
epoll_ctl(epfd, EPOLL_CTL_ADD, listen_fd, &ev);
struct epoll_event events[64];
while (1) {
int n = epoll_wait(epfd, events, 64, -1);
for (int i = 0; i < n; i++) {
int fd = events[i].data.fd;
if (fd == listen_fd) {
int client = accept4(listen_fd, NULL, NULL, SOCK_NONBLOCK);
struct epoll_event cev = { .events = EPOLLIN, .data.fd = client };
epoll_ctl(epfd, EPOLL_CTL_ADD, client, &cev);
} else {
char buf[1024];
int n = read(fd, buf, sizeof(buf));
if (n > 0) write(fd, buf, n);
else { close(fd); epoll_ctl(epfd, EPOLL_CTL_DEL, fd, NULL); }
}
}
}
}6.2 Goroutine vs Thread benchmark
package main
import (
"fmt"
"runtime"
"sync"
"time"
)
func main() {
runtime.GOMAXPROCS(8)
const N = 1_000_000
var wg sync.WaitGroup
start := time.Now()
for i := 0; i < N; i++ {
wg.Add(1)
go func() {
defer wg.Done()
time.Sleep(100 * time.Millisecond) // Simulate I/O
}()
}
wg.Wait()
fmt.Printf("1M goroutines: %v\n", time.Since(start))
// Result: ~100-200ms (vs hours for 1M OS threads)
}6.3 fsync benchmark
import os
import time
def benchmark_fsync(n=1000):
fd = os.open("test.dat", os.O_CREAT | os.O_WRONLY)
# Without fsync
start = time.time()
for _ in range(n):
os.write(fd, b"x" * 100)
elapsed = time.time() - start
print(f"No fsync: {n / elapsed:.0f} writes/s")
# With fsync
start = time.time()
for _ in range(n):
os.write(fd, b"x" * 100)
os.fsync(fd)
elapsed = time.time() - start
print(f"With fsync: {n / elapsed:.0f} writes/s")
os.close(fd)
benchmark_fsync()
# Typical results:
# No fsync: 500,000 writes/s
# With fsync: 200-2000 writes/s (disk dependent)6.4 Memory-mapped file
import mmap
with open("data.bin", "r+b") as f:
mm = mmap.mmap(f.fileno(), 0)
# Access file as memory
print(mm[0:100])
mm[0:5] = b"HELLO"
mm.close()7. System Design Diagrams
7.1 Process vs Thread vs Coroutine
flowchart LR subgraph Process["Process (heavy)"] P1[Process 1<br/>10MB] P2[Process 2<br/>10MB] end subgraph Thread["Threads (medium)"] TP[Process] T1[Thread 1<br/>1MB] T2[Thread 2<br/>1MB] T3[Thread 3<br/>1MB] TP --> T1 TP --> T2 TP --> T3 end subgraph Coroutine["Coroutines (light)"] CP[Process] CT[OS Thread] C1[Coroutine 1<br/>2KB] C2[Coroutine 2<br/>2KB] C3[Coroutine 3<br/>2KB] Cn[... 1M coroutines] CP --> CT CT --> C1 CT --> C2 CT --> C3 CT --> Cn end
7.2 Virtual Memory
flowchart TB subgraph Process["Process Address Space"] VP1[Virtual Page 0] VP2[Virtual Page 1] VP3[Virtual Page 2] VP4[Virtual Page 3] end subgraph PageTable["Page Table"] PT[V0→F5<br/>V1→F2<br/>V2→Disk<br/>V3→F8] end subgraph Physical["Physical RAM"] F1[Frame 0] F2[Frame 2] F5[Frame 5] F8[Frame 8] end subgraph Swap["Disk (Swap)"] SW[Swap Pages] end VP1 -.via PT.-> F5 VP2 -.-> F2 VP3 -.page fault.-> SW VP4 -.-> F8 style VP3 fill:#ffcdd2
7.3 epoll Event Loop
sequenceDiagram participant App participant Kernel participant Net as Network App->>Kernel: epoll_create1() App->>Kernel: epoll_ctl(ADD, socket1) App->>Kernel: epoll_ctl(ADD, socket2) App->>Kernel: epoll_ctl(ADD, socket3) loop Event Loop App->>Kernel: epoll_wait(timeout) Net->>Kernel: Data on socket2 Kernel-->>App: socket2 ready App->>Kernel: read(socket2) Kernel-->>App: data App->>App: process(data) end
7.4 fsync Path
sequenceDiagram participant App participant PC as Page Cache (RAM) participant Journal participant Disk App->>PC: write(fd, data) PC-->>App: returned (data in RAM) Note over App: Risk: crash → lose data App->>PC: fsync(fd) PC->>Journal: write metadata + data Journal->>Disk: physical write + flush Disk-->>Journal: written Journal-->>PC: synced PC-->>App: fsync returned Note over App,Disk: Now durable across crash
8. Aha Moments & Pitfalls
Aha Moments
#1: Virtual memory enables most “magic”. PagedAttention, mmap, copy-on-write fork, swap — all variations of paging.
#2: Coroutines = userspace scheduling. Cooperative, sub-microsecond switch. Why Go/Python asyncio handle 1M+ connections.
#3: fsync is the bottleneck. 100x slower than RAM write. Database commit throughput = fsync rate.
#4: epoll is OS magic for C10K. 1 thread, 100K connections. Foundation of Nginx, Redis, Node.js, Netty.
#5: Containers = namespaces + cgroups. Not VMs. Same kernel, just isolated view + resource limits.
#6: Page faults are expensive. Major fault = ms penalty. Avoid swap on critical services.
#7: Syscalls have cost (~500 ns). io_uring batches them → near-zero overhead. eBPF eliminates them entirely.
#8: TLB matters. Cache miss on virt→phys translation = 100ns. Huge pages reduce TLB pressure.
Pitfalls
Pitfall 1: Treating thread = process
Threads share memory → race conditions, deadlock. Need synchronization.
Pitfall 2: Blocking call in async loop
JavaScript: heavy CPU in handler → blocks event loop → all requests stall. Fix: Worker threads / offload.
Pitfall 3: Swap on database
Swap kicks in → query latency μs → seconds. Fix:
vm.swappiness=1or disable swap.
Pitfall 4: Default ulimits
nofile=1024→ server crashes at 1000 connections. Fix:ulimit -n 1048576.
Pitfall 5: No graceful shutdown
Container killed mid-request → corrupt state. Fix: Handle SIGTERM, drain connections.
Pitfall 6: Ignoring memory fragmentation
Long-running Redis → 10GB allocated, 5GB used. Fix: jemalloc or tcmalloc.
Pitfall 7: Privileged containers
--privileged→ full host access. Fix: Specific capabilities only.
Pitfall 8: Process per request (CGI-style)
1000 req/s × 10ms fork = saturated. Fix: Worker pool, persistent processes.
9. Internal Links
| Topic | Connects to |
|---|---|
| Tuan-Bonus-LLM-Serving-Infrastructure | PagedAttention = virtual memory paging |
| Tuan-Bonus-Edge-Wasm-Architecture | V8 isolate, Wasm sandbox use OS process model |
| Tuan-13-Monitoring-Observability | eBPF for kernel-level observability |
| Tuan-07-Database-Sharding-Replication | fsync, mmap, B-tree (storage) |
| Tuan-11-Microservices-Pattern | Containers, namespaces, cgroups |
| Tuan-Foundations-Computer-Architecture | Memory hierarchy, cache, NUMA |
| Tuan-Foundations-Database-Internals | Storage engines build on FS |
Tham khảo
Books:
- Operating Systems: Three Easy Pieces (free) — http://pages.cs.wisc.edu/~remzi/OSTEP/
- The Linux Programming Interface (Michael Kerrisk)
- Computer Systems: A Programmer’s Perspective (CSAPP)
- Modern Operating Systems (Tanenbaum)
Online:
- Brendan Gregg’s perf tools — https://www.brendangregg.com/
- Linux Kernel Newbies — https://kernelnewbies.org/
- Linux man pages — https://man7.org/linux/man-pages/
Courses:
- MIT 6.S081 Operating Systems Engineering — https://pdos.csail.mit.edu/6.S081/
- CMU 15-410 — https://www.cs.cmu.edu/~410/
- Stanford CS140 — https://www.scs.stanford.edu/~zyedidia/cs140/
Tiếp theo: Tuan-Foundations-Computer-Architecture — CPU, cache, memory, GPU.