Foundations: Operating Systems Essentials

“vLLM PagedAttention không phải invention mới — đó là virtual memory paging áp dụng vào KV cache. Cloudflare Workers V8 isolate là process isolation ở user-space. Kafka exactly-once dùng fsync barrier. eBPF observability đọc kernel syscalls. Mọi ‘magic’ của system design đều quay về 5-6 OS concepts cốt lõi. Architect không hiểu OS = không hiểu hệ thống mình thiết kế.”

Tags: cs-foundations operating-systems fundamentals Student: Hieu (Backend Dev → Architect) Liên quan: Tuan-Bonus-LLM-Serving-Infrastructure · Tuan-Bonus-Edge-Wasm-Architecture · Tuan-13-Monitoring-Observability · Tuan-Foundations-Computer-Architecture

1. Context & Why

Tại sao Backend Dev cần hiểu OS?

Topic em đang học	OS concept underlying
vLLM PagedAttention	Virtual memory paging
K8s containers	namespaces, cgroups
Cloudflare Workers	Process / sandbox isolation
eBPF observability	Kernel syscalls, tracepoints
fsync trong DB	File system journaling
NodeJS event loop	epoll / kqueue
Go goroutines	M:N threading
Postgres connections	Process per connection vs threading
Redis single-thread	I/O multiplexing
NVMe IOPS	Block layer, page cache

Key insight: OS là layer giữa code em viết và hardware. Hiểu OS giúp em debug, optimize, và design hệ thống đúng.

Tham chiếu chính

Operating Systems: Three Easy Pieces (OSTEP) — free book — http://pages.cs.wisc.edu/~remzi/OSTEP/
The Linux Programming Interface (Michael Kerrisk, 2010)
Computer Systems: A Programmer’s Perspective (CSAPP, Bryant & O’Hallaron)
Linux man pages — man 2 syscalls
Brendan Gregg’s site — https://www.brendangregg.com/

2. Deep Dive — Khái niệm cốt lõi

2.1 Process, Thread, Coroutine

3 levels of execution unit:

2.1.1 Process

Isolated address space (its own virtual memory)
Heavy: ~10MB+ memory overhead per process
Slow context switch: ~10 μs (TLB flush, page tables)
Failure isolated: Process crash doesn’t affect others
IPC needed for communication: pipes, sockets, shared memory

pid_t pid = fork();
if (pid == 0) {
    // Child process
    execvp("ls", args);
} else {
    // Parent
    waitpid(pid, &status, 0);
}

Examples:

Each browser tab in Chrome (process per tab for isolation)
PostgreSQL (process per connection by default)
Apache prefork

2.1.2 Thread

Shared address space within process
Lighter: ~1MB stack overhead
Faster context switch: ~1-2 μs
NOT failure isolated: 1 thread crash → whole process down
Direct memory access between threads (need synchronization)

pthread_t tid;
pthread_create(&tid, NULL, worker_fn, arg);
pthread_join(tid, NULL);

Examples:

Java thread pool (Tomcat, Spring)
Nginx worker threads
Most Go programs

2.1.3 Coroutine / Green Thread / Fiber

User-space scheduling (no kernel involvement)
Very light: ~2KB stack typical
Sub-microsecond switch
Cooperative: yields explicitly (await, yield)
M:N model: M coroutines on N OS threads

go func() {
    // Goroutine: M:N scheduled by Go runtime
    process(req)
}()

async def handler():
    await db.query()  # cooperative yield

Examples:

Go goroutines (M:N scheduler)
Python asyncio
Kotlin coroutines
Rust async/await + Tokio
Erlang/Elixir processes (BEAM VM)

2.1.4 Comparison

	Process	Thread	Coroutine
Address space	Separate	Shared	Shared
Memory overhead	~10 MB	~1 MB	~2 KB
Context switch	~10 μs	~1 μs	<1 μs
Communication	IPC	Shared memory	Channels
Crash isolation	✅ Yes	❌ No	❌ No
Max instances	Thousands	Tens of thousands	Millions
Best for	Isolation, security	CPU-bound parallel	I/O-bound concurrent

Architectural choice:

Microservices = process boundary (security, scaling)
Within service: threads or coroutines for concurrency
High-concurrency I/O: coroutines (Go, Node, Python asyncio)
CPU-bound: threads with thread pool

2.2 Virtual Memory

Most important OS concept for systems engineers.

2.2.1 The problem

Physical RAM is limited & fragmented
Multiple processes need memory
Need isolation (process A can’t read process B’s memory)
Need flexibility (allocate more than physical RAM)

2.2.2 Solution — Virtual Memory

Each process sees own continuous virtual address space (e.g., 0 to 2^48 on x86-64).

MMU (Memory Management Unit) translates virtual → physical via page tables.

Process View (virtual):
┌───────────────────────┐ 0xFFFF...
│ Stack (grows down)    │
├───────────────────────┤
│ ↓                     │
│                       │
│ ↑                     │
├───────────────────────┤
│ Heap (grows up)       │
├───────────────────────┤
│ Data (globals)        │
├───────────────────────┤
│ Text (code)           │
└───────────────────────┘ 0x0

OS maps virtual pages to physical RAM pages

2.2.3 Pages

Memory divided into fixed-size pages (typically 4KB)
Page table maps virtual page → physical frame
Lookup is expensive → TLB (Translation Lookaside Buffer) caches recent translations

Virtual address: [Page Number | Offset]
                       ↓
                   Page Table
                       ↓
Physical address: [Frame Number | Offset]

TLB miss = ~100ns penalty (page table walk). TLB hit = ~1ns.

2.2.4 Page Faults

When process accesses page not in physical RAM:

Type	Cause	Cost
Minor page fault	Page allocated but not mapped yet	~1-10 μs
Major page fault	Page on disk (swap)	~ms (swap in from disk)
Segfault	Invalid access	Process killed

Implication: Page fault → context switch → severe latency hit.

# Monitor page faults
$ vmstat 1
procs ... swap          io
 r b   si   so    bi    bo
 0 0    0    0     5     0   ← so > 0 = swapping (BAD)

2.2.5 Application: PagedAttention (vLLM)

vLLM PagedAttention (https://arxiv.org/abs/2309.06180) directly inspired by virtual memory:

Naive KV cache:
  Reserve max-length contiguous memory per request
  → 60-80% memory waste

PagedAttention:
  Divide KV cache into 16-token blocks (= "pages")
  Per-request "page table" maps logical positions → physical blocks
  → 4% memory waste

Same concept as OS virtual memory:
  - Logical (per-request) → Physical (GPU memory)
  - Page table per request
  - Block-level allocation

→ Why it matters: Hiểu virtual memory = hiểu PagedAttention. Tham chiếu Tuan-Bonus-LLM-Serving-Infrastructure.

2.2.6 Swap

When physical RAM full, OS moves cold pages to swap space on disk.

Implications for production:

Database never want swap (query latency 1ms → seconds)
Set vm.swappiness=1 on DB hosts
Disable swap entirely on Kafka brokers, Cassandra
Monitor si/so in vmstat

2.3 File Systems

2.3.1 Inodes & Files

Inode: metadata structure (size, permissions, timestamps, block pointers)
File: name → inode mapping (in directory)
Block: smallest disk allocation unit (4KB typical)

Directory entry: "data.txt" → inode #12345
Inode #12345:
  - Size: 8 KB
  - Permissions: 644
  - Block pointers: [block 100, block 101]
  - Indirect block (for large files): block 200

2.3.2 fsync — The Bottleneck

Default behavior: write goes to page cache (RAM), returned immediately. OS flushes to disk later.

Problem: Crash before flush → lose data.

Solution: fsync(fd) — block until data on disk.

write(fd, data, len);     // ~ μs (page cache only)
fsync(fd);                // ~1-10 ms (write to disk)

Database WAL (Write-Ahead Log) uses fsync:

Postgres COMMIT:
  1. Write WAL record
  2. fsync WAL file
  3. Reply "committed" to client
  4. Apply changes to data files (later)

Performance implication: Synchronous commit limited by disk fsync latency.

Storage	fsync latency	Commits/sec
HDD	5-15 ms	100-200
SATA SSD	0.5-2 ms	500-2000
NVMe SSD	0.05-0.5 ms	2000-20000
Persistent Memory (Optane)	0.001 ms	1M+

→ Why DB on NVMe matters: Sync commit throughput.

2.3.3 Journaling

Most modern FSes (ext4, XFS, ZFS) journal metadata:

Write metadata change to journal
fsync journal
Apply metadata change
Mark journal entry done

Crash recovery: Replay journal → consistent state.

Trade-off: 2x writes for metadata, but crash-safe.

2.3.4 Modern File Systems

FS	Strength	Use case
ext4	Mature, default Linux	General
XFS	Large files, parallel I/O	Databases, big data
ZFS	Checksumming, snapshots, COW	Storage servers, critical data
Btrfs	Snapshots, subvolumes	Containers
NFS	Network FS	Shared storage
EFS (AWS)	Elastic NFS	Cloud shared FS
FUSE	User-space FS	Custom (S3FS, etc.)

2.4 I/O Models

How does code wait for I/O? 5 models:

2.4.1 Blocking I/O (default)

read(fd, buf, len);  // Thread blocked until data ready

Pros: Simple Cons: 1 thread per connection → thousands of threads → thousands of context switches.

2.4.2 Non-blocking I/O

fcntl(fd, F_SETFL, O_NONBLOCK);
int n = read(fd, buf, len);  // Returns -1 + EAGAIN if no data

Pros: No blocking Cons: Need busy loop or polling.

2.4.3 I/O Multiplexing (select / poll / epoll / kqueue)

One thread monitors many sockets. Kernel notifies when ready.

// Linux: epoll
int epfd = epoll_create1(0);
epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &event);
 
while (1) {
    int n = epoll_wait(epfd, events, MAX_EVENTS, -1);
    for (int i = 0; i < n; i++) {
        // Handle ready event
    }
}

This is C10K’s solution: 1 thread, 10000 connections.

Implementations:

Linux: epoll (best, O(1))
macOS/BSD: kqueue
Windows: IOCP

Used by: Nginx, Redis, Node.js (libuv), Netty, Tokio.

2.4.4 Async I/O (true async)

io_uring (Linux 5.1+, 2019) and Windows IOCP — completion-based (vs readiness-based for epoll).

Magic of io_uring:

Submit ring (kernel)
Completion ring
Zero syscall overhead in fast path
2-3x faster than epoll for high IOPS

// Tokio with io_uring (in 2024+ stable)
let stream = TcpStream::connect("...").await?;
let n = stream.read(&mut buf).await?;

2.4.5 Comparison

Model	Threads	Concurrency	Best for
Blocking	1 per conn	~1K	Simple servers
Non-blocking + multiplexing (epoll)	1 worker	100K+	Most servers (Nginx, Redis)
Async (io_uring)	1-N workers	1M+	High IOPS storage
Thread pool + blocking	N (=cores)	10K	CPU-bound + some I/O

2.5 Process Scheduling

OS scheduler decides which process runs on which CPU.

2.5.1 Linux CFS (Completely Fair Scheduler)

Virtual runtime (vruntime): each task accumulates time used
Scheduler picks task with lowest vruntime
Ensures fairness over time
Default scheduler since Linux 2.6.23

2.5.2 Real-time scheduling

SCHED_FIFO, SCHED_RR: realtime priority
Used for low-latency workloads (audio, video, trading)
Risk: Starve other processes

2.5.3 CPU affinity

Pin process to specific cores:

cpu_set_t mask;
CPU_SET(2, &mask);  // Use core 2
sched_setaffinity(0, sizeof(mask), &mask);

Why pin?:

L1/L2 cache hot (avoid TLB flush)
NUMA locality
Real-time deterministic latency

Examples:

DPDK (Data Plane Development Kit) pins cores
High-frequency trading
Database fenced cores

2.5.4 cgroups (Control Groups)

Resource limits per group of processes:

# Limit memory to 1GB
echo 1073741824 > /sys/fs/cgroup/memory/mygroup/memory.limit_in_bytes
 
# Limit CPU to 50% of one core
echo 50000 > /sys/fs/cgroup/cpu/mygroup/cpu.cfs_quota_us
echo 100000 > /sys/fs/cgroup/cpu/mygroup/cpu.cfs_period_us

Used by: Docker, K8s (resources.limits.cpu/memory).

2.6 Inter-Process Communication (IPC)

5 main IPC mechanisms:

2.6.1 Pipes / FIFOs

Unidirectional byte stream
Anonymous (parent-child) or named (FIFO)
Used by shells: cat | grep | sort

2.6.2 Unix domain sockets

Bidirectional, like TCP but local-only
Faster than localhost TCP (no TCP/IP overhead)
Used by Nginx ↔ FPM, Postgres, Redis sockets

2.6.3 Shared memory

Fastest IPC (zero copy)
Mmap a file or anonymous region into multiple processes
Need synchronization (semaphores, futex)

int fd = shm_open("/myshm", O_CREAT | O_RDWR, 0600);
ftruncate(fd, SIZE);
void *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
               MAP_SHARED, fd, 0);

Used by: Redis with --protected-mode no + shared mmap, PostgreSQL shared buffers.

2.6.4 Message queues

POSIX mq, System V msgq
Less common in modern code (Kafka/Redis Streams replace)

2.6.5 Signals

Asynchronous notifications: SIGTERM, SIGKILL, SIGINT, SIGHUP
Limited info (no payload)

import signal
def handler(signum, frame):
    print("Graceful shutdown")
signal.signal(signal.SIGTERM, handler)

Best practice: Container should handle SIGTERM for graceful shutdown (drain connections, flush buffers).

2.7 Containers — namespaces + cgroups

Containers are NOT VMs. They are isolated processes using:

2.7.1 Linux namespaces

Namespace	Isolates
`pid`	Process IDs (container sees own PIDs)
`net`	Network interfaces
`mnt`	Mount points
`uts`	Hostname
`ipc`	IPC resources
`user`	User IDs
`cgroup`	cgroups view

# Create new PID namespace
unshare --pid --fork --mount-proc bash
ps  # Sees only own process

2.7.2 cgroups

Limit resources (CPU, memory, I/O, network).

Container = process(es) in a set of namespaces + cgroups limits + filesystem (overlay).

→ Why fast: No hypervisor, no separate OS kernel. Container = native Linux process with constraints.

2.8 Kernel vs User Space

Kernel space: privileged, direct hardware access. User space: applications, sandboxed.

Syscalls are bridge:

// User-space code
int fd = open("/etc/passwd", O_RDONLY);
//          ^---- syscall: trap to kernel mode

Cost: Syscall = ~100-500 ns (mode switch).

Why eBPF revolutionary:

Run sandboxed code in kernel without writing kernel module
Zero syscall overhead
Verifier ensures safety

Why io_uring: Submit/complete via shared rings, batch syscalls → near-zero overhead.

2.9 Memory Allocators

User-space allocators (malloc/free) on top of kernel page allocator:

Allocator	Used by	Property
glibc malloc (ptmalloc2)	Default Linux	Mature, complex
jemalloc	Redis, Cassandra, Firefox	Lower fragmentation
tcmalloc (Google)	Chrome, gRPC	Thread-cache, scalable
mimalloc (Microsoft)	Modern apps	Performance
rust alloc	Rust default	jemalloc-like

Production tip: For long-running servers (DBs, caches), switch to jemalloc to reduce memory fragmentation. Redis docs explicitly recommend this.

LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 ./my-server

3. Practical Applications — OS in System Design

3.1 Why Redis is single-threaded

Redis primary uses 1 thread + epoll:

Each operation is in-memory (μs)
Multi-threading would need locks → slow
epoll handles 100K+ concurrent connections
Background threads for fsync + AOF

Throughput: 100K-1M ops/sec/instance.

3.2 Why Postgres uses processes

Postgres = process per connection (with pooling via PgBouncer):

Crash isolation (1 conn crash doesn’t take down DB)
Mature codebase
Trade-off: ~10MB/connection → expensive

With PgBouncer: 10K clients → 100 backend connections.

3.3 Why Node.js single-threaded + libuv

Node.js = 1 main thread + libuv thread pool:

Main thread: JS event loop (epoll-based via libuv)
I/O offloaded to thread pool (4 threads default)
CPU-bound work blocks event loop → use Worker Threads

3.4 Why Go’s M:N scheduler

Goroutines:

M = OS threads (= GOMAXPROCS)
N = Goroutines (millions possible)
Cooperative + preemptive (since Go 1.14)
Network poller built-in (epoll/kqueue)

→ Why Go great for backend: handle 1M concurrent connections with reasonable memory.

3.5 Why Kubernetes pods

Pod = set of containers sharing network namespace + IPC:

Containers in same pod can localhost each other
Shared volumes (mount namespace shared)
Sidecar pattern: app + Envoy proxy in same pod

3.6 Why fsync is critical for DBs

Every database COMMIT involves fsync:

COMMIT in Postgres:
  1. Write WAL record to OS buffer
  2. fsync WAL file → wait for disk
  3. Reply OK to client

Skip fsync (synchronous_commit = off) → 10x throughput, but lose committed data on crash.

3.7 Why mmap for memory-mapped files

void *data = mmap(NULL, size, PROT_READ, MAP_SHARED, fd, 0);
// Now access file as memory

Used by:

LMDB, RocksDB partially
Index files in Lucene/Elasticsearch
Memory-mapped Postgres

Pros: OS handles caching, no double buffer Cons: Page faults can be unpredictable; Linus Torvalds famously hates it for some use cases (FS too)

4. Security First — OS-level Security

4.1 Principle of Least Privilege

Don’t run services as root
Use setcap for specific capabilities
Drop privileges after init (e.g., bind 80, then drop to ‘nobody’)

4.2 Container security

AppArmor / SELinux: Mandatory access control
Seccomp: Syscall filtering (block dangerous syscalls)
User namespaces: Container UID 0 ≠ host UID 0
Read-only root FS: --read-only Docker flag
No privileged containers: Only when absolutely needed

4.3 Memory protection

ASLR (Address Space Layout Randomization)
DEP/NX (No-eXecute on data pages)
Stack canaries

These are kernel + compiler features. Modern Linux + gcc/clang have all by default.

4.4 Side-channel attacks

Spectre/Meltdown (2018): exploit speculative execution + cache timing to leak data
Mitigations: kernel patches, microcode updates, KPTI (Kernel Page-Table Isolation)
Performance cost: 5-30%

5. DevOps — OS Tuning

5.1 Linux tuning for high-performance servers

# /etc/sysctl.conf
 
# Network
net.core.somaxconn = 65535
net.core.netdev_max_backlog = 16384
net.ipv4.tcp_max_syn_backlog = 8192
net.ipv4.tcp_fin_timeout = 30
net.ipv4.ip_local_port_range = 10000 65535
net.ipv4.tcp_tw_reuse = 1
 
# File handles
fs.file-max = 2097152
 
# Memory
vm.swappiness = 1            # Avoid swap
vm.dirty_ratio = 10          # Flush dirty pages aggressively
vm.dirty_background_ratio = 5
 
# Apply
sysctl -p

5.2 Per-process limits (ulimit)

# /etc/security/limits.conf
* soft nofile 1048576
* hard nofile 1048576
* soft nproc 1048576
* hard nproc 1048576

5.3 Monitoring tools

Tool	Use case
`top`, `htop`	CPU/memory by process
`vmstat`	Virtual memory stats
`iostat`	I/O statistics
`iotop`	I/O per process
`pidstat`	Per-process stats
`perf`	Performance profiling
`strace`	Syscall tracing
`ltrace`	Library call tracing
`bpftrace` / `bcc`	eBPF observability
`ss` / `netstat`	Network connections
`tcpdump`	Network packet capture
`lsof`	Open files
`sar`	Historical stats

5.4 Brendan Gregg’s USE Method

For each resource, check:

Utilization
Saturation
Errors

# CPU
top                       # Util
uptime                    # Saturation (load avg)
dmesg | grep -i error    # Errors
 
# Memory
free -m                   # Util
vmstat 1                  # Saturation (si/so)
dmesg | grep -i oom
 
# Disk
iostat -x 1               # Util (%util), Saturation (await)
 
# Network
sar -n DEV 1              # Util
ss -s                     # Saturation
ifconfig | grep errors

6. Code Examples

6.1 epoll server in C

// Simple TCP echo server with epoll
#include <sys/epoll.h>
#include <sys/socket.h>
// ... boilerplate
 
int main() {
    int listen_fd = socket(AF_INET, SOCK_STREAM | SOCK_NONBLOCK, 0);
    bind(listen_fd, ...);
    listen(listen_fd, 1024);
 
    int epfd = epoll_create1(0);
    struct epoll_event ev = { .events = EPOLLIN, .data.fd = listen_fd };
    epoll_ctl(epfd, EPOLL_CTL_ADD, listen_fd, &ev);
 
    struct epoll_event events[64];
    while (1) {
        int n = epoll_wait(epfd, events, 64, -1);
        for (int i = 0; i < n; i++) {
            int fd = events[i].data.fd;
            if (fd == listen_fd) {
                int client = accept4(listen_fd, NULL, NULL, SOCK_NONBLOCK);
                struct epoll_event cev = { .events = EPOLLIN, .data.fd = client };
                epoll_ctl(epfd, EPOLL_CTL_ADD, client, &cev);
            } else {
                char buf[1024];
                int n = read(fd, buf, sizeof(buf));
                if (n > 0) write(fd, buf, n);
                else { close(fd); epoll_ctl(epfd, EPOLL_CTL_DEL, fd, NULL); }
            }
        }
    }
}

6.2 Goroutine vs Thread benchmark

package main
 
import (
    "fmt"
    "runtime"
    "sync"
    "time"
)
 
func main() {
    runtime.GOMAXPROCS(8)
 
    const N = 1_000_000
    var wg sync.WaitGroup
 
    start := time.Now()
    for i := 0; i < N; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            time.Sleep(100 * time.Millisecond)  // Simulate I/O
        }()
    }
    wg.Wait()
 
    fmt.Printf("1M goroutines: %v\n", time.Since(start))
    // Result: ~100-200ms (vs hours for 1M OS threads)
}

6.3 fsync benchmark

import os
import time
 
def benchmark_fsync(n=1000):
    fd = os.open("test.dat", os.O_CREAT | os.O_WRONLY)
 
    # Without fsync
    start = time.time()
    for _ in range(n):
        os.write(fd, b"x" * 100)
    elapsed = time.time() - start
    print(f"No fsync: {n / elapsed:.0f} writes/s")
 
    # With fsync
    start = time.time()
    for _ in range(n):
        os.write(fd, b"x" * 100)
        os.fsync(fd)
    elapsed = time.time() - start
    print(f"With fsync: {n / elapsed:.0f} writes/s")
 
    os.close(fd)
 
benchmark_fsync()
# Typical results:
#   No fsync: 500,000 writes/s
#   With fsync: 200-2000 writes/s (disk dependent)

6.4 Memory-mapped file

import mmap
 
with open("data.bin", "r+b") as f:
    mm = mmap.mmap(f.fileno(), 0)
    # Access file as memory
    print(mm[0:100])
    mm[0:5] = b"HELLO"
    mm.close()

7. System Design Diagrams

7.1 Process vs Thread vs Coroutine

flowchart LR
    subgraph Process["Process (heavy)"]
        P1[Process 1<br/>10MB]
        P2[Process 2<br/>10MB]
    end

    subgraph Thread["Threads (medium)"]
        TP[Process]
        T1[Thread 1<br/>1MB]
        T2[Thread 2<br/>1MB]
        T3[Thread 3<br/>1MB]
        TP --> T1
        TP --> T2
        TP --> T3
    end

    subgraph Coroutine["Coroutines (light)"]
        CP[Process]
        CT[OS Thread]
        C1[Coroutine 1<br/>2KB]
        C2[Coroutine 2<br/>2KB]
        C3[Coroutine 3<br/>2KB]
        Cn[... 1M coroutines]
        CP --> CT
        CT --> C1
        CT --> C2
        CT --> C3
        CT --> Cn
    end

7.2 Virtual Memory

flowchart TB
    subgraph Process["Process Address Space"]
        VP1[Virtual Page 0]
        VP2[Virtual Page 1]
        VP3[Virtual Page 2]
        VP4[Virtual Page 3]
    end

    subgraph PageTable["Page Table"]
        PT[V0→F5<br/>V1→F2<br/>V2→Disk<br/>V3→F8]
    end

    subgraph Physical["Physical RAM"]
        F1[Frame 0]
        F2[Frame 2]
        F5[Frame 5]
        F8[Frame 8]
    end

    subgraph Swap["Disk (Swap)"]
        SW[Swap Pages]
    end

    VP1 -.via PT.-> F5
    VP2 -.-> F2
    VP3 -.page fault.-> SW
    VP4 -.-> F8

    style VP3 fill:#ffcdd2

7.3 epoll Event Loop

sequenceDiagram
    participant App
    participant Kernel
    participant Net as Network

    App->>Kernel: epoll_create1()
    App->>Kernel: epoll_ctl(ADD, socket1)
    App->>Kernel: epoll_ctl(ADD, socket2)
    App->>Kernel: epoll_ctl(ADD, socket3)

    loop Event Loop
        App->>Kernel: epoll_wait(timeout)

        Net->>Kernel: Data on socket2
        Kernel-->>App: socket2 ready

        App->>Kernel: read(socket2)
        Kernel-->>App: data

        App->>App: process(data)
    end

7.4 fsync Path

sequenceDiagram
    participant App
    participant PC as Page Cache (RAM)
    participant Journal
    participant Disk

    App->>PC: write(fd, data)
    PC-->>App: returned (data in RAM)
    Note over App: Risk: crash → lose data

    App->>PC: fsync(fd)
    PC->>Journal: write metadata + data
    Journal->>Disk: physical write + flush
    Disk-->>Journal: written
    Journal-->>PC: synced
    PC-->>App: fsync returned

    Note over App,Disk: Now durable across crash

8. Aha Moments & Pitfalls

Aha Moments

#1: Virtual memory enables most “magic”. PagedAttention, mmap, copy-on-write fork, swap — all variations of paging.

#2: Coroutines = userspace scheduling. Cooperative, sub-microsecond switch. Why Go/Python asyncio handle 1M+ connections.

#3: fsync is the bottleneck. 100x slower than RAM write. Database commit throughput = fsync rate.

#4: epoll is OS magic for C10K. 1 thread, 100K connections. Foundation of Nginx, Redis, Node.js, Netty.

#5: Containers = namespaces + cgroups. Not VMs. Same kernel, just isolated view + resource limits.

#6: Page faults are expensive. Major fault = ms penalty. Avoid swap on critical services.

#7: Syscalls have cost (~500 ns). io_uring batches them → near-zero overhead. eBPF eliminates them entirely.

#8: TLB matters. Cache miss on virt→phys translation = 100ns. Huge pages reduce TLB pressure.

Pitfalls

Pitfall 1: Treating thread = process

Threads share memory → race conditions, deadlock. Need synchronization.

Pitfall 2: Blocking call in async loop

JavaScript: heavy CPU in handler → blocks event loop → all requests stall. Fix: Worker threads / offload.

Pitfall 3: Swap on database

Swap kicks in → query latency μs → seconds. Fix: vm.swappiness=1 or disable swap.

Pitfall 4: Default ulimits

nofile=1024 → server crashes at 1000 connections. Fix: ulimit -n 1048576.

Pitfall 5: No graceful shutdown

Container killed mid-request → corrupt state. Fix: Handle SIGTERM, drain connections.

Pitfall 6: Ignoring memory fragmentation

Long-running Redis → 10GB allocated, 5GB used. Fix: jemalloc or tcmalloc.

Pitfall 7: Privileged containers

--privileged → full host access. Fix: Specific capabilities only.

Pitfall 8: Process per request (CGI-style)

1000 req/s × 10ms fork = saturated. Fix: Worker pool, persistent processes.

9. Internal Links

Topic	Connects to
Tuan-Bonus-LLM-Serving-Infrastructure	PagedAttention = virtual memory paging
Tuan-Bonus-Edge-Wasm-Architecture	V8 isolate, Wasm sandbox use OS process model
Tuan-13-Monitoring-Observability	eBPF for kernel-level observability
Tuan-07-Database-Sharding-Replication	fsync, mmap, B-tree (storage)
Tuan-11-Microservices-Pattern	Containers, namespaces, cgroups
Tuan-Foundations-Computer-Architecture	Memory hierarchy, cache, NUMA
Tuan-Foundations-Database-Internals	Storage engines build on FS

Tham khảo

Books:

Operating Systems: Three Easy Pieces (free) — http://pages.cs.wisc.edu/~remzi/OSTEP/
The Linux Programming Interface (Michael Kerrisk)
Computer Systems: A Programmer’s Perspective (CSAPP)
Modern Operating Systems (Tanenbaum)

Online:

Brendan Gregg’s perf tools — https://www.brendangregg.com/
Linux Kernel Newbies — https://kernelnewbies.org/
Linux man pages — https://man7.org/linux/man-pages/

Courses:

MIT 6.S081 Operating Systems Engineering — https://pdos.csail.mit.edu/6.S081/
CMU 15-410 — https://www.cs.cmu.edu/~410/
Stanford CS140 — https://www.scs.stanford.edu/~zyedidia/cs140/

Tiếp theo: Tuan-Foundations-Computer-Architecture — CPU, cache, memory, GPU.

lthieu's notes

Explorer

Tuan-Foundations-OS-Essentials

Foundations: Operating Systems Essentials

1. Context & Why

Tại sao Backend Dev cần hiểu OS?

Tham chiếu chính

2. Deep Dive — Khái niệm cốt lõi

2.1 Process, Thread, Coroutine

2.1.1 Process

2.1.2 Thread

2.1.3 Coroutine / Green Thread / Fiber

2.1.4 Comparison

2.2 Virtual Memory

2.2.1 The problem

2.2.2 Solution — Virtual Memory

2.2.3 Pages

2.2.4 Page Faults

2.2.5 Application: PagedAttention (vLLM)

2.2.6 Swap

2.3 File Systems

2.3.1 Inodes & Files

2.3.2 fsync — The Bottleneck

2.3.3 Journaling

2.3.4 Modern File Systems

2.4 I/O Models

2.4.1 Blocking I/O (default)

2.4.2 Non-blocking I/O

2.4.3 I/O Multiplexing (select / poll / epoll / kqueue)

2.4.4 Async I/O (true async)

2.4.5 Comparison

2.5 Process Scheduling

2.5.1 Linux CFS (Completely Fair Scheduler)

2.5.2 Real-time scheduling

2.5.3 CPU affinity

2.5.4 cgroups (Control Groups)

2.6 Inter-Process Communication (IPC)

2.6.1 Pipes / FIFOs

2.6.2 Unix domain sockets

2.6.3 Shared memory

2.6.4 Message queues

2.6.5 Signals

2.7 Containers — namespaces + cgroups

2.7.1 Linux namespaces

2.7.2 cgroups

2.8 Kernel vs User Space

2.9 Memory Allocators

3. Practical Applications — OS in System Design

3.1 Why Redis is single-threaded

3.2 Why Postgres uses processes

3.3 Why Node.js single-threaded + libuv

3.4 Why Go’s M:N scheduler

3.5 Why Kubernetes pods

3.6 Why fsync is critical for DBs

3.7 Why mmap for memory-mapped files

4. Security First — OS-level Security

4.1 Principle of Least Privilege

4.2 Container security

4.3 Memory protection

4.4 Side-channel attacks

5. DevOps — OS Tuning

5.1 Linux tuning for high-performance servers

5.2 Per-process limits (ulimit)

5.3 Monitoring tools

5.4 Brendan Gregg’s USE Method

6. Code Examples

6.1 epoll server in C

6.2 Goroutine vs Thread benchmark

6.3 fsync benchmark

6.4 Memory-mapped file

7. System Design Diagrams

7.1 Process vs Thread vs Coroutine

7.2 Virtual Memory

7.3 epoll Event Loop

7.4 fsync Path

8. Aha Moments & Pitfalls

Aha Moments

Pitfalls

Pitfall 1: Treating thread = process

Pitfall 2: Blocking call in async loop