Foundations: Computer Architecture for Architects
“DDIA Chapter 1: ‘A 10ms disk read = 10 million CPU cycles’. Hiểu memory hierarchy = hiểu 90% performance optimization. Sao Llama-70B inference giới hạn bởi memory bandwidth chứ không phải compute? Sao false sharing kill performance? Sao GPU H100 đắt vì HBM3? Architect cần hiểu hardware mình build trên.”
Tags: cs-foundations computer-architecture performance fundamentals Student: Hieu (Backend Dev → Architect) Liên quan: Tuan-Foundations-OS-Essentials · Tuan-Bonus-LLM-Serving-Infrastructure · Tuan-Bonus-Vector-Database-Internals
1. Context & Why
Tại sao cần hiểu Computer Architecture?
| Vấn đề production | Computer Architecture concept |
|---|---|
| Why Llama-70B serving = 50 tok/s/GPU | Memory bandwidth (HBM3 ~3.35 TB/s) |
| Why cache hit/miss matter 1000x perf | L1 (1ns) vs DRAM (100ns) |
| Why “false sharing” tanks multithreading | Cache line invalidation |
| Why NUMA-aware scheduling matter | Memory locality across sockets |
| Why GPU >> CPU for ML | SIMD parallelism + HBM bandwidth |
| Why ARM Graviton 20% cheaper for same perf | Different uarch, fewer transistors |
| Why io_uring outperform epoll | Fewer mode switches |
| Why LMAX Disruptor 100x throughput | Mechanical sympathy with cache |
Mechanical sympathy (Martin Thompson, LMAX): “Code that works WITH the hardware, not against it.” This requires hardware knowledge.
Tham chiếu chính
- Computer Systems: A Programmer’s Perspective (CSAPP) — Bryant & O’Hallaron — bible
- Computer Architecture: A Quantitative Approach (Hennessy & Patterson) — comprehensive
- What Every Programmer Should Know About Memory (Drepper, 2007) — https://people.freebsd.org/~lstewart/articles/cpumemory.pdf
- Brendan Gregg’s perf — https://www.brendangregg.com/
- Agner Fog’s optimization manuals — https://www.agner.org/optimize/
2. Deep Dive — Khái niệm cốt lõi
2.1 The Memory Hierarchy
Most important diagram in Computer Architecture:
Latency Size Cost/GB
┌────────────────┐
│ CPU Registers │ ~0.3 ns ~KB $$$$$
├────────────────┤
│ L1 Cache │ ~1 ns ~32-128 KB $$$$
├────────────────┤
│ L2 Cache │ ~3-4 ns ~256 KB-1MB $$$
├────────────────┤
│ L3 Cache │ ~10-15 ns ~4-64 MB $$
├────────────────┤
│ DRAM (Main) │ ~80-150 ns ~16 GB-2TB $
├────────────────┤
│ NVMe SSD │ ~50-100 μs ~512 GB-32TB ¢
├────────────────┤
│ HDD │ ~5-15 ms ~1-20 TB ¢
├────────────────┤
│ Network (DC) │ ~0.5 ms unlimited ¢
├────────────────┤
│ Network (cont)│ ~150 ms unlimited ¢
└────────────────┘
Key ratios:
- L1 to DRAM: 100x slower
- DRAM to NVMe: 500x slower
- NVMe to HDD: 100x slower
- DRAM to network DC: 5000x slower
Implication: Data locality > algorithmic complexity for performance.
Common quote (Jeff Dean): “If your code is dominated by memory access, big-O analysis lies to you.”
2.2 Cache Lines
Cache works in fixed-size chunks = cache lines (typically 64 bytes).
Reading 1 byte = entire 64-byte line loaded into cache.
Cache line:
| 0 | 1 | 2 | ... | 63 | ← 64 bytes
Implications:
- Spatial locality: accessing nearby memory is fast (already in cache)
- Sequential access ≫ random access
- Struct layout matters: hot fields together
2.2.1 Sequential vs Random access benchmark
// Sequential (cache-friendly)
for (int i = 0; i < N; i++)
sum += arr[i]; // ~0.5 ns/iter (L1 hit after first line)
// Random (cache-hostile)
for (int i = 0; i < N; i++)
sum += arr[indices[i]]; // ~100 ns/iter (DRAM miss every time)200x slower for random access. Same algorithmic complexity O(N).
2.2.2 Why arrays > linked lists
Array: contiguous → cache-friendly → fast. Linked list: scattered → cache-miss → slow.
Even O(N) array iteration outperforms O(log N) random tree traversal for small N.
2.3 False Sharing — Multithreading Killer
Classic bug: Two threads write to different variables in same cache line → cache invalidation storm.
struct Stats {
int counter_A; // bytes 0-3
int counter_B; // bytes 4-7
} stats; // Both in same 64-byte cache line!
// Thread 1: stats.counter_A++
// Thread 2: stats.counter_B++
// Both threads invalidate each other's cache line
// → 100x slower than expectedFix — padding to separate cache lines:
struct Stats {
int counter_A;
char padding1[60]; // pad to 64 bytes
int counter_B;
char padding2[60];
} __attribute__((aligned(64)));Or use language constructs:
#[repr(align(64))]
struct CacheAligned<T>(T);@Contended
public class Counter { volatile long count; }Real-world: LMAX Disruptor (Martin Thompson) achieves 6M ops/sec partly via cache-aware design.
2.4 NUMA — Non-Uniform Memory Access
Modern multi-socket servers: each socket has local DRAM. Cross-socket access slower.
┌──────────────┐ ┌──────────────┐
│ CPU 0 │ ◄───► │ CPU 1 │ ← Cross-socket
│ (Socket 0) │ │ (Socket 1) │ ~2x slower
│ │ │ │
│ DRAM 0 │ │ DRAM 1 │
│ (local) │ │ (local) │
└──────────────┘ └──────────────┘
Latency:
- Local NUMA node: 100ns
- Remote NUMA node: 200-300ns
Implication: Pin process + memory to same NUMA node.
# Check NUMA topology
numactl --hardware
# Run process bound to NUMA node 0
numactl --cpunodebind=0 --membind=0 ./my-serverUsed by: High-perf databases (Postgres, MySQL), JVM (-XX:+UseNUMA), DPDK.
2.5 SIMD — Single Instruction Multiple Data
Modern CPUs can process multiple data points per instruction:
| Extension | Width | Year | Operations/cycle |
|---|---|---|---|
| MMX | 64-bit | 1997 | 2 × int32 |
| SSE | 128-bit | 1999 | 4 × float32 |
| AVX | 256-bit | 2011 | 8 × float32 |
| AVX-512 | 512-bit | 2016 | 16 × float32 |
// Naive: 8 cycles
float sum = 0;
for (int i = 0; i < 8; i++) sum += arr[i];
// SIMD with AVX: 1 cycle
__m256 vec = _mm256_loadu_ps(arr);
// _mm256_hadd_ps for horizontal sumUsed by:
- Numpy (uses BLAS / MKL)
- Image / video codecs
- Cryptography (AES-NI)
- Databases (vectorized execution: ClickHouse, DuckDB)
- ML inference (CPU paths)
2.6 Branch Prediction & Speculation
Pipeline: Modern CPU has 14-20 stage pipeline. Branch instructions are tricky.
Branch predictor: guesses which way branch goes.
- Hit: pipeline flows
- Miss: pipeline flushed, ~10-20 cycle penalty
// Predictable: always true
for (int i = 0; i < N; i++) {
if (i % 2 == 0) ... // Pattern, easy to predict
}
// Unpredictable: random
for (int i = 0; i < N; i++) {
if (random_bool()) ... // Hard to predict, 50% miss
}Branchless code (no if):
// With branch
if (a > b) max = a; else max = b;
// Branchless (compiler may already do this)
int max = a + ((b - a) & ((b - a) >> 31));Production impact: Sorted data ≫ unsorted for filtering operations because of branch predictor.
2.7 The Roofline Model
Plot: arithmetic intensity (ops/byte) vs performance (FLOPS/sec).
Performance
▲
peak ┼────────── ← compute-bound region (limited by FLOPS)
│ /
│ /
│ /
│ / ← memory-bound region (limited by bandwidth)
│ /
│ /
│ /
└─────────────► Arithmetic Intensity (ops/byte)
Application: LLM inference:
- Decode phase: 1 op (multiply) per byte read from memory
- Memory bandwidth limit dominates → can’t speed up via more compute
- This is why HBM3 (3.35 TB/s) matters for H100
Application: ML training:
- Lots of matrix multiplications → high arithmetic intensity
- Compute-bound → benefit from more FLOPS
2.8 GPU Architecture
GPUs = thousands of simple cores, optimized for parallel SIMD.
2.8.1 NVIDIA A100 / H100 architecture
A100:
- 6,912 CUDA cores
- 432 Tensor cores (mixed precision)
- 80 GB HBM2e (2 TB/s bandwidth)
- 312 TFLOPS (FP16), 1248 TFLOPS (FP16 with sparsity)
H100 (2022):
- 14,592 CUDA cores
- 456 Tensor cores
- 80 GB HBM3 (3.35 TB/s bandwidth)
- 989 TFLOPS (FP16), 4 PFLOPS (FP8 with sparsity)
- Transformer Engine (FP8 with auto-scaling)
2.8.2 SM (Streaming Multiprocessor)
- 32 threads execute as warp (lock-step)
- All threads in warp execute same instruction
- Branch divergence within warp → serialize → slow
// Good: all warp threads same path
if (thread_id < N) compute(thread_id);
// Bad: branch divergence
if (data[thread_id] > 0) ... // half threads diverge2.8.3 Memory hierarchy on GPU
Registers (per thread) ~1 cycle
↓
Shared memory (per SM) ~5-10 cycles
↓
L2 Cache (shared) ~30-50 cycles
↓
HBM (Global memory) ~200-400 cycles
Implication: Maximize shared memory usage. Why flash attention matters — keeps attention computation in shared memory.
2.9 Storage Hierarchy
2.9.1 NVMe SSD
- PCIe 4.0 x4: ~7 GB/s sequential
- PCIe 5.0 x4 (2024+): ~14 GB/s sequential
- IOPS: 1M+ random reads
- Latency: ~50 μs
2.9.2 NVMe-oF (over fabric)
- NVMe over RDMA / TCP
- Disaggregated storage
- Sub-100 μs for remote NVMe
- Used by: AWS Nitro, Snowflake, Aurora
2.9.3 Persistent Memory (PMem / Optane)
- Intel Optane (discontinued 2022 but tech relevant)
- DRAM-like latency, persistent
- Used by: SAP HANA, Redis (modes), Postgres extensions
2.9.4 CXL (Compute Express Link)
- New (2022+) interconnect
- Disaggregated memory pools
- 128GB+ memory expansion via PCIe slot
- Lower than DRAM but higher than NVMe
Future: CXL enable memory-centric architectures.
2.10 ARM vs x86 — Why Graviton matters
ARM (Graviton, Apple Silicon) vs x86 (Intel/AMD):
| x86 | ARM | |
|---|---|---|
| Architecture | CISC | RISC |
| Power efficiency | Lower | Higher (~30%) |
| Cost | Higher | Lower |
| Cores per chip | Fewer, complex | Many, simpler |
| Best for | Compute-heavy single thread | Throughput, parallel |
AWS Graviton 4 (2024):
- 96 cores @ 2.8 GHz
- 30% faster than previous gen
- 60% less energy than x86
- 20-40% cheaper for same perf
Migration: Most languages compile fine. Watch for native deps (e.g., Pillow Python).
2.11 Latency Numbers — Updated 2024
Jeff Dean’s “Latency Numbers Every Programmer Should Know” (updated):
L1 cache reference 1 ns
Branch mispredict 5 ns
L2 cache reference 4 ns
Mutex lock/unlock 25 ns
Main memory reference 100 ns
Compress 1KB 3,000 ns (3 μs)
Send 1KB over 1 Gbps 10,000 ns (10 μs)
SSD random read 16,000 ns (16 μs)
NVMe random read 50,000 ns (50 μs)
Read 1MB sequential from RAM 250,000 ns (0.25 ms)
Read 1MB sequential from NVMe 500,000 ns (0.5 ms)
Round trip same datacenter 500,000 ns (0.5 ms)
Read 1MB sequential from HDD 5,000,000 ns (5 ms)
Disk seek 10,000,000 ns (10 ms)
Send packet CA→Netherlands 150,000,000 ns (150 ms)
Memorize ratios:
- L1 : DRAM = 1 : 100
- DRAM : NVMe = 1 : 500
- NVMe : HDD = 1 : 100
3. Practical Applications
3.1 Why LLM serving is memory-bound
Llama-70B FP16 inference:
- Model: 140 GB
- Per token: read entire model weights once
- Bandwidth: 3.35 TB/s (H100 HBM3)
- Theoretical max throughput: 3,350 GB/s ÷ 140 GB = 24 tokens/s
→ Can’t go faster than memory bandwidth allows. Continuous batching helps because multiple requests share weight reads.
3.2 Why ClickHouse is fast
ClickHouse architectural choices:
- Columnar storage → cache-friendly (process column at a time)
- Vectorized execution → SIMD-accelerated
- JIT compilation for hot queries
- Late materialization — don’t read all columns
→ 10-100x faster than row-oriented OLAP databases.
3.3 Why memory bandwidth matters more than CPU GHz
Modern CPUs spend 50-80% of time waiting for memory. Adding more GHz doesn’t help.
Pareto rule for performance:
- Reduce cache misses
- Improve data layout
- Use SIMD
- Then optimize algorithm
- (Adding cores last)
3.4 LMAX Disruptor — Hardware-aware design
Disruptor (2010): Lock-free ring buffer that achieves 6M+ ops/sec.
Tricks:
- Single writer per slot → no synchronization
- Cache-line aligned slots → no false sharing
- Mechanical sympathy with CPU pipeline
- Sequence numbers (no shared lock)
Design pattern: When you NEED max throughput, design with cache in mind.
3.5 Storage tier strategy
| Tier | Storage | Use case | Cost/GB |
|---|---|---|---|
| Hot | DRAM (Redis) | Sub-ms reads | $5-10 |
| Warm | NVMe (Postgres) | Active data | $0.10-0.30 |
| Cool | SATA SSD | Less active | $0.03-0.10 |
| Cold | S3 Standard | Backups | $0.023 |
| Frozen | S3 Glacier | Archive | $0.001-0.004 |
Architecture pattern: Lakehouse uses this — hot data in compute cache, warm in S3 IA, cold in Glacier.
4. Performance Testing & Profiling
4.1 Benchmark methodology
Trinity of benchmarking:
- Microbenchmarks: Specific function (use
criterionin Rust, JMH in Java) - Synthetic load tests: Tool like
wrk,vegeta,k6 - Production shadow traffic: Real-world patterns
Pitfalls:
- Warmup needed (JIT, cache, branch predictor)
- Don’t average — use percentiles
- Watch for thermal throttling
- Disable turbo boost for reproducibility
4.2 perf — Linux profiling
# CPU profiling
perf record -F 99 -g ./my-app # Sample 99Hz, with stack
perf report # Interactive view
# Cache miss profiling
perf stat -e cache-misses,cache-references ./my-app
# Cycle attribution
perf annotate -d ./my-app4.3 Brendan Gregg’s flame graphs
# Capture
perf record -F 99 -ag -- sleep 30
# Generate flame graph
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg→ Visual: y-axis stack depth, x-axis sample count, width = time spent.
4.4 eBPF-based profiling
# CPU profile via eBPF (no perf needed)
bpftrace -e 'profile:hz:99 /pid == 1234/ { @[ustack] = count(); }'
# Off-CPU analysis
offcputime -p 1234
# Cache misses by function
llcstat -p 1234Tools: BCC, bpftrace, Pixie, Pyroscope, Parca.
5. Architecture Implications
5.1 When CPU bound
Symptoms: high CPU%, low waiting%, performance scales with cores.
Optimizations:
- Profile, optimize hot loops
- Use SIMD where possible
- Multithreading
- Better algorithm
5.2 When memory bandwidth bound
Symptoms: CPU% medium, but adding cores doesn’t help.
Optimizations:
- Reduce data size (compression, quantization)
- Batch operations
- Improve locality (data layout)
- Use cache-aware algorithms
Examples: LLM inference, in-memory analytics, big data scans.
5.3 When I/O bound
Symptoms: low CPU, high iowait.
Optimizations:
- Async I/O (epoll, io_uring)
- Connection pooling
- Reduce I/O (cache, batch)
- Faster storage (NVMe)
5.4 When network bound
Symptoms: txqueuelen saturated, retransmits.
Optimizations:
- Compression
- HTTP/2 multiplexing
- Connection reuse
- CDN / edge
6. Code Examples
6.1 Cache-friendly vs hostile
// Cache-friendly: row-major iteration
int matrix[N][N];
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
matrix[i][j] = i + j;
// ~10x faster than column-major!
// Cache-hostile
for (int j = 0; j < N; j++)
for (int i = 0; i < N; i++)
matrix[i][j] = i + j; // Stride access6.2 False sharing demo
use std::sync::atomic::{AtomicU64, Ordering};
use std::thread;
// BAD: false sharing
struct BadCounter {
a: AtomicU64, // 8 bytes
b: AtomicU64, // 8 bytes — same cache line!
}
// GOOD: padded to 64 bytes
#[repr(align(64))]
struct PaddedAtomic(AtomicU64);
struct GoodCounter {
a: PaddedAtomic,
b: PaddedAtomic,
}
fn benchmark<F: Fn(&AtomicU64) + Send + Sync + 'static>(name: &str, c: &impl ...) {
// ...
}Result: PaddedAtomic 5-10x faster than naive when 2 threads contend.
6.3 SIMD with Rust
use std::simd::*;
fn sum_simd(arr: &[f32]) -> f32 {
let chunks = arr.chunks_exact(8);
let remainder = chunks.remainder();
let sum_vec = chunks
.map(|chunk| f32x8::from_slice(chunk))
.fold(f32x8::splat(0.0), |acc, v| acc + v);
let sum_remainder: f32 = remainder.iter().sum();
sum_vec.reduce_sum() + sum_remainder
}
// 4-8x faster than scalar sum6.4 NUMA-aware allocation
#include <numa.h>
if (numa_available() >= 0) {
// Allocate on local NUMA node
void *mem = numa_alloc_local(SIZE);
// ...
numa_free(mem, SIZE);
}7. System Design Diagrams
7.1 Memory Hierarchy
flowchart TB CPU[CPU Core] CPU <-->|0.3ns| Reg[Registers] Reg <-->|1ns| L1[L1 Cache<br/>32-128 KB] L1 <-->|3ns| L2[L2 Cache<br/>256 KB-1 MB] L2 <-->|10ns| L3[L3 Cache<br/>4-64 MB shared] L3 <-->|100ns| DRAM[DRAM<br/>16 GB-2 TB] DRAM <-->|50μs| NVMe[NVMe SSD] NVMe <-->|5ms| HDD[HDD] NVMe <-->|0.5ms| NetDC[Network DC] NetDC <-->|150ms| NetCont[Network Cross-Continent] style Reg fill:#1b5e20,color:#fff style L1 fill:#2e7d32,color:#fff style L2 fill:#43a047,color:#fff style L3 fill:#66bb6a,color:#fff style DRAM fill:#a5d6a7,color:#000 style NVMe fill:#fff9c4,color:#000 style HDD fill:#ffe0b2,color:#000
7.2 NUMA Architecture
flowchart LR subgraph S0["Socket 0"] CPU0[CPU 0<br/>cores 0-7] DRAM0[DRAM 0<br/>local] CPU0 <--> DRAM0 end subgraph S1["Socket 1"] CPU1[CPU 1<br/>cores 8-15] DRAM1[DRAM 1<br/>local] CPU1 <--> DRAM1 end CPU0 <-.QPI/UPI<br/>cross-socket.-> CPU1 CPU0 <-.slower.-> DRAM1 CPU1 <-.slower.-> DRAM0 style CPU0 fill:#bbdefb style CPU1 fill:#c8e6c9
7.3 GPU Architecture (Simplified)
flowchart TB subgraph GPU["NVIDIA H100 SXM5"] subgraph SMs["132 Streaming Multiprocessors"] SM1[SM 1<br/>128 cores<br/>Tensor cores] SM2[...] SM132[SM 132] end L2[L2 Cache<br/>50 MB shared] HBM[HBM3<br/>80 GB<br/>3.35 TB/s] SMs --> L2 L2 --> HBM end NVLink[NVLink 4<br/>900 GB/s<br/>to other GPUs] PCIe[PCIe 5<br/>128 GB/s<br/>to CPU] GPU --> NVLink GPU --> PCIe
7.4 Roofline Model
flowchart LR subgraph Compute["Compute-Bound Region"] C1[High arithmetic<br/>intensity] C2[Performance ≈ peak FLOPS] C3[Examples: ML training,<br/>scientific compute] end subgraph Memory["Memory-Bound Region"] M1[Low arithmetic<br/>intensity] M2[Performance ≈ bandwidth × intensity] M3[Examples: LLM decode,<br/>analytics scans] end Compute -.compute optimization.-> M1 Memory -.bandwidth optimization.-> C1 style Compute fill:#c8e6c9 style Memory fill:#fff9c4
8. Aha Moments & Pitfalls
Aha Moments
#1: Memory hierarchy ratios are 100x at each level. L1 to DRAM, DRAM to NVMe, NVMe to HDD. Architecture decisions = where data lives.
#2: Modern CPUs spend 50-80% time waiting for memory. Adding GHz doesn’t help. Reduce cache misses → reduce data size → improve locality.
#3: LLM inference = memory bandwidth bound. 3.35 TB/s ÷ 140 GB = 24 tok/s theoretical max. Continuous batching shares weight reads.
#4: False sharing is silent killer. Two threads writing different vars in same cache line = serialization. Pad to 64 bytes.
#5: NUMA matters at scale. 2-socket server: pin process + memory to same node. Saves 50%+ memory latency.
#6: SIMD can give 4-16x speedup for free. Compiler auto-vectorizes simple loops. Vectorized DBs (DuckDB, ClickHouse) win because of this.
#7: Branch predictor matters. Sorted data >> unsorted. ~20 cycle penalty per misprediction.
#8: GPU is parallel SIMD on steroids. 14K cores in lockstep. Branch divergence kills GPU perf.
Pitfalls
Pitfall 1: O(N) > O(log N) for small N
Linked list O(log N) operations slower than array O(N) because of cache. Fix: Profile actual workload, not just Big O.
Pitfall 2: Same cache line struct fields
Multi-threaded counters in same struct → false sharing. Fix: Pad to 64 bytes between thread-local fields.
Pitfall 3: Random access patterns
Hash map with poor distribution → cache miss every lookup. Fix: Better hash, or use array-based DS for small data.
Pitfall 4: Over-rely on CPU GHz
“Just buy faster CPU” — but workload is memory-bound. Fix: Profile, find real bottleneck.
Pitfall 5: NUMA-blind allocation
JVM defaults can spread memory across NUMA → 2x slower. Fix:
-XX:+UseNUMAfor JVM,numactlfor processes.
Pitfall 6: Thermal throttling in benchmarks
Hot CPU clocks down → benchmarks unreliable. Fix: Disable turbo boost, monitor temps, use sustained workload.
Pitfall 7: Sequential vs parallel mix
Add cores but workload sequential → no improvement (Amdahl’s law). Fix: Profile parallel speedup curve before scaling.
Pitfall 8: Forget about TLB
Random access across huge memory → TLB misses → 100ns each. Fix: Huge pages (
madvise(MADV_HUGEPAGE)), Transparent HugePages.
Pitfall 9: GPU branch divergence
Naive CUDA code with
if/else→ warp serialization. Fix: Algorithm redesign, predicated execution.
Pitfall 10: Assume HDD = SSD
Designed for sequential HDD → poor on random NVMe (or vice versa). Fix: Match algorithm to storage characteristics.
9. Internal Links
| Topic | Connects to |
|---|---|
| Tuan-Foundations-OS-Essentials | Virtual memory uses MMU + caches |
| Tuan-Bonus-LLM-Serving-Infrastructure | LLM = memory-bandwidth bound (HBM3) |
| Tuan-Bonus-Vector-Database-Internals | Vector search uses SIMD + cache |
| Tuan-Foundations-Database-Internals | Storage hierarchy, B-tree vs LSM |
| Case-Design-Stock-Exchange | LMAX Disruptor, mechanical sympathy |
| Tuan-13-Monitoring-Observability | perf, eBPF for profiling |
Tham khảo
Books:
- Computer Systems: A Programmer’s Perspective (CSAPP, Bryant & O’Hallaron)
- Computer Architecture: A Quantitative Approach (Hennessy & Patterson)
- Systems Performance (Brendan Gregg)
- The Art of Multiprocessor Programming (Herlihy & Shavit)
Papers:
- Drepper, What Every Programmer Should Know About Memory — https://people.freebsd.org/~lstewart/articles/cpumemory.pdf
- Roofline model paper (Williams et al.)
Online:
- Brendan Gregg — https://www.brendangregg.com/
- Agner Fog — https://www.agner.org/optimize/
- LMAX Disruptor — https://lmax-exchange.github.io/disruptor/
Courses:
- CMU 15-418 Parallel Computer Architecture — http://www.cs.cmu.edu/~418/
- MIT 6.172 Performance Engineering — https://ocw.mit.edu/courses/6-172-performance-engineering-of-software-systems-fall-2018/
Tiếp theo: Tuan-Foundations-Database-Internals — Storage engines (B-tree, LSM), MVCC, query optimizer.