Foundations: Computer Architecture for Architects

“DDIA Chapter 1: ‘A 10ms disk read = 10 million CPU cycles’. Hiểu memory hierarchy = hiểu 90% performance optimization. Sao Llama-70B inference giới hạn bởi memory bandwidth chứ không phải compute? Sao false sharing kill performance? Sao GPU H100 đắt vì HBM3? Architect cần hiểu hardware mình build trên.”

Tags: cs-foundations computer-architecture performance fundamentals Student: Hieu (Backend Dev → Architect) Liên quan: Tuan-Foundations-OS-Essentials · Tuan-Bonus-LLM-Serving-Infrastructure · Tuan-Bonus-Vector-Database-Internals

1. Context & Why

Tại sao cần hiểu Computer Architecture?

Vấn đề production	Computer Architecture concept
Why Llama-70B serving = 50 tok/s/GPU	Memory bandwidth (HBM3 ~3.35 TB/s)
Why cache hit/miss matter 1000x perf	L1 (1ns) vs DRAM (100ns)
Why “false sharing” tanks multithreading	Cache line invalidation
Why NUMA-aware scheduling matter	Memory locality across sockets
Why GPU >> CPU for ML	SIMD parallelism + HBM bandwidth
Why ARM Graviton 20% cheaper for same perf	Different uarch, fewer transistors
Why io_uring outperform epoll	Fewer mode switches
Why LMAX Disruptor 100x throughput	Mechanical sympathy with cache

Mechanical sympathy (Martin Thompson, LMAX): “Code that works WITH the hardware, not against it.” This requires hardware knowledge.

Tham chiếu chính

Computer Systems: A Programmer’s Perspective (CSAPP) — Bryant & O’Hallaron — bible
Computer Architecture: A Quantitative Approach (Hennessy & Patterson) — comprehensive
What Every Programmer Should Know About Memory (Drepper, 2007) — https://people.freebsd.org/~lstewart/articles/cpumemory.pdf
Brendan Gregg’s perf — https://www.brendangregg.com/
Agner Fog’s optimization manuals — https://www.agner.org/optimize/

2. Deep Dive — Khái niệm cốt lõi

2.1 The Memory Hierarchy

Most important diagram in Computer Architecture:

                          Latency       Size       Cost/GB
┌────────────────┐
│  CPU Registers │       ~0.3 ns        ~KB         $$$$$
├────────────────┤
│  L1 Cache      │       ~1 ns          ~32-128 KB  $$$$
├────────────────┤
│  L2 Cache      │       ~3-4 ns        ~256 KB-1MB $$$
├────────────────┤
│  L3 Cache      │       ~10-15 ns      ~4-64 MB    $$
├────────────────┤
│  DRAM (Main)   │       ~80-150 ns     ~16 GB-2TB  $
├────────────────┤
│  NVMe SSD      │       ~50-100 μs     ~512 GB-32TB ¢
├────────────────┤
│  HDD           │       ~5-15 ms       ~1-20 TB     ¢
├────────────────┤
│  Network (DC)  │       ~0.5 ms        unlimited    ¢
├────────────────┤
│  Network (cont)│       ~150 ms        unlimited    ¢
└────────────────┘

Key ratios:

L1 to DRAM: 100x slower
DRAM to NVMe: 500x slower
NVMe to HDD: 100x slower
DRAM to network DC: 5000x slower

Implication: Data locality > algorithmic complexity for performance.

Common quote (Jeff Dean): “If your code is dominated by memory access, big-O analysis lies to you.”

2.2 Cache Lines

Cache works in fixed-size chunks = cache lines (typically 64 bytes).

Reading 1 byte = entire 64-byte line loaded into cache.

Cache line:
| 0 | 1 | 2 | ... | 63 |  ← 64 bytes

Implications:

Spatial locality: accessing nearby memory is fast (already in cache)
Sequential access ≫ random access
Struct layout matters: hot fields together

2.2.1 Sequential vs Random access benchmark

// Sequential (cache-friendly)
for (int i = 0; i < N; i++)
    sum += arr[i];        // ~0.5 ns/iter (L1 hit after first line)
 
// Random (cache-hostile)
for (int i = 0; i < N; i++)
    sum += arr[indices[i]]; // ~100 ns/iter (DRAM miss every time)

200x slower for random access. Same algorithmic complexity O(N).

2.2.2 Why arrays > linked lists

Array: contiguous → cache-friendly → fast. Linked list: scattered → cache-miss → slow.

Even O(N) array iteration outperforms O(log N) random tree traversal for small N.

Classic bug: Two threads write to different variables in same cache line → cache invalidation storm.

struct Stats {
    int counter_A;  // bytes 0-3
    int counter_B;  // bytes 4-7
} stats;  // Both in same 64-byte cache line!
 
// Thread 1: stats.counter_A++
// Thread 2: stats.counter_B++
 
// Both threads invalidate each other's cache line
// → 100x slower than expected

Fix — padding to separate cache lines:

struct Stats {
    int counter_A;
    char padding1[60];  // pad to 64 bytes
    int counter_B;
    char padding2[60];
} __attribute__((aligned(64)));

Or use language constructs:

#[repr(align(64))]
struct CacheAligned<T>(T);

@Contended
public class Counter { volatile long count; }

Real-world: LMAX Disruptor (Martin Thompson) achieves 6M ops/sec partly via cache-aware design.

2.4 NUMA — Non-Uniform Memory Access

Modern multi-socket servers: each socket has local DRAM. Cross-socket access slower.

┌──────────────┐       ┌──────────────┐
│  CPU 0       │ ◄───► │  CPU 1       │ ← Cross-socket
│  (Socket 0)  │       │  (Socket 1)  │   ~2x slower
│              │       │              │
│  DRAM 0      │       │  DRAM 1      │
│  (local)     │       │  (local)     │
└──────────────┘       └──────────────┘

Latency:

Local NUMA node: 100ns
Remote NUMA node: 200-300ns

Implication: Pin process + memory to same NUMA node.

# Check NUMA topology
numactl --hardware
 
# Run process bound to NUMA node 0
numactl --cpunodebind=0 --membind=0 ./my-server

Used by: High-perf databases (Postgres, MySQL), JVM (-XX:+UseNUMA), DPDK.

2.5 SIMD — Single Instruction Multiple Data

Modern CPUs can process multiple data points per instruction:

Extension	Width	Year	Operations/cycle
MMX	64-bit	1997	2 × int32
SSE	128-bit	1999	4 × float32
AVX	256-bit	2011	8 × float32
AVX-512	512-bit	2016	16 × float32

// Naive: 8 cycles
float sum = 0;
for (int i = 0; i < 8; i++) sum += arr[i];
 
// SIMD with AVX: 1 cycle
__m256 vec = _mm256_loadu_ps(arr);
// _mm256_hadd_ps for horizontal sum

Used by:

Numpy (uses BLAS / MKL)
Image / video codecs
Cryptography (AES-NI)
Databases (vectorized execution: ClickHouse, DuckDB)
ML inference (CPU paths)

2.6 Branch Prediction & Speculation

Pipeline: Modern CPU has 14-20 stage pipeline. Branch instructions are tricky.

Branch predictor: guesses which way branch goes.

Hit: pipeline flows
Miss: pipeline flushed, ~10-20 cycle penalty

// Predictable: always true
for (int i = 0; i < N; i++) {
    if (i % 2 == 0) ...  // Pattern, easy to predict
}
 
// Unpredictable: random
for (int i = 0; i < N; i++) {
    if (random_bool()) ...  // Hard to predict, 50% miss
}

Branchless code (no if):

// With branch
if (a > b) max = a; else max = b;
 
// Branchless (compiler may already do this)
int max = a + ((b - a) & ((b - a) >> 31));

Production impact: Sorted data ≫ unsorted for filtering operations because of branch predictor.

2.7 The Roofline Model

Plot: arithmetic intensity (ops/byte) vs performance (FLOPS/sec).

Performance
    ▲
peak ┼──────────  ← compute-bound region (limited by FLOPS)
     │       /
     │      /
     │     /
     │    / ← memory-bound region (limited by bandwidth)
     │   /
     │  /
     │ /
     └─────────────► Arithmetic Intensity (ops/byte)

Application: LLM inference:

Decode phase: 1 op (multiply) per byte read from memory
Memory bandwidth limit dominates → can’t speed up via more compute
This is why HBM3 (3.35 TB/s) matters for H100

Application: ML training:

Lots of matrix multiplications → high arithmetic intensity
Compute-bound → benefit from more FLOPS

2.8 GPU Architecture

GPUs = thousands of simple cores, optimized for parallel SIMD.

2.8.1 NVIDIA A100 / H100 architecture

A100:

6,912 CUDA cores
432 Tensor cores (mixed precision)
80 GB HBM2e (2 TB/s bandwidth)
312 TFLOPS (FP16), 1248 TFLOPS (FP16 with sparsity)

H100 (2022):

14,592 CUDA cores
456 Tensor cores
80 GB HBM3 (3.35 TB/s bandwidth)
989 TFLOPS (FP16), 4 PFLOPS (FP8 with sparsity)
Transformer Engine (FP8 with auto-scaling)

2.8.2 SM (Streaming Multiprocessor)

32 threads execute as warp (lock-step)
All threads in warp execute same instruction
Branch divergence within warp → serialize → slow

// Good: all warp threads same path
if (thread_id < N) compute(thread_id);
 
// Bad: branch divergence
if (data[thread_id] > 0) ...  // half threads diverge

2.8.3 Memory hierarchy on GPU

Registers (per thread)        ~1 cycle
   ↓
Shared memory (per SM)        ~5-10 cycles
   ↓
L2 Cache (shared)             ~30-50 cycles
   ↓
HBM (Global memory)           ~200-400 cycles

Implication: Maximize shared memory usage. Why flash attention matters — keeps attention computation in shared memory.

2.9 Storage Hierarchy

2.9.1 NVMe SSD

PCIe 4.0 x4: ~7 GB/s sequential
PCIe 5.0 x4 (2024+): ~14 GB/s sequential
IOPS: 1M+ random reads
Latency: ~50 μs

2.9.2 NVMe-oF (over fabric)

NVMe over RDMA / TCP
Disaggregated storage
Sub-100 μs for remote NVMe
Used by: AWS Nitro, Snowflake, Aurora

2.9.3 Persistent Memory (PMem / Optane)

Intel Optane (discontinued 2022 but tech relevant)
DRAM-like latency, persistent
Used by: SAP HANA, Redis (modes), Postgres extensions

2.9.4 CXL (Compute Express Link)

New (2022+) interconnect
Disaggregated memory pools
128GB+ memory expansion via PCIe slot
Lower than DRAM but higher than NVMe

Future: CXL enable memory-centric architectures.

2.10 ARM vs x86 — Why Graviton matters

ARM (Graviton, Apple Silicon) vs x86 (Intel/AMD):

	x86	ARM
Architecture	CISC	RISC
Power efficiency	Lower	Higher (~30%)
Cost	Higher	Lower
Cores per chip	Fewer, complex	Many, simpler
Best for	Compute-heavy single thread	Throughput, parallel

AWS Graviton 4 (2024):

96 cores @ 2.8 GHz
30% faster than previous gen
60% less energy than x86
20-40% cheaper for same perf

Migration: Most languages compile fine. Watch for native deps (e.g., Pillow Python).

2.11 Latency Numbers — Updated 2024

Jeff Dean’s “Latency Numbers Every Programmer Should Know” (updated):

L1 cache reference                    1 ns
Branch mispredict                     5 ns
L2 cache reference                    4 ns
Mutex lock/unlock                    25 ns
Main memory reference               100 ns
Compress 1KB                      3,000 ns (3 μs)
Send 1KB over 1 Gbps                10,000 ns (10 μs)
SSD random read                    16,000 ns (16 μs)
NVMe random read                   50,000 ns (50 μs)
Read 1MB sequential from RAM       250,000 ns (0.25 ms)
Read 1MB sequential from NVMe      500,000 ns (0.5 ms)
Round trip same datacenter         500,000 ns (0.5 ms)
Read 1MB sequential from HDD     5,000,000 ns (5 ms)
Disk seek                       10,000,000 ns (10 ms)
Send packet CA→Netherlands     150,000,000 ns (150 ms)

Memorize ratios:

L1 : DRAM = 1 : 100
DRAM : NVMe = 1 : 500
NVMe : HDD = 1 : 100

3. Practical Applications

3.1 Why LLM serving is memory-bound

Llama-70B FP16 inference:

Model: 140 GB
Per token: read entire model weights once
Bandwidth: 3.35 TB/s (H100 HBM3)
Theoretical max throughput: 3,350 GB/s ÷ 140 GB = 24 tokens/s

→ Can’t go faster than memory bandwidth allows. Continuous batching helps because multiple requests share weight reads.

3.2 Why ClickHouse is fast

ClickHouse architectural choices:

Columnar storage → cache-friendly (process column at a time)
Vectorized execution → SIMD-accelerated
JIT compilation for hot queries
Late materialization — don’t read all columns

→ 10-100x faster than row-oriented OLAP databases.

3.3 Why memory bandwidth matters more than CPU GHz

Modern CPUs spend 50-80% of time waiting for memory. Adding more GHz doesn’t help.

Pareto rule for performance:

Reduce cache misses
Improve data layout
Use SIMD
Then optimize algorithm
(Adding cores last)

3.4 LMAX Disruptor — Hardware-aware design

Disruptor (2010): Lock-free ring buffer that achieves 6M+ ops/sec.

Tricks:

Single writer per slot → no synchronization
Cache-line aligned slots → no false sharing
Mechanical sympathy with CPU pipeline
Sequence numbers (no shared lock)

Design pattern: When you NEED max throughput, design with cache in mind.

3.5 Storage tier strategy

Tier	Storage	Use case	Cost/GB
Hot	DRAM (Redis)	Sub-ms reads	$5-10
Warm	NVMe (Postgres)	Active data	$0.10-0.30
Cool	SATA SSD	Less active	$0.03-0.10
Cold	S3 Standard	Backups	$0.023
Frozen	S3 Glacier	Archive	$0.001-0.004

Architecture pattern: Lakehouse uses this — hot data in compute cache, warm in S3 IA, cold in Glacier.

4. Performance Testing & Profiling

4.1 Benchmark methodology

Trinity of benchmarking:

Microbenchmarks: Specific function (use criterion in Rust, JMH in Java)
Synthetic load tests: Tool like wrk, vegeta, k6
Production shadow traffic: Real-world patterns

Pitfalls:

Warmup needed (JIT, cache, branch predictor)
Don’t average — use percentiles
Watch for thermal throttling
Disable turbo boost for reproducibility

4.2 perf — Linux profiling

# CPU profiling
perf record -F 99 -g ./my-app    # Sample 99Hz, with stack
perf report                        # Interactive view
 
# Cache miss profiling
perf stat -e cache-misses,cache-references ./my-app
 
# Cycle attribution
perf annotate -d ./my-app

4.3 Brendan Gregg’s flame graphs

# Capture
perf record -F 99 -ag -- sleep 30
 
# Generate flame graph
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg

→ Visual: y-axis stack depth, x-axis sample count, width = time spent.

4.4 eBPF-based profiling

# CPU profile via eBPF (no perf needed)
bpftrace -e 'profile:hz:99 /pid == 1234/ { @[ustack] = count(); }'
 
# Off-CPU analysis
offcputime -p 1234
 
# Cache misses by function
llcstat -p 1234

Tools: BCC, bpftrace, Pixie, Pyroscope, Parca.

5. Architecture Implications

5.1 When CPU bound

Symptoms: high CPU%, low waiting%, performance scales with cores.

Optimizations:

Profile, optimize hot loops
Use SIMD where possible
Multithreading
Better algorithm

5.2 When memory bandwidth bound

Symptoms: CPU% medium, but adding cores doesn’t help.

Optimizations:

Reduce data size (compression, quantization)
Batch operations
Improve locality (data layout)
Use cache-aware algorithms

Examples: LLM inference, in-memory analytics, big data scans.

5.3 When I/O bound

Symptoms: low CPU, high iowait.

Optimizations:

Async I/O (epoll, io_uring)
Connection pooling
Reduce I/O (cache, batch)
Faster storage (NVMe)

5.4 When network bound

Symptoms: txqueuelen saturated, retransmits.

Optimizations:

Compression
HTTP/2 multiplexing
Connection reuse
CDN / edge

6. Code Examples

6.1 Cache-friendly vs hostile

// Cache-friendly: row-major iteration
int matrix[N][N];
for (int i = 0; i < N; i++)
    for (int j = 0; j < N; j++)
        matrix[i][j] = i + j;
// ~10x faster than column-major!
 
// Cache-hostile
for (int j = 0; j < N; j++)
    for (int i = 0; i < N; i++)
        matrix[i][j] = i + j;  // Stride access

use std::sync::atomic::{AtomicU64, Ordering};
use std::thread;
 
// BAD: false sharing
struct BadCounter {
    a: AtomicU64,  // 8 bytes
    b: AtomicU64,  // 8 bytes — same cache line!
}
 
// GOOD: padded to 64 bytes
#[repr(align(64))]
struct PaddedAtomic(AtomicU64);
 
struct GoodCounter {
    a: PaddedAtomic,
    b: PaddedAtomic,
}
 
fn benchmark<F: Fn(&AtomicU64) + Send + Sync + 'static>(name: &str, c: &impl ...) {
    // ...
}

Result: PaddedAtomic 5-10x faster than naive when 2 threads contend.

6.3 SIMD with Rust

use std::simd::*;
 
fn sum_simd(arr: &[f32]) -> f32 {
    let chunks = arr.chunks_exact(8);
    let remainder = chunks.remainder();
 
    let sum_vec = chunks
        .map(|chunk| f32x8::from_slice(chunk))
        .fold(f32x8::splat(0.0), |acc, v| acc + v);
 
    let sum_remainder: f32 = remainder.iter().sum();
    sum_vec.reduce_sum() + sum_remainder
}
 
// 4-8x faster than scalar sum

6.4 NUMA-aware allocation

#include <numa.h>
 
if (numa_available() >= 0) {
    // Allocate on local NUMA node
    void *mem = numa_alloc_local(SIZE);
    // ...
    numa_free(mem, SIZE);
}

7. System Design Diagrams

7.1 Memory Hierarchy

flowchart TB
    CPU[CPU Core]

    CPU <-->|0.3ns| Reg[Registers]
    Reg <-->|1ns| L1[L1 Cache<br/>32-128 KB]
    L1 <-->|3ns| L2[L2 Cache<br/>256 KB-1 MB]
    L2 <-->|10ns| L3[L3 Cache<br/>4-64 MB shared]
    L3 <-->|100ns| DRAM[DRAM<br/>16 GB-2 TB]
    DRAM <-->|50μs| NVMe[NVMe SSD]
    NVMe <-->|5ms| HDD[HDD]
    NVMe <-->|0.5ms| NetDC[Network DC]
    NetDC <-->|150ms| NetCont[Network Cross-Continent]

    style Reg fill:#1b5e20,color:#fff
    style L1 fill:#2e7d32,color:#fff
    style L2 fill:#43a047,color:#fff
    style L3 fill:#66bb6a,color:#fff
    style DRAM fill:#a5d6a7,color:#000
    style NVMe fill:#fff9c4,color:#000
    style HDD fill:#ffe0b2,color:#000

7.2 NUMA Architecture

flowchart LR
    subgraph S0["Socket 0"]
        CPU0[CPU 0<br/>cores 0-7]
        DRAM0[DRAM 0<br/>local]
        CPU0 <--> DRAM0
    end

    subgraph S1["Socket 1"]
        CPU1[CPU 1<br/>cores 8-15]
        DRAM1[DRAM 1<br/>local]
        CPU1 <--> DRAM1
    end

    CPU0 <-.QPI/UPI<br/>cross-socket.-> CPU1
    CPU0 <-.slower.-> DRAM1
    CPU1 <-.slower.-> DRAM0

    style CPU0 fill:#bbdefb
    style CPU1 fill:#c8e6c9

7.3 GPU Architecture (Simplified)

flowchart TB
    subgraph GPU["NVIDIA H100 SXM5"]
        subgraph SMs["132 Streaming Multiprocessors"]
            SM1[SM 1<br/>128 cores<br/>Tensor cores]
            SM2[...]
            SM132[SM 132]
        end

        L2[L2 Cache<br/>50 MB shared]
        HBM[HBM3<br/>80 GB<br/>3.35 TB/s]

        SMs --> L2
        L2 --> HBM
    end

    NVLink[NVLink 4<br/>900 GB/s<br/>to other GPUs]
    PCIe[PCIe 5<br/>128 GB/s<br/>to CPU]

    GPU --> NVLink
    GPU --> PCIe

7.4 Roofline Model

flowchart LR
    subgraph Compute["Compute-Bound Region"]
        C1[High arithmetic<br/>intensity]
        C2[Performance ≈ peak FLOPS]
        C3[Examples: ML training,<br/>scientific compute]
    end

    subgraph Memory["Memory-Bound Region"]
        M1[Low arithmetic<br/>intensity]
        M2[Performance ≈ bandwidth × intensity]
        M3[Examples: LLM decode,<br/>analytics scans]
    end

    Compute -.compute optimization.-> M1
    Memory -.bandwidth optimization.-> C1

    style Compute fill:#c8e6c9
    style Memory fill:#fff9c4

8. Aha Moments & Pitfalls

Aha Moments

#1: Memory hierarchy ratios are 100x at each level. L1 to DRAM, DRAM to NVMe, NVMe to HDD. Architecture decisions = where data lives.

#2: Modern CPUs spend 50-80% time waiting for memory. Adding GHz doesn’t help. Reduce cache misses → reduce data size → improve locality.

#3: LLM inference = memory bandwidth bound. 3.35 TB/s ÷ 140 GB = 24 tok/s theoretical max. Continuous batching shares weight reads.

#4: False sharing is silent killer. Two threads writing different vars in same cache line = serialization. Pad to 64 bytes.

#5: NUMA matters at scale. 2-socket server: pin process + memory to same node. Saves 50%+ memory latency.

#6: SIMD can give 4-16x speedup for free. Compiler auto-vectorizes simple loops. Vectorized DBs (DuckDB, ClickHouse) win because of this.

#7: Branch predictor matters. Sorted data >> unsorted. ~20 cycle penalty per misprediction.

#8: GPU is parallel SIMD on steroids. 14K cores in lockstep. Branch divergence kills GPU perf.

Pitfalls

Pitfall 1: O(N) > O(log N) for small N

Linked list O(log N) operations slower than array O(N) because of cache. Fix: Profile actual workload, not just Big O.

Pitfall 2: Same cache line struct fields

Multi-threaded counters in same struct → false sharing. Fix: Pad to 64 bytes between thread-local fields.

Pitfall 3: Random access patterns

Hash map with poor distribution → cache miss every lookup. Fix: Better hash, or use array-based DS for small data.

Pitfall 4: Over-rely on CPU GHz

“Just buy faster CPU” — but workload is memory-bound. Fix: Profile, find real bottleneck.

JVM defaults can spread memory across NUMA → 2x slower. Fix: -XX:+UseNUMA for JVM, numactl for processes.

Pitfall 6: Thermal throttling in benchmarks

Hot CPU clocks down → benchmarks unreliable. Fix: Disable turbo boost, monitor temps, use sustained workload.

Pitfall 7: Sequential vs parallel mix

Add cores but workload sequential → no improvement (Amdahl’s law). Fix: Profile parallel speedup curve before scaling.

Pitfall 8: Forget about TLB

Random access across huge memory → TLB misses → 100ns each. Fix: Huge pages (madvise(MADV_HUGEPAGE)), Transparent HugePages.

Pitfall 9: GPU branch divergence

Naive CUDA code with if/else → warp serialization. Fix: Algorithm redesign, predicated execution.

Pitfall 10: Assume HDD = SSD

Designed for sequential HDD → poor on random NVMe (or vice versa). Fix: Match algorithm to storage characteristics.

9. Internal Links

Topic	Connects to
Tuan-Foundations-OS-Essentials	Virtual memory uses MMU + caches
Tuan-Bonus-LLM-Serving-Infrastructure	LLM = memory-bandwidth bound (HBM3)
Tuan-Bonus-Vector-Database-Internals	Vector search uses SIMD + cache
Tuan-Foundations-Database-Internals	Storage hierarchy, B-tree vs LSM
Case-Design-Stock-Exchange	LMAX Disruptor, mechanical sympathy
Tuan-13-Monitoring-Observability	perf, eBPF for profiling

Tham khảo

Books:

Computer Systems: A Programmer’s Perspective (CSAPP, Bryant & O’Hallaron)
Computer Architecture: A Quantitative Approach (Hennessy & Patterson)
Systems Performance (Brendan Gregg)
The Art of Multiprocessor Programming (Herlihy & Shavit)

Papers:

Drepper, What Every Programmer Should Know About Memory — https://people.freebsd.org/~lstewart/articles/cpumemory.pdf
Roofline model paper (Williams et al.)

Online:

Brendan Gregg — https://www.brendangregg.com/
Agner Fog — https://www.agner.org/optimize/
LMAX Disruptor — https://lmax-exchange.github.io/disruptor/

Courses:

CMU 15-418 Parallel Computer Architecture — http://www.cs.cmu.edu/~418/
MIT 6.172 Performance Engineering — https://ocw.mit.edu/courses/6-172-performance-engineering-of-software-systems-fall-2018/

Tiếp theo: Tuan-Foundations-Database-Internals — Storage engines (B-tree, LSM), MVCC, query optimizer.

lthieu's notes

Explorer

Tuan-Foundations-Computer-Architecture

Foundations: Computer Architecture for Architects

1. Context & Why

Tại sao cần hiểu Computer Architecture?

Tham chiếu chính

2. Deep Dive — Khái niệm cốt lõi

2.1 The Memory Hierarchy

2.2 Cache Lines

2.2.1 Sequential vs Random access benchmark

2.2.2 Why arrays > linked lists

2.3 False Sharing — Multithreading Killer

2.4 NUMA — Non-Uniform Memory Access

2.5 SIMD — Single Instruction Multiple Data

2.6 Branch Prediction & Speculation

2.7 The Roofline Model

2.8 GPU Architecture

2.8.1 NVIDIA A100 / H100 architecture

2.8.2 SM (Streaming Multiprocessor)

2.8.3 Memory hierarchy on GPU

2.9 Storage Hierarchy

2.9.1 NVMe SSD

2.9.2 NVMe-oF (over fabric)

2.9.3 Persistent Memory (PMem / Optane)

2.9.4 CXL (Compute Express Link)

2.10 ARM vs x86 — Why Graviton matters

2.11 Latency Numbers — Updated 2024

3. Practical Applications

3.1 Why LLM serving is memory-bound

3.2 Why ClickHouse is fast

3.3 Why memory bandwidth matters more than CPU GHz

3.4 LMAX Disruptor — Hardware-aware design

3.5 Storage tier strategy

4. Performance Testing & Profiling

4.1 Benchmark methodology

4.2 perf — Linux profiling

4.3 Brendan Gregg’s flame graphs

4.4 eBPF-based profiling

5. Architecture Implications

5.1 When CPU bound

5.2 When memory bandwidth bound

5.3 When I/O bound

5.4 When network bound

6. Code Examples

6.1 Cache-friendly vs hostile

6.2 False sharing demo

6.3 SIMD with Rust

6.4 NUMA-aware allocation

7. System Design Diagrams

7.1 Memory Hierarchy

7.2 NUMA Architecture

7.3 GPU Architecture (Simplified)

7.4 Roofline Model

8. Aha Moments & Pitfalls

Aha Moments

Pitfalls

Pitfall 1: O(N) > O(log N) for small N

Pitfall 2: Same cache line struct fields

Pitfall 3: Random access patterns

Pitfall 4: Over-rely on CPU GHz

Pitfall 5: NUMA-blind allocation

Pitfall 6: Thermal throttling in benchmarks

Pitfall 7: Sequential vs parallel mix

Pitfall 8: Forget about TLB

Pitfall 9: GPU branch divergence

Pitfall 10: Assume HDD = SSD

9. Internal Links

Tham khảo

Graph View

Table of Contents

Backlinks