Foundations: Mathematics for System Architects
“Architect không cần là Math PhD. Nhưng phải hiểu: tại sao vector similarity dùng cosine, sao A/B test cần 5000 users, sao Bloom filter false positive rate, sao zstd compress 5x với 1MB block, sao CRDT cần lattice property, sao consistent hashing dùng modular arithmetic. Đây không phải toán học cao cấp — đây là discrete math + probability + linear algebra phổ thông áp dụng vào engineering.”
Tags: cs-foundations mathematics probability linear-algebra fundamentals Student: Hieu (Backend Dev → Architect) Liên quan: Tuan-Bonus-LLM-Serving-Infrastructure · Tuan-Bonus-CRDTs-Conflict-Free-Data-Types · Tuan-Bonus-Progressive-Delivery
1. Context & Why
Tại sao Architect cần biết Math?
| Topic system design | Math underlying |
|---|---|
| Vector similarity (RAG, semantic search) | Linear algebra: dot product, cosine |
| A/B test sample size | Statistics: hypothesis testing |
| Bloom filter false positive rate | Probability: birthday paradox |
| HyperLogLog cardinality | Probability: random hash |
| Compression ratio | Information theory: entropy |
| CRDT convergence | Discrete math: lattice theory |
| Consistent hashing | Number theory: modular arithmetic |
| Erasure coding (S3) | Linear algebra: Galois fields |
| Cache hit rate model | Probability: Markov chains |
| Capacity planning | Calculus: derivatives, growth rates |
| Embedding dimension | Linear algebra: vector spaces |
| GPU performance (FLOPS) | Linear algebra: matrix multiply |
Key insight: Architect-level decisions thường có math model behind. Hiểu math = predict performance, choose right algorithm, debug edge cases.
Tham chiếu chính
- Mathematics for Machine Learning (Deisenroth, Faisal, Ong, free) — https://mml-book.com/
- Concrete Mathematics (Knuth, Graham, Patashnik) — discrete math classic
- All of Statistics (Wasserman) — probability + stats
- Information Theory, Inference, and Learning Algorithms (MacKay, free) — http://www.inference.org.uk/itila/
- 3Blue1Brown — visual intuition — https://www.3blue1brown.com/
2. Linear Algebra (cho AI/ML/Vector Search)
2.1 Vectors & Spaces
Vector = list of numbers, represents point in N-dimensional space.
v = [3, 4] → point in 2D space
w = [1, 0, 0, ..., 1] → 1024-dim embedding
Embedding: ML transforms text/image/anything into N-dim vector.
- BERT: 768 dim
- OpenAI ada-002: 1536 dim
- Custom: 64-2048 dim typical
2.2 Operations You Need
2.2.1 Dot Product
import numpy as np
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
np.dot(a, b) # 1*4 + 2*5 + 3*6 = 32Geometric meaning: How much vectors point same direction.
- Positive: same direction
- Zero: perpendicular (uncorrelated)
- Negative: opposite
2.2.2 Vector Norm (Length)
np.linalg.norm([3, 4]) # 5 (Pythagoras)2.2.3 Cosine Similarity
Most used in vector search:
def cosine_sim(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
cosine_sim([1, 0], [1, 0]) # 1.0 (identical)
cosine_sim([1, 0], [0, 1]) # 0.0 (perpendicular)
cosine_sim([1, 0], [-1, 0]) # -1.0 (opposite)Used by: All RAG systems, recommendation engines, deduplication.
Why cosine over Euclidean: Magnitude doesn’t matter for “meaning similarity” — only direction.
2.2.4 Euclidean Distance
np.linalg.norm(a - b)When to use: When magnitudes matter (e.g., image pixels).
2.2.5 L1 vs L2 Norms
- L1 (Manhattan): — robust to outliers
- L2 (Euclidean): — smooth, common
- L∞ (Chebyshev): — worst component
2.3 Matrices
Matrix = 2D grid of numbers. Linear transformation from N-dim to M-dim.
M = np.array([[1, 2],
[3, 4]])2.3.1 Matrix Multiplication
A = np.random.rand(100, 50) # 100 rows × 50 cols
B = np.random.rand(50, 30) # 50 rows × 30 cols
C = A @ B # 100 × 30Cost: O(N³) for N×N matrices. Most compute in ML.
Why GPUs dominate: Matrix multiply parallelizes perfectly.
2.3.2 Identity, Inverse, Transpose
- Identity: I, M @ I = M
- Transpose: M^T (rows become columns)
- Inverse: M @ M^-1 = I (only square, full-rank)
2.3.3 Eigenvalues / Eigenvectors
Special vectors v where M @ v = λ × v (just scaled by λ).
Used in: PCA (dimensionality reduction), PageRank, spectral clustering.
2.4 Application: Attention Mechanism (LLM)
Transformer attention is matrix multiplication:
Where:
- Q (Query): N × d
- K (Key): N × d
- V (Value): N × d
- Output: N × d
Why memory bandwidth bound: For each token decode, need to read K, V matrices for all previous tokens.
2.5 Application: Embedding Search
# Database has 1M document embeddings (1536-dim)
embeddings = np.random.rand(1_000_000, 1536)
# Query embedding
query = np.random.rand(1536)
# Compute cosine similarity to all
scores = embeddings @ query / (np.linalg.norm(embeddings, axis=1) * np.linalg.norm(query))
# Top 10
top_k = np.argsort(scores)[-10:]Cost: 1.5B multiplications for 1M docs. → Why ANN (approximate nearest neighbor) like HNSW exists.
3. Probability & Statistics
3.1 Basics
Probability = number between 0 and 1.
Random variable X = outcome of random process (e.g., dice roll).
Expected value:
Variance:
3.2 Common Distributions
3.2.1 Uniform
All outcomes equally likely. .
Example: Random hash, dice roll.
3.2.2 Normal (Gaussian)
Bell curve. Defined by μ (mean) and σ (std dev).
Why important: Central Limit Theorem — sum of many random vars approaches normal.
Examples:
- Network latency distribution (sometimes log-normal)
- Measurement errors
3.2.3 Exponential
Time between events in Poisson process.
Example: Time between requests in stable load.
3.2.4 Poisson
Number of events in fixed interval.
Example: Requests per second arrival.
3.2.5 Power-law / Pareto
80/20 rule. Few big, many small.
Examples:
- File sizes (most small, few huge)
- Web page popularity
- Network traffic per user
3.3 The Birthday Paradox
Question: How many people in room before 50% chance 2 share birthday?
Answer: 23 (not 183!).
Why: With N people, pairs, each pair 1/365 chance.
For N=23, P ≈ 50%.
Implications for engineering:
- UUID collision: 128-bit space, but generations expected before collision.
- Hash collision in 32-bit: ~65,000 inserts.
- Bloom filters: Trade-off based on this.
3.4 Bloom Filters Math
Bloom filter: Probabilistic set, “definitely not in” or “probably in”.
Parameters:
- m = bit array size
- k = number of hash functions
- n = items inserted
False positive rate:
Optimal k: .
Example: Want 1% FP rate for 1M items:
- m ≈ 9.6M bits = 1.2 MB
- k ≈ 7 hash functions
import math
def bloom_size(n, fp_rate):
m = -(n * math.log(fp_rate)) / (math.log(2) ** 2)
k = (m / n) * math.log(2)
return int(m), int(k)
m, k = bloom_size(1_000_000, 0.01)
print(f"m={m} bits ({m/8/1024:.1f} KB), k={k}")Used by: Cassandra (per-SSTable), Bitcoin SPV, Chrome Safe Browsing, RocksDB.
3.5 HyperLogLog — Cardinality Estimation
Problem: Count distinct in stream without storing all items.
HLL idea:
- Hash each item
- Track longest leading-zeros run in any hash
- Cardinality estimate = 2^max_zeros (loose math)
Practical: 16KB → estimate cardinality of billions with ~2% error.
Used by: Redis (PFCOUNT), BigQuery APPROX_COUNT_DISTINCT, Druid.
3.6 A/B Testing & Hypothesis Testing
Question: Is variant B better than A? (or random noise?)
3.6.1 Null hypothesis
H0: A = B (no difference) H1: A ≠ B
Goal: Reject H0 with confidence.
3.6.2 P-value
P(observe ≥ this difference | H0 true).
- p < 0.05 → significant (5% false positive rate)
- p < 0.01 → highly significant
3.6.3 Sample size formula
For 80% power, 5% significance, detect 5% lift:
Practical: ~10K-100K samples per arm typical for web A/B test.
3.6.4 Common pitfalls
- Peeking: Stop test early when “significant” → false positives.
- Multiple comparisons: Test 20 metrics, 1 will be “significant” by chance.
- Sample ratio mismatch: 50/50 split actually 51/49 → statistical issue.
Tools: Statsig, LaunchDarkly, Eppo handle all this.
3.7 Markov Chains (briefly)
System with states + transition probabilities.
Stationary distribution: Long-run probabilities.
Used in:
- PageRank (Google)
- Caching strategies (LRU vs LFU)
- Queue analysis
3.8 Queueing Theory
Little’s Law:
- L = average requests in system
- λ = arrival rate
- W = average wait time
Practical: 1000 req/s, 100ms avg latency → 100 concurrent requests.
M/M/1 queue: Single server, exponential arrivals.
When utilization → 100%, latency → ∞. Why 80% utilization is sweet spot.
Universal Scalability Law (USL, Neil Gunther):
- α: contention
- β: coherence
Predicts: Performance ceiling, knee of curve. Why scaling 100x cores ≠ 100x throughput.
3.9 Power Laws & Pareto in Systems
80/20 rule appears everywhere:
- 20% users → 80% traffic
- 20% files → 80% accesses
- 20% bugs → 80% incidents
Implications:
- Cache top 20% → 80% hit rate
- Optimize hot path
- Per-tenant isolation matters (one user = noisy neighbor)
4. Discrete Mathematics
4.1 Set Theory
Set: Collection of unique elements.
- A ∪ B (union)
- A ∩ B (intersection)
- A − B (difference)
- A × B (Cartesian product)
Database: Tables are sets, JOIN is intersection-like, UNION = set union.
4.2 Graph Theory
Graph G = (V, E): vertices + edges.
Types:
- Directed vs undirected
- Weighted vs unweighted
- Cyclic vs acyclic (DAG)
- Connected vs disconnected
Used in system design:
- Service dependency graphs
- Database query plans (DAG)
- Workflow systems (Airflow DAGs)
- Saga compensations
- Network topology
Algorithms:
- BFS/DFS — connectivity
- Dijkstra — shortest path (network routing)
- Topological sort — task scheduling
- Min-cut/max-flow — capacity planning
4.3 Lattices (cho CRDT)
Lattice = partially ordered set with join (∨) operation.
Properties:
- Commutative: a ∨ b = b ∨ a
- Associative: (a ∨ b) ∨ c = a ∨ (b ∨ c)
- Idempotent: a ∨ a = a
These are the same 3 properties CvRDT requires!
Examples:
- Natural numbers with max: max(a,b)
- Sets with union: a ∪ b
- Boolean with OR: a ∨ b
→ This is why G-Counter (max), OR-Set (union) work as CRDTs.
Tham chiếu: Tuan-Bonus-CRDTs-Conflict-Free-Data-Types.
4.4 Modular Arithmetic
a mod n = remainder when a divided by n.
Used in:
- Hashing (
hash(key) % Nfor shard) - Consistent hashing (mod 2^32)
- Crypto (RSA mod p×q)
- Round-robin scheduling
Properties:
- (a + b) mod n = ((a mod n) + (b mod n)) mod n
- Distributive: simplifies large number computation
4.5 Combinatorics
Permutations: ordering matters. P(n,k) = n!/(n-k)! Combinations: ordering doesn’t matter. C(n,k) = n!/(k!(n-k)!)
Used in:
- Shuffle sharding (C(N, k) combinations)
- Test case enumeration
- Hash function analysis
Stirling’s approximation: for large n.
5. Information Theory
5.1 Entropy
Information content of message:
Unit: bits.
Examples:
- Fair coin: 1 bit per flip
- Biased coin (90% heads): 0.47 bits
- 256 equally likely chars: 8 bits
Property: Maximum entropy when uniform distribution.
5.2 Compression Theory
Shannon’s source coding theorem: No lossless compressor can do better than entropy.
Why text compresses well: English entropy ~1-2 bits/char vs naive 8 bits/char → 4-8x compression possible.
Why random data doesn’t compress: Already maximum entropy.
5.3 Compression Algorithms
| Algorithm | Type | Used by |
|---|---|---|
| Huffman | Static, optimal for fixed dist | gzip part |
| LZ77 / LZ78 | Dictionary, sliding window | zip, gzip |
| DEFLATE | LZ77 + Huffman | gzip, zlib, PNG |
| LZ4 | Speed-optimized LZ77 | High-throughput |
| Snappy (Google) | Speed-optimized | Big data |
| Zstd (Facebook) | Modern, balanced | Best general purpose |
| Brotli (Google) | HTTP-optimized | Web |
Performance (typical):
| Algo | Speed | Ratio |
|---|---|---|
| LZ4 | 500 MB/s | 2x |
| Snappy | 250 MB/s | 2x |
| zstd-1 | 200 MB/s | 3x |
| zstd-19 | 5 MB/s | 5x |
| gzip-9 | 30 MB/s | 4x |
5.4 Erasure Coding
Problem: Replication 3x = 200% storage overhead. Can we be more efficient?
Erasure coding (Reed-Solomon):
- Split data into K chunks
- Add M parity chunks
- Need any K of K+M to reconstruct
Example: K=10, M=4 (Reed-Solomon 10+4)
- Storage overhead: 40% (vs 200% for 3-way replication)
- Tolerates loss of 4 chunks
Used by: S3 (~1.5x overhead), Hadoop EC, Ceph, MinIO.
Math foundation: Polynomial interpolation in Galois fields.
5.5 Hash Functions Theory
Cryptographic hash properties:
- Preimage resistance: Given h, hard to find x where hash(x) = h
- Second preimage: Given x, hard to find x’ where hash(x’) = hash(x)
- Collision resistance: Hard to find any x, x’ where hash(x) = hash(x’)
Birthday attack: Collision in 2^(n/2) tries (not 2^n).
- SHA-1 (160-bit): 2^80 birthday ≈ broken
- SHA-256: 2^128 still strong
Non-cryptographic (faster, no security):
- MurmurHash3, xxHash, CityHash
- Used for: hash tables, consistent hashing, Bloom filter
6. Calculus (briefly)
6.1 Derivatives — Rate of Change
For f(x), derivative f’(x) = how fast f changes.
Used in:
- Gradient descent (ML training)
- Performance optimization (find peak)
- Capacity planning (growth curves)
6.2 Optimization
Gradient descent:
Iteratively move opposite to gradient → minimum.
ML training = gradient descent over loss function.
6.3 Big-O & Asymptotic Behavior
| Growth | |
|---|---|
| O(1) | Constant |
| O(log N) | Logarithmic |
| O(N) | Linear |
| O(N log N) | Sorting |
| O(N²) | Naive |
| O(N³) | Matrix multiply naive |
| O(2^N) | Exponential (NP-hard) |
Limit notation: behavior.
7. Practical Applications in System Design
7.1 Vector search complexity
Naive cosine search: O(N×D) per query (D = dim).
ANN (HNSW): O(log N) typical, with build cost.
Why important: 1B vectors × 1536 dim naive = infeasible. ANN makes it possible.
7.2 Capacity estimation
Daily users: 100M
Sessions per user: 10
Events per session: 50
Event size: 1 KB
Throughput = 100M × 10 × 50 / 86400s ≈ 580K events/s
Storage/day = 100M × 10 × 50 × 1KB = 50 TB/day
Apply Pareto: Peak = 3-5x average. Apply queueing: Plan for 50-70% utilization.
7.3 Cache sizing
Power law: 20% items → 80% traffic.
Cache top 20% items: 80% hit rate. Cache top 5% items: ~50% hit rate (Zipf distribution).
→ Small cache can be very effective if access pattern skewed.
7.4 SLO & error budgets
99.9% SLO = 0.1% errors allowed = 43.2 minutes/month.
Error budget:
Burn rate alerting: Alert when consuming budget too fast.
- 1-hour fast burn: consuming 14.4× rate → 4-day alert
- 6-hour slow burn: consuming 2.4× rate
7.5 Reliability math
Series: System works if all components work.
10 services × 99.9% each = 99% overall (87.6 hours/year downtime!)
Parallel: System works if any component works.
3 replicas × 99% each = 99.9999% overall.
Implication: Avoid serial dependencies, embrace redundancy.
7.6 N-modular redundancy
For consensus: need majority. With N nodes tolerating f failures:
- 3 nodes tolerate 1 failure
- 5 nodes tolerate 2 failures
- 7 nodes tolerate 3 failures
→ Why Raft/Paxos use odd N. Tham chiếu Tuan-Bonus-Consensus-Raft-Paxos.
8. Code Examples
8.1 Cosine similarity
import numpy as np
def cosine_similarity(a, b):
"""Range [-1, 1], 1 = identical direction."""
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Batch: 1 query vs N docs
def batch_cosine(query, docs):
"""docs: (N, D), query: (D,)"""
norm_docs = np.linalg.norm(docs, axis=1)
norm_q = np.linalg.norm(query)
return (docs @ query) / (norm_docs * norm_q)
# Example: semantic search
docs = np.random.rand(10000, 768)
query = np.random.rand(768)
scores = batch_cosine(query, docs)
top10 = np.argsort(scores)[-10:][::-1]8.2 Bloom filter
import mmh3
import math
class BloomFilter:
def __init__(self, n_items, fp_rate):
self.size = int(-(n_items * math.log(fp_rate)) / (math.log(2) ** 2))
self.k = int((self.size / n_items) * math.log(2))
self.bits = bytearray((self.size + 7) // 8)
def _hashes(self, item):
h1 = mmh3.hash(item, 0)
h2 = mmh3.hash(item, 1)
return [(h1 + i * h2) % self.size for i in range(self.k)]
def add(self, item):
for h in self._hashes(item):
self.bits[h // 8] |= (1 << (h % 8))
def __contains__(self, item):
return all(
self.bits[h // 8] & (1 << (h % 8))
for h in self._hashes(item)
)
bf = BloomFilter(n_items=10000, fp_rate=0.01)
bf.add("hello")
print("hello" in bf) # True
print("world" in bf) # False (probably)8.3 Sample size for A/B test
import math
from scipy import stats
def sample_size(baseline, mde, alpha=0.05, power=0.8):
"""
baseline: baseline conversion rate (e.g., 0.10)
mde: minimum detectable effect (e.g., 0.01 for 1% absolute lift)
alpha: significance level
power: 1 - beta (typically 0.8)
"""
z_alpha = stats.norm.ppf(1 - alpha / 2)
z_beta = stats.norm.ppf(power)
p1 = baseline
p2 = baseline + mde
p_avg = (p1 + p2) / 2
n = ((z_alpha * math.sqrt(2 * p_avg * (1 - p_avg))
+ z_beta * math.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2) / (mde ** 2)
return math.ceil(n)
# Need to detect 1% lift on 10% baseline
n = sample_size(baseline=0.10, mde=0.01)
print(f"Need {n} samples per arm")
# Need 14744 samples per arm8.4 Reliability calculator
def reliability_series(*reliabilities):
"""Reliability of system requiring all components."""
r = 1.0
for x in reliabilities:
r *= x
return r
def reliability_parallel(*reliabilities):
"""Reliability of system requiring any component."""
failure_prob = 1.0
for x in reliabilities:
failure_prob *= (1 - x)
return 1 - failure_prob
# 10 microservices in series, each 99.9%
print(f"Series 10×99.9%: {reliability_series(*[0.999]*10) * 100:.3f}%")
# 99.005%
# 3 replicas in parallel, each 99%
print(f"Parallel 3×99%: {reliability_parallel(*[0.99]*3) * 100:.4f}%")
# 99.9999%8.5 HyperLogLog (simplified)
import mmh3
class HyperLogLog:
def __init__(self, p=14):
self.p = p
self.m = 1 << p # 16384 registers
self.registers = [0] * self.m
def add(self, item):
h = mmh3.hash(item) & 0xFFFFFFFF
idx = h >> (32 - self.p)
rho = self._leading_zeros(h << self.p) + 1
self.registers[idx] = max(self.registers[idx], rho)
def _leading_zeros(self, x):
if x == 0: return 32
z = 0
while x & 0x80000000 == 0:
z += 1
x <<= 1
return z
def count(self):
alpha = 0.7213 / (1 + 1.079 / self.m)
Z = sum(2 ** -r for r in self.registers)
return int(alpha * self.m * self.m / Z)
hll = HyperLogLog()
for i in range(100000):
hll.add(f"user_{i}")
print(f"Estimated: {hll.count()}") # ~100K, ~1-2% error9. System Design Diagrams
9.1 Vector Similarity
flowchart LR Query[Query embedding<br/>768-dim vector] DB[(Vector DB<br/>1M embeddings)] Query --> Compute[Compute cosine<br/>similarity] DB --> Compute Compute --> Sort[Sort by similarity] Sort --> TopK[Top-K results] Note["Naive: O(N·D)<br/>HNSW ANN: O(log N)"] style Note fill:#fff9c4
9.2 Bloom Filter
flowchart TB Item["Item: 'apple'"] Item --> H1[Hash 1: 312] Item --> H2[Hash 2: 1004] Item --> H3[Hash 3: 89] H1 --> BitArray H2 --> BitArray H3 --> BitArray subgraph BitArray["Bit array (size m=2048)"] Bits["...0010101...01010100...01010..."] end Lookup{"contains('apple')?"} Lookup -->|Check H1| Bits Lookup -->|Check H2| Bits Lookup -->|Check H3| Bits Result["All bits set → 'maybe in'<br/>Any bit unset → 'definitely not'"]
9.3 Reliability — Series vs Parallel
flowchart LR subgraph Series["Series: ALL must work"] S1[99.9%] --> S2[99.9%] --> S3[99.9%] SR[Result: 99.7%] end subgraph Parallel["Parallel: ANY must work"] P1[99%] P2[99%] P3[99%] PR[Result: 99.9999%] end style Series fill:#ffcdd2 style Parallel fill:#c8e6c9
9.4 Universal Scalability Law
flowchart LR Throughput[Throughput] Cores[Cores N] subgraph USL["USL Curve"] Linear[Linear scaling<br/>ideal] Knee[Contention knee<br/>α causes saturation] Decline[Coherence decline<br/>β causes drop] end Cores --> Linear --> Knee --> Decline Note["X(N) = λN / (1 + α(N-1) + βN(N-1))<br/>Adding cores beyond knee<br/>= performance regression"] style Note fill:#fff9c4
10. Aha Moments & Pitfalls
Aha Moments
#1: Vectors + cosine = semantic search. Foundation of all RAG. Direction matters, not magnitude.
#2: Birthday paradox is everywhere. UUID collision, hash collision, all 2^(n/2) tries. SHA-1 → SHA-256 because of this.
#3: Bloom filter math is elegant. Trade memory for false positive rate. Cassandra, RocksDB use it.
#4: Lattice properties = CRDT properties. Math foundation: commutative, associative, idempotent. Same 3 things.
#5: 80/20 rule for design. Power law everywhere. Cache top 20%, optimize hot path.
#6: Series multiplies, parallel adds (1-R). 10 microservices × 99.9% = 99% (catastrophic). Embrace parallelism.
#7: Universal Scalability Law predicts decline. More cores → eventually slower. Profile, don’t assume.
#8: Information theory bounds compression. Random data can’t compress. English text compresses 4-8x.
Pitfalls
Pitfall 1: Forget normalize before cosine
Cosine sim works on unit vectors. Forget normalize → dot product behavior. Fix: Either normalize, or use cosine_similarity formula correctly.
Pitfall 2: 32-bit hash for big data
2^16 = 65K items → birthday collision. UUID v1 also collides. Fix: 128-bit (UUID v4), 256-bit (SHA-256) for security/uniqueness.
Pitfall 3: Peeking in A/B test
Stop test when “looks significant” → false positive rate explodes. Fix: Predetermine sample size. Or use sequential testing (Mann-Whitney).
Pitfall 4: Ignore queueing theory
“100 servers handle 100 req/s” → utilization 1 → infinite queue. Fix: Plan 50-70% utilization. Add capacity for burst.
Pitfall 5: Forget Big-O for big N
O(N²) fine for N=1000, dies at N=1M. Fix: Profile with realistic data sizes.
Pitfall 6: Linear scaling assumption
“10 cores = 10x throughput” → ignore Amdahl’s law, USL. Fix: Profile parallel speedup, identify contention points.
Pitfall 7: Misuse of Bloom filter
Used for “exists” check but app expects definitive yes/no. Fix: Bloom filter = “definitely not” or “probably yes”. Use for negative case (skip lookup).
Pitfall 8: Wrong distribution model
Assume normal distribution for latency → P99 way off. Fix: Latency often log-normal or Pareto. Use percentiles, not mean.
Pitfall 9: Sampling bias
A/B test on logged-in users only → bias. Fix: Random assignment at registration / first visit.
Pitfall 10: Compressing already-compressed data
Apply gzip on JPEG → minimal gain, CPU waste. Fix: Know data type. Random or compressed data ≈ entropy max.
11. Internal Links
| Topic | Connects to |
|---|---|
| Tuan-Bonus-LLM-Serving-Infrastructure | Linear algebra, attention math, FLOPS |
| Tuan-Bonus-CRDTs-Conflict-Free-Data-Types | Lattice theory, abstract algebra |
| Tuan-Bonus-Progressive-Delivery | A/B testing, sample size |
| Tuan-Bonus-Consistency-Models-Isolation | Probability of conflicts |
| Tuan-02-Back-of-the-envelope | Estimation math |
| Tuan-10-Consistent-Hashing | Modular arithmetic |
| Case-Design-S3-Object-Storage | Erasure coding (Galois fields) |
| Case-Design-Search-Autocomplete | Probability, frequency analysis |
Tham khảo
Books:
- Mathematics for Machine Learning (Deisenroth et al., free) — https://mml-book.com/
- Concrete Mathematics (Knuth) — discrete math
- All of Statistics (Wasserman)
- Information Theory, Inference, and Learning Algorithms (MacKay, free) — http://www.inference.org.uk/itila/
- Probabilistic Programming and Bayesian Methods (Davidson-Pilon, free) — https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers
- Introduction to Probability (Blitzstein, MIT)
Online:
- 3Blue1Brown — visual intuition — https://www.3blue1brown.com/
- Khan Academy linear algebra
- Statistics Done Wrong — common pitfalls
Courses:
- MIT 18.06 Linear Algebra (Strang)
- MIT 6.041 Probability
- Stanford CS229 Machine Learning math reviews
Hoàn thành Phase B (5 files Foundations). Tiếp theo: Phase C — update References.md.