Foundations: Mathematics for System Architects

“Architect không cần là Math PhD. Nhưng phải hiểu: tại sao vector similarity dùng cosine, sao A/B test cần 5000 users, sao Bloom filter false positive rate, sao zstd compress 5x với 1MB block, sao CRDT cần lattice property, sao consistent hashing dùng modular arithmetic. Đây không phải toán học cao cấp — đây là discrete math + probability + linear algebra phổ thông áp dụng vào engineering.”

Tags: cs-foundations mathematics probability linear-algebra fundamentals Student: Hieu (Backend Dev → Architect) Liên quan: Tuan-Bonus-LLM-Serving-Infrastructure · Tuan-Bonus-CRDTs-Conflict-Free-Data-Types · Tuan-Bonus-Progressive-Delivery

1. Context & Why

Tại sao Architect cần biết Math?

Topic system design	Math underlying
Vector similarity (RAG, semantic search)	Linear algebra: dot product, cosine
A/B test sample size	Statistics: hypothesis testing
Bloom filter false positive rate	Probability: birthday paradox
HyperLogLog cardinality	Probability: random hash
Compression ratio	Information theory: entropy
CRDT convergence	Discrete math: lattice theory
Consistent hashing	Number theory: modular arithmetic
Erasure coding (S3)	Linear algebra: Galois fields
Cache hit rate model	Probability: Markov chains
Capacity planning	Calculus: derivatives, growth rates
Embedding dimension	Linear algebra: vector spaces
GPU performance (FLOPS)	Linear algebra: matrix multiply

Key insight: Architect-level decisions thường có math model behind. Hiểu math = predict performance, choose right algorithm, debug edge cases.

Tham chiếu chính

Mathematics for Machine Learning (Deisenroth, Faisal, Ong, free) — https://mml-book.com/
Concrete Mathematics (Knuth, Graham, Patashnik) — discrete math classic
All of Statistics (Wasserman) — probability + stats
Information Theory, Inference, and Learning Algorithms (MacKay, free) — http://www.inference.org.uk/itila/
3Blue1Brown — visual intuition — https://www.3blue1brown.com/

2. Linear Algebra (cho AI/ML/Vector Search)

2.1 Vectors & Spaces

Vector = list of numbers, represents point in N-dimensional space.

v = [3, 4]  → point in 2D space
w = [1, 0, 0, ..., 1]  → 1024-dim embedding

Embedding: ML transforms text/image/anything into N-dim vector.

BERT: 768 dim
OpenAI ada-002: 1536 dim
Custom: 64-2048 dim typical

2.2 Operations You Need

2.2.1 Dot Product

$a \cdot b = \sum_{i = 1}^{n} a_{i} b_{i}$

import numpy as np
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
np.dot(a, b)  # 1*4 + 2*5 + 3*6 = 32

Geometric meaning: How much vectors point same direction.

Positive: same direction
Zero: perpendicular (uncorrelated)
Negative: opposite

2.2.2 Vector Norm (Length)

$∥ v ∥ = \sum v_{i}^{2}$

np.linalg.norm([3, 4])  # 5 (Pythagoras)

2.2.3 Cosine Similarity

Most used in vector search:

$cos (θ) = \frac{a \cdot b}{∥ a ∥∥ b ∥}$

def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
 
cosine_sim([1, 0], [1, 0])  # 1.0 (identical)
cosine_sim([1, 0], [0, 1])  # 0.0 (perpendicular)
cosine_sim([1, 0], [-1, 0]) # -1.0 (opposite)

Used by: All RAG systems, recommendation engines, deduplication.

Why cosine over Euclidean: Magnitude doesn’t matter for “meaning similarity” — only direction.

2.2.4 Euclidean Distance

$d (a, b) = \sum (a_{i} - b_{i})^{2}$

np.linalg.norm(a - b)

When to use: When magnitudes matter (e.g., image pixels).

2.2.5 L1 vs L2 Norms

L1 (Manhattan): $\sum ∣ a_{i} - b_{i} ∣$ — robust to outliers
L2 (Euclidean): $\sum (a_{i} - b_{i})^{2}$ — smooth, common
L∞ (Chebyshev): $max ∣ a_{i} - b_{i} ∣$ — worst component

2.3 Matrices

Matrix = 2D grid of numbers. Linear transformation from N-dim to M-dim.

M = np.array([[1, 2],
              [3, 4]])

2.3.1 Matrix Multiplication

A = np.random.rand(100, 50)   # 100 rows × 50 cols
B = np.random.rand(50, 30)    # 50 rows × 30 cols
C = A @ B                      # 100 × 30

Cost: O(N³) for N×N matrices. Most compute in ML.

Why GPUs dominate: Matrix multiply parallelizes perfectly.

2.3.2 Identity, Inverse, Transpose

Identity: I, M @ I = M
Transpose: M^T (rows become columns)
Inverse: M @ M^-1 = I (only square, full-rank)

2.3.3 Eigenvalues / Eigenvectors

Special vectors v where M @ v = λ × v (just scaled by λ).

Used in: PCA (dimensionality reduction), PageRank, spectral clustering.

2.4 Application: Attention Mechanism (LLM)

Transformer attention is matrix multiplication:

$Attention (Q, K, V) = softmax (\frac{Q K ^{T}}{d _{k}}) V$

Where:

Q (Query): N × d
K (Key): N × d
V (Value): N × d
Output: N × d

Why memory bandwidth bound: For each token decode, need to read K, V matrices for all previous tokens.

2.5 Application: Embedding Search

# Database has 1M document embeddings (1536-dim)
embeddings = np.random.rand(1_000_000, 1536)
 
# Query embedding
query = np.random.rand(1536)
 
# Compute cosine similarity to all
scores = embeddings @ query / (np.linalg.norm(embeddings, axis=1) * np.linalg.norm(query))
 
# Top 10
top_k = np.argsort(scores)[-10:]

Cost: 1.5B multiplications for 1M docs. → Why ANN (approximate nearest neighbor) like HNSW exists.

3. Probability & Statistics

3.1 Basics

Probability = number between 0 and 1.

Random variable X = outcome of random process (e.g., dice roll).

Expected value: $E [X] = \sum_{i} x_{i} \cdot P (x_{i})$

Variance: $Var (X) = E [(X - E [X])^{2}]$

3.2 Common Distributions

3.2.1 Uniform

All outcomes equally likely. $P (x) = 1/ n$ .

Example: Random hash, dice roll.

3.2.2 Normal (Gaussian)

Bell curve. Defined by μ (mean) and σ (std dev).

Why important: Central Limit Theorem — sum of many random vars approaches normal.

Examples:

Network latency distribution (sometimes log-normal)
Measurement errors

3.2.3 Exponential

Time between events in Poisson process.

Example: Time between requests in stable load.

3.2.4 Poisson

Number of events in fixed interval.

Example: Requests per second arrival.

$P (k) = \frac{λ ^{k} e ^{- λ}}{k !}$

3.2.5 Power-law / Pareto

80/20 rule. Few big, many small.

Examples:

File sizes (most small, few huge)
Web page popularity
Network traffic per user

3.3 The Birthday Paradox

Question: How many people in room before 50% chance 2 share birthday?

Answer: 23 (not 183!).

Why: With N people, $(2 N)$ pairs, each pair 1/365 chance.

$P (collision) = 1 - \prod_{i = 1}^{N - 1} (1 - \frac{i}{365})$

For N=23, P ≈ 50%.

Implications for engineering:

UUID collision: 128-bit space, but $2^{64}$ generations expected before collision.
Hash collision in 32-bit: ~65,000 inserts.
Bloom filters: Trade-off based on this.

3.4 Bloom Filters Math

Bloom filter: Probabilistic set, “definitely not in” or “probably in”.

Parameters:

m = bit array size
k = number of hash functions
n = items inserted

False positive rate: $P_{f p} = (1 - e^{- kn / m})^{k}$

Optimal k: $k = (m / n) ln 2$ .

Example: Want 1% FP rate for 1M items:

m ≈ 9.6M bits = 1.2 MB
k ≈ 7 hash functions

import math
 
def bloom_size(n, fp_rate):
    m = -(n * math.log(fp_rate)) / (math.log(2) ** 2)
    k = (m / n) * math.log(2)
    return int(m), int(k)
 
m, k = bloom_size(1_000_000, 0.01)
print(f"m={m} bits ({m/8/1024:.1f} KB), k={k}")

Used by: Cassandra (per-SSTable), Bitcoin SPV, Chrome Safe Browsing, RocksDB.

3.5 HyperLogLog — Cardinality Estimation

Problem: Count distinct in stream without storing all items.

HLL idea:

Hash each item
Track longest leading-zeros run in any hash
Cardinality estimate = 2^max_zeros (loose math)

Practical: 16KB → estimate cardinality of billions with ~2% error.

Used by: Redis (PFCOUNT), BigQuery APPROX_COUNT_DISTINCT, Druid.

3.6 A/B Testing & Hypothesis Testing

Question: Is variant B better than A? (or random noise?)

3.6.1 Null hypothesis

H0: A = B (no difference) H1: A ≠ B

Goal: Reject H0 with confidence.

3.6.2 P-value

P(observe ≥ this difference | H0 true).

p < 0.05 → significant (5% false positive rate)
p < 0.01 → highly significant

3.6.3 Sample size formula

For 80% power, 5% significance, detect 5% lift:

$n = \frac{2 ( z _{α /2} + z _{β} ) ^{2} σ ^{2}}{δ ^{2}}$

Practical: ~10K-100K samples per arm typical for web A/B test.

3.6.4 Common pitfalls

Peeking: Stop test early when “significant” → false positives.
Multiple comparisons: Test 20 metrics, 1 will be “significant” by chance.
Sample ratio mismatch: 50/50 split actually 51/49 → statistical issue.

Tools: Statsig, LaunchDarkly, Eppo handle all this.

3.7 Markov Chains (briefly)

System with states + transition probabilities.

Stationary distribution: Long-run probabilities.

Used in:

PageRank (Google)
Caching strategies (LRU vs LFU)
Queue analysis

3.8 Queueing Theory

Little’s Law:

$L = λW$

L = average requests in system
λ = arrival rate
W = average wait time

Practical: 1000 req/s, 100ms avg latency → 100 concurrent requests.

M/M/1 queue: Single server, exponential arrivals.

$Avg wait = \frac{1}{μ - λ}$

When utilization → 100%, latency → ∞. Why 80% utilization is sweet spot.

Universal Scalability Law (USL, Neil Gunther):

$X (N) = \frac{λ N}{1 + α ( N - 1 ) + βN ( N - 1 )}$

α: contention
β: coherence

Predicts: Performance ceiling, knee of curve. Why scaling 100x cores ≠ 100x throughput.

3.9 Power Laws & Pareto in Systems

80/20 rule appears everywhere:

20% users → 80% traffic
20% files → 80% accesses
20% bugs → 80% incidents

Implications:

Cache top 20% → 80% hit rate
Optimize hot path
Per-tenant isolation matters (one user = noisy neighbor)

4. Discrete Mathematics

4.1 Set Theory

Set: Collection of unique elements.

A ∪ B (union)
A ∩ B (intersection)
A − B (difference)
A × B (Cartesian product)

Database: Tables are sets, JOIN is intersection-like, UNION = set union.

4.2 Graph Theory

Graph G = (V, E): vertices + edges.

Types:

Directed vs undirected
Weighted vs unweighted
Cyclic vs acyclic (DAG)
Connected vs disconnected

Used in system design:

Service dependency graphs
Database query plans (DAG)
Workflow systems (Airflow DAGs)
Saga compensations
Network topology

Algorithms:

BFS/DFS — connectivity
Dijkstra — shortest path (network routing)
Topological sort — task scheduling
Min-cut/max-flow — capacity planning

4.3 Lattices (cho CRDT)

Lattice = partially ordered set with join (∨) operation.

Properties:

Commutative: a ∨ b = b ∨ a
Associative: (a ∨ b) ∨ c = a ∨ (b ∨ c)
Idempotent: a ∨ a = a

These are the same 3 properties CvRDT requires!

Examples:

Natural numbers with max: max(a,b)
Sets with union: a ∪ b
Boolean with OR: a ∨ b

→ This is why G-Counter (max), OR-Set (union) work as CRDTs.

Tham chiếu: Tuan-Bonus-CRDTs-Conflict-Free-Data-Types.

4.4 Modular Arithmetic

a mod n = remainder when a divided by n.

Used in:

Hashing (hash(key) % N for shard)
Consistent hashing (mod 2^32)
Crypto (RSA mod p×q)
Round-robin scheduling

Properties:

(a + b) mod n = ((a mod n) + (b mod n)) mod n
Distributive: simplifies large number computation

4.5 Combinatorics

Permutations: ordering matters. P(n,k) = n!/(n-k)! Combinations: ordering doesn’t matter. C(n,k) = n!/(k!(n-k)!)

Used in:

Shuffle sharding (C(N, k) combinations)
Test case enumeration
Hash function analysis

Stirling’s approximation: $n! \approx 2 πn (n / e)^{n}$ for large n.

5. Information Theory

5.1 Entropy

Information content of message:

$H (X) = - \sum P (x) lo g_{2} P (x)$

Unit: bits.

Examples:

Fair coin: 1 bit per flip
Biased coin (90% heads): 0.47 bits
256 equally likely chars: 8 bits

Property: Maximum entropy when uniform distribution.

5.2 Compression Theory

Shannon’s source coding theorem: No lossless compressor can do better than entropy.

Why text compresses well: English entropy ~1-2 bits/char vs naive 8 bits/char → 4-8x compression possible.

Why random data doesn’t compress: Already maximum entropy.

5.3 Compression Algorithms

Algorithm	Type	Used by
Huffman	Static, optimal for fixed dist	gzip part
LZ77 / LZ78	Dictionary, sliding window	zip, gzip
DEFLATE	LZ77 + Huffman	gzip, zlib, PNG
LZ4	Speed-optimized LZ77	High-throughput
Snappy (Google)	Speed-optimized	Big data
Zstd (Facebook)	Modern, balanced	Best general purpose
Brotli (Google)	HTTP-optimized	Web

Performance (typical):

Algo	Speed	Ratio
LZ4	500 MB/s	2x
Snappy	250 MB/s	2x
zstd-1	200 MB/s	3x
zstd-19	5 MB/s	5x
gzip-9	30 MB/s	4x

5.4 Erasure Coding

Problem: Replication 3x = 200% storage overhead. Can we be more efficient?

Erasure coding (Reed-Solomon):

Split data into K chunks
Add M parity chunks
Need any K of K+M to reconstruct

Example: K=10, M=4 (Reed-Solomon 10+4)

Storage overhead: 40% (vs 200% for 3-way replication)
Tolerates loss of 4 chunks

Used by: S3 (~1.5x overhead), Hadoop EC, Ceph, MinIO.

Math foundation: Polynomial interpolation in Galois fields.

5.5 Hash Functions Theory

Cryptographic hash properties:

Preimage resistance: Given h, hard to find x where hash(x) = h
Second preimage: Given x, hard to find x’ where hash(x’) = hash(x)
Collision resistance: Hard to find any x, x’ where hash(x) = hash(x’)

Birthday attack: Collision in 2^(n/2) tries (not 2^n).

SHA-1 (160-bit): 2^80 birthday ≈ broken
SHA-256: 2^128 still strong

Non-cryptographic (faster, no security):

MurmurHash3, xxHash, CityHash
Used for: hash tables, consistent hashing, Bloom filter

6. Calculus (briefly)

6.1 Derivatives — Rate of Change

For f(x), derivative f’(x) = how fast f changes.

Used in:

Gradient descent (ML training)
Performance optimization (find peak)
Capacity planning (growth curves)

6.2 Optimization

Gradient descent: $x_{n + 1} = x_{n} - α \nabla f (x_{n})$

Iteratively move opposite to gradient → minimum.

ML training = gradient descent over loss function.

6.3 Big-O & Asymptotic Behavior

	Growth
O(1)	Constant
O(log N)	Logarithmic
O(N)	Linear
O(N log N)	Sorting
O(N²)	Naive
O(N³)	Matrix multiply naive
O(2^N)	Exponential (NP-hard)

Limit notation: $lim_{N \to \infty}$ behavior.

7. Practical Applications in System Design

7.1 Vector search complexity

Naive cosine search: O(N×D) per query (D = dim).

ANN (HNSW): O(log N) typical, with build cost.

Why important: 1B vectors × 1536 dim naive = infeasible. ANN makes it possible.

7.2 Capacity estimation

Daily users: 100M
Sessions per user: 10
Events per session: 50
Event size: 1 KB

Throughput = 100M × 10 × 50 / 86400s ≈ 580K events/s
Storage/day = 100M × 10 × 50 × 1KB = 50 TB/day

Apply Pareto: Peak = 3-5x average. Apply queueing: Plan for 50-70% utilization.

7.3 Cache sizing

Power law: 20% items → 80% traffic.

Cache top 20% items: 80% hit rate. Cache top 5% items: ~50% hit rate (Zipf distribution).

→ Small cache can be very effective if access pattern skewed.

7.4 SLO & error budgets

99.9% SLO = 0.1% errors allowed = 43.2 minutes/month.

Error budget: $Budget = 1 - SLO$

Burn rate alerting: Alert when consuming budget too fast.

1-hour fast burn: consuming 14.4× rate → 4-day alert
6-hour slow burn: consuming 2.4× rate

7.5 Reliability math

Series: System works if all components work. $R_{ser i es} = R_{1} \times R_{2} \times ... \times R_{n}$

10 services × 99.9% each = 99% overall (87.6 hours/year downtime!)

Parallel: System works if any component works. $R_{p a r a ll e l} = 1 - (1 - R_{1}) (1 - R_{2}) ... (1 - R_{n})$

3 replicas × 99% each = 99.9999% overall.

Implication: Avoid serial dependencies, embrace redundancy.

7.6 N-modular redundancy

For consensus: need majority. With N nodes tolerating f failures: $N = 2 f + 1$

3 nodes tolerate 1 failure
5 nodes tolerate 2 failures
7 nodes tolerate 3 failures

→ Why Raft/Paxos use odd N. Tham chiếu Tuan-Bonus-Consensus-Raft-Paxos.

8. Code Examples

8.1 Cosine similarity

import numpy as np
 
def cosine_similarity(a, b):
    """Range [-1, 1], 1 = identical direction."""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
 
 
# Batch: 1 query vs N docs
def batch_cosine(query, docs):
    """docs: (N, D), query: (D,)"""
    norm_docs = np.linalg.norm(docs, axis=1)
    norm_q = np.linalg.norm(query)
    return (docs @ query) / (norm_docs * norm_q)
 
 
# Example: semantic search
docs = np.random.rand(10000, 768)
query = np.random.rand(768)
scores = batch_cosine(query, docs)
top10 = np.argsort(scores)[-10:][::-1]

8.2 Bloom filter

import mmh3
import math
 
class BloomFilter:
    def __init__(self, n_items, fp_rate):
        self.size = int(-(n_items * math.log(fp_rate)) / (math.log(2) ** 2))
        self.k = int((self.size / n_items) * math.log(2))
        self.bits = bytearray((self.size + 7) // 8)
 
    def _hashes(self, item):
        h1 = mmh3.hash(item, 0)
        h2 = mmh3.hash(item, 1)
        return [(h1 + i * h2) % self.size for i in range(self.k)]
 
    def add(self, item):
        for h in self._hashes(item):
            self.bits[h // 8] |= (1 << (h % 8))
 
    def __contains__(self, item):
        return all(
            self.bits[h // 8] & (1 << (h % 8))
            for h in self._hashes(item)
        )
 
 
bf = BloomFilter(n_items=10000, fp_rate=0.01)
bf.add("hello")
print("hello" in bf)  # True
print("world" in bf)  # False (probably)

8.3 Sample size for A/B test

import math
from scipy import stats
 
def sample_size(baseline, mde, alpha=0.05, power=0.8):
    """
    baseline: baseline conversion rate (e.g., 0.10)
    mde: minimum detectable effect (e.g., 0.01 for 1% absolute lift)
    alpha: significance level
    power: 1 - beta (typically 0.8)
    """
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta = stats.norm.ppf(power)
 
    p1 = baseline
    p2 = baseline + mde
    p_avg = (p1 + p2) / 2
 
    n = ((z_alpha * math.sqrt(2 * p_avg * (1 - p_avg))
          + z_beta * math.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2) / (mde ** 2)
 
    return math.ceil(n)
 
 
# Need to detect 1% lift on 10% baseline
n = sample_size(baseline=0.10, mde=0.01)
print(f"Need {n} samples per arm")
# Need 14744 samples per arm

8.4 Reliability calculator

def reliability_series(*reliabilities):
    """Reliability of system requiring all components."""
    r = 1.0
    for x in reliabilities:
        r *= x
    return r
 
 
def reliability_parallel(*reliabilities):
    """Reliability of system requiring any component."""
    failure_prob = 1.0
    for x in reliabilities:
        failure_prob *= (1 - x)
    return 1 - failure_prob
 
 
# 10 microservices in series, each 99.9%
print(f"Series 10×99.9%: {reliability_series(*[0.999]*10) * 100:.3f}%")
# 99.005%
 
# 3 replicas in parallel, each 99%
print(f"Parallel 3×99%: {reliability_parallel(*[0.99]*3) * 100:.4f}%")
# 99.9999%

8.5 HyperLogLog (simplified)

import mmh3
 
class HyperLogLog:
    def __init__(self, p=14):
        self.p = p
        self.m = 1 << p  # 16384 registers
        self.registers = [0] * self.m
 
    def add(self, item):
        h = mmh3.hash(item) & 0xFFFFFFFF
        idx = h >> (32 - self.p)
        rho = self._leading_zeros(h << self.p) + 1
        self.registers[idx] = max(self.registers[idx], rho)
 
    def _leading_zeros(self, x):
        if x == 0: return 32
        z = 0
        while x & 0x80000000 == 0:
            z += 1
            x <<= 1
        return z
 
    def count(self):
        alpha = 0.7213 / (1 + 1.079 / self.m)
        Z = sum(2 ** -r for r in self.registers)
        return int(alpha * self.m * self.m / Z)
 
 
hll = HyperLogLog()
for i in range(100000):
    hll.add(f"user_{i}")
print(f"Estimated: {hll.count()}")  # ~100K, ~1-2% error

9. System Design Diagrams

9.1 Vector Similarity

flowchart LR
    Query[Query embedding<br/>768-dim vector]
    DB[(Vector DB<br/>1M embeddings)]

    Query --> Compute[Compute cosine<br/>similarity]
    DB --> Compute

    Compute --> Sort[Sort by similarity]
    Sort --> TopK[Top-K results]

    Note["Naive: O(N·D)<br/>HNSW ANN: O(log N)"]

    style Note fill:#fff9c4

9.2 Bloom Filter

flowchart TB
    Item["Item: 'apple'"]

    Item --> H1[Hash 1: 312]
    Item --> H2[Hash 2: 1004]
    Item --> H3[Hash 3: 89]

    H1 --> BitArray
    H2 --> BitArray
    H3 --> BitArray

    subgraph BitArray["Bit array (size m=2048)"]
        Bits["...0010101...01010100...01010..."]
    end

    Lookup{"contains('apple')?"}
    Lookup -->|Check H1| Bits
    Lookup -->|Check H2| Bits
    Lookup -->|Check H3| Bits

    Result["All bits set → 'maybe in'<br/>Any bit unset → 'definitely not'"]

9.3 Reliability — Series vs Parallel

flowchart LR
    subgraph Series["Series: ALL must work"]
        S1[99.9%] --> S2[99.9%] --> S3[99.9%]
        SR[Result: 99.7%]
    end

    subgraph Parallel["Parallel: ANY must work"]
        P1[99%]
        P2[99%]
        P3[99%]
        PR[Result: 99.9999%]
    end

    style Series fill:#ffcdd2
    style Parallel fill:#c8e6c9

9.4 Universal Scalability Law

flowchart LR
    Throughput[Throughput]
    Cores[Cores N]

    subgraph USL["USL Curve"]
        Linear[Linear scaling<br/>ideal]
        Knee[Contention knee<br/>α causes saturation]
        Decline[Coherence decline<br/>β causes drop]
    end

    Cores --> Linear --> Knee --> Decline

    Note["X(N) = λN / (1 + α(N-1) + βN(N-1))<br/>Adding cores beyond knee<br/>= performance regression"]

    style Note fill:#fff9c4

10. Aha Moments & Pitfalls

Aha Moments

#1: Vectors + cosine = semantic search. Foundation of all RAG. Direction matters, not magnitude.

#2: Birthday paradox is everywhere. UUID collision, hash collision, all 2^(n/2) tries. SHA-1 → SHA-256 because of this.

#3: Bloom filter math is elegant. Trade memory for false positive rate. Cassandra, RocksDB use it.

#4: Lattice properties = CRDT properties. Math foundation: commutative, associative, idempotent. Same 3 things.

#5: 80/20 rule for design. Power law everywhere. Cache top 20%, optimize hot path.

#6: Series multiplies, parallel adds (1-R). 10 microservices × 99.9% = 99% (catastrophic). Embrace parallelism.

#7: Universal Scalability Law predicts decline. More cores → eventually slower. Profile, don’t assume.

#8: Information theory bounds compression. Random data can’t compress. English text compresses 4-8x.

Pitfalls

Pitfall 1: Forget normalize before cosine

Cosine sim works on unit vectors. Forget normalize → dot product behavior. Fix: Either normalize, or use cosine_similarity formula correctly.

Pitfall 2: 32-bit hash for big data

2^16 = 65K items → birthday collision. UUID v1 also collides. Fix: 128-bit (UUID v4), 256-bit (SHA-256) for security/uniqueness.

Pitfall 3: Peeking in A/B test

Stop test when “looks significant” → false positive rate explodes. Fix: Predetermine sample size. Or use sequential testing (Mann-Whitney).

Pitfall 4: Ignore queueing theory

“100 servers handle 100 req/s” → utilization 1 → infinite queue. Fix: Plan 50-70% utilization. Add capacity for burst.

Pitfall 5: Forget Big-O for big N

O(N²) fine for N=1000, dies at N=1M. Fix: Profile with realistic data sizes.

Pitfall 6: Linear scaling assumption

“10 cores = 10x throughput” → ignore Amdahl’s law, USL. Fix: Profile parallel speedup, identify contention points.

Pitfall 7: Misuse of Bloom filter

Used for “exists” check but app expects definitive yes/no. Fix: Bloom filter = “definitely not” or “probably yes”. Use for negative case (skip lookup).

Pitfall 8: Wrong distribution model

Assume normal distribution for latency → P99 way off. Fix: Latency often log-normal or Pareto. Use percentiles, not mean.

Pitfall 9: Sampling bias

A/B test on logged-in users only → bias. Fix: Random assignment at registration / first visit.

Pitfall 10: Compressing already-compressed data

Apply gzip on JPEG → minimal gain, CPU waste. Fix: Know data type. Random or compressed data ≈ entropy max.

11. Internal Links

Topic	Connects to
Tuan-Bonus-LLM-Serving-Infrastructure	Linear algebra, attention math, FLOPS
Tuan-Bonus-CRDTs-Conflict-Free-Data-Types	Lattice theory, abstract algebra
Tuan-Bonus-Progressive-Delivery	A/B testing, sample size
Tuan-Bonus-Consistency-Models-Isolation	Probability of conflicts
Tuan-02-Back-of-the-envelope	Estimation math
Tuan-10-Consistent-Hashing	Modular arithmetic
Case-Design-S3-Object-Storage	Erasure coding (Galois fields)
Case-Design-Search-Autocomplete	Probability, frequency analysis

Tham khảo

Books:

Mathematics for Machine Learning (Deisenroth et al., free) — https://mml-book.com/
Concrete Mathematics (Knuth) — discrete math
All of Statistics (Wasserman)
Information Theory, Inference, and Learning Algorithms (MacKay, free) — http://www.inference.org.uk/itila/
Probabilistic Programming and Bayesian Methods (Davidson-Pilon, free) — https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers
Introduction to Probability (Blitzstein, MIT)

Online:

3Blue1Brown — visual intuition — https://www.3blue1brown.com/
Khan Academy linear algebra
Statistics Done Wrong — common pitfalls

Courses:

MIT 18.06 Linear Algebra (Strang)
MIT 6.041 Probability
Stanford CS229 Machine Learning math reviews

Hoàn thành Phase B (5 files Foundations). Tiếp theo: Phase C — update References.md.

lthieu's notes

Explorer

Tuan-Foundations-Math-for-Architects

Foundations: Mathematics for System Architects

1. Context & Why

Tại sao Architect cần biết Math?

Tham chiếu chính

2. Linear Algebra (cho AI/ML/Vector Search)

2.1 Vectors & Spaces

2.2 Operations You Need

2.2.1 Dot Product

2.2.2 Vector Norm (Length)

2.2.3 Cosine Similarity

2.2.4 Euclidean Distance

2.2.5 L1 vs L2 Norms

2.3 Matrices

2.3.1 Matrix Multiplication

2.3.2 Identity, Inverse, Transpose

2.3.3 Eigenvalues / Eigenvectors

2.4 Application: Attention Mechanism (LLM)

2.5 Application: Embedding Search

3. Probability & Statistics

3.1 Basics

3.2 Common Distributions

3.2.1 Uniform

3.2.2 Normal (Gaussian)

3.2.3 Exponential

3.2.4 Poisson

3.2.5 Power-law / Pareto

3.3 The Birthday Paradox

3.4 Bloom Filters Math

3.5 HyperLogLog — Cardinality Estimation

3.6 A/B Testing & Hypothesis Testing

3.6.1 Null hypothesis

3.6.2 P-value

3.6.3 Sample size formula

3.6.4 Common pitfalls

3.7 Markov Chains (briefly)

3.8 Queueing Theory

3.9 Power Laws & Pareto in Systems

4. Discrete Mathematics

4.1 Set Theory

4.2 Graph Theory

4.3 Lattices (cho CRDT)

4.4 Modular Arithmetic

4.5 Combinatorics

5. Information Theory

5.1 Entropy

5.2 Compression Theory

5.3 Compression Algorithms

5.4 Erasure Coding

5.5 Hash Functions Theory

6. Calculus (briefly)

6.1 Derivatives — Rate of Change

6.2 Optimization

6.3 Big-O & Asymptotic Behavior

7. Practical Applications in System Design

7.1 Vector search complexity

7.2 Capacity estimation

7.3 Cache sizing

7.4 SLO & error budgets

7.5 Reliability math

7.6 N-modular redundancy

8. Code Examples

8.1 Cosine similarity

8.2 Bloom filter

8.3 Sample size for A/B test

8.4 Reliability calculator

8.5 HyperLogLog (simplified)

9. System Design Diagrams

9.1 Vector Similarity

9.2 Bloom Filter

9.3 Reliability — Series vs Parallel

9.4 Universal Scalability Law

10. Aha Moments & Pitfalls

Aha Moments

Pitfalls

Pitfall 1: Forget normalize before cosine

Pitfall 2: 32-bit hash for big data

Pitfall 3: Peeking in A/B test