Tuần 02: Back-of-the-envelope Estimation

“Một System Architect không nói ‘nhiều lắm’. Họ nói ‘50,000 requests per second, mỗi request trung bình 2KB, tức khoảng 100MB/s bandwidth’.”

Tags: system-design estimation alex-xu Prerequisite: Tuan-01-Scale-From-Zero-To-Millions Liên quan: Tuan-05-Load-Balancer · Tuan-06-Cache-Strategy · Tuan-07-Database-Sharding-Replication · Tuan-16-Design-URL-Shortener

1. Context & Why

Analogy đời thường

Hieu, tưởng tượng em mở một quán phở. Trước khi thuê mặt bằng, em cần trả lời:

Mỗi ngày có bao nhiêu khách? → QPS (Queries Per Second)
Mỗi khách ăn bao nhiêu phở? → Payload size
Cần bao nhiêu bàn ghế? → Concurrent connections
Kho nguyên liệu cần bao lớn? → Storage
Cần bao nhiêu nhân viên phục vụ? → Server instances
Bếp chạy tối đa mấy tô/phút? → Throughput

Nếu em thuê mặt bằng 10 bàn mà ngày nào cũng 500 khách → quán sập. Nếu em thuê 500 bàn mà chỉ 10 khách/ngày → phá sản vì chi phí.

Back-of-the-envelope estimation chính là kỹ năng tính nhanh để ra quyết định kiến trúc hợp lý trước khi viết một dòng code nào.

Tại sao Alex Xu đặt nó ở Chương 2?

Vì đây là ngôn ngữ chung của System Design Interview. Interviewer không muốn nghe “dùng Redis cho nhanh”. Họ muốn nghe:

“Với 10M DAU, peak QPS khoảng 20K, mỗi query cần lookup < 1ms. Redis single node handle được 100K ops/s, nên một node là đủ cho read. Nhưng data size khoảng 50GB, cần cluster 3 nodes cho memory.”

Đó là sự khác biệt giữa Junior và Architect.

2. Deep Dive — Các khái niệm cốt lõi

2.1 Power of 2 — Bảng tra cứu bắt buộc nhớ

Đây là bảng mà mọi System Architect đều phải thuộc nằm lòng:

Power	Giá trị xấp xỉ	Ý nghĩa thực tế
$2^{10}$	1 Thousand (1 KB)	Một đoạn text ngắn
$2^{20}$	1 Million (1 MB)	Một bức ảnh chất lượng trung bình
$2^{30}$	1 Billion (1 GB)	~250 bài hát MP3
$2^{40}$	1 Trillion (1 TB)	Một ổ cứng lớn
$2^{50}$	1 Quadrillion (1 PB)	Toàn bộ thư viện Quốc hội Mỹ x3

2.2 Latency Numbers Every Programmer Should Know

Nguồn gốc: Jeff Dean (Google), cập nhật cho phần cứng hiện đại (~2024)

Operation	Latency	So sánh
L1 cache reference	1 ns	Nhìn xuống bàn phím
L2 cache reference	4 ns	Nhìn sang đồng nghiệp
Main memory reference	100 ns	Đi lấy nước ở pantry
SSD random read	16 μs	Đi toilet
HDD seek	2 ms	Đi pha cà phê
Network round trip (same DC)	0.5 ms	Gọi điện cho đồng nghiệp cùng toà nhà
Network round trip (cross-region)	150 ms	Gọi quốc tế
Read 1 MB from memory	3 μs	—
Read 1 MB from SSD	49 μs	—
Read 1 MB from HDD	825 μs	—
Read 1 MB from network (1 Gbps)	10 ms	—

Aha Moment #1: SSD nhanh hơn HDD 50 lần cho random read. Nhưng memory nhanh hơn SSD 160 lần. Đó là lý do cache tồn tại → xem Tuan-06-Cache-Strategy.

Aha Moment #2: Network round trip cùng data center (0.5ms) vs cross-region (150ms) — chênh 300 lần. Đó là lý do CDN và multi-region deployment tồn tại → xem Tuan-03-Networking-DNS-CDN.

2.3 Availability Numbers

Availability	Downtime/năm	Downtime/tháng	Downtime/tuần
99% (two 9s)	3.65 ngày	7.31 giờ	1.68 giờ
99.9% (three 9s)	8.77 giờ	43.83 phút	10.08 phút
99.99% (four 9s)	52.60 phút	4.38 phút	1.01 phút
99.999% (five 9s)	5.26 phút	26.30 giây	6.05 giây

Hầu hết SLA thương mại nhắm tới 99.9% – 99.99%. Five 9s cực kỳ đắt đỏ và thường chỉ dành cho hệ thống payment/healthcare.

3. Framework ước lượng — Quy trình 4 bước

Bước 1: Xác định số liệu đầu vào (Assumptions)

Luôn bắt đầu bằng câu hỏi:

DAU (Daily Active Users) — Bao nhiêu người dùng/ngày?
Tỉ lệ Read:Write — Hệ thống read-heavy hay write-heavy?
Kích thước trung bình — Mỗi object/request bao nhiêu byte?
Retention period — Giữ data bao lâu?

Bước 2: Tính QPS (Queries Per Second)

QP S_{a vg} = \frac{D A U \times queries/user/day}{86400}

QP S_{p e ak} = QP S_{a vg} \times peak_multiplier

Rule of thumb: Peak thường gấp 2x – 5x so với average. Hệ thống e-commerce vào flash sale có thể 10x – 50x.

Bước 3: Tính Storage

St or a g e_{d ai l y} = D A U \times writes/user/day \times avg_size

St or a g e_{t o t a l} = St or a g e_{d ai l y} \times retention_days

Bước 4: Tính Bandwidth

B an d w i d t h_{in} = QP S_{w r i t e} \times avg_write_size

B an d w i d t h_{o u t} = QP S_{re a d} \times avg_read_size

4. Ví dụ thực hành: Ước lượng cho URL Shortener

Tham chiếu: Tuan-16-Design-URL-Shortener, sdi.anhvy.dev — URL Shortener

Assumptions (Đặt giả thiết)

Thông số	Giá trị	Giải thích
DAU	100M	Hệ thống quy mô lớn (Bitly-like)
URL shortens/user/day	0.1	Không phải ai cũng tạo link mỗi ngày
URL reads/user/day	1	Mỗi người click trung bình 1 short link/ngày
Read:Write ratio	10:1	Read-heavy system
Avg URL size	500 bytes	Original URL + metadata
Retention	5 năm	—

Tính QPS

W r i t e QP S_{a vg} = \frac{100 M \times 0.1}{86400} \approx \frac{10 M}{86400} \approx 116 w r i t es / s

R e a d QP S_{a vg} = \frac{100 M \times 1}{86400} \approx 1, 157 re a d s / s

R e a d QP S_{p e ak} = 1, 157 \times 3 \approx 3, 500 re a d s / s

Nhận xét: 3,500 read QPS — một server trung bình với Redis đã handle được. Chưa cần phức tạp hoá architecture.

Tính Storage

N e w U R L s / d a y = 100 M \times 0.1 = 10 M U R L s / d a y

St or a g e / d a y = 10 M \times 500 b y t es = 5 GB / d a y

St or a g e (5 n \overset{a}{˘} m) = 5 GB \times 365 \times 5 = 9.125 TB \approx 10 TB

Nhận xét: 10TB trong 5 năm — vẫn trong phạm vi single database cluster (PostgreSQL + partitioning). Chưa bắt buộc phải shard.

Tính Bandwidth

B an d w i d t h_{in} = 116 \times 500 b y t es = 58 K B / s \approx 0.5 M b p s

B an d w i d t h_{o u t} = 3, 500 \times 500 b y t es = 1.75 MB / s \approx 14 M b p s

Nhận xét: Bandwidth cực kỳ nhỏ. Không phải bottleneck.

Tính Cache Memory

Theo Pareto principle (80/20 rule): 20% URL tạo ra 80% traffic.

C a c h e s i ze = 3, 500 re q / s \times 86, 400 s \times 20% \times 500 b y t es

= 302 M \times 500 b y t es = 151 GB \approx 30 GB (s a u d e d u p)

Redis cluster 3 nodes × 16GB mỗi node = 48GB → đủ chứa hot data.

Tóm tắt ước lượng

Metric	Value
Write QPS (avg)	~116/s
Read QPS (peak)	~3,500/s
New URLs/day	10M
Storage (5 years)	~10 TB
Bandwidth out (peak)	~14 Mbps
Cache memory	~30 GB

5. Ví dụ thực hành 2: Ước lượng cho Chat System

Tham chiếu: Tuan-17-Design-Chat-System

Assumptions

Thông số	Giá trị
DAU	50M
Messages sent/user/day	40
Avg message size	100 bytes
Group chats avg size	50 members
% messages in groups	60%
Media messages	10% of total, avg 200KB
Retention	Vĩnh viễn

Tính nhanh

T o t a l m ess a g es / d a y = 50 M \times 40 = 2 B m ess a g es / d a y

W r i t e QP S_{a vg} = \frac{2 B}{86 , 400} \approx 23, 000 w r i t es / s

W r i t e QP S_{p e ak} = 23, 000 \times 3 = 69, 000 w r i t es / s

Alert: 69K write QPS — đây là write-heavy system. Cần Message Queue làm buffer → Tuan-08-Message-Queue.

Text storage/day:

2 B \times 100 b y t es = 200 GB / d a y

Media storage/day:

2 B \times 10% \times 200 K B = 40 TB / d a y

Alert: 40TB media/ngày! Đây là bottleneck thực sự. Cần Object Storage (S3) + CDN → Tuan-03-Networking-DNS-CDN.

Text storage/năm: $200 GB \times 365 = 73 TB / ye a r$ Media storage/năm: $40 TB \times 365 = 14.6 PB / ye a r$

6. Security First — Estimation cũng cần bảo mật

6.1 Ước lượng cho Rate Limiting

Khi tính QPS, ngay lập tức phải nghĩ tới abuse scenario:

A tt a c k er c a p a c i t y = 10, 000 b o t s \times 100 re q / s = 1 M re q / s

Nếu hệ thống chỉ handle 3,500 QPS → DDoS chỉ cần 0.35% capacity của botnet là đã quá tải.

Giải pháp ước lượng rate limit:

R a t e l imi t p er I P = \frac{QP S _{p e ak} \times s a f e t y _ ma r g in}{es t ima t e d _ co n c u rre n t _ u sers}

= \frac{3 , 500 \times 2}{100 , 000} = 0.07 re q / s / I P \approx 4 re q / min / I P

Xem chi tiết: Tuan-09-Rate-Limiter

6.2 Ước lượng Storage cho Audit Log

Trong hệ thống cần compliance (PCI-DSS, HIPAA, SOX):

A u d i t l o g s i ze = T o t a l QPS \times a vg l o g_e n t ry \times re t e n t i o n

= 3, 500 \times 1 K B \times 86, 400 \times 365 \times 7 ye a rs

\approx 770 TB

Audit log thường lớn hơn data chính! Cần compression + tiered storage (hot/warm/cold) → xem Tuan-15-Data-Security-Encryption.

6.3 Key Rotation Estimation

Nếu dùng encryption at rest với key rotation mỗi 90 ngày:

Keys p er ye a r = ⌈ 365/90 ⌉ = 5 k eys / ye a r

Keys in 7 ye a rs re t e n t i o n = 35 k eys

Phải quản lý 35 encryption keys + decryption backward compatibility → cần KMS (Key Management Service) → Tuan-15-Data-Security-Encryption.

7. DevOps/Ops-Light — Monitoring Capacity

7.1 Capacity Alert Thresholds

Từ estimation, đặt alert:

# prometheus-alerts.yml
groups:
  - name: capacity_planning
    rules:
      - alert: QPSNearCapacity
        expr: rate(http_requests_total[5m]) > 2800  # 80% of 3,500 peak
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "QPS approaching peak capacity ({{ $value }}/s)"
 
      - alert: StorageGrowthAnomaly
        expr: predict_linear(disk_used_bytes[7d], 30*24*3600) > disk_total_bytes * 0.9
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Storage predicted to reach 90% in 30 days"
 
      - alert: CacheHitRateDropped
        expr: redis_keyspace_hits_total / (redis_keyspace_hits_total + redis_keyspace_misses_total) < 0.8
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Cache hit rate dropped below 80%"

7.2 Grafana Dashboard Essentials

Từ estimation, cần dashboard theo dõi:

Panel	Query (PromQL)	Threshold
QPS Realtime	`rate(http_requests_total[1m])`	Warning: 80% peak, Critical: 95%
P99 Latency	`histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))`	< 200ms
Error Rate	`rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])`	< 0.1%
Storage Growth	`disk_used_bytes`	Predict 90 days
Cache Hit Rate	`redis_keyspace_hits / (hits + misses)`	> 80%

Chi tiết: Tuan-13-Monitoring-Observability

8. Code Example — Estimation Calculator

Python: Quick Estimation Tool

"""
Back-of-the-envelope Estimation Calculator
Sử dụng trong System Design Interview prep
"""
 
from dataclasses import dataclass
from enum import Enum
 
class TimeUnit(Enum):
    SECOND = 1
    MINUTE = 60
    HOUR = 3600
    DAY = 86400
    MONTH = 2592000  # 30 days
    YEAR = 31536000  # 365 days
 
@dataclass
class SystemEstimation:
    name: str
    dau: int
    reads_per_user_per_day: float
    writes_per_user_per_day: float
    avg_read_size_bytes: int
    avg_write_size_bytes: int
    retention_days: int
    peak_multiplier: float = 3.0
    cache_hot_data_pct: float = 0.20  # Pareto 80/20
 
    @property
    def write_qps_avg(self) -> float:
        return (self.dau * self.writes_per_user_per_day) / TimeUnit.DAY.value
 
    @property
    def read_qps_avg(self) -> float:
        return (self.dau * self.reads_per_user_per_day) / TimeUnit.DAY.value
 
    @property
    def read_qps_peak(self) -> float:
        return self.read_qps_avg * self.peak_multiplier
 
    @property
    def write_qps_peak(self) -> float:
        return self.write_qps_avg * self.peak_multiplier
 
    @property
    def storage_per_day_gb(self) -> float:
        return (self.dau * self.writes_per_user_per_day * self.avg_write_size_bytes) / (1024**3)
 
    @property
    def storage_total_tb(self) -> float:
        return (self.storage_per_day_gb * self.retention_days) / 1024
 
    @property
    def bandwidth_in_mbps(self) -> float:
        return (self.write_qps_peak * self.avg_write_size_bytes * 8) / (1024**2)
 
    @property
    def bandwidth_out_mbps(self) -> float:
        return (self.read_qps_peak * self.avg_read_size_bytes * 8) / (1024**2)
 
    @property
    def cache_size_gb(self) -> float:
        daily_read_data = self.read_qps_avg * TimeUnit.DAY.value * self.avg_read_size_bytes
        return (daily_read_data * self.cache_hot_data_pct) / (1024**3)
 
    def report(self) -> str:
        def fmt(n: float) -> str:
            if n >= 1_000_000:
                return f"{n/1_000_000:.1f}M"
            if n >= 1_000:
                return f"{n/1_000:.1f}K"
            return f"{n:.1f}"
 
        return f"""
╔══════════════════════════════════════════════════╗
║  ESTIMATION REPORT: {self.name:<28} ║
╠══════════════════════════════════════════════════╣
║  DAU:              {fmt(self.dau):>12}               ║
║  Read:Write ratio: {self.reads_per_user_per_day/max(self.writes_per_user_per_day,0.001):>10.0f}:1                 ║
╠──────────────────────────────────────────────────╣
║  QPS (avg)                                       ║
║    Read:           {fmt(self.read_qps_avg):>12} /s              ║
║    Write:          {fmt(self.write_qps_avg):>12} /s              ║
║  QPS (peak, x{self.peak_multiplier:.0f})                                  ║
║    Read:           {fmt(self.read_qps_peak):>12} /s              ║
║    Write:          {fmt(self.write_qps_peak):>12} /s              ║
╠──────────────────────────────────────────────────╣
║  Storage                                         ║
║    Per day:        {self.storage_per_day_gb:>10.2f} GB              ║
║    Total ({self.retention_days//365}yr):     {self.storage_total_tb:>10.2f} TB              ║
╠──────────────────────────────────────────────────╣
║  Bandwidth (peak)                                ║
║    Inbound:        {self.bandwidth_in_mbps:>10.2f} Mbps            ║
║    Outbound:       {self.bandwidth_out_mbps:>10.2f} Mbps            ║
╠──────────────────────────────────────────────────╣
║  Cache (20% hot data)                            ║
║    Size:           {self.cache_size_gb:>10.2f} GB              ║
╚══════════════════════════════════════════════════╝
"""
 
 
# === Ví dụ sử dụng ===
 
if __name__ == "__main__":
    # Case 1: URL Shortener
    url_shortener = SystemEstimation(
        name="URL Shortener",
        dau=100_000_000,
        reads_per_user_per_day=1.0,
        writes_per_user_per_day=0.1,
        avg_read_size_bytes=500,
        avg_write_size_bytes=500,
        retention_days=365 * 5,
        peak_multiplier=3.0,
    )
    print(url_shortener.report())
 
    # Case 2: Chat System
    chat_system = SystemEstimation(
        name="Chat System (text only)",
        dau=50_000_000,
        reads_per_user_per_day=200.0,  # read nhiều hơn gửi
        writes_per_user_per_day=40.0,
        avg_read_size_bytes=100,
        avg_write_size_bytes=100,
        retention_days=365 * 7,
        peak_multiplier=3.0,
    )
    print(chat_system.report())
 
    # Case 3: Instagram-like (media heavy)
    instagram = SystemEstimation(
        name="Photo Sharing App",
        dau=500_000_000,
        reads_per_user_per_day=50.0,
        writes_per_user_per_day=0.5,
        avg_read_size_bytes=200_000,   # 200KB avg image thumbnail
        avg_write_size_bytes=2_000_000, # 2MB avg uploaded image
        retention_days=365 * 10,
        peak_multiplier=3.0,
    )
    print(instagram.report())

Node.js: Express Middleware — Request Counting cho Estimation Validation

// middleware/estimation-validator.js
// Đặt vào production để validate estimation vs thực tế
 
const prometheus = require('prom-client');
 
const requestCounter = new prometheus.Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'path', 'status'],
});
 
const requestDuration = new prometheus.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'path'],
  buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5],
});
 
const requestSizeBytes = new prometheus.Histogram({
  name: 'http_request_size_bytes',
  help: 'HTTP request payload size',
  labelNames: ['method', 'path'],
  buckets: [100, 500, 1000, 5000, 10000, 50000, 100000],
});
 
function estimationValidator(req, res, next) {
  const start = process.hrtime.bigint();
 
  res.on('finish', () => {
    const duration = Number(process.hrtime.bigint() - start) / 1e9;
    const path = req.route?.path || req.path;
 
    requestCounter.inc({ method: req.method, path, status: res.statusCode });
    requestDuration.observe({ method: req.method, path }, duration);
    requestSizeBytes.observe(
      { method: req.method, path },
      parseInt(req.headers['content-length'] || '0', 10)
    );
  });
 
  next();
}
 
module.exports = { estimationValidator, requestCounter, requestDuration };

9. System Design Diagram — Estimation trong context

flowchart TD
    subgraph "Step 1: Gather Requirements"
        A[Functional Requirements] --> B[Non-functional Requirements]
        B --> C["Assumptions<br/>(DAU, Read:Write, Size)"]
    end

    subgraph "Step 2: Estimation"
        C --> D["QPS Calculation<br/>$$QPS = DAU × actions / 86400$$"]
        C --> E["Storage Calculation<br/>$$Storage = writes/day × size × days$$"]
        C --> F["Bandwidth Calculation<br/>$$BW = QPS × payload\_size$$"]
        D --> G["Cache Sizing<br/>$$Cache = daily\_reads × 20\%$$"]
    end

    subgraph "Step 3: Architecture Decision"
        D --> H{QPS < 10K?}
        H -->|Yes| I["Single Server<br/>+ Read Replica"]
        H -->|No| J["Distributed System<br/>+ Load Balancer<br/>+ Sharding"]
        E --> K{Storage < 1TB?}
        K -->|Yes| L["Single DB<br/>+ Partitioning"]
        K -->|No| M["Sharded DB<br/>or NoSQL"]
        G --> N{Cache < 64GB?}
        N -->|Yes| O["Single Redis Node"]
        N -->|No| P["Redis Cluster"]
    end

    subgraph "Step 4: Ops Validation"
        I --> Q["Prometheus Alerts<br/>based on estimated thresholds"]
        J --> Q
        Q --> R["Dashboard:<br/>Actual vs Estimated"]
    end

    style D fill:#f9a825,stroke:#333,stroke-width:2px
    style E fill:#f9a825,stroke:#333,stroke-width:2px
    style F fill:#f9a825,stroke:#333,stroke-width:2px

10. Quick Reference — Các con số cần nhớ

QPS Benchmarks (single instance)

Component	Capacity	Ghi chú
Nginx (static)	50,000+ req/s	Reverse proxy
Node.js (Express)	5,000–15,000 req/s	Tuỳ logic complexity
Java (Spring Boot)	10,000–30,000 req/s	Với thread pool tuned
PostgreSQL	5,000–20,000 QPS	Simple queries, indexing tốt
MySQL	5,000–20,000 QPS	Tương tự PostgreSQL
Redis	100,000+ ops/s	In-memory, single thread
Kafka (single broker)	100,000+ msg/s	Throughput-oriented
Elasticsearch	5,000–10,000 QPS	Search queries

Storage Quick Math

Loại data	Avg size	1M records	1B records
Short text (tweet)	300 bytes	300 MB	300 GB
JSON document	2 KB	2 GB	2 TB
Image thumbnail	50 KB	50 GB	50 TB
Full image	2 MB	2 TB	2 PB
Video (1 min, 720p)	50 MB	50 TB	50 PB

11. Bài tập tự luyện

Bài 1: Design Twitter — Estimation

Assumptions: 300M MAU, 50% DAU, avg 5 tweets/day/active user, avg 2 photos/tweet (100KB each), 100 reads/user/day.

Tính:

Bài 2: Design YouTube — Estimation

Assumptions: 2B MAU, 30% DAU, avg 5 videos watched/day, avg video = 50MB (3 min, 720p), 0.01% users upload 1 video/day.

Tính:

Upload QPS
Streaming bandwidth
Storage/year
CDN cache sizing

Bài 3: Design Payment System — Estimation + Security

Assumptions: 50M DAU, 2 transactions/user/day, avg transaction payload = 1KB, cần audit log cho 7 năm.

Tính:

Transaction QPS
Storage (transaction + audit log)
Encryption overhead (AES-256 adds ~16 bytes/block)
Key rotation schedule

12. Common Pitfalls — Sai lầm thường gặp

Pitfall 1: Quên tính Peak

Sai: “100M DAU ÷ 86,400 = 1,157 QPS, server dư sức.” Đúng: Peak = 3x–5x average. Nếu có event (Black Friday, Tết) → 10x–50x. Luôn design cho peak, pay cho average.

Pitfall 2: Nhầm lẫn đơn vị

Mbps (megabits) vs MBps (megabytes). 1 MBps = 8 Mbps. Nhiều người tính bandwidth sai vì quên convert.

Pitfall 3: Quên tính metadata & overhead

Mỗi row trong DB không chỉ có data. Cần thêm: index (~30% data size), WAL logs, replication overhead, backup. Rule of thumb: Nhân storage estimate với 3x cho production.

Pitfall 4: Over-engineer dựa trên estimation

Estimation cho thấy 1,000 QPS mà lập tức dựng Kubernetes cluster 50 nodes → lãng phí. Start simple, scale khi cần → Tuan-01-Scale-From-Zero-To-Millions.

Pitfall 5: Không validate estimation với thực tế

Estimation chỉ là giả thuyết. Sau khi deploy, phải đo thực tế bằng monitoring → Tuan-13-Monitoring-Observability. Nếu estimation sai > 10x → phải xem lại assumptions.

13. Aha Moments — Đúc kết

#1: Estimation không cần chính xác. Cần đúng order of magnitude (bậc đại lượng). 1,000 vs 1,200 QPS không quan trọng. Nhưng 1,000 vs 100,000 QPS → kiến trúc hoàn toàn khác.

#2: Hai con số quan trọng nhất là QPS và Storage. Từ đó suy ra mọi thứ khác.

#3: Estimation là vũ khí giao tiếp. Khi nói “hệ thống cần handle 50K QPS”, cả team đều hiểu scale. Khi nói “hệ thống cần nhanh” → không ai hiểu gì.

#4: Security estimation thường bị bỏ quên. Audit log, encryption overhead, rate limiting — tất cả đều tiêu tốn resource và phải được tính vào.

Tham khảo

Alex Xu, System Design Interview — Chapter 2: Back-of-the-envelope Estimation
Jeff Dean, Numbers Every Programmer Should Know
sdi.anhvy.dev — Vietnamese System Design Reference
Tuan-01-Scale-From-Zero-To-Millions — Nền tảng scaling
Tuan-06-Cache-Strategy — Tại sao cache tồn tại (latency numbers)
Tuan-09-Rate-Limiter — Ứng dụng estimation vào rate limiting
Tuan-13-Monitoring-Observability — Validate estimation bằng monitoring
Tuan-16-Design-URL-Shortener — Case study áp dụng estimation đầy đủ

Tuần tới: Tuan-03-Networking-DNS-CDN — Hiểu luồng request từ browser đến server

lthieu's notes

Explorer

Tuan-02-Back-of-the-envelope

Tuần 02: Back-of-the-envelope Estimation

1. Context & Why

Analogy đời thường

Tại sao Alex Xu đặt nó ở Chương 2?

2. Deep Dive — Các khái niệm cốt lõi

2.1 Power of 2 — Bảng tra cứu bắt buộc nhớ

2.2 Latency Numbers Every Programmer Should Know

2.3 Availability Numbers

3. Framework ước lượng — Quy trình 4 bước

Bước 1: Xác định số liệu đầu vào (Assumptions)

Bước 2: Tính QPS (Queries Per Second)

Bước 3: Tính Storage

Bước 4: Tính Bandwidth

4. Ví dụ thực hành: Ước lượng cho URL Shortener

Assumptions (Đặt giả thiết)

Tính QPS

Tính Storage

Tính Bandwidth

Tính Cache Memory

Tóm tắt ước lượng

5. Ví dụ thực hành 2: Ước lượng cho Chat System

Assumptions

Tính nhanh

6. Security First — Estimation cũng cần bảo mật

6.1 Ước lượng cho Rate Limiting

6.2 Ước lượng Storage cho Audit Log

6.3 Key Rotation Estimation

7. DevOps/Ops-Light — Monitoring Capacity

7.1 Capacity Alert Thresholds

7.2 Grafana Dashboard Essentials

8. Code Example — Estimation Calculator

Python: Quick Estimation Tool

Node.js: Express Middleware — Request Counting cho Estimation Validation

9. System Design Diagram — Estimation trong context

10. Quick Reference — Các con số cần nhớ

QPS Benchmarks (single instance)

Storage Quick Math

11. Bài tập tự luyện

Bài 1: Design Twitter — Estimation

Bài 2: Design YouTube — Estimation

Bài 3: Design Payment System — Estimation + Security

12. Common Pitfalls — Sai lầm thường gặp

Pitfall 1: Quên tính Peak

Pitfall 2: Nhầm lẫn đơn vị

Pitfall 3: Quên tính metadata & overhead

Pitfall 4: Over-engineer dựa trên estimation

Pitfall 5: Không validate estimation với thực tế

13. Aha Moments — Đúc kết

Tham khảo

Graph View

Table of Contents

Backlinks