Case Study: Design Metrics Monitoring & Alerting System

“Hệ thống không có monitoring giống như lái xe ban đêm mà tắt đèn — mọi thứ ổn cho đến khi em đâm vào tường. Lúc đó thì đã quá muộn.”

Tags: system-design monitoring alerting metrics time-series case-study alex-xu-vol2 Student: Hieu Source: Alex Xu — System Design Interview Volume 2, Chapter 6 Prerequisite: Tuan-13-Monitoring-Observability · Tuan-08-Message-Queue · Tuan-06-Cache-Strategy Lien quan: Tuan-07-Database-Sharding-Replication · Tuan-02-Back-of-the-envelope · Tuan-01-Scale-From-Zero-To-Millions

1. Context & Why — Tại sao cần Metrics Monitoring & Alerting?

1.1 Analogy: He thong canh bao chay trong toa nha

Hieu, em hay tuong tuong he thong monitoring nhu he thong phong chay chua chay trong mot toa nha choc troi:

Thành phần phòng cháy	Thành phần monitoring	Vai trò
Sensor khói/nhiệt (mỗi phòng có 1 cái)	Metrics collection agent (mỗi server có 1 agent)	Thu thập tín hiệu liên tục từ nguồn
Dây dẫn tín hiệu	Ingestion pipeline (Kafka)	Truyền tải dữ liệu từ sensor về trung tâm
Trung tâm điều khiển	Time Series Database	Lưu trữ và xử lý mọi tín hiệu
Bảng hiển thị trạng thái	Dashboard (Grafana)	Hiển thị trạng thái real-time cho người quan sát
Quy tắc kích hoạt báo động (nhiệt > 80 độ C)	Alert rules engine	Đánh giá điều kiện để quyết định có báo động không
Còi báo động + đèn nhấp nháy	Notification channels (PagerDuty, Slack, Email)	Thông báo đến đúng người, đúng mức độ
Đội cứu hỏa	On-call engineer	Người thực sự xử lý sự cố
Hệ thống chữa cháy tự động (sprinkler)	Auto-remediation (auto-scaling, restart)	Phản ứng tự động trước khi người can thiệp

Aha Moment: Một hệ thống phòng cháy tốt không phải là hệ thống kêu nhiều — mà là hệ thống kêu đúng lúc, đúng mức, và không bao giờ bỏ sót đám cháy thật. Monitoring cũng vậy: alert fatigue (quá nhiều false alarm) nguy hiểm không kém việc không có alert.

1.2 Tại sao monitoring quan trọng trong production?

Hieu, khi em deploy hệ thống lên production, có 3 câu hỏi em phải trả lời được mọi lúc:

Hệ thống có đang healthy không? (CPU, memory, disk, network)
Business logic có đang hoạt động đúng không? (request rate, error rate, latency)
Khi có vấn đề, ai biết trước — em hay customer? (alerting)

Không có monitoring, em chỉ biết hệ thống sập khi customer gọi điện than phiền. Đó là cách tệ nhất để phát hiện incident.

1.3 Scope của bài toán

Bài toán trong Alex Xu Vol 2 Chapter 6 focus vào infrastructure metrics monitoring — không phải application-level tracing hay logging. Cụ thể:

Thuộc tính	Trong scope	Ngoài scope
Metrics (CPU, memory, request count, latency)	Yes	—
Logs (structured/unstructured text)	—	Dùng ELK/Loki
Traces (distributed request flow)	—	Dùng Jaeger/Zipkin
Alerting (dựa trên metrics)	Yes	—
Dashboard (visualization)	Yes	—
Auto-remediation	—	Nâng cao, ngoài scope

Ghi nhớ: Trong thực tế, Metrics + Logs + Traces = Three Pillars of Observability (đã học ở Tuan-13-Monitoring-Observability). Bài này deep dive vào pillar đầu tiên — Metrics.

2. Deep Dive — Alex Xu 4-Step Framework

Step 1 — Understand the Problem & Establish Design Scope

2.1 Functional Requirements (Yêu cầu chức năng)

Trước khi vẽ diagram, em phải hỏi interviewer để xác định scope rõ ràng:

Câu hỏi	Câu trả lời giả định	Tại sao hỏi
Hệ thống cần monitor bao nhiêu server?	10,000 servers	Quyết định ingestion throughput
Mỗi server thu thập bao nhiêu metrics?	100 metrics/server (CPU, memory, disk, network, custom metrics)	Quyết định số lượng time series
Tổng bao nhiêu time series?	1,000,000 time series (10K servers x 100 metrics)	Quyết định TSDB capacity
Ingestion rate?	10,000,000 data points/sec (10M/s)	Quyết định Kafka throughput, TSDB write capacity
Collection interval?	Mỗi 10 giây cho infrastructure metrics, mỗi 1 phút cho business metrics	Quyết định data volume
Cần alerting không?	Có — rule-based alerting với multiple severity levels	Core feature
Cần dashboard không?	Có — real-time dashboard với drill-down	Core feature
Retention bao lâu?	1 năm cho raw data, 3 năm cho downsampled data	Quyết định storage strategy
Multi-tenant không?	Có — nhiều team dùng chung hệ thống	Access control, isolation
Có anomaly detection không?	Ngoài scope ban đầu, nhưng extensible	Future feature

2.2 Non-functional Requirements (Yêu cầu phi chức năng)

Requirement	Target	Ghi chú
Scalability	10K servers, 1M time series, 10M data points/sec	Hệ thống quy mô lớn
Availability	99.99% (four 9s) cho alerting pipeline	Alert bị miss = sự cố không được phát hiện
Latency — Ingestion	< 30 giây end-to-end (từ collection đến queryable)	Near-real-time monitoring
Latency — Alerting	< 1 phút từ threshold breached đến notification sent	Phát hiện sự cố nhanh
Latency — Dashboard	< 2 giây cho query thông thường, < 10 giây cho complex query	UX requirement
Durability	Không mất data points sau khi đã ingested	At-least-once delivery
Query flexibility	Hỗ trợ aggregation (sum, avg, percentile), filtering, grouping	PromQL-like
Reliability	Alert không bị duplicate gửi quá 2 lần, không bị miss	Dedup + at-least-once

2.3 Phân biệt Metrics vs Events

Đặc điểm	Metric	Event (Log)
Bản chất	Giá trị số (numeric) đo lường theo thời gian	Text mô tả một sự kiện xảy ra
Ví dụ	`cpu_usage = 78.5%`	`"User 123 logged in from IP 1.2.3.4"`
Cardinality	Thấp-trung bình (metric name + tags)	Rất cao (mỗi event unique)
Storage	Compact, nén tốt (time series)	Lớn, khó nén
Query pattern	Aggregation (avg, sum, percentile over time)	Search, filter, full-text
Database phù hợp	TSDB (InfluxDB, Prometheus)	Search engine (Elasticsearch)
Retention cost	Rẻ (sau downsampling)	Đắt (raw text)

Quan trọng: Bài toán này focus vào metrics — numeric time series data. Đây là lý do TSDB (Time Series Database) là core component.

Step 2 — Propose High-Level Design

2.4 Analogy cho High-Level Architecture

Hieu, tưởng tượng kiến trúc monitoring như hệ thống thần kinh của cơ thể:

Collection agents = Các đầu dây thần kinh (nerve endings) — cảm nhận nhiệt độ, áp suất, đau
Ingestion pipeline = Dây thần kinh (nerves) — truyền tín hiệu về trung tâm
Time series DB = Não — lưu trữ và xử lý thông tin
Query service = Vùng vỏ não nhận thức — trả lời câu hỏi “đang xảy ra cái gì?”
Alerting system = Phản xạ — tự động phản ứng khi nguy hiểm
Dashboard = Bảng chỉ số sức khỏe — bác sĩ nhìn vào để chẩn đoán

2.5 Architecture Overview

flowchart TB
    subgraph "Metrics Sources (10K servers)"
        S1["Application Servers<br/>(CPU, Memory, Disk)"]
        S2["Database Servers<br/>(Query latency, Connections)"]
        S3["Message Queues<br/>(Queue depth, Consumer lag)"]
        S4["Load Balancers<br/>(Request rate, Error rate)"]
        S5["Custom Business Metrics<br/>(Orders/sec, Revenue)"]
    end

    subgraph "Collection Layer"
        CA["Collection Agents<br/>(Prometheus exporters /<br/>StatsD / Telegraf)"]
        SD["Service Discovery<br/>(Consul / K8s API /<br/>DNS-based)"]
    end

    subgraph "Ingestion Pipeline"
        KAFKA["Kafka Cluster<br/>(Buffer & Decouple)"]
        SP["Stream Processor<br/>(Flink / Kafka Streams)<br/>Pre-aggregation"]
    end

    subgraph "Storage Layer"
        TSDB[("Time Series Database<br/>(InfluxDB / Prometheus<br/>/ TimescaleDB)")]
        HOT[("Hot Storage<br/>(SSD, recent 7 days)")]
        WARM[("Warm Storage<br/>(HDD, 7 days - 3 months)")]
        COLD[("Cold Storage<br/>(S3/GCS, 3 months - 3 years)")]
    end

    subgraph "Query & Alerting"
        QS["Query Service<br/>(PromQL / InfluxQL)"]
        CACHE[("Query Cache<br/>(Redis)")]
        ARE["Alert Rules Engine<br/>(Evaluate rules<br/>every 15-60s)"]
    end

    subgraph "Notification & Visualization"
        AM["Alert Manager<br/>(Grouping, Dedup,<br/>Silencing, Routing)"]
        DASH["Dashboard<br/>(Grafana)"]
        PD["PagerDuty"]
        SLACK["Slack"]
        EMAIL["Email"]
        WEBHOOK["Webhook"]
    end

    S1 & S2 & S3 & S4 & S5 --> CA
    SD -->|"Discover targets"| CA
    CA -->|"Push metrics"| KAFKA
    KAFKA --> SP
    SP -->|"Write"| TSDB
    TSDB --- HOT
    TSDB --- WARM
    TSDB --- COLD

    TSDB --> QS
    QS --> CACHE
    QS --> DASH
    QS --> ARE
    ARE -->|"FIRING alerts"| AM
    AM --> PD & SLACK & EMAIL & WEBHOOK

    style KAFKA fill:#FF9800,stroke:#333,stroke-width:2px,color:#fff
    style TSDB fill:#4CAF50,stroke:#333,stroke-width:2px,color:#fff
    style ARE fill:#f44336,stroke:#333,stroke-width:2px,color:#fff
    style AM fill:#e91e63,stroke:#333,stroke-width:2px,color:#fff
    style DASH fill:#2196F3,stroke:#333,stroke-width:2px,color:#fff

2.6 Component Responsibilities

Component	Trách nhiệm	Technology choices
Collection Agent	Thu thập metrics từ server/service, format thành time series	Prometheus Node Exporter, Telegraf, StatsD, collectd
Service Discovery	Biết danh sách targets cần thu thập (pull model)	Consul, Kubernetes API, DNS SRV records, file-based
Kafka Cluster	Buffer metrics data, decouple collection từ storage, handle burst	Apache Kafka (partitioned by metric name hash)
Stream Processor	Pre-aggregate raw metrics trước khi ghi vào TSDB	Apache Flink, Kafka Streams, Apache Spark Streaming
Time Series DB	Lưu trữ metrics data, index by metric name + tags	InfluxDB, Prometheus, TimescaleDB, OpenTSDB, VictoriaMetrics
Query Service	Parse query language, fetch data từ TSDB, trả về result	PromQL engine, InfluxQL, Flux
Query Cache	Cache kết quả hot queries	Redis, in-memory LRU cache
Alert Rules Engine	Evaluate alert rules periodically, determine firing alerts	Prometheus Alertmanager, Grafana Alerting
Alert Manager	Grouping, deduplication, silencing, routing alerts	Prometheus Alertmanager
Dashboard	Visualization — graphs, tables, heatmaps	Grafana, Kibana, custom UI

2.7 Data Flow — End to End

sequenceDiagram
    participant Server as Application Server
    participant Agent as Collection Agent
    participant SD as Service Discovery
    participant Kafka as Kafka Cluster
    participant SP as Stream Processor
    participant TSDB as Time Series DB
    participant QS as Query Service
    participant ARE as Alert Rules Engine
    participant AM as Alert Manager
    participant OC as On-Call Engineer

    Note over Server,Agent: Collection Phase (every 10s)
    Agent->>SD: Discover scrape targets
    SD-->>Agent: List of endpoints
    Agent->>Server: Scrape /metrics (pull model)
    Server-->>Agent: Metric data points
    Agent->>Kafka: Produce metric batch

    Note over Kafka,TSDB: Ingestion Phase
    Kafka->>SP: Consume raw metrics
    SP->>SP: Pre-aggregate (5s → 1min)
    SP->>TSDB: Write aggregated data points

    Note over TSDB,OC: Query & Alert Phase
    ARE->>TSDB: Evaluate rule: avg(cpu) > 90% for 5m
    TSDB-->>ARE: Result: true, value = 94.2%
    ARE->>AM: Fire alert: HIGH CPU on server-042
    AM->>AM: Check: grouped? silenced? duplicate?
    AM->>OC: Send PagerDuty notification
    OC-->>AM: Acknowledge alert

    Note over TSDB,QS: Dashboard Phase
    QS->>TSDB: Query: avg(cpu) by (cluster) last 1h
    TSDB-->>QS: Time series result
    QS-->>QS: Cache result (TTL 30s)

Step 3 — Design Deep Dive

2.8 Data Model — Time Series Data

Cấu trúc một data point

Mỗi metric data point gồm 4 thành phần:

Field	Mô tả	Ví dụ
Metric name	Tên metric, mô tả cái đang đo	`http_request_total`, `cpu_usage_percent`
Tags / Labels	Cặp key-value để phân loại, filtering	`{host="server-042", region="us-east-1", service="payment"}`
Timestamp	Thời điểm ghi nhận (Unix epoch, millisecond precision)	`1679000000000`
Value	Giá trị số (int64 hoặc float64)	`78.5`

Ví dụ data representation

metric_name{label1="value1", label2="value2"} value timestamp

cpu_usage_percent{host="server-042", region="us-east-1", env="prod"} 78.5 1679000000
http_request_total{method="GET", path="/api/orders", status="200"} 150432 1679000000
memory_used_bytes{host="server-042"} 6442450944 1679000000
kafka_consumer_lag{topic="orders", consumer_group="payment-svc"} 2350 1679000000

Time Series = Unique Combination of Metric Name + Tags

Hieu, đây là concept cực kỳ quan trọng. Một time series được xác định bởi metric name + tập hợp tất cả label pairs:

cpu_usage{host="s1", region="us-east-1"}  → Time series #1
cpu_usage{host="s1", region="eu-west-1"}  → Time series #2  (khác region)
cpu_usage{host="s2", region="us-east-1"}  → Time series #3  (khác host)

Mỗi time series là một stream các (timestamp, value) pairs. TSDB optimize cho việc lưu trữ và query các stream này.

Đặc tính của Time Series Data

Đặc tính	Giải thích	Ảnh hưởng design
Append-only	Data points chỉ được ghi thêm, không bao giờ update hay delete	Write path đơn giản, LSM tree phù hợp
Write-heavy	Write >> Read (mỗi giây có hàng triệu data points mới)	Optimize write path, batch writes
Recent data queried most	90%+ queries là cho data trong 24h gần nhất	Hot/warm/cold tiering, cache recent data
Time-ordered	Data luôn đến theo thứ tự thời gian (hoặc gần đúng)	Sequential write, time-based partitioning
High cardinality tags	Tag values có thể rất nhiều (e.g., `pod_id` trong Kubernetes)	Cardinality explosion — killer issue
Compressible	Giá trị liên tiếp thường gần nhau (CPU 78.5 → 78.7 → 78.3)	Delta encoding, XOR compression cực hiệu quả

Aha Moment: Time series data có pattern cực kỳ đặc biệt — append-only, write-heavy, time-ordered, compressible. Đó là lý do TSDB tồn tại — relational DB và general-purpose NoSQL không tận dụng được các đặc tính này.

2.9 Collection — Push vs Pull Model

Đây là một trong những quyết định kiến trúc quan trọng nhất. Hai mô hình chính:

Pull Model (Prometheus-style)

flowchart LR
    subgraph "Targets"
        T1["Server A<br/>/metrics endpoint"]
        T2["Server B<br/>/metrics endpoint"]
        T3["Server C<br/>/metrics endpoint"]
    end

    subgraph "Service Discovery"
        SD["Consul / K8s API"]
    end

    subgraph "Collector"
        PROM["Prometheus Server<br/>(Pull every 15s)"]
    end

    SD -->|"Target list"| PROM
    PROM -->|"HTTP GET /metrics"| T1
    PROM -->|"HTTP GET /metrics"| T2
    PROM -->|"HTTP GET /metrics"| T3

    style PROM fill:#E65100,stroke:#333,stroke-width:2px,color:#fff

Cách hoạt động: Collector chủ động gọi HTTP GET đến /metrics endpoint của mỗi target theo interval cố định.

Ưu điểm	Nhược điểm
Dễ debug: Collector biết chính xác target nào alive, target nào dead	Firewall/NAT issues: Collector phải reach được target
Service discovery tự nhiên: Nếu pull fail → target down	Scaling collector: Một Prometheus instance có giới hạn targets
Không cần agent phức tạp: Target chỉ cần expose HTTP endpoint	Short-lived jobs: Job chạy xong trước khi bị scrape → miss data
Centralized control: Thay đổi interval, targets ở một chỗ	Network overhead: Mỗi scrape là một HTTP request

Push Model (Datadog/StatsD-style)

flowchart LR
    subgraph "Targets"
        T1["Server A<br/>+ Agent"]
        T2["Server B<br/>+ Agent"]
        T3["Server C<br/>+ Agent"]
    end

    subgraph "Gateway"
        GW["Push Gateway /<br/>Collector Endpoint"]
    end

    T1 -->|"Push metrics"| GW
    T2 -->|"Push metrics"| GW
    T3 -->|"Push metrics"| GW
    GW --> KAFKA["Kafka"]

    style GW fill:#1565C0,stroke:#333,stroke-width:2px,color:#fff

Cách hoạt động: Agent trên mỗi server chủ động gửi metrics đến central collector/gateway.

Ưu điểm	Nhược điểm
Firewall-friendly: Agent push ra ngoài, không cần inbound access	Khó biết target dead: Không nhận data ≠ target down (có thể network issue)
Short-lived jobs OK: Job push metrics ngay khi có	Thundering herd: Nhiều agent push cùng lúc → burst
Agent tự quản collection interval	Config scattered: Mỗi agent config riêng
NAT/multi-DC friendly: Agent biết cách reach gateway	Cần agent phức tạp hơn: Buffering, retry, batching

So sánh tổng hợp

Tiêu chí	Pull (Prometheus)	Push (Datadog/StatsD)
Health detection	Tự nhiên (pull fail = down)	Cần heartbeat riêng
Service discovery	Cần (Consul, K8s API)	Không cần (agent tự push)
Firewall	Cần inbound access	Chỉ cần outbound
Short-lived jobs	Khó (dùng Pushgateway workaround)	Dễ
Scaling	Horizontal: federation, sharding	Horizontal: gateway + Kafka
Ai dùng?	Prometheus, Thanos, VictoriaMetrics	Datadog, New Relic, StatsD, Telegraf

Thực tế: Hệ thống production quy mô lớn thường dùng hybrid — pull cho infrastructure metrics (server luôn sẵn sàng), push cho business metrics và short-lived jobs. Kafka ở giữa làm buffer cho cả hai.

Service Discovery cho Pull Model

Trong pull model, collector cần biết danh sách tất cả targets để scrape. Đây là lý do cần service discovery:

Phương pháp	Mô tả	Khi nào dùng
Static config	Hardcode list targets trong config file	Môi trường nhỏ, ít thay đổi
File-based	Đọc target list từ file, file được update bởi automation	CM tool (Ansible/Puppet) quản lý
Consul	Query Consul catalog để lấy list healthy services	Consul đã có trong stack
Kubernetes API	Watch K8s API cho pods với annotation `prometheus.io/scrape: "true"`	Kubernetes environment
DNS SRV	Resolve SRV records để lấy host:port	DNS-based service discovery
EC2/GCE API	Query cloud provider API cho list instances	Cloud-native environment

2.10 Ingestion Pipeline — Kafka as Buffer

Tại sao cần Kafka giữa Collection và TSDB?

Hieu, đây là câu hỏi hay. Sao không ghi thẳng từ agent vào TSDB?

Vấn đề khi ghi thẳng	Kafka giải quyết thế nào
Burst traffic — deploy 1000 servers cùng lúc, restart agents	Kafka absorb burst, TSDB consume ổn định
TSDB downtime — upgrade, compaction heavy	Kafka retain data, TSDB catch up sau
Fan-out — cùng data cần đi nhiều nơi (TSDB, alerting, real-time dashboard)	Kafka multiple consumer groups
Backpressure — TSDB quá tải	Kafka giữ data, consumer tự điều chỉnh pace
Data transformation — cần pre-aggregate trước khi ghi	Stream processor consume từ Kafka, transform, rồi ghi

Ingestion Pipeline Architecture

flowchart LR
    subgraph "Collection Agents"
        A1["Agent 1"]
        A2["Agent 2"]
        AN["Agent N"]
    end

    subgraph "Kafka Cluster"
        direction TB
        T1["Topic: raw-metrics<br/>Partitions: 64<br/>Replication: 3"]
        T2["Topic: aggregated-metrics<br/>Partitions: 32<br/>Replication: 3"]
    end

    subgraph "Stream Processing"
        SP1["Flink Job 1<br/>Counter aggregation<br/>(sum per minute)"]
        SP2["Flink Job 2<br/>Histogram aggregation<br/>(percentile computation)"]
        SP3["Flink Job 3<br/>Rule evaluation<br/>(real-time alerting)"]
    end

    subgraph "Storage"
        TSDB["Time Series DB"]
        ARE["Alert Rules Engine"]
    end

    A1 & A2 & AN -->|"Produce"| T1
    T1 --> SP1 & SP2 & SP3
    SP1 & SP2 -->|"Write aggregated"| T2
    T2 --> TSDB
    SP3 -->|"Trigger alerts"| ARE

    style T1 fill:#FF9800,stroke:#333,stroke-width:2px,color:#fff
    style T2 fill:#FFC107,stroke:#333,stroke-width:2px,color:#000

Kafka Partitioning Strategy

Cách partition metrics trong Kafka ảnh hưởng lớn đến hiệu suất:

Strategy	Cách làm	Ưu	Nhược
By metric name	`partition_key = hash(metric_name)`	Cùng metric trên cùng partition → dễ aggregate	Hotspot nếu 1 metric có data nhiều hơn
By host	`partition_key = hash(host_id)`	Tất cả metrics của 1 host trên 1 partition → ordered per host	Hotspot nếu 1 host có nhiều metrics
By metric + host	`partition_key = hash(metric_name + host_id)`	Phân bổ đều, 1 time series trên 1 partition	Nhiều partitions cần thiết
Round-robin	Không có key, Kafka tự phân bổ	Phân bổ đều nhất	Mất ordering — khó aggregate

Khuyến nghị: Dùng hash(metric_name) cho hệ thống general. Nếu có stream processing cần aggregate per host, dùng hash(metric_name + host_id).

Stream Processor — Pre-aggregation

Pre-aggregation là bước cực kỳ quan trọng để giảm tải cho TSDB:

Operation	Input	Output	Ví dụ
Counter sum	1000 data points/sec cho `http_request_total`	1 data point/min (sum of 60s window)	150432 requests in last minute
Gauge average	1000 data points/sec cho `cpu_usage`	1 data point/min (avg of 60s window)	avg CPU = 78.3% last minute
Histogram merge	1000 latency values/sec	1 histogram per minute (p50, p90, p99)	p99 latency = 245ms last minute
Rate computation	Counter values over time	Rate per second	2,507 req/s

Lợi ích: Giảm write volume từ 10M data points/sec xuống còn khoảng 170K data points/sec (giảm ~60x) mà vẫn giữ được thông tin đủ cho 1-minute granularity.

2.11 Time Series Database — Core of the System

Tại sao Relational DB không phù hợp?

Hieu, đây là câu hỏi interviewer hay hỏi. Em phải giải thích được rõ ràng:

Vấn đề	Relational DB (PostgreSQL)	TSDB (InfluxDB, Prometheus)
Write throughput	10K-50K inserts/sec (B-tree overhead)	500K-2M+ inserts/sec (LSM tree, batch write)
Compression	Tệ cho time series (generic compression)	Tuyệt vời (gorilla, delta-of-delta, XOR) — 10x-50x
Time-based queries	Full table scan hoặc cần complex indexing	Native time-based partitioning, O(1) time range lookup
Downsampling	Phải viết custom logic, expensive	Built-in continuous queries
Retention	Manual DELETE, vacuum overhead	Built-in retention policies, automatic TTL
Storage efficiency	~16 bytes/data point (row overhead)	~1-2 bytes/data point (after compression)
Cardinality handling	OK (B-tree index)	Chuyên optimize cho label-based queries

Kết luận: Relational DB được thiết kế cho general-purpose CRUD. TSDB được thiết kế chuyên biệt cho append-only, time-ordered, numeric data. Sự khác biệt là 10-100x ở write throughput và storage efficiency.

TSDB Options — So sánh

TSDB	Architecture	Query Language	Scaling	Ai dùng
Prometheus	Single-node, local storage	PromQL	Federation / Thanos	Kubernetes ecosystem
InfluxDB	Cluster (enterprise), single-node (OSS)	InfluxQL, Flux	Enterprise clustering	IoT, DevOps
TimescaleDB	PostgreSQL extension	SQL + custom functions	PostgreSQL native replication	Teams muốn SQL compatibility
OpenTSDB	On top of HBase	HTTP API	HBase scaling	Hadoop ecosystem
VictoriaMetrics	Single-node + cluster	PromQL compatible	Built-in clustering	Prometheus alternative at scale
ClickHouse	Column-oriented OLAP	SQL	Built-in sharding	Analytics-heavy use cases
Apache Druid	Real-time + batch	SQL-like	Built-in clustering	Real-time analytics

LSM Tree — Tại sao TSDB dùng LSM thay vì B-tree?

Đặc tính	B-tree (PostgreSQL, MySQL)	LSM Tree (RocksDB, LevelDB)
Write pattern	Random I/O (update page in place)	Sequential I/O (append to WAL + memtable)
Write amplification	Cao (page split, rebalance)	Thấp hơn (merge sort)
Read performance	Tốt (O(log n) lookup)	Tệ hơn (check memtable + multiple levels)
Space amplification	Thấp (in-place update)	Cao hơn (multiple copies during compaction)
Phù hợp cho	Read-heavy, update-heavy	Write-heavy, append-only

Kết luận: Time series data là write-heavy, append-only → LSM tree là lựa chọn tự nhiên. Đó là lý do Prometheus, InfluxDB, VictoriaMetrics đều dùng LSM-based storage engine.

Compression — Chìa khóa của Storage Efficiency

Compression là yếu tố sống còn cho TSDB. Không có compression, storage cost sẽ unmanageable.

Gorilla Encoding (Facebook, 2015) — kỹ thuật nền tảng cho hầu hết TSDB hiện đại:

Kỹ thuật	Áp dụng cho	Cách hoạt động	Tỷ lệ nén
Delta-of-delta	Timestamps	Thay vì lưu timestamp tuyệt đối, lưu delta của delta. Nếu interval đều (mỗi 10s), delta-of-delta = 0 → nén cực tốt	~1-2 bits/timestamp (thay vì 64 bits)
XOR encoding	Float values	XOR giá trị hiện tại với giá trị trước. Nếu giá trị gần nhau, XOR result có nhiều leading/trailing zeros → nén tốt	~1-4 bytes/value (thay vì 8 bytes)
Dictionary encoding	Tag values	Map string → integer ID. `"us-east-1"` → `3`	Giảm 5-10x cho tags
Run-length encoding	Repeated values	Lưu `(value, count)` thay vì repeat	Tốt cho constant metrics

Ví dụ trực quan Delta-of-delta:

Timestamps:     1679000000  1679000010  1679000020  1679000030  1679000040
Deltas:                     10          10          10          10
Delta-of-delta:             0           0           0           0

Thay vì lưu 5 giá trị 64-bit (40 bytes), chỉ cần lưu 1 base timestamp + 4 bits (gần 0 bytes cho delta-of-delta = 0).

Ví dụ trực quan XOR encoding:

Values (float64):    78.50    78.52    78.48    78.55
XOR with previous:   —        0x...0040  0x...00C0  0x...0070

Vì values gần nhau, XOR result có rất nhiều leading zeros → chỉ cần lưu vài bits cho phần khác biệt.

Aha Moment: Nhờ Gorilla encoding, TSDB đạt compression ratio 10x-50x. Một data point raw cần ~16 bytes, sau compression chỉ còn 1-2 bytes. Đây là lý do monitoring system có thể lưu trữ hàng tỷ data points mà storage cost vẫn hợp lý.

Downsampling — Trade Precision for Storage

Downsampling là quá trình giảm resolution của data theo thời gian:

Tier	Resolution	Retention	Use case
Raw	10 giây	7 ngày	Debug real-time issues
1 phút	1 phút	30 ngày	Recent investigation
5 phút	5 phút	90 ngày	Weekly review
1 giờ	1 giờ	1 năm	Monthly reporting
1 ngày	1 ngày	3 năm	Annual trends

Cách downsampling hoạt động: Continuous query (hoặc background job) aggregate raw data:

Raw (10s): 78.5, 78.3, 78.7, 78.4, 78.6, 78.2  (6 points in 1 minute)
      ↓ downsample to 1-minute
1-min: avg=78.45, min=78.2, max=78.7, count=6     (1 aggregated point)

Lưu ý: Khi downsample, luôn lưu min, max, avg, sum, count — không chỉ avg. Vì nếu chỉ có avg, em mất thông tin về spikes (max) và dips (min).

Storage savings từ downsampling:

Tier	Data points/series/day	Giảm so với raw
Raw (10s)	8,640	1x
1 phút	1,440	6x
5 phút	288	30x
1 giờ	24	360x
1 ngày	1	8,640x

Trade-off: Khi nhìn graph CPU 6 tháng trước ở resolution 1 giờ, em sẽ không thấy spike 2 phút. Nhưng em sẽ thấy min/max trong mỗi giờ → đủ để phát hiện pattern bất thường.

2.12 Query Service — PromQL-like Language

Query Patterns phổ biến

Pattern	Mô tả	Ví dụ PromQL
Instant query	Giá trị hiện tại của metric	`cpu_usage{host="s1"}`
Range query	Giá trị trong khoảng thời gian	`cpu_usage{host="s1"}[5m]`
Aggregation	Aggregate across labels	`avg(cpu_usage) by (cluster)`
Rate	Tốc độ tăng của counter	`rate(http_requests_total[5m])`
Top-K	Top servers theo metric	`topk(10, cpu_usage)`
Percentile	Tính percentile từ histogram	`histogram_quantile(0.99, rate(http_duration_bucket[5m]))`
Math	Phép tính giữa metrics	`memory_used / memory_total * 100`
Alert condition	Điều kiện kết hợp	`avg(cpu_usage) by (host) > 90 for 5m`

Query Optimization

Kỹ thuật	Mô tả	Tác dụng
Pre-aggregation	Tính sẵn aggregated values khi ingest (recording rules)	Query `avg(cpu) by cluster` trở thành simple lookup thay vì scan millions of series
Query result caching	Cache kết quả query trong Redis (TTL = collection interval)	Dashboard refresh mỗi 30s nhưng cùng query → cache hit
Time-based partitioning	Data partition theo time block (e.g., 2-hour blocks trong Prometheus)	Query `last 1 hour` chỉ cần đọc 1-2 blocks, không scan toàn bộ
Inverted index	Index label values → series IDs	Filter `{region="us-east-1"}` nhanh, không cần scan
Query parallelism	Split query across time ranges, parallel fetch	24h query = 24 parallel 1h queries, merge results
Subquery optimization	Rewrite nested queries thành efficient form	Tránh N+1 query problem

Recording Rules (Pre-computed Queries)

Recording rules là concept quan trọng để optimize dashboard performance:

# Recording rule definition
name: cluster:cpu_usage:avg5m
expression: avg(rate(cpu_usage_percent[5m])) by (cluster)
interval: 1m

Thay vì mỗi lần refresh dashboard phải aggregate raw data từ 10K servers, recording rule tính sẵn mỗi phút và lưu kết quả như một metric mới. Dashboard query metric đã pre-computed → response time giảm từ vài giây xuống milliseconds.

2.13 Alerting System — Brain of Monitoring

Alert Rules Engine

Alert rules engine là component quan trọng nhất — nó quyết định khi nào cần gọi người.

Cấu trúc một alert rule:

Field	Mô tả	Ví dụ
Name	Tên alert	`HighCPUUsage`
Expression	PromQL condition	`avg(cpu_usage{env="prod"}) by (host) > 90`
Duration	Phải true bao lâu trước khi fire	`for 5m` (tránh false alarm từ spike ngắn)
Severity	Mức độ nghiêm trọng	`critical`, `warning`, `info`
Labels	Metadata cho routing	`team="platform", service="payment"`
Annotations	Thông tin cho on-call	`summary="Host {{ $labels.host }} CPU > 90% for 5m"`

Alert States — State Machine

stateDiagram-v2
    [*] --> INACTIVE: Rule created
    INACTIVE --> PENDING: Condition becomes true
    PENDING --> FIRING: Condition true for >= duration
    PENDING --> INACTIVE: Condition becomes false<br/>before duration
    FIRING --> RESOLVED: Condition becomes false
    RESOLVED --> INACTIVE: After cooldown period
    RESOLVED --> PENDING: Condition true again
    FIRING --> FIRING: Condition still true<br/>(re-evaluate every interval)

    note right of PENDING
        Alert chưa fire
        Đang chờ đủ duration
        VD: CPU > 90% mới 2 phút
        (cần 5 phút mới fire)
    end note

    note right of FIRING
        Alert đã fire
        Notification đã gửi
        Tiếp tục gửi reminder
        cho đến khi RESOLVED
    end note

    note right of RESOLVED
        Condition đã về bình thường
        Gửi "resolved" notification
        Cooldown trước khi reset
    end note

Giải thích chi tiết:

State	Ý nghĩa	Action
INACTIVE	Điều kiện chưa xảy ra hoặc đã reset	Không làm gì
PENDING	Điều kiện đã true nhưng chưa đủ `duration`	Chờ. Nếu condition false trước duration → quay lại INACTIVE
FIRING	Điều kiện true >= `duration`	Gửi notification, tiếp tục evaluate
RESOLVED	Điều kiện đã false sau khi đã FIRING	Gửi “resolved” notification

Tại sao cần PENDING state? Để tránh flapping alerts. CPU spike lên 95% trong 30 giây rồi xuống 80% — không đáng gửi alert lúc 3 giờ sáng. Chỉ khi CPU > 90% liên tục 5 phút mới fire → giảm false alarms đáng kể.

Alert Manager — Routing & Processing

Alert Manager nhận alerts từ Rules Engine và xử lý trước khi gửi notification:

flowchart TB
    ARE["Alert Rules Engine<br/>Firing alerts"] --> AM["Alert Manager"]

    subgraph AM_Internal["Alert Manager Processing"]
        direction TB
        GROUP["Grouping<br/>Gom alerts cùng nhóm"]
        DEDUP["Deduplication<br/>Loại alert trùng"]
        SILENCE["Silencing<br/>Tắt alert đang maintenance"]
        INHIBIT["Inhibition<br/>Alert A suppress Alert B"]
        THROTTLE["Throttling<br/>Rate limit notifications"]
        ROUTE["Routing<br/>Gửi đến đúng team"]
    end

    AM --> GROUP --> DEDUP --> SILENCE --> INHIBIT --> THROTTLE --> ROUTE

    ROUTE -->|"critical"| PD["PagerDuty<br/>(wake up on-call)"]
    ROUTE -->|"warning"| SLACK["Slack Channel<br/>(#alerts-platform)"]
    ROUTE -->|"info"| EMAIL["Email Digest<br/>(daily summary)"]

    style AM fill:#e91e63,stroke:#333,stroke-width:2px,color:#fff
    style PD fill:#f44336,stroke:#333,stroke-width:2px,color:#fff
    style SLACK fill:#4CAF50,stroke:#333,stroke-width:2px,color:#fff

Chi tiết từng bước xử lý:

Bước	Mô tả	Ví dụ
Grouping	Gom alerts có cùng label set thành 1 notification	50 servers cùng cluster CPU cao → 1 notification: “Cluster X: 50 hosts high CPU”
Deduplication	Không gửi lại alert đã gửi (trong repeat interval)	Alert “High CPU server-042” đã gửi 5 phút trước → skip (repeat interval = 15m)
Silencing	Tắt alert cho resources đang maintenance	Server-042 đang patching → silence tất cả alerts cho server-042 từ 2am-4am
Inhibition	Alert cấp cao suppress alert cấp thấp	Alert “Node down” suppress “High CPU” và “Disk full” cho cùng node (vì node down thì các metrics khác không có ý nghĩa)
Throttling	Rate limit notifications per channel	Tối đa 10 PagerDuty alerts/hour per team → tránh alert storm
Routing	Route đến đúng team/channel dựa trên labels	`team="payment"` + `severity="critical"` → PagerDuty payment on-call

Escalation Policy

Level	Thời gian chờ	Action
L1	0 phút	PagerDuty alert to primary on-call
L2	15 phút không ACK	Escalate to secondary on-call
L3	30 phút không ACK	Escalate to team lead + Slack broadcast
L4	60 phút không ACK	Escalate to engineering manager + incident channel

Aha Moment: Alert system tốt không phải gửi nhiều alert — mà là gửi đúng alert, đúng người, đúng lúc, đúng mức. Grouping, dedup, silencing, inhibition — tất cả đều nhằm giảm noise, tăng signal.

2.14 Visualization — Grafana-like Dashboard

Dashboard Architecture

Component	Vai trò	Chi tiết
Dashboard definition	JSON/YAML config mô tả panels, queries, layout	Version controlled trong Git
Panel types	Graph (time series), Singlestat, Table, Heatmap, Gauge	Mỗi panel = 1 hoặc nhiều queries
Variable/template	Dynamic filters (dropdown chọn cluster, host, service)	Query TSDB label values để populate dropdown
Auto-refresh	Dashboard tự refresh mỗi 10-30 giây	WebSocket hoặc polling
Time range selector	Chọn thời gian: last 15m, 1h, 6h, 24h, 7d, custom	Ảnh hưởng query range
Drill-down	Click vào graph → zoom in, click vào host → xem detail	Link giữa các dashboard
Annotations	Đánh dấu events trên timeline (deploy, incident, config change)	Correlate metrics với events

Real-time Streaming vs Polling

Approach	Mô tả	Khi nào dùng
Polling	Frontend gọi API mỗi X giây để lấy data mới	Đơn giản, phù hợp refresh interval 10-30s
WebSocket	Server push data mới khi có	Real-time dashboard (< 5s latency)
Server-Sent Events (SSE)	Unidirectional stream từ server	Simpler than WebSocket, one-way data

Thực tế: Grafana dùng polling (mặc định 10s) vì monitoring data không cần sub-second latency. Polling đơn giản hơn WebSocket nhiều về infrastructure.

Dashboard Best Practices

Practice	Mô tả	Lý do
USE Method	Utilization, Saturation, Errors per resource	Brendan Gregg’s framework — systematic resource analysis
RED Method	Rate, Errors, Duration per service	Tom Wilkie’s framework — service-level monitoring
Four Golden Signals	Latency, Traffic, Errors, Saturation	Google SRE book — most important metrics
Layered dashboards	Overview → Cluster → Host → Process	Drill-down from high-level to detail
Alerts on dashboard	Show active alerts inline with metrics	Context for investigating

2.15 Storage — Hot/Warm/Cold Tiering

Storage Tiering Architecture

flowchart TB
    subgraph "Hot Tier (Recent 7 days)"
        HOT_SSD["NVMe SSD<br/>Raw data (10s resolution)<br/>In-memory index<br/>Fastest query"]
    end

    subgraph "Warm Tier (7 days - 3 months)"
        WARM_HDD["HDD / Cheap SSD<br/>Downsampled to 1-min<br/>Compressed<br/>Moderate query speed"]
    end

    subgraph "Cold Tier (3 months - 3 years)"
        COLD_S3["Object Storage (S3/GCS)<br/>Downsampled to 1-hour<br/>Heavily compressed<br/>Slow but cheapest"]
    end

    HOT_SSD -->|"After 7 days:<br/>downsample + migrate"| WARM_HDD
    WARM_HDD -->|"After 3 months:<br/>further downsample + migrate"| COLD_S3
    COLD_S3 -->|"After retention:<br/>delete"| DEL["Delete"]

    QS["Query Service"] -->|"Last 7d queries"| HOT_SSD
    QS -->|"7d-3mo queries"| WARM_HDD
    QS -->|"3mo+ queries"| COLD_S3

    style HOT_SSD fill:#f44336,stroke:#333,stroke-width:2px,color:#fff
    style WARM_HDD fill:#FF9800,stroke:#333,stroke-width:2px,color:#fff
    style COLD_S3 fill:#2196F3,stroke:#333,stroke-width:2px,color:#fff

Storage Cost Comparison

Tier	Media	Cost (estimate/GB/month)	Read latency	Use case
Hot	NVMe SSD	$0.20-0.40	< 1ms	Real-time dashboards, alerting
Warm	HDD / SATA SSD	$0.05-0.10	5-50ms	Investigation, weekly reports
Cold	S3/GCS	$0.01-0.02	100ms-1s	Compliance, annual reports

Compaction

Compaction là background process merge nhiều small files thành ít large files:

Vấn đề khi không compact	Compaction giải quyết
Hàng ngàn small SSTable files	Merge thành ít large files
Read phải check nhiều files (read amplification)	Read chỉ cần check ít files
Deleted data vẫn chiếm space (tombstones)	Thực sự xóa data đã tombstoned
Overlapping time ranges giữa files	Non-overlapping time ranges sau compact

Compaction strategies:

Strategy	Mô tả	Trade-off
Size-tiered	Merge files có similar size	Write-optimized, space amplification cao
Leveled	Merge theo levels, mỗi level lớn hơn 10x	Read-optimized, write amplification cao
Time-window	Merge files cùng time window	Phù hợp TSDB (data naturally partitioned by time)

Retention Policies

Policy	Mô tả	Configuration
Time-based	Xóa data older than X days	`retention_period = 365d`
Size-based	Xóa data cũ nhất khi total size > threshold	`max_storage = 10TB`
Resolution-based	Downsample data older than X, delete raw	`raw_retention = 7d, 1m_retention = 90d`

3. Capacity Estimation (Uoc luong dung luong)

3.1 Assumptions

Thông số	Giá trị	Giải thích
Servers monitored	10,000	Medium-large infrastructure
Metrics per server	100	CPU, memory, disk, network, custom
Total time series	1,000,000	10K x 100
Collection interval	10 seconds	Standard for infrastructure
Raw data point size (uncompressed)	16 bytes	8 bytes timestamp + 8 bytes float64 value
Compression ratio	12x	Gorilla encoding average
Tags overhead per series	200 bytes	Average tag set size
Retention: raw	7 days	Hot tier
Retention: 1-minute	90 days	Warm tier
Retention: 1-hour	365 days	Cold tier
Query QPS (dashboard)	500	50 dashboards x 10 panels x refresh every second
Alert rules	5,000	Evaluated every 15-60 seconds

3.2 Ingestion Rate

D a t a p o in t s_{p er seco n d} = \frac{T o t a l t im e ser i es}{C o ll ec t i o n in t er v a l} = \frac{1 , 000 , 000}{10} = 100, 000 p o in t s / sec

Lưu ý: Con số 10M/s trong requirement bao gồm cả high-cardinality metrics (e.g., per-request latency histograms) và redundancy (push + pull overlap). Với pre-aggregation tại agent level:

E ff ec t i v e in g es t i o n_{a f t er p re - a gg} \approx \frac{10 , 000 , 000}{60} \approx 167, 000 p o in t s / sec

Giải thích: Pre-aggregation ở agent level gom 60 data points (10s interval x 6 points/minute) thành 1 aggregated point/minute cho một số metrics, giảm đáng kể volume.

3.3 Storage Estimation

Raw data (Hot tier — 7 ngày)

P o in t s_{p er d a y} = 100, 000 p o in t s / sec \times 86, 400 sec / d a y = 8.64 \times 1 0^{9} p o in t s / d a y

R a w s i z e_{p er d a y} = 8.64 \times 1 0^{9} \times 16 b y t es = 138.24 GB / d a y

C o m p resse d_{p er d a y} = \frac{138.24 GB}{12} = 11.52 GB / d a y

Ho t t i er s t or a g e = 11.52 GB / d a y \times 7 d a ys = 80.64 GB \approx 81 GB

1-minute downsampled (Warm tier — 90 ngày)

Downsample ratio: 10s → 1min = 6x giảm data points, nhưng lưu 5 aggregates (min, max, avg, sum, count) → net reduction ~1.2x:

Wa r m d ai l y = \frac{11.52 GB / d a y}{6} \times 5 a gg re g a t es = 9.6 GB / d a y

Wa r m t i er s t or a g e = 9.6 GB / d a y \times 90 d a ys = 864 GB \approx 864 GB

1-hour downsampled (Cold tier — 365 ngày)

C o l d d ai l y = \frac{9.6 GB / d a y}{60} = 0.16 GB / d a y

C o l d t i er s t or a g e = 0.16 GB / d a y \times 365 d a ys = 58.4 GB \approx 59 GB

Total storage

T o t a l s t or a g e = Ho t + Wa r m + C o l d = 81 + 864 + 59 = 1, 004 GB \approx 1 TB

So sánh nếu không compress: Raw storage cho 1 năm = $138.24 GB / d a y \times 365 = 50.5 TB$ . Với compression + downsampling → 1 TB. Giảm ~50x.

3.4 Index Storage

Inverted index cho labels:

I n d e x s i ze = T o t a l t im e ser i es \times A vg t a g s o v er h e a d = 1, 000, 000 \times 200 b y t es = 200 MB

Index nhỏ đủ để fit hoàn toàn trong memory → fast label-based filtering.

3.5 Kafka Throughput

K a f ka in g es t i o n = 10, 000, 000 p o in t s / sec \times 16 b y t es = 160 MB / s

K a f ka w i t h o v er h e a d (h e a d ers, k eys, ba t c hin g) \approx 160 MB / s \times 1.5 = 240 MB / s

K a f ka p a r t i t i o n s n ee d e d = \frac{240 MB / s}{10 MB / s p er p a r t i t i o n} = 24 p a r t i t i o n s (minim u m)

Khuyến nghị: 64 partitions cho headroom và parallelism. Kafka cluster 3-5 brokers, mỗi broker handle ~100 MB/s.

3.6 Query QPS

D a s hb o a r d QPS = 50 d a s hb o a r d s \times 10 p an e l s / d a s hb o a r d \times \frac{1}{30 s re f res h} = 16.7 QPS

A l er t e v a l u a t i o n QPS = \frac{5 , 000 r u l es}{15 s in t er v a l} = 333 QPS

A d - h oc q u ery QPS \approx 50 QPS (e n g in eers in v es t i g a t in g)

T o t a l q u ery QPS = 17 + 333 + 50 = 400 QPS

P e ak q u ery QPS = 400 \times 3 = 1, 200 QPS

Nhận xét: Query load khá nhẹ so với write load. Monitoring system là extremely write-heavy — write:read ratio ~ 250:1 (100K writes/s vs 400 reads/s).

3.7 Tóm tắt Estimation

Metric	Value
Ingestion rate (raw)	10M data points/sec
Ingestion rate (after pre-agg)	~167K data points/sec
Kafka throughput	~240 MB/s
Hot storage (7d, raw, compressed)	~81 GB
Warm storage (90d, 1-min, compressed)	~864 GB
Cold storage (365d, 1-hour, compressed)	~59 GB
Total storage (1 year)	~1 TB
Uncompressed equivalent	~50.5 TB
Compression + downsampling savings	~50x
Total time series	1M
Index size	~200 MB
Query QPS (peak)	~1,200
Alert rules	5,000
Write:Read ratio	~250:1

4. Security

4.1 Access Control — Multi-tenant Isolation

Trong tổ chức lớn, nhiều team dùng chung monitoring platform. Security là critical:

Concern	Solution	Chi tiết
Team A xem metrics của Team B	Label-based access control (LBAC)	Mỗi tenant có label `team="X"`, query engine tự động inject filter `{team="X"}` dựa trên user identity
Unauthorized alert config changes	RBAC (Role-Based Access Control)	Roles: viewer (xem dashboard), editor (tạo dashboard/alert), admin (manage users/data sources)
Cross-tenant data leaks	Namespace isolation	Mỗi tenant có separate TSDB namespace hoặc separate database instance
API access	API key + OAuth2	Mỗi service cần API key để push metrics, mỗi user cần OAuth2 token để query

Multi-tenancy Architecture Options

Approach	Mô tả	Ưu	Nhược
Shared cluster + label filter	Tất cả tenants trên 1 TSDB, phân biệt bằng `__tenant_id__` label	Đơn giản, resource efficient	Noisy neighbor, security boundary yếu
Namespace per tenant	Mỗi tenant có separate namespace/database trong cùng cluster	Balance giữa isolation và efficiency	Phức tạp quản lý
Cluster per tenant	Mỗi tenant có riêng TSDB cluster	Isolation hoàn toàn	Tốn resource, operational overhead

Khuyến nghị: Shared cluster + label filter cho teams nhỏ. Namespace per tenant cho production grade. Cluster per tenant chỉ khi có compliance requirements (e.g., PCI-DSS, HIPAA).

4.2 Preventing Alert Spam

Threat	Mitigation
Malicious service push fake metrics → trigger false alerts	Authenticate metric sources, validate metric names against allow-list
Alert rule misconfiguration → alert storm	Alert rule review process (PR-based), dry-run mode, max alerts per rule limit
Notification channel abuse	Rate limit per channel: max 10 PagerDuty/hour, max 100 Slack/hour
Recursive alerting	”Monitoring the monitoring” alert không được route qua same monitoring system

4.3 Securing Notification Channels

Channel	Security measure
PagerDuty	API key rotation every 90 days, IP whitelist
Slack	Webhook URL stored in secret manager (Vault), not in config files
Email	SPF/DKIM/DMARC to prevent spoofing alert emails
Webhook	mTLS between alert manager and webhook endpoint, HMAC signature verification

4.4 Audit Logging

Mọi thay đổi liên quan đến alerting phải được audit:

Event	What to log	Retention
Alert rule created/modified/deleted	Who, when, old config, new config	2 years
Silence created/expired	Who, what alerts silenced, duration	1 year
Notification channel config changed	Who, what changed	2 years
Dashboard permission changed	Who, what granted/revoked	1 year
Alert acknowledged/resolved	Who, when, how long to ACK	1 year

Tại sao audit logging quan trọng? Post-incident review: “Ai đã silence alert cho server-042 lúc 2am mà không ai biết?” — audit log trả lời câu hỏi này.

5. DevOps — Monitoring the Monitoring System

5.1 Meta-monitoring — The Chicken-and-Egg Problem

Hieu, đây là câu hỏi kinh điển: “Ai monitor cái monitoring system?”

Nếu monitoring system sập mà không ai biết → tất cả alerts đều mất → production sập mà không ai hay.

Strategy	Mô tả
Separate monitoring system	Dùng hệ thống monitoring thứ hai (đơn giản hơn) để monitor hệ thống chính. VD: dùng Pingdom/UptimeRobot external check
Self-monitoring	Hệ thống tự monitor chính mình (Prometheus monitor Prometheus). Nhưng nếu nó sập → self-monitoring cũng sập
Dead man’s switch	Gửi heartbeat mỗi phút đến external service. Nếu heartbeat miss → external service alert. Đây là approach được khuyến nghị nhất
Cross-region monitoring	Monitoring system ở region A monitor system ở region B và ngược lại

Best practice: Kết hợp self-monitoring + dead man’s switch (Cronitor, Healthchecks.io, hoặc simple external cron check).

5.2 Key Operational Metrics to Monitor

Metric	Ý nghĩa	Threshold
Ingestion lag	Khoảng cách giữa data point timestamp và thời điểm queryable	< 30s normal, > 60s warning, > 5m critical
Kafka consumer lag	Số messages chưa được consume	< 10K normal, > 100K warning, > 1M critical
TSDB write latency P99	Thời gian ghi 1 batch vào TSDB	< 100ms normal, > 500ms warning
TSDB compaction duration	Thời gian chạy compaction	< 10 min normal, > 30 min warning
TSDB disk usage	Phần trăm disk sử dụng	< 70% normal, > 80% warning, > 90% critical
Query latency P99	Thời gian xử lý query	< 2s normal, > 5s warning, > 10s critical
Alert evaluation duration	Thời gian evaluate tất cả alert rules	< evaluation_interval (15s)
Notification delivery latency	Thời gian từ alert fire đến notification sent	< 30s normal, > 60s warning
Active time series	Số time series đang active (written to recently)	Monitor for cardinality explosion
Dropped data points	Data points bị drop do overload	0 is ideal, > 0.1% is concerning

5.3 TSDB Compaction Monitoring

Compaction là background task critical. Nếu compaction tụt lại:

Symptom	Root cause	Impact
Compaction queue growing	Write rate > compaction rate	Query slower (more files to scan)
Disk space increasing faster than expected	Compaction không kịp merge/delete	Disk full risk
Query latency tăng	Quá nhiều small files chưa compact	Read amplification

Actions:

Monitor compaction_pending_bytes và compaction_duration
Set alert nếu compaction lag > 2x average
Tune compaction concurrency (number of compaction goroutines/threads)
Tăng disk IOPS nếu bottleneck là disk

5.4 Capacity Planning

Signal	Đo lường	Action khi trend xấu
Time series growth rate	Active series count over time	Add TSDB nodes hoặc reduce cardinality
Ingestion rate growth	Data points/sec trend	Scale Kafka + stream processors
Storage growth rate	TB used over time	Add storage, adjust retention, increase downsampling
Query latency trend	P99 latency over weeks	Add query cache, optimize recording rules, add query nodes

5.5 Runbook Essentials

Mỗi alert phải có runbook link. Runbook bao gồm:

Section	Nội dung
What	Alert này nghĩa là gì (plain language)
Impact	Ảnh hưởng business nếu không fix
Diagnosis	Các bước kiểm tra (dashboard links, queries to run)
Remediation	Các bước fix (restart service, scale up, failover)
Escalation	Khi nào escalate, escalate cho ai
Post-incident	Tạo incident ticket, schedule post-mortem

6. Mermaid Diagrams — Architecture Views

6.1 Overall System Architecture (Chi tiết)

flowchart TB
    subgraph Sources["Metrics Sources"]
        direction LR
        APP["Application<br/>Servers (5K)"]
        DB_SRC["Database<br/>Servers (1K)"]
        MQ_SRC["Message<br/>Queues (500)"]
        LB_SRC["Load<br/>Balancers (200)"]
        K8S["Kubernetes<br/>Pods (3K)"]
        CUSTOM["Custom Business<br/>Metrics"]
    end

    subgraph Collection["Collection Layer"]
        direction LR
        EXPORTER["Prometheus<br/>Exporters"]
        STATSD["StatsD<br/>Agents"]
        TELEGRAF["Telegraf<br/>Agents"]
        SD_COL["Service<br/>Discovery"]
    end

    subgraph Ingestion["Ingestion Pipeline"]
        direction LR
        KAFKA_IN["Kafka Cluster<br/>(5 brokers, 64 partitions)<br/>240 MB/s throughput"]
        FLINK["Apache Flink<br/>(Stream Processing)<br/>Pre-aggregation<br/>Rate computation"]
    end

    subgraph Storage["Storage Layer"]
        direction TB
        TSDB_MAIN["Primary TSDB Cluster<br/>(InfluxDB / VictoriaMetrics)<br/>Write: 167K points/sec"]

        subgraph Tiers["Storage Tiers"]
            direction LR
            HOT_T["Hot (SSD)<br/>7 days<br/>81 GB"]
            WARM_T["Warm (HDD)<br/>90 days<br/>864 GB"]
            COLD_T["Cold (S3)<br/>365 days<br/>59 GB"]
        end
    end

    subgraph QueryAlert["Query & Alerting"]
        direction TB
        QS_MAIN["Query Service<br/>(PromQL Engine)"]
        CACHE_Q["Query Cache<br/>(Redis)"]
        RECORDING["Recording Rules<br/>(Pre-computed)"]
        ALERT_ENGINE["Alert Rules Engine<br/>(5K rules, eval every 15s)"]
        ALERT_MGR["Alert Manager<br/>(Group, Dedup,<br/>Silence, Route)"]
    end

    subgraph Output["Output"]
        direction LR
        GRAFANA["Grafana<br/>Dashboards"]
        PD_OUT["PagerDuty"]
        SLACK_OUT["Slack"]
        EMAIL_OUT["Email"]
        API_OUT["Metrics API<br/>(for automation)"]
    end

    Sources --> Collection
    Collection --> Ingestion
    Ingestion --> Storage
    TSDB_MAIN --- Tiers
    Storage --> QueryAlert
    QS_MAIN --> CACHE_Q
    QS_MAIN --> RECORDING
    QS_MAIN --> GRAFANA
    QS_MAIN --> API_OUT
    ALERT_ENGINE --> ALERT_MGR
    ALERT_MGR --> PD_OUT & SLACK_OUT & EMAIL_OUT
    SD_COL --> Collection

    style KAFKA_IN fill:#FF9800,stroke:#333,stroke-width:2px,color:#fff
    style TSDB_MAIN fill:#4CAF50,stroke:#333,stroke-width:2px,color:#fff
    style ALERT_ENGINE fill:#f44336,stroke:#333,stroke-width:2px,color:#fff
    style GRAFANA fill:#2196F3,stroke:#333,stroke-width:2px,color:#fff

6.2 Ingestion Pipeline Detail

flowchart LR
    subgraph Agents["Collection Agents"]
        A1["Agent<br/>server-001"]
        A2["Agent<br/>server-002"]
        AN["Agent<br/>server-N"]
    end

    subgraph Buffer["Kafka Buffer"]
        direction TB
        TOPIC_RAW["Topic: raw-metrics<br/>64 partitions<br/>retention: 24h<br/>replication: 3"]
    end

    subgraph Processing["Stream Processing (Flink)"]
        direction TB
        WINDOW["Tumbling Window<br/>(1 minute)"]
        AGG_SUM["Sum Aggregator<br/>(counters)"]
        AGG_AVG["Average Aggregator<br/>(gauges)"]
        AGG_HIST["Histogram Merger<br/>(distributions)"]
        AGG_RATE["Rate Computer<br/>(counters → rate/sec)"]
    end

    subgraph Output_Kafka["Kafka Output"]
        TOPIC_AGG["Topic: aggregated-metrics<br/>32 partitions"]
    end

    subgraph Write["TSDB Write Path"]
        WAL["Write-Ahead Log<br/>(durability)"]
        MEMTABLE["Memtable<br/>(in-memory, sorted)"]
        FLUSH["Flush to disk<br/>(when memtable full)"]
        SSTABLE["SSTable on disk<br/>(compressed, immutable)"]
    end

    Agents -->|"Batch push<br/>every 10s"| Buffer
    Buffer --> WINDOW
    WINDOW --> AGG_SUM & AGG_AVG & AGG_HIST & AGG_RATE
    AGG_SUM & AGG_AVG & AGG_HIST & AGG_RATE --> Output_Kafka
    Output_Kafka --> WAL
    WAL --> MEMTABLE
    MEMTABLE -->|"Memtable full<br/>(128MB)"| FLUSH
    FLUSH --> SSTABLE

    style TOPIC_RAW fill:#FF9800,stroke:#333,stroke-width:2px,color:#fff
    style MEMTABLE fill:#FFEB3B,stroke:#333,stroke-width:2px,color:#000
    style SSTABLE fill:#90A4AE,stroke:#333,stroke-width:2px,color:#000

6.3 Alerting Flow Detail

flowchart TB
    subgraph RuleEval["Alert Rules Evaluation (every 15s)"]
        RULES["5,000 Alert Rules<br/>(stored in ConfigDB)"]
        EVAL["Rule Evaluator<br/>(parallel evaluation)"]
        TSDB_Q["Query TSDB"]
    end

    subgraph StateM["State Machine per Rule"]
        direction LR
        INACTIVE["INACTIVE"]
        PENDING["PENDING<br/>(waiting for duration)"]
        FIRING["FIRING"]
        RESOLVED["RESOLVED"]
    end

    subgraph AlertMgr["Alert Manager"]
        direction TB
        RECV["Receive firing alerts"]
        GROUP_AM["Group by<br/>(cluster, alertname)"]
        DEDUP_AM["Deduplicate<br/>(same alert within<br/>repeat interval)"]
        SILENCE_AM["Check silences<br/>(maintenance windows)"]
        INHIBIT_AM["Check inhibition<br/>(parent alert<br/>suppresses child)"]
        ROUTE_AM["Route by<br/>severity + team labels"]
    end

    subgraph Notify["Notification"]
        CRIT["CRITICAL<br/>→ PagerDuty<br/>(wake up on-call)"]
        WARN["WARNING<br/>→ Slack #alerts<br/>(business hours)"]
        INFO["INFO<br/>→ Email digest<br/>(daily batch)"]
    end

    subgraph Escalation["Escalation"]
        L1["L1: Primary on-call<br/>(0 min)"]
        L2["L2: Secondary on-call<br/>(+15 min)"]
        L3["L3: Team lead<br/>(+30 min)"]
        L4["L4: Eng Manager<br/>(+60 min)"]
    end

    RULES --> EVAL
    EVAL -->|"PromQL query"| TSDB_Q
    TSDB_Q -->|"Result"| EVAL
    EVAL --> StateM
    FIRING -->|"Fire event"| AlertMgr
    RESOLVED -->|"Resolve event"| AlertMgr
    RECV --> GROUP_AM --> DEDUP_AM --> SILENCE_AM --> INHIBIT_AM --> ROUTE_AM
    ROUTE_AM --> CRIT & WARN & INFO
    CRIT --> L1
    L1 -->|"No ACK in 15m"| L2
    L2 -->|"No ACK in 15m"| L3
    L3 -->|"No ACK in 30m"| L4

    style FIRING fill:#f44336,stroke:#333,stroke-width:2px,color:#fff
    style CRIT fill:#d32f2f,stroke:#333,stroke-width:2px,color:#fff
    style WARN fill:#FF9800,stroke:#333,stroke-width:2px,color:#fff
    style INFO fill:#2196F3,stroke:#333,stroke-width:2px,color:#fff

6.4 Storage Tiering Lifecycle

flowchart LR
    subgraph Ingest["Data Ingestion"]
        NEW["New data point<br/>t = now"]
    end

    subgraph Hot["Hot Tier"]
        HOT_STORE["NVMe SSD<br/>Resolution: 10s<br/>Age: 0 - 7 days<br/>Size: 81 GB"]
    end

    subgraph DS1["Downsampler #1"]
        DOWN1["Aggregate 10s → 1min<br/>(min, max, avg, sum, count)<br/>Runs daily"]
    end

    subgraph Warm["Warm Tier"]
        WARM_STORE["HDD / SATA SSD<br/>Resolution: 1 min<br/>Age: 7 - 90 days<br/>Size: 864 GB"]
    end

    subgraph DS2["Downsampler #2"]
        DOWN2["Aggregate 1min → 1hour<br/>Runs weekly"]
    end

    subgraph Cold["Cold Tier"]
        COLD_STORE["S3 / GCS<br/>Resolution: 1 hour<br/>Age: 90 - 365 days<br/>Size: 59 GB"]
    end

    subgraph Delete["Retention"]
        DEL_PROC["Delete data<br/>older than retention<br/>Runs daily"]
    end

    NEW --> HOT_STORE
    HOT_STORE -->|"Age > 7d"| DS1
    DS1 --> WARM_STORE
    HOT_STORE -->|"Delete raw<br/>after 7d"| DEL_PROC
    WARM_STORE -->|"Age > 90d"| DS2
    DS2 --> COLD_STORE
    WARM_STORE -->|"Delete 1min<br/>after 90d"| DEL_PROC
    COLD_STORE -->|"Age > 365d"| DEL_PROC

    style HOT_STORE fill:#f44336,stroke:#333,stroke-width:2px,color:#fff
    style WARM_STORE fill:#FF9800,stroke:#333,stroke-width:2px,color:#fff
    style COLD_STORE fill:#2196F3,stroke:#333,stroke-width:2px,color:#fff
    style DEL_PROC fill:#9E9E9E,stroke:#333,stroke-width:2px,color:#fff

7. Aha Moments & Pitfalls

7.1 Aha Moments — Bai hoc quan trong

Aha #1: Cardinality Explosion Kills TSDB

Vấn đề: Mỗi unique combination of metric name + label values = 1 time series. Nếu em thêm label request_id (unique per request) vào metric → vô số time series.

Ví dụ:

http_request_duration{method="GET", path="/api/orders", status="200"}
→ 1 time series (OK)

http_request_duration{method="GET", path="/api/orders", status="200", request_id="abc123"}
→ MỖI REQUEST = 1 time series (DISASTER)

Nếu có 10K req/s → 10K new time series/s → 864M time series/day. TSDB sẽ:

OOM (out of memory) vì inverted index phình to
Query timeout vì phải scan quá nhiều series
Compaction không kịp vì quá nhiều files

Rule of thumb: Giữ total active time series < 10M. Nếu label value có cardinality > 1000, không nên là label — nên là log hoặc trace attribute.

Cách phát hiện: Monitor metric active_time_series_count. Alert nếu tăng > 20% trong 1 giờ.

Aha #2: Compression Ratio 10x-50x

Gorilla encoding + delta-of-delta + XOR compression biến time series data từ 16 bytes/point thành 1-2 bytes/point. Đây là lý do:

TSDB có thể lưu hàng tỷ data points trên hardware bình thường
Relational DB không cạnh tranh được (generic compression chỉ đạt 2-3x)
Monitoring ở scale lớn là khả thi về mặt cost

Điều kiện để compression tốt:

Data points đến đều đặn (delta-of-delta timestamps ~ 0)
Values thay đổi ít giữa consecutive points (XOR ~ 0)
Cùng metric name + tags (cùng time series)

Khi compression kém:

Irregular timestamps (random arrival) → delta-of-delta lớn
Highly volatile values (random fluctuation) → XOR lớn
Quá nhiều unique time series (mỗi series quá ngắn để compress hiệu quả)

Aha #3: Downsampling — Trade Precision for Storage

Downsampling không phải “mất data” — mà là chuyển đổi từ raw precision sang statistical summary:

Raw (10s)	Downsampled (1h)
78.5, 78.3, 78.7, 78.4, 78.6, 78.2, … (360 points)	avg=78.45, min=75.2, max=82.1, count=360

Em mất khả năng biết chính xác CPU lúc 14:32:40 là bao nhiêu. Nhưng em vẫn biết:

Trung bình giờ đó CPU bao nhiêu (avg)
Spike cao nhất là bao nhiêu (max)
Dip thấp nhất là bao nhiêu (min)

Với 3-month old data, đó là đủ thông tin cho hầu hết use cases.

Aha #4: Pull Model Simplifies Health Detection

Trong pull model (Prometheus-style), khi collector không thể scrape target → target down. Đây là “miễn phí” — không cần heartbeat mechanism riêng.

Trong push model, khi collector không nhận data → không biết target down hay network issue hay agent crash. Cần thêm heartbeat/alive check → complexity tăng.

Đó là lý do Prometheus thắng trong Kubernetes ecosystem — K8s service discovery + pull model = elegant monitoring solution.

Aha #5: Alert on Symptoms, Not Causes

Bad alert (cause)	Good alert (symptom)
“CPU > 90%"	"API latency P99 > 500ms"
"Memory > 80%"	"Error rate > 1%"
"Disk > 70%"	"User-facing 5xx > 0.5%”

Cause-based alerts (CPU, memory) tạo noise — CPU 90% có thể hoàn toàn bình thường nếu service vẫn fast. Symptom-based alerts (latency, error rate) cho biết user experience đang bị ảnh hưởng.

Google SRE principle: “Alert on symptoms that affect users, debug causes with dashboards.”

7.2 Pitfalls — Bay can tranh

Pitfall #1: Alert Fatigue

Vấn đề: Quá nhiều alerts → on-call engineer bắt đầu ignore alerts → miss critical alert → production down.

Dấu hiệu:

50% alerts là false positive hoặc not actionable
On-call engineer ACK alert mà không investigate
“Mute all” culture

Giải pháp:

Mỗi alert phải actionable — nếu on-call không cần làm gì → đó là log, không phải alert
Review alert effectiveness hàng tháng: bao nhiêu % true positive?
Tune thresholds dựa trên historical data
Dùng for duration để loại transient spikes

Vấn đề: Monitor mọi thứ trong hệ thống nhưng quên monitor từ góc nhìn user.

Internal monitoring (cần có)	External monitoring (hay quên)
Server CPU, memory	Synthetic monitoring (fake user requests from outside)
Database connections	Real User Monitoring (RUM) — actual user experience
Queue depth	Endpoint availability from external checks
Error logs	SSL certificate expiry

Giải pháp: Kết hợp internal monitoring + external synthetic checks (Pingdom, Datadog Synthetics).

Pitfall #3: TSDB Không Scale — Single Point of Failure

Vấn đề: Single Prometheus instance có giới hạn:

~1M active time series (beyond this → OOM risk)
Single-node storage (disk failure = data loss)
No horizontal scaling

Giải pháp tầng:

Scale level	Solution
< 1M series	Single Prometheus instance
1M - 10M series	Prometheus federation (hierarchical)
10M - 100M series	Thanos / Cortex / VictoriaMetrics cluster
> 100M series	Dedicated TSDB cluster (InfluxDB Enterprise, M3DB)

Pitfall #4: Không có Retention Policy → Disk Full

Vấn đề: Quên set retention → data tích lũy vô hạn → disk full → TSDB crash → monitoring system down → không ai biết production đang sập.

Giải pháp:

Set retention policy từ ngày 1
Monitor disk usage của TSDB node
Auto-delete + downsampling pipeline
Alert khi disk > 80%

Pitfall #5: Dashboard Overload

Vấn đề: 200 dashboards mà không ai dùng, 50 panels trên 1 dashboard mà không ai hiểu.

Giải pháp:

Mỗi service có 1 dashboard chính với Four Golden Signals (latency, traffic, errors, saturation)
Layered approach: Overview → Cluster → Service → Instance
Dashboard ownership: mỗi dashboard có owner (team/person)
Deprecate dashboards không ai xem trong 30 ngày

Pitfall #6: Ignoring Label Naming Conventions

Vấn đề: Team A dùng host, Team B dùng hostname, Team C dùng server. Cùng concept, khác label → không thể aggregate cross-team.

Giải pháp:

Establish label naming convention sớm
Enforce qua metric validation tại ingestion
Common labels: host, region, env, service, team, cluster

8. Tong ket — Key Design Decisions

Decision	Options	Recommendation	Lý do
Collection model	Push vs Pull	Hybrid (Pull + Push gateway)	Pull cho infrastructure, Push cho ephemeral jobs
Buffer	Direct write vs Kafka	Kafka	Decouple, burst handling, fan-out
TSDB	Relational vs TSDB	TSDB (InfluxDB/VictoriaMetrics/Prometheus)	10-100x better for time series workload
Compression	Generic vs Gorilla	Gorilla encoding	10-50x compression ratio
Downsampling	Keep all raw vs Downsample	Multi-resolution downsampling	Balance precision vs cost
Storage	Single tier vs Multi-tier	Hot/Warm/Cold tiering	Cost optimization
Alerting	Simple threshold vs State machine	State machine (PENDING → FIRING → RESOLVED)	Reduce false alarms
Alert processing	Direct send vs Alert Manager	Alert Manager (group, dedup, silence)	Reduce noise, route correctly
Query	SQL vs PromQL	PromQL-like	Purpose-built for time series
Dashboard	Custom vs Grafana	Grafana	Industry standard, rich ecosystem

9. Interview Tips

Hieu, nếu em gặp bài này trong interview, đây là flow em nên follow:

Bước	Thời gian	Nói gì
Clarify requirements	5 phút	Bao nhiêu servers? Bao nhiêu metrics? Retention bao lâu? Cần alerting không?
High-level design	10 phút	Vẽ 5 components: Collection → Kafka → TSDB → Query/Alerting → Dashboard
Deep dive — Data model	5 phút	Giải thích time series structure, tại sao TSDB không dùng relational DB
Deep dive — Ingestion	5 phút	Push vs pull, Kafka as buffer, pre-aggregation
Deep dive — TSDB	5 phút	LSM tree, Gorilla compression, downsampling
Deep dive — Alerting	5 phút	State machine, Alert Manager (group/dedup/silence)
Estimation	5 phút	Write throughput, storage with compression, query QPS
Trade-offs	5 phút	Cardinality explosion, monitoring the monitoring, alert fatigue

Pro tip: Interviewer đặc biệt ấn tượng khi em mention: (1) Gorilla compression, (2) cardinality explosion, (3) monitoring the monitoring system — đây là những topic cho thấy em có production experience, không chỉ đọc sách.

10. Internal Links — Lien ket noi bo

Topic	Link	Liên quan thế nào
Monitoring & Observability foundations	Tuan-13-Monitoring-Observability	Nền tảng lý thuyết, Three Pillars, SLI/SLO/SLA
Message Queue (Kafka)	Tuan-08-Message-Queue	Kafka là core component trong ingestion pipeline
Cache Strategy	Tuan-06-Cache-Strategy	Query cache, in-memory index, memtable
Database Sharding & Replication	Tuan-07-Database-Sharding-Replication	TSDB clustering, time-based partitioning
Back-of-the-envelope Estimation	Tuan-02-Back-of-the-envelope	Capacity estimation framework
Scaling fundamentals	Tuan-01-Scale-From-Zero-To-Millions	Horizontal scaling for TSDB cluster
Notification System	Tuan-19-Design-Notification-System	Alert notification delivery (PagerDuty, Slack, Email)

“Monitoring tốt không phải là biết mọi thứ — mà là biết đúng thứ, đúng lúc, và không bao giờ miss thứ quan trọng.”

Next step: Hieu, sau bài này em nên thực hành setup Prometheus + Grafana + Alertmanager trên local Docker Compose. Hands-on experience sẽ solidify tất cả concepts trên.

lthieu's notes

Explorer

Case-Design-Metrics-Monitoring-Alerting

Case Study: Design Metrics Monitoring & Alerting System

1. Context & Why — Tại sao cần Metrics Monitoring & Alerting?

1.1 Analogy: He thong canh bao chay trong toa nha

1.2 Tại sao monitoring quan trọng trong production?

1.3 Scope của bài toán

2. Deep Dive — Alex Xu 4-Step Framework

Step 1 — Understand the Problem & Establish Design Scope

2.1 Functional Requirements (Yêu cầu chức năng)

2.2 Non-functional Requirements (Yêu cầu phi chức năng)

2.3 Phân biệt Metrics vs Events

Step 2 — Propose High-Level Design

2.4 Analogy cho High-Level Architecture

2.5 Architecture Overview

2.6 Component Responsibilities

2.7 Data Flow — End to End

Step 3 — Design Deep Dive

2.8 Data Model — Time Series Data

Cấu trúc một data point

Ví dụ data representation

Time Series = Unique Combination of Metric Name + Tags

Đặc tính của Time Series Data

2.9 Collection — Push vs Pull Model

Pull Model (Prometheus-style)

Push Model (Datadog/StatsD-style)

So sánh tổng hợp

Service Discovery cho Pull Model

2.10 Ingestion Pipeline — Kafka as Buffer

Tại sao cần Kafka giữa Collection và TSDB?

Ingestion Pipeline Architecture

Kafka Partitioning Strategy

Stream Processor — Pre-aggregation

2.11 Time Series Database — Core of the System

Tại sao Relational DB không phù hợp?

TSDB Options — So sánh

LSM Tree — Tại sao TSDB dùng LSM thay vì B-tree?

Compression — Chìa khóa của Storage Efficiency

Downsampling — Trade Precision for Storage

2.12 Query Service — PromQL-like Language

Query Patterns phổ biến

Query Optimization

Recording Rules (Pre-computed Queries)

2.13 Alerting System — Brain of Monitoring

Alert Rules Engine

Alert States — State Machine

Alert Manager — Routing & Processing

Escalation Policy

2.14 Visualization — Grafana-like Dashboard

Dashboard Architecture

Real-time Streaming vs Polling

Dashboard Best Practices

2.15 Storage — Hot/Warm/Cold Tiering

Storage Tiering Architecture

Storage Cost Comparison

Compaction

Retention Policies

3. Capacity Estimation (Uoc luong dung luong)

3.1 Assumptions

3.2 Ingestion Rate

3.3 Storage Estimation

Raw data (Hot tier — 7 ngày)

1-minute downsampled (Warm tier — 90 ngày)

1-hour downsampled (Cold tier — 365 ngày)

Total storage

3.4 Index Storage

3.5 Kafka Throughput

3.6 Query QPS

3.7 Tóm tắt Estimation

4. Security

4.1 Access Control — Multi-tenant Isolation

Multi-tenancy Architecture Options

4.2 Preventing Alert Spam

4.3 Securing Notification Channels

4.4 Audit Logging

5. DevOps — Monitoring the Monitoring System

5.1 Meta-monitoring — The Chicken-and-Egg Problem

5.2 Key Operational Metrics to Monitor

5.3 TSDB Compaction Monitoring