Case Study: Design Ad Click Event Aggregation
“1 ty luot click moi ngay, moi click phai duoc dem dung 1 lan, ket qua phai co trong vai phut. Day la bai toan ma sai 0.1% cung co the mat hang trieu do quang cao.”
Tags: system-design ad-click event-aggregation streaming exactly-once mapreduce alex-xu-vol2 Student: Hieu Source: Alex Xu — System Design Interview Volume 2, Chapter 3 Prerequisite: Tuan-08-Message-Queue · Tuan-02-Back-of-the-envelope · Tuan-01-Scale-From-Zero-To-Millions Lien quan: Tuan-13-Monitoring-Observability · Case-Design-Metrics-Monitoring-Alerting · Tuan-07-Database-Sharding-Replication
1. Context & Why — Tai sao Ad Click Event Aggregation quan trong?
1.1 Analogy: Dem phieu bau cu quoc gia real-time
Hieu, em hay tuong tuong bai toan nay nhu viec dem phieu bau cu quoc gia dien ra real-time:
| Bau cu quoc gia | Ad Click Aggregation | Diem chung |
|---|---|---|
| Hang trieu phieu tu khap ca nuoc | Hang ty click tu khap the gioi | Volume cuc lon, den lien tuc |
| Moi phieu chi duoc dem 1 lan — dem 2 lan = gian lan | Moi click chi duoc dem 1 lan — dem 2 lan = tinh tien sai nha quang cao | Exactly-once la yeu cau bat buoc |
| Ket qua phai cap nhat lien tuc — bao chi can biet ai dang dan moi phut | Dashboard phai cap nhat lien tuc — advertiser can biet bao nhieu click trong 1 phut qua | Real-time aggregation voi latency thap |
| Chia theo khu vuc — dem theo tinh, theo thanh pho | Chia theo filter — dem theo ad_id, campaign_id, country | Multi-dimensional aggregation |
| Phieu den muon — co van phong gui ket qua cham vi xa | Late events — click event den muon do network delay | Late event handling la thach thuc lon |
| Kiem tra cheo — ket qua dem thu cong phai khop voi may | Reconciliation — batch job kiem tra lai ket qua streaming | Reconciliation dam bao chinh xac |
| Co the bau trung — mot nguoi di bau 2 lan | Click fraud — bot click nhieu lan | Deduplication can xu ly |
Aha Moment: Bai toan nay khong phai la “dem click” don gian. No la bai toan dem dung, dem du, dem nhanh o quy mo hang ty event moi ngay. Sai mot chut = mat tien that — hoac nha quang cao bi tinh thua, hoac platform mat doanh thu.
1.2 Tai sao bai toan nay kho?
Hieu, dem so tuong nhu don gian — nhung dem so o quy mo lon, real-time, va chinh xac tuyet doi lai la mot trong nhung bai toan kho nhat cua distributed systems:
| Thach thuc | Giai thich |
|---|---|
| Volume cuc lon | 1 ty click/ngay = ~11,574 click/giay trung binh, peak co the 50K+/giay |
| Real-time | Advertiser muon thay ket qua trong vai phut, khong phai cuoi ngay |
| Exactly-once | Dem thua = tinh tien sai. Dem thieu = mat doanh thu. Ca hai deu khong chap nhan duoc |
| Late events | Event co the den muon 5 phut, 1 gio, tham chi 1 ngay do mobile network, CDN delay |
| Multi-dimensional | Can aggregate theo nhieu chieu: ad_id, campaign_id, country, device_type, time window |
| Fault tolerance | He thong khong duoc mat data khi server crash, network partition |
| Hot spots | Quang cao viral co the nhan 100x traffic binh thuong — tao hot partition |
| Consistency vs Latency | Can can bang giua do chinh xac va toc do tra ket qua |
1.3 Business context — Tai sao advertiser quan tam?
Trong mo hinh quang cao online (Google Ads, Facebook Ads, TikTok Ads), advertiser tra tien theo CPC (Cost Per Click). Moi click = mot khoan tien. Vi du:
- Advertiser A chay quang cao voi CPC = $0.50
- 1 ngay co 1 trieu click vao quang cao cua A
- Tong chi phi = $500,000/ngay
Neu he thong dem sai 1% → sai 1.8 trieu/nam. Voi hang trieu advertiser, con so nay nhan len gap boi. Day la ly do data accuracy la non-negotiable.
1.4 Scope cua bai toan
| Thuoc tinh | Trong scope | Ngoai scope |
|---|---|---|
| Click event aggregation (dem click theo ad_id, time window) | Yes | — |
| Real-time query (advertiser xem dashboard) | Yes | — |
| Filtering (theo campaign_id, country) | Yes | — |
| Click fraud detection (bot, click farms) | Lien quan (Section 4) | Deep dive rieng |
| Conversion tracking (click → purchase) | — | Bai toan khac |
| Ad serving (chon quang cao nao hien thi) | — | Bai toan khac |
| Billing (tinh tien advertiser) | — | Downstream cua aggregation |
2. Deep Dive — Alex Xu 4-Step Framework
Step 1 — Understand the Problem & Establish Design Scope
2.1.1 Functional Requirements
| Chuc nang | Mo ta chi tiet |
|---|---|
| Aggregate ad clicks | Dem so luong click va unique click cho moi ad_id trong cac time window: 1 phut, 5 phut, 1 gio |
| Return aggregated data | Query API tra ve click_count va unique_click_count cho mot ad_id trong mot khoang thoi gian |
| Support filtering | Filter ket qua theo ad_id, campaign_id, country, device_type |
| Support multi-dimensional aggregation | Aggregate theo nhieu chieu cung luc: vi du click count per ad_id per country per minute |
| Store raw events | Luu raw click event de reconciliation va re-processing |
Cac cau hoi em nen hoi interviewer:
| Cau hoi | Cau tra loi gia dinh | Tai sao hoi |
|---|---|---|
| Bao nhieu click/ngay? | 1 billion (1 ty) click/ngay | Quyet dinh throughput cua toan he thong |
| Bao nhieu ad dang active? | 2 trieu ad | Quyet dinh cardinality cua aggregation |
| Aggregation window nao? | 1 phut, 5 phut, 1 gio | Quyet dinh windowing strategy |
| Latency yeu cau? | < vai phut tu click den queryable | Quyet dinh streaming vs batch |
| Data accuracy? | Exactly-once — khong duoc dem thua hay thieu | Quyet dinh delivery semantics |
| Can store raw data khong? | Co — de reconciliation va ad-hoc analysis | Quyet dinh storage strategy |
| Bao lau giu raw data? | Raw: 7 ngay, Aggregated: years | Quyet dinh data retention |
| Click fraud co trong scope khong? | Basic dedup co, advanced fraud detection ngoai scope | Ranh gioi thiet ke |
2.1.2 Non-Functional Requirements
| Requirement | Target | Ly do |
|---|---|---|
| Throughput | 1B clicks/ngay, peak 50K clicks/sec | Volume lon, bursty traffic |
| Latency | End-to-end < vai phut (tu click den queryable) | Near-real-time dashboard |
| Accuracy | Exactly-once processing | Lien quan den tien bac |
| Availability | 99.99% cho ingestion pipeline | Mat event = mat doanh thu |
| Scalability | Scale horizontal khi traffic tang | Mua sale, viral ads |
| Fault tolerance | Khong mat data khi node failure | Zero data loss |
| Idempotency | Re-process khong thay doi ket qua | Retry safety |
| Queryability | Query response < 1 giay cho single ad_id | Dashboard UX |
2.1.3 API Design
API 1: Aggregate click count cho mot ad_id
GET /v1/ads/{ad_id}/aggregated_count
Parameters:
| Parameter | Type | Mo ta |
|---|---|---|
from | long (epoch) | Thoi diem bat dau |
to | long (epoch) | Thoi diem ket thuc |
filter | object | Optional: {campaign_id: "xxx", country: "VN"} |
Response:
| Field | Type | Mo ta |
|---|---|---|
ad_id | string | ID cua quang cao |
click_count | long | Tong so click |
unique_click_count | long | So click tu distinct users |
API 2: Top N clicked ads trong khoang thoi gian
GET /v1/ads/top_clicked
Parameters:
| Parameter | Type | Mo ta |
|---|---|---|
count | int | So luong top ads can tra ve (N) |
window | int | Khoang thoi gian tinh bang phut (1, 5, 60) |
filter | object | Optional filter |
Step 2 — High-Level Design
2.2.1 Kien truc tong quan
Hieu, kien truc nay giong nhu day chuyen xu ly ca trong nha may thuy san:
- Tau ca (Client) mang ca ve cang → Click events tu user browsers/apps
- Cang ca (Load Balancer) phan phoi ca vao cac bang chuyen → Ingestion Service nhan events
- Kho lanh (Message Queue) giu ca tuoi truoc khi che bien → Kafka buffer events
- Day chuyen che bien (Aggregation Service) cat, loc, dong goi → Map-Reduce aggregate clicks
- Kho thanh pham (Database) luu san pham da dong goi → Aggregated DB luu ket qua
- Quay ban hang (Query Service) phuc vu khach hang → Query API tra ket qua cho advertiser
flowchart LR subgraph "Data Generation" C1["User Clicks<br/>(Browsers/Apps)"] end subgraph "Log Ingestion" LB["Load Balancer"] IS1["Ingestion<br/>Service 1"] IS2["Ingestion<br/>Service 2"] ISN["Ingestion<br/>Service N"] end subgraph "Message Queue" K["Apache Kafka<br/>(Partitioned by ad_id)"] end subgraph "Streaming Aggregation" AG1["Aggregation Node 1<br/>(Map + Aggregate)"] AG2["Aggregation Node 2<br/>(Map + Aggregate)"] AGN["Aggregation Node N<br/>(Map + Aggregate)"] end subgraph "Second Message Queue (Optional)" K2["Kafka Topic 2<br/>(Aggregated Results)"] end subgraph "Storage" RAW[("Raw Data Store<br/>(Cassandra / S3)")] AGG[("Aggregated DB<br/>(Cassandra / ClickHouse)")] end subgraph "Serving" QS["Query Service"] CACHE[("Redis Cache")] DASH["Advertiser<br/>Dashboard"] end C1 --> LB LB --> IS1 & IS2 & ISN IS1 & IS2 & ISN --> K K --> AG1 & AG2 & AGN K -->|"Raw events"| RAW AG1 & AG2 & AGN --> K2 K2 --> AGG AGG --> QS QS --> CACHE QS --> DASH style K fill:#FF9800,stroke:#333,stroke-width:2px,color:#fff style K2 fill:#FF9800,stroke:#333,stroke-width:2px,color:#fff style AGG fill:#4CAF50,stroke:#333,stroke-width:2px,color:#fff style QS fill:#2196F3,stroke:#333,stroke-width:2px,color:#fff
2.2.2 Data Flow — End to End
sequenceDiagram participant User as User Browser participant IS as Ingestion Service participant Kafka as Kafka (Raw Events) participant AGG as Aggregation Service participant Kafka2 as Kafka (Aggregated) participant DB as Aggregated DB participant QS as Query Service participant ADV as Advertiser Dashboard User->>IS: Click event (ad_id, timestamp, user_id, ip, country) IS->>IS: Validate & enrich event IS->>Kafka: Produce to partition (hash ad_id) Note over Kafka,AGG: Streaming Aggregation (per minute window) Kafka->>AGG: Consume batch of events AGG->>AGG: Map: filter + transform AGG->>AGG: Aggregate: count by ad_id per window AGG->>AGG: Reduce: merge partial results AGG->>Kafka2: Produce aggregated result Kafka2->>DB: Write aggregated data Note over DB: (ad_id, window_start, click_count, unique_count) ADV->>QS: GET /v1/ads/123/aggregated_count?from=...&to=... QS->>DB: Query aggregated data DB-->>QS: Results QS-->>ADV: JSON response
2.2.3 Component Responsibilities
| Component | Trach nhiem | Technology choices |
|---|---|---|
| Ingestion Service | Nhan click event tu client, validate, enrich (geo lookup), gui vao Kafka | Custom service (Go/Java), stateless, horizontally scalable |
| Kafka (Raw) | Buffer raw events, decouple ingestion tu processing, replay capability | Apache Kafka, partitioned by ad_id |
| Raw Data Store | Luu raw events cho reconciliation, ad-hoc query, re-processing | Cassandra (write-heavy), hoac S3 (cost-effective) |
| Aggregation Service | Stream processing: map, aggregate, reduce trong time window | Apache Flink, Spark Streaming |
| Kafka (Aggregated) | Buffer aggregated results truoc khi ghi DB, decouple aggregation tu storage | Apache Kafka |
| Aggregated DB | Luu ket qua aggregation, phuc vu read queries | Cassandra, ClickHouse |
| Query Service | Parse query, fetch data tu DB, apply filter, return results | Custom service voi caching layer |
| Redis Cache | Cache hot queries (top ads, recent aggregations) | Redis cluster |
2.2.4 Tai sao can 2 Kafka topics?
| Loi ich | Giai thich |
|---|---|
| Decoupling | Aggregation service va DB writer co the scale doc lap |
| Exactly-once | Kafka transactions dam bao aggregated result duoc ghi chinh xac 1 lan |
| Replay | Neu DB writer loi, co the replay tu Kafka topic 2 ma khong can re-aggregate |
| Multiple consumers | Nhieu downstream systems co the doc aggregated data (alerting, billing, analytics) |
Step 3 — Design Deep Dive
2.3.1 Data Model
Raw Click Event
| Field | Type | Mo ta | Vi du |
|---|---|---|---|
ad_id | string | ID cua quang cao | ”ad_001” |
click_timestamp | long (epoch ms) | Thoi diem user click (event time) | 1678886400000 |
user_id | string | ID cua user (cookie/device ID) | “user_abc123” |
ip | string | IP address cua user | ”203.113.152.4” |
country | string | Country code (tu IP geo lookup) | “VN” |
device_type | string | Loai thiet bi | ”mobile” |
campaign_id | string | Campaign chua ad nay | ”camp_xyz” |
Kich thuoc trung binh: ~0.5-1 KB per event.
Aggregated Data
| Field | Type | Mo ta | Vi du |
|---|---|---|---|
ad_id | string | ID cua quang cao | ”ad_001” |
window_start | long (epoch) | Bat dau cua time window | 1678886400 |
window_size | int | Kich thuoc window (phut) | 1 |
click_count | long | Tong so click trong window | 45,231 |
unique_click_count | long | So click tu distinct users | 38,102 |
filter_id | string | Composite key cua filter dimensions | ”country=VN” |
Aha Moment: Raw event va aggregated data co dac diem hoan toan khac nhau. Raw event write-heavy, append-only, high volume. Aggregated data read-heavy, update-in-place (upsert), lower volume. Day la ly do chung can databases khac nhau — hoac it nhat la tables khac nhau voi access patterns khac nhau.
Star Schema cho Aggregation
Hieu, data model cho aggregation tuong tu star schema trong data warehousing:
| Thanh phan | Vai tro | Vi du |
|---|---|---|
| Fact table | Luu metric (click_count) | aggregated_clicks (ad_id, window, click_count, unique_count) |
| Dimension tables | Luu thong tin filter | ads (ad_id, campaign_id, advertiser_id), campaigns (campaign_id, budget) |
Query pattern: join fact table voi dimension tables de filter theo campaign_id, advertiser_id, country.
2.3.2 Choosing the Right Database
| Yeu cau | Raw Data | Aggregated Data |
|---|---|---|
| Write pattern | Append-only, sequential | Upsert (update or insert) |
| Read pattern | Rare (reconciliation, re-processing) | Frequent (dashboard queries) |
| Volume | Cuc lon (~1TB/ngay) | Nho hon nhieu (~vài GB/ngay) |
| Query pattern | Full scan, time-range scan | Point query (ad_id + time range), top-N |
| Retention | 7 ngay (sau do archive) | Years |
| Consistency | Eventual OK | Strong preferred |
Raw Data → Cassandra hoac S3
| Lua chon | Uu diem | Nhuoc diem |
|---|---|---|
| Cassandra | Write-optimized (LSM-tree), time-series friendly, TTL support, distributed | Query flexibility han che, expensive storage |
| S3 + Parquet | Cuc re, unlimited storage, Athena/Spark co the query | Latency cao, khong phu hop real-time query |
| Hybrid (Cassandra 7 ngay + S3 archive) | Ket hop ca hai | Phuc tap operations |
Ket luan: Dung Cassandra cho raw data voi TTL 7 ngay. Archive sang S3 (Parquet format) cho long-term storage. Cassandra write-heavy performance rat phu hop voi use case nay — partition key la ad_id, clustering key la click_timestamp.
Aggregated Data → Cassandra hoac ClickHouse
| Lua chon | Uu diem | Nhuoc diem |
|---|---|---|
| Cassandra | Consistent voi raw data store, write-optimized | OLAP queries cham (khong co secondary index tot) |
| ClickHouse | OLAP-optimized, column-oriented, blazing fast aggregation queries | Write pattern khac (batch insert tot hon single upsert) |
| Druid | Real-time OLAP, pre-aggregation tai ingestion | Phuc tap operations |
Ket luan: ClickHouse la lua chon tot nhat cho aggregated data vi:
- Column-oriented storage nen toi uu cho aggregation queries (SUM, COUNT, GROUP BY)
- Query speed cuc nhanh cho dashboard use case
- Built-in materialized views co the tu dong aggregate
- Compression ratio cao cho time-series data
Trade-off: Neu team chua co kinh nghiem voi ClickHouse, dung Cassandra cho ca raw va aggregated data de giam operational complexity. Performance khong tot bang nhung don gian hon.
2.3.3 Aggregation Service — MapReduce Model
Day la trai tim cua he thong. Aggregation service dung mo hinh MapReduce de xu ly events theo 3 giai doan:
Giai doan 1: Map Node (Filter + Transform)
| Nhiem vu | Chi tiet |
|---|---|
| Nhan raw event tu Kafka | Moi Map node nhan events tu 1 hoac nhieu Kafka partitions |
| Filter | Loai bo invalid events (missing ad_id, invalid timestamp) |
| Transform | Enrich data (geo lookup tu IP → country), normalize fields |
| Emit key-value | Output: (ad_id, 1) — nghia la “ad nay duoc click 1 lan” |
Giai doan 2: Aggregate Node (Count by ad_id per window)
| Nhiem vu | Chi tiet |
|---|---|
| Nhan (ad_id, 1) tu Map | Group by ad_id |
| Aggregate trong time window | Dem so luong clicks va unique clicks trong tung window (1 phut) |
| Maintain state | Dung in-memory state (Flink state backend) de luu partial aggregation |
| Emit partial result | Output: (ad_id, window, partial_count, partial_unique_count) |
Giai doan 3: Reduce Node (Merge Results)
| Nhiem vu | Chi tiet |
|---|---|
| Nhan partial results tu nhieu Aggregate nodes | Truong hop ad_id bi chia ra nhieu node |
| Merge | Cong tat ca partial_count lai, merge HyperLogLog cho unique count |
| Output final result | (ad_id, window, final_click_count, final_unique_count) |
| Ghi vao Kafka topic 2 | Aggregated result san sang ghi DB |
flowchart TB subgraph "Kafka Partitions" P1["Partition 0<br/>ad_001, ad_005, ad_009..."] P2["Partition 1<br/>ad_002, ad_006, ad_010..."] P3["Partition 2<br/>ad_003, ad_007, ad_011..."] P4["Partition 3<br/>ad_004, ad_008, ad_012..."] end subgraph "Map Nodes" M1["Map Node 1<br/>Filter + Transform<br/>Emit (ad_id, 1)"] M2["Map Node 2<br/>Filter + Transform<br/>Emit (ad_id, 1)"] M3["Map Node 3<br/>Filter + Transform<br/>Emit (ad_id, 1)"] M4["Map Node 4<br/>Filter + Transform<br/>Emit (ad_id, 1)"] end subgraph "Aggregate Nodes (Windowed)" A1["Aggregate Node 1<br/>Window: 1 min<br/>Count by ad_id"] A2["Aggregate Node 2<br/>Window: 1 min<br/>Count by ad_id"] end subgraph "Reduce Nodes" R1["Reduce Node<br/>Merge partial counts<br/>Final result per ad_id"] end subgraph "Output" K2["Kafka Topic 2<br/>(Aggregated Results)"] DB[("ClickHouse")] end P1 --> M1 P2 --> M2 P3 --> M3 P4 --> M4 M1 & M2 --> A1 M3 & M4 --> A2 A1 & A2 --> R1 R1 --> K2 K2 --> DB style M1 fill:#42A5F5,color:#fff style M2 fill:#42A5F5,color:#fff style M3 fill:#42A5F5,color:#fff style M4 fill:#42A5F5,color:#fff style A1 fill:#FFA726,color:#fff style A2 fill:#FFA726,color:#fff style R1 fill:#EF5350,color:#fff
Aha Moment: Trong thuc te, khi dung Apache Flink, Map va Aggregate thuong merge thanh 1 operator. Flink tu dong lam chuyen nay qua operator chaining de giam network overhead. Mo hinh 3 giai doan la de hieu concept — implementation thuc te compact hon.
Unique Click Count — HyperLogLog
Dem unique clicks (distinct user_id) la bai toan kho vi:
- Khong the giu tat ca user_id trong memory (qua nhieu)
- Exact count yeu cau O(n) memory per ad_id per window
Giai phap: Dung HyperLogLog (HLL) — probabilistic data structure:
| Thuoc tinh | Gia tri |
|---|---|
| Memory | ~12 KB per counter (co dinh, bat ke bao nhieu elements) |
| Accuracy | Sai so ~0.81% (voi 16K registers) |
| Mergeable | Co the merge 2 HLL counters → perfect cho distributed aggregation |
| Trade-off | Mat do chinh xac tuyet doi, nhung tiet kiem memory cuc lon |
Voi 2 trieu active ads, moi ad 1 HLL counter = 2M x 12KB = ~24 GB memory — hoan toan kha thi.
Neu can exact unique count (mot so advertiser tra them tien cho do chinh xac), dung bitmap hoac Bloom filter voi memory lon hon.
2.3.4 Streaming vs Batching — Lambda vs Kappa Architecture
Day la quyet dinh kien truc quan trong nhat cua bai toan. Hieu can hieu 2 truong phai:
Lambda Architecture (Batch + Speed Layer)
flowchart TB subgraph "Data Source" EVENTS["Click Events"] end subgraph "Batch Layer" BATCH_STORE[("Batch Storage<br/>(HDFS / S3)")] BATCH_PROC["Batch Processing<br/>(Spark / MapReduce)<br/>Chay moi gio hoac moi ngay"] BATCH_VIEW[("Batch View<br/>(accurate, complete)")] end subgraph "Speed Layer" STREAM_PROC["Stream Processing<br/>(Flink / Storm)<br/>Real-time"] RT_VIEW[("Real-time View<br/>(approximate, fast)")] end subgraph "Serving Layer" MERGE["Merge Results<br/>Batch View + Real-time View"] QS["Query Service"] end EVENTS --> BATCH_STORE EVENTS --> STREAM_PROC BATCH_STORE --> BATCH_PROC BATCH_PROC --> BATCH_VIEW STREAM_PROC --> RT_VIEW BATCH_VIEW --> MERGE RT_VIEW --> MERGE MERGE --> QS style BATCH_PROC fill:#4CAF50,color:#fff style STREAM_PROC fill:#FF9800,color:#fff style MERGE fill:#2196F3,color:#fff
| Uu diem | Nhuoc diem |
|---|---|
| Batch layer dam bao accuracy | 2 codebases — phai maintain 2 pipeline (batch + streaming) |
| Speed layer dam bao low latency | Logic trung lap — cung 1 aggregation nhung viet 2 lan |
| Neu streaming sai, batch se sua | Phuc tap debug — sai o layer nao? |
| Merge logic khong don gian |
Kappa Architecture (Streaming Only)
flowchart LR subgraph "Data Source" EVENTS["Click Events"] end subgraph "Streaming Layer (Only Layer)" KAFKA["Kafka<br/>(Immutable Log)"] FLINK["Stream Processing<br/>(Apache Flink)<br/>Real-time Aggregation"] end subgraph "Serving Layer" DB[("Aggregated DB")] QS["Query Service"] end subgraph "Reconciliation (Periodic)" RECON["Batch Reconciliation<br/>(Hourly / Daily)<br/>Verify streaming results"] end EVENTS --> KAFKA KAFKA --> FLINK FLINK --> DB DB --> QS KAFKA -.->|"Replay if needed"| FLINK DB -.-> RECON KAFKA -.-> RECON style KAFKA fill:#FF9800,color:#fff style FLINK fill:#4CAF50,color:#fff style RECON fill:#9E9E9E,color:#fff
| Uu diem | Nhuoc diem |
|---|---|
| 1 codebase — chi can maintain 1 pipeline | Can streaming framework mature (Flink co exactly-once) |
| Don gian hon Lambda nhieu | Neu can re-process, replay tu Kafka (can du retention) |
| Kafka la immutable log — co the replay bat ky luc nao | |
| Apache Flink ho tro exactly-once processing |
Ket luan cua Alex Xu: Kappa architecture la lua chon uu tien cho bai toan nay. Ly do: Apache Flink da mature du de dam bao exactly-once processing. Lambda chi can thiet khi streaming framework chua du tin cay — dieu nay khong con dung voi Flink 2024+.
Aha Moment: Lambda architecture ra doi vi streaming frameworks ngay xua (Storm, Samza) khong dam bao exactly-once. Khi Flink giai quyet van de nay, ly do ton tai cua Lambda giam di dang ke. Tuy nhien, reconciliation (batch job kiem tra lai) van can thiet nhu safety net — day khong phai la Lambda, ma la best practice.
2.3.5 Watermark va Late Events
Day la khai niem kho nhat trong stream processing. Hieu can hieu ro:
Event Time vs Processing Time
| Khai niem | Dinh nghia | Vi du |
|---|---|---|
| Event time | Thoi diem su kien thuc su xay ra (user click) | 14:00:00 — user click vao quang cao |
| Processing time | Thoi diem he thong nhan duoc event | 14:00:05 — event den Flink (tre 5 giay do network) |
| Ingestion time | Thoi diem event vao Kafka | 14:00:02 — Kafka nhan event |
Tai sao quan trong? Vi em aggregate theo time window. Neu dung processing time, 1 click xay ra luc 13:59:59 nhung den luc 14:00:05 se bi dem vao window 14:00-14:01 thay vi 13:59-14:00. Sai window = sai ket qua.
→ Ket luan: Luon dung event time cho aggregation. Processing time chi dung cho monitoring he thong.
Watermark la gi?
Watermark = loi tuyen bo cua he thong rang: “Tat ca event voi event_time ⇐ W da den het”.
Vi du: Watermark = 14:05:00 co nghia la he thong tin rang khong con event nao voi timestamp truoc 14:05:00 se den nua. He thong co the dong (close) tat ca window truoc 14:05:00 va emit ket qua.
Van de: Lam sao biet tat ca event da den? Khong the biet chac chan. Vi vay watermark luon co allowed lateness (do tre cho phep).
| Watermark strategy | Mo ta | Trade-off |
|---|---|---|
| Punctuated watermark | Emit watermark dua tren event data (vi du: max event_time - 5s) | Phu hop traffic deu |
| Periodic watermark | Emit watermark dinh ky (vi du: moi 200ms, watermark = max event_time seen - 10s) | Pho bien nhat, don gian |
| Watermark = max_event_time - allowed_lateness | Vi du: max_event_time - 5 phut | Balance giua latency va completeness |
Late Event Handling
Khi event den sau watermark (late event), co 3 cach xu ly:
| Strategy | Mo ta | Khi nao dung |
|---|---|---|
| Drop | Bo qua late event | Khi data accuracy khong quan trong (dashboard xap xi) |
| Update aggregation | Mo lai window, cap nhat ket qua, emit lai | Khi can accuracy cao — dung cho ad click |
| Side output | Gui late event sang 1 stream rieng de xu ly sau | Khi late event can xu ly khac (reconciliation, alerting) |
Trong bai toan ad click, Alex Xu khuyen dung ket hop Update + Side output:
- Late event trong 5 phut: update aggregation (mo lai window, dem lai)
- Late event qua 5 phut: gui vao side output → reconciliation batch job xu ly
Aha Moment: Watermark la khai niem kho nhat trong stream processing vi no lien quan den trade-off giua latency va completeness. Watermark qua nhanh (allowed_lateness nho) → nhieu late events bi bo → sai ket qua. Watermark qua cham (allowed_lateness lon) → latency cao → dashboard cap nhat cham. Khong co gia tri “dung” — chi co gia tri “phu hop voi business yeu cau”.
2.3.6 Exactly-Once Processing
Trong he thong phan tan, exactly-once la cuc ky kho vi co nhieu diem failure:
| Diem failure | Van de | Giai phap |
|---|---|---|
| Kafka consumer doc event | Consumer crash sau khi xu ly nhung truoc khi commit offset | Flink checkpointing (snapshot state + Kafka offset cung luc) |
| Aggregation service | Node crash giua chung xu ly | Flink savepoints/checkpoints — resume tu last consistent state |
| Ghi vao DB | Ghi thanh cong nhung chua ack | Idempotent writes — ghi lai cung data khong thay doi ket qua |
| Kafka producer (ghi aggregated result) | Producer crash sau khi Kafka nhan nhung truoc khi ack | Kafka idempotent producer (enable.idempotence=true) |
Exactly-Once End-to-End Flow
flowchart TB subgraph "Source: Kafka Consumer" KC["Kafka Consumer<br/>Track offset per partition"] end subgraph "Processing: Flink" CP["Checkpoint Manager<br/>(Periodic snapshots)"] STATE["Flink State Backend<br/>(RocksDB / Heap)"] OP["Aggregation Operator<br/>(Stateful processing)"] end subgraph "Sink: Kafka Producer + DB" KP["Kafka Producer<br/>(Idempotent + Transactional)"] DB[("ClickHouse<br/>(Idempotent Upsert)")] end subgraph "Checkpoint Flow" BARRIER["Checkpoint Barrier<br/>(injected into stream)"] end KC -->|"Events"| OP OP --> STATE OP --> KP KP --> DB CP -->|"Trigger checkpoint"| BARRIER BARRIER -->|"Snapshot: offset + state + pending writes"| CP CP -->|"On success: commit Kafka offset + flush writes"| KC CP -->|"On success: commit Kafka transaction"| KP style CP fill:#e53935,color:#fff style STATE fill:#1e88e5,color:#fff style OP fill:#43a047,color:#fff
Co che Flink Checkpoint hoat dong nhu the nao:
- Checkpoint Manager dinh ky (vi du moi 30 giay) inject checkpoint barrier vao stream
- Khi barrier di qua operator, operator snapshot state (partial aggregation, HLL counters) vao durable storage (HDFS/S3)
- Dong thoi luu Kafka consumer offset tai thoi diem do
- Khi tat ca operators da snapshot xong → checkpoint hoan thanh
- Neu failure xay ra → Flink rollback ve last successful checkpoint:
- Reset Kafka consumer offset ve checkpoint offset (doc lai events tu do)
- Restore operator state tu checkpoint
- Xu ly lai events tu checkpoint offset → ket qua giong het vi state giong het
Quan trong: Exactly-once trong Flink khong co nghia la moi event chi duoc xu ly 1 lan. No co nghia la ket qua cuoi cung giong nhu moi event chi duoc xu ly 1 lan. Thuc te, events co the duoc xu ly lai (sau checkpoint restore) nhung ket qua khong thay doi nho idempotent writes.
2.3.7 Deduplication — Cung 1 click bi gui 2 lan
Khac voi exactly-once (van de cua he thong), deduplication xu ly van de tu phia client:
| Nguyen nhan | Mo ta | Vi du |
|---|---|---|
| Client retry | User click, request timeout, client gui lai | 1 click → 2 events |
| Double click | User click nhanh 2 lan | 2 clicks thuc te nhung chi nen dem 1 |
| Page reload | User refresh trang co quang cao | Click event gui lai |
| CDN/Proxy retry | Intermediate server retry request | 1 click → nhieu events |
Dedup Strategy
| Cach tiep can | Mo ta | Trade-off |
|---|---|---|
| Dedup key: (ad_id, user_id, click_timestamp) | 2 events co cung 3 field = duplicate | Can luu seen keys trong window → memory |
| Dedup window | Chi check duplicate trong 1 khoang thoi gian (vi du 5 phut) | Duplicate xa hon window khong bi bat |
| Bloom filter | Probabilistic — check “da thay chua?” voi false positive rate thap | Tiet kiem memory, nhung co false positive (bo nham event that) |
| External dedup store (Redis) | Luu dedup key trong Redis voi TTL | Chinh xac hon Bloom filter, nhung them latency va dependency |
Ket luan: Dung Bloom filter trong Flink operator voi dedup window 5 phut. False positive rate ~0.1% la chap nhan duoc (bo nham 1/1000 click that — insignificant voi 1 ty clicks/ngay).
Cho truong hop can chinh xac tuyet doi (billing), dung Redis dedup voi key {ad_id}:{user_id}:{minute_bucket} va TTL 10 phut.
2.3.8 Time Window Types
Flink (va cac streaming framework khac) ho tro nhieu loai time window:
Tumbling Window
| Thuoc tinh | Gia tri |
|---|---|
| Dinh nghia | Fixed-size, non-overlapping windows |
| Vi du | Window 1 phut: [14:00-14:01), [14:01-14:02), [14:02-14:03) |
| Khi nao dung | Ad click aggregation — dem click trong tung phut, tung gio |
| Dac diem | Moi event thuoc dung 1 window. Don gian nhat, efficient nhat |
Sliding Window
| Thuoc tinh | Gia tri |
|---|---|
| Dinh nghia | Fixed-size windows nhung overlap voi nhau |
| Vi du | Window 5 phut, slide moi 1 phut: [14:00-14:05), [14:01-14:06), [14:02-14:07) |
| Khi nao dung | Khi can “moving average” — vi du “so click trong 5 phut gan nhat” cap nhat moi phut |
| Dac diem | Moi event thuoc nhieu windows. Ton memory hon tumbling |
Session Window
| Thuoc tinh | Gia tri |
|---|---|
| Dinh nghia | Window dong khi khong co event moi trong 1 khoang thoi gian (gap) |
| Vi du | Session gap 30 phut: events luc 14:00, 14:05, 14:10 thuoc 1 session. Event luc 14:50 bat dau session moi |
| Khi nao dung | User behavior analysis — 1 session browsing cua user |
| Dac diem | Kich thuoc window dong (dynamic), khong co dinh |
Cho bai toan ad click: Dung tumbling window cho 1 phut, 5 phut, 1 gio. Day la lua chon tu nhien nhat vi:
- Moi click chi thuoc 1 window → don gian, khong duplicate counting
- Window 5 phut = aggregate 5 windows 1-phut → co the tinh tu window nho hon
- Window 1 gio = aggregate 60 windows 1-phut → tiep tuc cascade
Aha Moment: Khong can 3 pipeline rieng cho 3 window sizes. Chi can aggregate window 1 phut (smallest), sau do cascade aggregation: 5 windows 1-phut → 1 window 5-phut, 12 windows 5-phut → 1 window 1-gio. Day giet kiem compute va storage dang ke.
2.3.9 Scaling — Xu ly traffic lon
Kafka Partitioning
| Chien luoc | Mo ta | Uu/Nhuoc |
|---|---|---|
| Partition by ad_id | Hash(ad_id) % num_partitions | Tat ca clicks cua 1 ad vao cung partition → aggregation don gian, NHUNG hot ads tao hot partition |
| Partition by ad_id + minute | Hash(ad_id + minute_bucket) | Phan tan hon, nhung can reduce step de merge |
| Random partition | Round-robin | Phan tan deu nhat, nhung aggregation phai shuffle data (expensive) |
Ket luan: Dung partition by ad_id lam default. Xu ly hot partition rieng (Section 2.3.11).
Flink Parallelism
| Component | Parallelism | Ly do |
|---|---|---|
| Source (Kafka consumer) | = So Kafka partitions | Moi consumer doc 1 partition |
| Map operator | = So Kafka partitions | 1:1 voi source |
| Aggregate operator | Tuy throughput | keyBy(ad_id) tu dong distribute |
| Sink (Kafka producer) | Tuy throughput | Co the nho hon source |
Khi can scale:
- Tang Kafka partitions (vi du tu 100 → 200)
- Tang Flink parallelism tuong ung
- Tang consumer group instances
- Flink auto-scaling (reactive mode) — Flink 1.13+ ho tro
Aggregation Node Horizontal Scaling
Flink tu dong distribute state dua tren key group. Khi tang parallelism:
- State cua key group A duoc migrate tu node 1 sang node 2
- Trong qua trinh migration, co brief pause (managed boi checkpoint/restore)
- Khong can manual resharding nhu voi database
2.3.10 Fault Tolerance
| Co che | Muc | Mo ta |
|---|---|---|
| Flink Checkpoints | Streaming | Periodic snapshot cua tat ca operator state + Kafka offsets. Automatic recovery |
| Flink Savepoints | Streaming | Manual checkpoint cho planned maintenance (upgrade, config change) |
| Kafka Consumer Group Rebalancing | Messaging | Khi 1 consumer die, partitions duoc redistribute cho consumers con lai |
| Kafka Replication | Messaging | Moi partition co 3 replicas — tolerate 2 broker failures |
| DB Replication | Storage | Cassandra/ClickHouse replication factor 3 |
| End-to-end exactly-once | Toan he thong | Flink checkpoint + Kafka transaction + idempotent DB write |
Failure scenarios va recovery:
| Scenario | He thong phan ung |
|---|---|
| 1 Flink TaskManager crash | JobManager detect (heartbeat timeout) → redeploy tasks → restore tu last checkpoint |
| Flink JobManager crash | Standby JobManager take over (HA mode voi ZooKeeper) → restore job tu last checkpoint |
| 1 Kafka broker crash | ISR (In-Sync Replicas) → leader election → consumer reconnect → continue tu last committed offset |
| DB node crash | Replication → query route sang replica → repair node sau |
| Network partition (Flink ↔ Kafka) | Consumer timeout → rebalance → reconnect → replay tu last offset |
| Entire data center failure | Cross-DC replication (Kafka MirrorMaker, Cassandra multi-DC) → failover sang DC khac |
2.3.11 Reconciliation — Verify Streaming Results
Du Flink co exactly-once, van can reconciliation vi:
- Bug trong aggregation logic
- Clock skew giua servers
- Edge cases trong watermark/late event handling
- Data corruption
Reconciliation Flow
| Buoc | Mo ta |
|---|---|
| 1. Batch job chay moi gio | Doc raw events tu Cassandra/S3 cho window vua qua |
| 2. Re-aggregate | Tinh lai click_count va unique_count bang batch processing (Spark) |
| 3. So sanh | Compare batch result voi streaming result trong aggregated DB |
| 4. Detect drift | Neu sai lech > threshold (vi du 0.1%) → alert |
| 5. Auto-correct (optional) | Ghi de ket qua streaming bang ket qua batch (batch la source of truth) |
| Metric | Threshold | Action |
|---|---|---|
| Drift < 0.01% | Normal | Log only |
| Drift 0.01% - 0.1% | Warning | Alert team |
| Drift > 0.1% | Critical | Alert + auto-correct + investigate root cause |
Aha Moment: Reconciliation la safety net, khong phai primary mechanism. Trong he thong tot, reconciliation luon cho ket qua khop (drift ~ 0%). Neu drift thuong xuyen > 0, co bug trong streaming pipeline.
2.3.12 Hot Shard Problem — Quang cao viral
Van de: Quang cao viral (vi du Super Bowl ad) co the nhan 100x-1000x traffic binh thuong. Vi partition by ad_id, tat ca clicks cua ad nay vao 1 Kafka partition va 1 Flink subtask → hot shard → bottleneck.
| Trieu chung | Hau qua |
|---|---|
| 1 Kafka partition nhan qua nhieu data | Consumer lag tang, latency tang |
| 1 Flink subtask dung qua nhieu CPU/memory | Backpressure, checkpoint timeout |
| Cac partition/subtask khac idle | Waste resources |
Giai phap: Salted Keys + Secondary Aggregation
| Buoc | Mo ta |
|---|---|
| 1. Detect hot key | Monitor click rate per ad_id. Neu vuot threshold (vi du 10x average) → mark as hot |
| 2. Salt the key | Thay vi partition by ad_id, partition by ad_id + random(0..N) voi N = so shards mong muon |
| 3. First aggregation | Moi shard aggregate rieng → N partial results cho 1 ad_id |
| 4. Secondary aggregation | 1 reducer merge N partial results → final result cho ad_id |
Vi du voi ad_001 la hot key, N=10:
- Thay vi 1 partition nhan tat ca clicks cua ad_001
- 10 partitions nhan clicks: ad_001_0, ad_001_1, …, ad_001_9
- 10 aggregation nodes dem rieng
- 1 reduce node cong 10 partial counts lai
Trade-off:
- Pro: Traffic phan tan deu, khong con hot shard
- Con: Them 1 aggregation step, logic phuc tap hon, latency tang chut it
- Con: Unique count (HLL) can merge — nhung HLL ho tro merge tot nen khong van de
Aha Moment: Hot shard problem khong chi xay ra trong bai toan nay. No xuat hien bat cu khi nao em partition by key va 1 key co traffic bat can xung. Giai phap “salted keys + secondary aggregation” la general pattern ap dung duoc cho nhieu bai toan khac.
3. Back-of-the-Envelope Estimation
3.1 Traffic Estimation
Assumptions:
- 1 ty (1 billion) clicks/ngay
- Peak traffic = 5x average
3.2 Kafka Throughput
Raw event size: ~1 KB (ad_id + timestamp + user_id + ip + country + metadata)
Kafka partition estimation:
- Moi partition handle ~10 MB/sec (conservative)
- So partitions can thiet cho peak: partitions toi thieu
- Thuc te dung 100-200 partitions de scale consumers va dam bao parallelism cho Flink
Kafka retention:
- Raw topic retention: 7 ngay
- Storage per day:
- 7-day retention:
- Voi replication factor 3: Kafka cluster storage
3.3 Raw Data Storage
3.4 Aggregated Data Storage
Assumptions:
- 2 trieu active ads
- Aggregation windows: 1 phut, 5 phut, 1 gio
- Moi aggregated record: ~100 bytes (ad_id + window + counts)
Voi ClickHouse compression ratio ~10:1:
3.5 Memory cho Flink State
HyperLogLog cho unique count:
Partial aggregation state (tumbling window 1 min):
Tong memory Flink state: ~25 GB — phan bo tren nhieu TaskManagers, hoan toan kha thi.
3.6 Tong hop
| Resource | Gia tri |
|---|---|
| Traffic | 11.5K avg / 58K peak clicks/sec |
| Kafka cluster storage | 21 TB (7-day retention, RF=3) |
| Kafka throughput | 11 MB/s avg / 57 MB/s peak |
| Kafka partitions | 100-200 |
| Raw data (Cassandra) | 7 TB (7-day TTL) |
| Raw data (S3 archive) | 73 TB/year (compressed) |
| Aggregated data (ClickHouse) | 10.5 TB/year (compressed) |
| Flink state memory | ~25 GB (distributed) |
| Flink TaskManagers | 10-20 (tuy CPU/memory spec) |
4. Security & Data Privacy
4.1 Click Fraud Detection
Click fraud la moi de doa lon nhat doi voi he thong quang cao. Neu khong phat hien, advertiser mat tien, platform mat uy tin.
| Loai fraud | Mo ta | Dau hieu nhan biet |
|---|---|---|
| Bot clicks | Automated scripts click quang cao lien tuc | Click rate bat thuong tu 1 IP, khong co mouse movement/scroll |
| Click farms | Nhom nguoi duoc tra tien de click | Nhieu clicks tu cung geo location, pattern giong nhau |
| Competitor clicking | Doi thu click quang cao cua em de hay ngân sach | Clicks tu cung IP range, khong conversion |
| Ad stacking | Nhieu quang cao chong len nhau, 1 click dem cho nhieu ad | Click coordinates giong nhau cho nhieu ads |
| Pixel stuffing | Quang cao hien thi 1x1 pixel, khong ai thay nhung dem la “impression” | Impression khong tuong ung voi click |
Anti-Fraud Strategies
| Strategy | Mo ta | Khi nao ap dung |
|---|---|---|
| IP-based rate limiting | Gioi han so click tu 1 IP per ad per time window | Real-time, tai ingestion layer |
| User-agent fingerprinting | Phat hien bot dua tren browser fingerprint | Real-time, tai ingestion layer |
| Click-through rate (CTR) anomaly | CTR bat thuong cao cho 1 ad → suspicious | Near-real-time, tai aggregation layer |
| Geo anomaly | Click tu country khong phu hop voi target audience | Near-real-time |
| Session analysis | Phan tich hanh vi sau click (bounce rate, time on page) | Batch analysis |
| Machine learning model | Train model phat hien fraud patterns | Batch + real-time inference |
Implementation: Fraud detection khong lam trong aggregation pipeline chinh. Thay vao do:
- Raw events di qua fraud detection service (parallel voi aggregation)
- Fraud service flag suspicious clicks
- Flagged clicks khong duoc dem trong aggregation (hoac dem rieng)
- Advertiser co the dispute va review flagged clicks
4.2 User Privacy
| Nguyen tac | Mo ta |
|---|---|
| Aggregate only | Report chi chua aggregated numbers (click_count), KHONG chua individual user data |
| No PII in reports | user_id, ip KHONG xuat hien trong aggregated results |
| PII only for dedup | user_id chi dung trong dedup window (5 phut), sau do discard |
| IP hashing | IP address duoc hash truoc khi luu (khong luu raw IP) |
| Data minimization | Chi thu thap data can thiet cho aggregation, khong hon |
| GDPR/CCPA compliance | User co quyen yeu cau xoa data. Raw data TTL 7 ngay ho tro dieu nay |
| Consent-based tracking | Chi track click khi user da consent (cookie consent banner) |
4.3 Data Retention Policies
| Data type | Retention | Ly do | Storage |
|---|---|---|---|
| Raw click events | 7 ngay (hot) + 90 ngay (archive) | Reconciliation + dispute resolution | Cassandra → S3 |
| Aggregated data (1-min) | 1 nam | Dashboard + reporting | ClickHouse |
| Aggregated data (1-hour) | 3 nam | Long-term analytics | ClickHouse (compressed) |
| Aggregated data (1-day) | Vinh vien | Historical trend | ClickHouse (cold storage) |
| Dedup keys | 10 phut | Dedup window | Redis (TTL) |
| Fraud detection logs | 1 nam | Audit + dispute | S3 |
5. DevOps & Monitoring
5.1 Key Metrics can monitor
Kafka Metrics
| Metric | Mo ta | Alert threshold |
|---|---|---|
| Consumer lag | So messages chua duoc consume | > 100K messages → Warning, > 1M → Critical |
| Under-replicated partitions | Partitions chua du replicas | > 0 → Warning |
| Broker disk usage | Disk usage cua Kafka broker | > 80% → Warning, > 90% → Critical |
| Producer error rate | Ty le loi khi produce message | > 0.1% → Warning |
| Request latency (p99) | Latency cua produce/consume request | > 100ms → Warning |
Flink Metrics
| Metric | Mo ta | Alert threshold |
|---|---|---|
| Checkpoint duration | Thoi gian hoan thanh 1 checkpoint | > 30s → Warning, > 60s → Critical |
| Checkpoint failure rate | Ty le checkpoint that bai | > 0 trong 1 gio → Critical |
| Backpressure | Operator dang bi cham, buffer day | isBackPressured = true > 5 phut → Warning |
| Records per second | Throughput cua Flink job | Giam > 50% so voi baseline → Warning |
| State size | Kich thuoc state cua operators | Tang > 2x so voi baseline → Warning |
| Late events dropped | So late events bi drop | > 1% total events → Warning |
| TaskManager failures | So TaskManager crash | > 0 → Alert |
Data Freshness & Accuracy
| Metric | Mo ta | Alert threshold |
|---|---|---|
| End-to-end latency | Thoi gian tu click xay ra den data queryable trong DB | > 5 phut → Warning, > 10 phut → Critical |
| Data freshness | Khoang cach giua “now” va timestamp cua record moi nhat trong DB | > 3 phut → Warning |
| Reconciliation drift | % chenh lech giua streaming va batch result | > 0.01% → Warning, > 0.1% → Critical |
| Query latency (p99) | Latency cua query service | > 500ms → Warning, > 2s → Critical |
Database Metrics
| Metric | Mo ta | Alert threshold |
|---|---|---|
| Write latency (p99) | Latency ghi vao ClickHouse/Cassandra | > 100ms → Warning |
| Disk usage | Storage usage | > 75% → Warning |
| Compaction lag (Cassandra) | Pending compactions | > 10 → Warning |
| Query queue size (ClickHouse) | So queries dang cho | > 50 → Warning |
5.2 Alerting Strategy
| Severity | Vi du | Notification | Response time |
|---|---|---|---|
| P1 - Critical | Consumer lag > 1M, checkpoint failures, data freshness > 10 min | PagerDuty (phone call) | < 15 phut |
| P2 - Warning | Consumer lag > 100K, reconciliation drift > 0.01%, high backpressure | Slack + PagerDuty (SMS) | < 1 gio |
| P3 - Info | Traffic spike, scheduled maintenance | Slack | Next business day |
5.3 Runbook — Common Issues
| Van de | Nguyen nhan thuong gap | Cach xu ly |
|---|---|---|
| Consumer lag tang dot ngot | Traffic spike (viral ad), slow consumer, GC pause | Scale Flink parallelism, check backpressure, check GC logs |
| Checkpoint failure | State qua lon, slow storage, network issue | Tang checkpoint timeout, check state size, check HDFS/S3 health |
| Reconciliation drift | Bug trong aggregation logic, clock skew, late events | So sanh raw events voi aggregated results, check watermark config |
| High query latency | ClickHouse overloaded, missing index, large result set | Add materialized view, optimize query, scale read replicas |
| Kafka broker down | Disk full, hardware failure, network partition | Check disk, failover sang replica, add new broker |
5.4 Deployment Strategy
| Chien luoc | Mo ta | Khi nao dung |
|---|---|---|
| Blue-green deployment | Chay 2 Flink jobs song song (old + new), compare results, switch traffic | Major logic changes |
| Canary deployment | Route 5% traffic sang new version, monitor metrics | Minor changes |
| Savepoint + restart | Take savepoint cua Flink job, deploy new version, restore tu savepoint | Config changes, minor updates |
| A/B testing | 2 pipelines xu ly cung data, compare results | Testing new aggregation logic |
6. Mermaid Diagrams — Tong hop
6.1 High-Level Architecture (Complete)
flowchart TB subgraph "Clients" BROWSER["User Browsers"] APP["Mobile Apps"] end subgraph "Ingestion Layer" LB["Load Balancer<br/>(L7)"] IS["Ingestion Service Cluster<br/>(Stateless, Auto-scale)"] end subgraph "Message Queue Layer" KR["Kafka — Raw Events Topic<br/>(100+ partitions, RF=3)"] end subgraph "Processing Layer" FRAUD["Fraud Detection<br/>Service (Async)"] FLINK["Apache Flink Cluster<br/>Map → Aggregate → Reduce"] end subgraph "Storage Layer" RAW_CASS[("Cassandra<br/>Raw Events<br/>TTL 7 days")] RAW_S3[("S3 Archive<br/>Raw Events<br/>Parquet")] AGG_CH[("ClickHouse<br/>Aggregated Data")] end subgraph "Serving Layer" QS["Query Service"] REDIS[("Redis Cache")] end subgraph "Reconciliation" SPARK["Spark Batch Job<br/>(Hourly)"] end subgraph "Consumers" DASH["Advertiser Dashboard"] BILLING["Billing Service"] ANALYTICS["Analytics Platform"] end BROWSER & APP --> LB LB --> IS IS --> KR KR --> FRAUD KR --> FLINK KR --> RAW_CASS RAW_CASS -.->|"Archive"| RAW_S3 FLINK --> AGG_CH AGG_CH --> QS QS --> REDIS QS --> DASH & BILLING & ANALYTICS RAW_S3 -.-> SPARK AGG_CH -.-> SPARK style KR fill:#FF9800,stroke:#333,stroke-width:2px,color:#fff style FLINK fill:#4CAF50,stroke:#333,stroke-width:2px,color:#fff style AGG_CH fill:#2196F3,stroke:#333,stroke-width:2px,color:#fff style FRAUD fill:#f44336,stroke:#333,stroke-width:2px,color:#fff style SPARK fill:#9C27B0,stroke:#333,stroke-width:2px,color:#fff
6.2 Exactly-Once End-to-End Flow (Detailed)
sequenceDiagram participant Kafka as Kafka (Source) participant Flink as Flink Operator participant State as Flink State (RocksDB) participant CM as Checkpoint Manager participant Storage as Durable Storage (HDFS) participant Sink as Kafka (Sink) + DB Note over Kafka,Sink: Normal Processing Kafka->>Flink: Event batch (offset 100-200) Flink->>State: Update aggregation state Flink->>Sink: Emit partial result (uncommitted) Note over CM,Storage: Checkpoint Triggered (every 30s) CM->>Flink: Inject checkpoint barrier Flink->>State: Snapshot current state State->>Storage: Persist state snapshot Flink->>CM: Report: offset=200, state=snapshot_42 CM->>Kafka: Commit consumer offset=200 CM->>Sink: Commit Kafka transaction (flush writes to DB) Note over Kafka,Sink: Failure & Recovery Flink->>Flink: TaskManager CRASH! CM->>Storage: Load last checkpoint (snapshot_42, offset=200) CM->>Kafka: Reset consumer to offset=200 CM->>Flink: Restore state from snapshot_42 Kafka->>Flink: Replay events from offset 200 Flink->>State: Re-process (same state → same result) Flink->>Sink: Re-emit results (idempotent write → no duplicate)
6.3 Lambda vs Kappa Architecture Comparison
flowchart TB subgraph "Lambda Architecture" direction TB L_SRC["Events"] --> L_BATCH["Batch Layer<br/>(Spark, chay moi gio)"] L_SRC --> L_SPEED["Speed Layer<br/>(Flink, real-time)"] L_BATCH --> L_BV["Batch View<br/>(accurate, stale)"] L_SPEED --> L_RV["Real-time View<br/>(fast, approximate)"] L_BV --> L_MERGE["Merge"] L_RV --> L_MERGE L_MERGE --> L_QS["Query"] style L_BATCH fill:#4CAF50,color:#fff style L_SPEED fill:#FF9800,color:#fff style L_MERGE fill:#f44336,color:#fff end subgraph "Kappa Architecture (Preferred)" direction TB K_SRC["Events"] --> K_KAFKA["Kafka<br/>(Immutable Log)"] K_KAFKA --> K_STREAM["Stream Layer<br/>(Flink, real-time<br/>exactly-once)"] K_STREAM --> K_DB["Serving DB"] K_DB --> K_QS["Query"] K_KAFKA -.->|"Replay<br/>if needed"| K_STREAM K_DB -.-> K_RECON["Reconciliation<br/>(Safety net)"] style K_KAFKA fill:#FF9800,color:#fff style K_STREAM fill:#4CAF50,color:#fff style K_RECON fill:#9E9E9E,color:#fff end
6.4 Hot Shard Solution — Salted Keys
flowchart TB subgraph "Problem: Hot Shard" direction LR HOT_AD["ad_001 (viral)<br/>100K clicks/sec"] --> P1_HOT["Partition 0<br/>OVERLOADED"] NORMAL1["ad_002<br/>10 clicks/sec"] --> P2["Partition 1<br/>idle"] NORMAL2["ad_003<br/>15 clicks/sec"] --> P3["Partition 2<br/>idle"] style P1_HOT fill:#f44336,color:#fff end subgraph "Solution: Salted Keys + Secondary Aggregation" direction TB SALT["ad_001 → ad_001_0, ad_001_1, ..., ad_001_9"] subgraph "First Aggregation (Parallel)" S0["ad_001_0<br/>count=10K"] S1["ad_001_1<br/>count=10K"] S9["ad_001_9<br/>count=10K"] end subgraph "Secondary Aggregation (Merge)" REDUCE["Reduce Node<br/>10K x 10 = 100K"] end SALT --> S0 & S1 & S9 S0 & S1 & S9 --> REDUCE style SALT fill:#FF9800,color:#fff style REDUCE fill:#4CAF50,color:#fff end
7. Aha Moments & Pitfalls
7.1 Aha Moments — Nhung dieu Hieu can nho
| # | Insight | Giai thich |
|---|---|---|
| 1 | Kappa > Lambda cho hau het truong hop | Lambda architecture ra doi vi streaming frameworks chua mature. Voi Flink exactly-once, Kappa don gian hon va du chinh xac. Chi can them reconciliation batch job lam safety net — day khong phai Lambda |
| 2 | Watermark la concept kho nhat | No dai dien cho trade-off giua latency va completeness. Khong co gia tri “dung” — chi co gia tri phu hop voi business. Cach duy nhat hieu watermark: thuc hanh voi Flink |
| 3 | Hot shard co the giet performance | 1 quang cao viral → 1 partition overloaded → toan bo pipeline cham lai (backpressure). Salted keys la giai phap, nhung tang complexity. Monitor click rate per ad_id de detect som |
| 4 | Reconciliation la safety net bat buoc | Du streaming exactly-once, van can batch job verify. Bug trong logic, clock skew, edge cases — tat ca co the gay sai lech nho. Reconciliation phat hien truoc khi advertiser phat hien |
| 5 | Exactly-once khong co nghia “xu ly 1 lan” | No co nghia “ket qua giong nhu xu ly 1 lan”. Events co the bi replay (sau checkpoint restore) nhung ket qua khong doi nho idempotent writes. Hieu sai concept nay → thiet ke sai |
| 6 | Cascade aggregation tiet kiem compute | Aggregate 1-min window, roi tinh 5-min tu 5x 1-min, 1-hour tu 12x 5-min. Khong can 3 pipeline rieng — 1 pipeline + cascade |
| 7 | HyperLogLog la hero an minh | Dem unique users trong 12KB memory per counter voi sai so < 1%. Khong co HLL, bai toan unique count o scale nay gan nhu bat kha thi voi memory hop ly |
| 8 | Raw data la bao hiem | Luon luu raw events (du dat). Khi logic sai, khi can re-process, khi advertiser dispute — raw data la nguon su that duy nhat |
7.2 Pitfalls — Nhung loi thuong gap
| # | Pitfall | Hau qua | Cach tranh |
|---|---|---|---|
| 1 | Dung processing time thay vi event time | Click xay ra luc 13:59 bi dem vao window 14:00 | Luon dung event time cho aggregation. Processing time chi cho system monitoring |
| 2 | Khong xu ly late events | Click den muon bi mat → under-count → advertiser khieu nai | Configure allowed lateness, dung side output cho late events |
| 3 | Khong co reconciliation | Bug trong streaming lam sai ket qua ma khong ai biet | Chay batch reconciliation moi gio, alert khi drift > threshold |
| 4 | Partition by random key | Moi aggregation node can du lieu cua moi ad → shuffle qua nhieu data | Partition by ad_id de data locality. Chi salt khi can xu ly hot keys |
| 5 | Checkpoint interval qua lon | Khi failure, phai replay nhieu events → recovery cham, duplicate window lon | Checkpoint moi 30s-1 phut. Balance giua overhead va recovery time |
| 6 | Khong monitor consumer lag | Pipeline cham ma khong biet → data stale → advertiser thay so cu | Consumer lag la metric #1 can monitor. Alert ngay khi tang bat thuong |
| 7 | Single point of failure cho fraud detection | Fraud service down → tat ca clicks duoc chap nhan → mat tien | Fraud detection async, khong block aggregation pipeline. Fallback: flag va review sau |
| 8 | Unique count dung exact counting | O(n) memory per ad per window → memory explosion voi 2M ads | Dung HyperLogLog. Chi dung exact counting cho billing-critical use cases |
| 9 | Khong luu raw data | Khong the re-process khi logic sai, khong the reconcile | Luon luu raw events — Cassandra (7 ngay) + S3 (archive) |
| 10 | Scale vertically thay vi horizontally | Hit hardware limit, single point of failure | Kafka partitions + Flink parallelism cho horizontal scaling. Vertical chi cho DB optimization |
7.3 Interview Tips
| Tip | Giai thich |
|---|---|
| Bat dau voi requirements | Hoi ro: bao nhieu clicks/ngay, latency target, accuracy yeu cau. Khong nhay vao solution ngay |
| Noi ve trade-offs | Lambda vs Kappa, HLL vs exact count, Cassandra vs ClickHouse — interviewer muon nghe em can nhac, khong phai chi chon 1 |
| Ve diagram truoc, chi tiet sau | High-level flow: Ingestion → Queue → Aggregation → DB → Query. Roi dive vao tung component |
| Nhac den exactly-once som | Day la yeu cau kho nhat cua bai toan. Neu em biet giai thich Flink checkpoint + idempotent write → an tuong lon |
| Hot shard la bonus point | Nhieu ung vien khong nghi toi. Neu em tu de cap va giai phap salted keys → interviewer biet em co kinh nghiem thuc te |
| Reconciliation cho thay em la senior | Junior engineer chi nghi streaming la du. Senior biet can safety net. Nhac den reconciliation batch job → chung to production experience |
8. Tong ket — Summary Table
| Khia canh | Quyet dinh | Ly do |
|---|---|---|
| Architecture | Kappa (streaming only + reconciliation) | Don gian hon Lambda, Flink exactly-once mature |
| Message queue | Apache Kafka | Durable, partitioned, replayable |
| Streaming engine | Apache Flink | Exactly-once, stateful processing, checkpoint/savepoint |
| Raw data store | Cassandra (7 days) + S3 (archive) | Write-heavy, time-series, cost-effective |
| Aggregated data store | ClickHouse | OLAP-optimized, fast aggregation queries |
| Partitioning strategy | By ad_id (salted for hot keys) | Data locality cho aggregation |
| Time window | Tumbling 1-min, cascade to 5-min and 1-hour | Simple, no overlap, efficient |
| Unique counting | HyperLogLog | Memory efficient, mergeable, ~1% error OK |
| Deduplication | Bloom filter (5-min window) | Memory efficient, low false positive |
| Late events | Watermark + allowed lateness 5 min + side output | Balance latency va completeness |
| Reconciliation | Spark batch job hourly | Safety net, detect drift |
| Fraud detection | Async, parallel pipeline | Khong block aggregation |
9. Internal Links & Further Reading
Prerequisite (da hoc)
- Tuan-01-Scale-From-Zero-To-Millions — Nen tang ve scalability
- Tuan-02-Back-of-the-envelope — Ky nang estimation
- Tuan-08-Message-Queue — Kafka fundamentals, consumer groups, partitions
Lien quan (tham khao cheo)
- Tuan-13-Monitoring-Observability — Monitoring pipeline tuong tu (metrics ingestion)
- Case-Design-Metrics-Monitoring-Alerting — Bai toan tuong tu nhung cho infrastructure metrics
- Tuan-07-Database-Sharding-Replication — Cassandra partitioning, ClickHouse replication
Concepts de tim hieu them
- Apache Flink — Stateful stream processing, checkpointing, exactly-once semantics
- ClickHouse — Column-oriented OLAP database, materialized views
- HyperLogLog — Probabilistic cardinality estimation
- Kafka Streams vs Flink — Khi nao dung cai nao
- Event sourcing — Luu tat ca events, derive state tu events (Kafka la event log)
“He thong tot khong phai la he thong khong bao gio sai — ma la he thong biet khi nao no sai, va tu sua duoc. Reconciliation chinh la co che tu sua do.”