Case Study: Design Ad Click Event Aggregation

“1 ty luot click moi ngay, moi click phai duoc dem dung 1 lan, ket qua phai co trong vai phut. Day la bai toan ma sai 0.1% cung co the mat hang trieu do quang cao.”

Tags: system-design ad-click event-aggregation streaming exactly-once mapreduce alex-xu-vol2 Student: Hieu Source: Alex Xu — System Design Interview Volume 2, Chapter 3 Prerequisite: Tuan-08-Message-Queue · Tuan-02-Back-of-the-envelope · Tuan-01-Scale-From-Zero-To-Millions Lien quan: Tuan-13-Monitoring-Observability · Case-Design-Metrics-Monitoring-Alerting · Tuan-07-Database-Sharding-Replication


1. Context & Why — Tai sao Ad Click Event Aggregation quan trong?

1.1 Analogy: Dem phieu bau cu quoc gia real-time

Hieu, em hay tuong tuong bai toan nay nhu viec dem phieu bau cu quoc gia dien ra real-time:

Bau cu quoc giaAd Click AggregationDiem chung
Hang trieu phieu tu khap ca nuocHang ty click tu khap the gioiVolume cuc lon, den lien tuc
Moi phieu chi duoc dem 1 lan — dem 2 lan = gian lanMoi click chi duoc dem 1 lan — dem 2 lan = tinh tien sai nha quang caoExactly-once la yeu cau bat buoc
Ket qua phai cap nhat lien tuc — bao chi can biet ai dang dan moi phutDashboard phai cap nhat lien tuc — advertiser can biet bao nhieu click trong 1 phut quaReal-time aggregation voi latency thap
Chia theo khu vuc — dem theo tinh, theo thanh phoChia theo filter — dem theo ad_id, campaign_id, countryMulti-dimensional aggregation
Phieu den muon — co van phong gui ket qua cham vi xaLate events — click event den muon do network delayLate event handling la thach thuc lon
Kiem tra cheo — ket qua dem thu cong phai khop voi mayReconciliation — batch job kiem tra lai ket qua streamingReconciliation dam bao chinh xac
Co the bau trung — mot nguoi di bau 2 lanClick fraud — bot click nhieu lanDeduplication can xu ly

Aha Moment: Bai toan nay khong phai la “dem click” don gian. No la bai toan dem dung, dem du, dem nhanh o quy mo hang ty event moi ngay. Sai mot chut = mat tien that — hoac nha quang cao bi tinh thua, hoac platform mat doanh thu.

1.2 Tai sao bai toan nay kho?

Hieu, dem so tuong nhu don gian — nhung dem so o quy mo lon, real-time, va chinh xac tuyet doi lai la mot trong nhung bai toan kho nhat cua distributed systems:

Thach thucGiai thich
Volume cuc lon1 ty click/ngay = ~11,574 click/giay trung binh, peak co the 50K+/giay
Real-timeAdvertiser muon thay ket qua trong vai phut, khong phai cuoi ngay
Exactly-onceDem thua = tinh tien sai. Dem thieu = mat doanh thu. Ca hai deu khong chap nhan duoc
Late eventsEvent co the den muon 5 phut, 1 gio, tham chi 1 ngay do mobile network, CDN delay
Multi-dimensionalCan aggregate theo nhieu chieu: ad_id, campaign_id, country, device_type, time window
Fault toleranceHe thong khong duoc mat data khi server crash, network partition
Hot spotsQuang cao viral co the nhan 100x traffic binh thuong — tao hot partition
Consistency vs LatencyCan can bang giua do chinh xac va toc do tra ket qua

1.3 Business context — Tai sao advertiser quan tam?

Trong mo hinh quang cao online (Google Ads, Facebook Ads, TikTok Ads), advertiser tra tien theo CPC (Cost Per Click). Moi click = mot khoan tien. Vi du:

  • Advertiser A chay quang cao voi CPC = $0.50
  • 1 ngay co 1 trieu click vao quang cao cua A
  • Tong chi phi = $500,000/ngay

Neu he thong dem sai 1% → sai 1.8 trieu/nam. Voi hang trieu advertiser, con so nay nhan len gap boi. Day la ly do data accuracy la non-negotiable.

1.4 Scope cua bai toan

Thuoc tinhTrong scopeNgoai scope
Click event aggregation (dem click theo ad_id, time window)Yes
Real-time query (advertiser xem dashboard)Yes
Filtering (theo campaign_id, country)Yes
Click fraud detection (bot, click farms)Lien quan (Section 4)Deep dive rieng
Conversion tracking (click → purchase)Bai toan khac
Ad serving (chon quang cao nao hien thi)Bai toan khac
Billing (tinh tien advertiser)Downstream cua aggregation

2. Deep Dive — Alex Xu 4-Step Framework

Step 1 — Understand the Problem & Establish Design Scope

2.1.1 Functional Requirements

Chuc nangMo ta chi tiet
Aggregate ad clicksDem so luong click va unique click cho moi ad_id trong cac time window: 1 phut, 5 phut, 1 gio
Return aggregated dataQuery API tra ve click_count va unique_click_count cho mot ad_id trong mot khoang thoi gian
Support filteringFilter ket qua theo ad_id, campaign_id, country, device_type
Support multi-dimensional aggregationAggregate theo nhieu chieu cung luc: vi du click count per ad_id per country per minute
Store raw eventsLuu raw click event de reconciliation va re-processing

Cac cau hoi em nen hoi interviewer:

Cau hoiCau tra loi gia dinhTai sao hoi
Bao nhieu click/ngay?1 billion (1 ty) click/ngayQuyet dinh throughput cua toan he thong
Bao nhieu ad dang active?2 trieu adQuyet dinh cardinality cua aggregation
Aggregation window nao?1 phut, 5 phut, 1 gioQuyet dinh windowing strategy
Latency yeu cau?< vai phut tu click den queryableQuyet dinh streaming vs batch
Data accuracy?Exactly-once — khong duoc dem thua hay thieuQuyet dinh delivery semantics
Can store raw data khong?Co — de reconciliation va ad-hoc analysisQuyet dinh storage strategy
Bao lau giu raw data?Raw: 7 ngay, Aggregated: yearsQuyet dinh data retention
Click fraud co trong scope khong?Basic dedup co, advanced fraud detection ngoai scopeRanh gioi thiet ke

2.1.2 Non-Functional Requirements

RequirementTargetLy do
Throughput1B clicks/ngay, peak 50K clicks/secVolume lon, bursty traffic
LatencyEnd-to-end < vai phut (tu click den queryable)Near-real-time dashboard
AccuracyExactly-once processingLien quan den tien bac
Availability99.99% cho ingestion pipelineMat event = mat doanh thu
ScalabilityScale horizontal khi traffic tangMua sale, viral ads
Fault toleranceKhong mat data khi node failureZero data loss
IdempotencyRe-process khong thay doi ket quaRetry safety
QueryabilityQuery response < 1 giay cho single ad_idDashboard UX

2.1.3 API Design

API 1: Aggregate click count cho mot ad_id

GET /v1/ads/{ad_id}/aggregated_count

Parameters:

ParameterTypeMo ta
fromlong (epoch)Thoi diem bat dau
tolong (epoch)Thoi diem ket thuc
filterobjectOptional: {campaign_id: "xxx", country: "VN"}

Response:

FieldTypeMo ta
ad_idstringID cua quang cao
click_countlongTong so click
unique_click_countlongSo click tu distinct users

API 2: Top N clicked ads trong khoang thoi gian

GET /v1/ads/top_clicked

Parameters:

ParameterTypeMo ta
countintSo luong top ads can tra ve (N)
windowintKhoang thoi gian tinh bang phut (1, 5, 60)
filterobjectOptional filter

Step 2 — High-Level Design

2.2.1 Kien truc tong quan

Hieu, kien truc nay giong nhu day chuyen xu ly ca trong nha may thuy san:

  • Tau ca (Client) mang ca ve cang → Click events tu user browsers/apps
  • Cang ca (Load Balancer) phan phoi ca vao cac bang chuyen → Ingestion Service nhan events
  • Kho lanh (Message Queue) giu ca tuoi truoc khi che bien → Kafka buffer events
  • Day chuyen che bien (Aggregation Service) cat, loc, dong goi → Map-Reduce aggregate clicks
  • Kho thanh pham (Database) luu san pham da dong goi → Aggregated DB luu ket qua
  • Quay ban hang (Query Service) phuc vu khach hang → Query API tra ket qua cho advertiser
flowchart LR
    subgraph "Data Generation"
        C1["User Clicks<br/>(Browsers/Apps)"]
    end

    subgraph "Log Ingestion"
        LB["Load Balancer"]
        IS1["Ingestion<br/>Service 1"]
        IS2["Ingestion<br/>Service 2"]
        ISN["Ingestion<br/>Service N"]
    end

    subgraph "Message Queue"
        K["Apache Kafka<br/>(Partitioned by ad_id)"]
    end

    subgraph "Streaming Aggregation"
        AG1["Aggregation Node 1<br/>(Map + Aggregate)"]
        AG2["Aggregation Node 2<br/>(Map + Aggregate)"]
        AGN["Aggregation Node N<br/>(Map + Aggregate)"]
    end

    subgraph "Second Message Queue (Optional)"
        K2["Kafka Topic 2<br/>(Aggregated Results)"]
    end

    subgraph "Storage"
        RAW[("Raw Data Store<br/>(Cassandra / S3)")]
        AGG[("Aggregated DB<br/>(Cassandra / ClickHouse)")]
    end

    subgraph "Serving"
        QS["Query Service"]
        CACHE[("Redis Cache")]
        DASH["Advertiser<br/>Dashboard"]
    end

    C1 --> LB
    LB --> IS1 & IS2 & ISN
    IS1 & IS2 & ISN --> K
    K --> AG1 & AG2 & AGN
    K -->|"Raw events"| RAW
    AG1 & AG2 & AGN --> K2
    K2 --> AGG
    AGG --> QS
    QS --> CACHE
    QS --> DASH

    style K fill:#FF9800,stroke:#333,stroke-width:2px,color:#fff
    style K2 fill:#FF9800,stroke:#333,stroke-width:2px,color:#fff
    style AGG fill:#4CAF50,stroke:#333,stroke-width:2px,color:#fff
    style QS fill:#2196F3,stroke:#333,stroke-width:2px,color:#fff

2.2.2 Data Flow — End to End

sequenceDiagram
    participant User as User Browser
    participant IS as Ingestion Service
    participant Kafka as Kafka (Raw Events)
    participant AGG as Aggregation Service
    participant Kafka2 as Kafka (Aggregated)
    participant DB as Aggregated DB
    participant QS as Query Service
    participant ADV as Advertiser Dashboard

    User->>IS: Click event (ad_id, timestamp, user_id, ip, country)
    IS->>IS: Validate & enrich event
    IS->>Kafka: Produce to partition (hash ad_id)

    Note over Kafka,AGG: Streaming Aggregation (per minute window)
    Kafka->>AGG: Consume batch of events
    AGG->>AGG: Map: filter + transform
    AGG->>AGG: Aggregate: count by ad_id per window
    AGG->>AGG: Reduce: merge partial results
    AGG->>Kafka2: Produce aggregated result

    Kafka2->>DB: Write aggregated data
    Note over DB: (ad_id, window_start, click_count, unique_count)

    ADV->>QS: GET /v1/ads/123/aggregated_count?from=...&to=...
    QS->>DB: Query aggregated data
    DB-->>QS: Results
    QS-->>ADV: JSON response

2.2.3 Component Responsibilities

ComponentTrach nhiemTechnology choices
Ingestion ServiceNhan click event tu client, validate, enrich (geo lookup), gui vao KafkaCustom service (Go/Java), stateless, horizontally scalable
Kafka (Raw)Buffer raw events, decouple ingestion tu processing, replay capabilityApache Kafka, partitioned by ad_id
Raw Data StoreLuu raw events cho reconciliation, ad-hoc query, re-processingCassandra (write-heavy), hoac S3 (cost-effective)
Aggregation ServiceStream processing: map, aggregate, reduce trong time windowApache Flink, Spark Streaming
Kafka (Aggregated)Buffer aggregated results truoc khi ghi DB, decouple aggregation tu storageApache Kafka
Aggregated DBLuu ket qua aggregation, phuc vu read queriesCassandra, ClickHouse
Query ServiceParse query, fetch data tu DB, apply filter, return resultsCustom service voi caching layer
Redis CacheCache hot queries (top ads, recent aggregations)Redis cluster

2.2.4 Tai sao can 2 Kafka topics?

Loi ichGiai thich
DecouplingAggregation service va DB writer co the scale doc lap
Exactly-onceKafka transactions dam bao aggregated result duoc ghi chinh xac 1 lan
ReplayNeu DB writer loi, co the replay tu Kafka topic 2 ma khong can re-aggregate
Multiple consumersNhieu downstream systems co the doc aggregated data (alerting, billing, analytics)

Step 3 — Design Deep Dive

2.3.1 Data Model

Raw Click Event
FieldTypeMo taVi du
ad_idstringID cua quang cao”ad_001”
click_timestamplong (epoch ms)Thoi diem user click (event time)1678886400000
user_idstringID cua user (cookie/device ID)“user_abc123”
ipstringIP address cua user”203.113.152.4”
countrystringCountry code (tu IP geo lookup)“VN”
device_typestringLoai thiet bi”mobile”
campaign_idstringCampaign chua ad nay”camp_xyz”

Kich thuoc trung binh: ~0.5-1 KB per event.

Aggregated Data
FieldTypeMo taVi du
ad_idstringID cua quang cao”ad_001”
window_startlong (epoch)Bat dau cua time window1678886400
window_sizeintKich thuoc window (phut)1
click_countlongTong so click trong window45,231
unique_click_countlongSo click tu distinct users38,102
filter_idstringComposite key cua filter dimensions”country=VN”

Aha Moment: Raw event va aggregated data co dac diem hoan toan khac nhau. Raw event write-heavy, append-only, high volume. Aggregated data read-heavy, update-in-place (upsert), lower volume. Day la ly do chung can databases khac nhau — hoac it nhat la tables khac nhau voi access patterns khac nhau.

Star Schema cho Aggregation

Hieu, data model cho aggregation tuong tu star schema trong data warehousing:

Thanh phanVai troVi du
Fact tableLuu metric (click_count)aggregated_clicks (ad_id, window, click_count, unique_count)
Dimension tablesLuu thong tin filterads (ad_id, campaign_id, advertiser_id), campaigns (campaign_id, budget)

Query pattern: join fact table voi dimension tables de filter theo campaign_id, advertiser_id, country.

2.3.2 Choosing the Right Database

Yeu cauRaw DataAggregated Data
Write patternAppend-only, sequentialUpsert (update or insert)
Read patternRare (reconciliation, re-processing)Frequent (dashboard queries)
VolumeCuc lon (~1TB/ngay)Nho hon nhieu (~vài GB/ngay)
Query patternFull scan, time-range scanPoint query (ad_id + time range), top-N
Retention7 ngay (sau do archive)Years
ConsistencyEventual OKStrong preferred
Raw Data → Cassandra hoac S3
Lua chonUu diemNhuoc diem
CassandraWrite-optimized (LSM-tree), time-series friendly, TTL support, distributedQuery flexibility han che, expensive storage
S3 + ParquetCuc re, unlimited storage, Athena/Spark co the queryLatency cao, khong phu hop real-time query
Hybrid (Cassandra 7 ngay + S3 archive)Ket hop ca haiPhuc tap operations

Ket luan: Dung Cassandra cho raw data voi TTL 7 ngay. Archive sang S3 (Parquet format) cho long-term storage. Cassandra write-heavy performance rat phu hop voi use case nay — partition key la ad_id, clustering key la click_timestamp.

Aggregated Data → Cassandra hoac ClickHouse
Lua chonUu diemNhuoc diem
CassandraConsistent voi raw data store, write-optimizedOLAP queries cham (khong co secondary index tot)
ClickHouseOLAP-optimized, column-oriented, blazing fast aggregation queriesWrite pattern khac (batch insert tot hon single upsert)
DruidReal-time OLAP, pre-aggregation tai ingestionPhuc tap operations

Ket luan: ClickHouse la lua chon tot nhat cho aggregated data vi:

  1. Column-oriented storage nen toi uu cho aggregation queries (SUM, COUNT, GROUP BY)
  2. Query speed cuc nhanh cho dashboard use case
  3. Built-in materialized views co the tu dong aggregate
  4. Compression ratio cao cho time-series data

Trade-off: Neu team chua co kinh nghiem voi ClickHouse, dung Cassandra cho ca raw va aggregated data de giam operational complexity. Performance khong tot bang nhung don gian hon.

2.3.3 Aggregation Service — MapReduce Model

Day la trai tim cua he thong. Aggregation service dung mo hinh MapReduce de xu ly events theo 3 giai doan:

Giai doan 1: Map Node (Filter + Transform)
Nhiem vuChi tiet
Nhan raw event tu KafkaMoi Map node nhan events tu 1 hoac nhieu Kafka partitions
FilterLoai bo invalid events (missing ad_id, invalid timestamp)
TransformEnrich data (geo lookup tu IP → country), normalize fields
Emit key-valueOutput: (ad_id, 1) — nghia la “ad nay duoc click 1 lan”
Giai doan 2: Aggregate Node (Count by ad_id per window)
Nhiem vuChi tiet
Nhan (ad_id, 1) tu MapGroup by ad_id
Aggregate trong time windowDem so luong clicks va unique clicks trong tung window (1 phut)
Maintain stateDung in-memory state (Flink state backend) de luu partial aggregation
Emit partial resultOutput: (ad_id, window, partial_count, partial_unique_count)
Giai doan 3: Reduce Node (Merge Results)
Nhiem vuChi tiet
Nhan partial results tu nhieu Aggregate nodesTruong hop ad_id bi chia ra nhieu node
MergeCong tat ca partial_count lai, merge HyperLogLog cho unique count
Output final result(ad_id, window, final_click_count, final_unique_count)
Ghi vao Kafka topic 2Aggregated result san sang ghi DB
flowchart TB
    subgraph "Kafka Partitions"
        P1["Partition 0<br/>ad_001, ad_005, ad_009..."]
        P2["Partition 1<br/>ad_002, ad_006, ad_010..."]
        P3["Partition 2<br/>ad_003, ad_007, ad_011..."]
        P4["Partition 3<br/>ad_004, ad_008, ad_012..."]
    end

    subgraph "Map Nodes"
        M1["Map Node 1<br/>Filter + Transform<br/>Emit (ad_id, 1)"]
        M2["Map Node 2<br/>Filter + Transform<br/>Emit (ad_id, 1)"]
        M3["Map Node 3<br/>Filter + Transform<br/>Emit (ad_id, 1)"]
        M4["Map Node 4<br/>Filter + Transform<br/>Emit (ad_id, 1)"]
    end

    subgraph "Aggregate Nodes (Windowed)"
        A1["Aggregate Node 1<br/>Window: 1 min<br/>Count by ad_id"]
        A2["Aggregate Node 2<br/>Window: 1 min<br/>Count by ad_id"]
    end

    subgraph "Reduce Nodes"
        R1["Reduce Node<br/>Merge partial counts<br/>Final result per ad_id"]
    end

    subgraph "Output"
        K2["Kafka Topic 2<br/>(Aggregated Results)"]
        DB[("ClickHouse")]
    end

    P1 --> M1
    P2 --> M2
    P3 --> M3
    P4 --> M4

    M1 & M2 --> A1
    M3 & M4 --> A2

    A1 & A2 --> R1
    R1 --> K2
    K2 --> DB

    style M1 fill:#42A5F5,color:#fff
    style M2 fill:#42A5F5,color:#fff
    style M3 fill:#42A5F5,color:#fff
    style M4 fill:#42A5F5,color:#fff
    style A1 fill:#FFA726,color:#fff
    style A2 fill:#FFA726,color:#fff
    style R1 fill:#EF5350,color:#fff

Aha Moment: Trong thuc te, khi dung Apache Flink, Map va Aggregate thuong merge thanh 1 operator. Flink tu dong lam chuyen nay qua operator chaining de giam network overhead. Mo hinh 3 giai doan la de hieu concept — implementation thuc te compact hon.

Unique Click Count — HyperLogLog

Dem unique clicks (distinct user_id) la bai toan kho vi:

  • Khong the giu tat ca user_id trong memory (qua nhieu)
  • Exact count yeu cau O(n) memory per ad_id per window

Giai phap: Dung HyperLogLog (HLL) — probabilistic data structure:

Thuoc tinhGia tri
Memory~12 KB per counter (co dinh, bat ke bao nhieu elements)
AccuracySai so ~0.81% (voi 16K registers)
MergeableCo the merge 2 HLL counters → perfect cho distributed aggregation
Trade-offMat do chinh xac tuyet doi, nhung tiet kiem memory cuc lon

Voi 2 trieu active ads, moi ad 1 HLL counter = 2M x 12KB = ~24 GB memory — hoan toan kha thi.

Neu can exact unique count (mot so advertiser tra them tien cho do chinh xac), dung bitmap hoac Bloom filter voi memory lon hon.

2.3.4 Streaming vs Batching — Lambda vs Kappa Architecture

Day la quyet dinh kien truc quan trong nhat cua bai toan. Hieu can hieu 2 truong phai:

Lambda Architecture (Batch + Speed Layer)
flowchart TB
    subgraph "Data Source"
        EVENTS["Click Events"]
    end

    subgraph "Batch Layer"
        BATCH_STORE[("Batch Storage<br/>(HDFS / S3)")]
        BATCH_PROC["Batch Processing<br/>(Spark / MapReduce)<br/>Chay moi gio hoac moi ngay"]
        BATCH_VIEW[("Batch View<br/>(accurate, complete)")]
    end

    subgraph "Speed Layer"
        STREAM_PROC["Stream Processing<br/>(Flink / Storm)<br/>Real-time"]
        RT_VIEW[("Real-time View<br/>(approximate, fast)")]
    end

    subgraph "Serving Layer"
        MERGE["Merge Results<br/>Batch View + Real-time View"]
        QS["Query Service"]
    end

    EVENTS --> BATCH_STORE
    EVENTS --> STREAM_PROC
    BATCH_STORE --> BATCH_PROC
    BATCH_PROC --> BATCH_VIEW
    STREAM_PROC --> RT_VIEW
    BATCH_VIEW --> MERGE
    RT_VIEW --> MERGE
    MERGE --> QS

    style BATCH_PROC fill:#4CAF50,color:#fff
    style STREAM_PROC fill:#FF9800,color:#fff
    style MERGE fill:#2196F3,color:#fff
Uu diemNhuoc diem
Batch layer dam bao accuracy2 codebases — phai maintain 2 pipeline (batch + streaming)
Speed layer dam bao low latencyLogic trung lap — cung 1 aggregation nhung viet 2 lan
Neu streaming sai, batch se suaPhuc tap debug — sai o layer nao?
Merge logic khong don gian
Kappa Architecture (Streaming Only)
flowchart LR
    subgraph "Data Source"
        EVENTS["Click Events"]
    end

    subgraph "Streaming Layer (Only Layer)"
        KAFKA["Kafka<br/>(Immutable Log)"]
        FLINK["Stream Processing<br/>(Apache Flink)<br/>Real-time Aggregation"]
    end

    subgraph "Serving Layer"
        DB[("Aggregated DB")]
        QS["Query Service"]
    end

    subgraph "Reconciliation (Periodic)"
        RECON["Batch Reconciliation<br/>(Hourly / Daily)<br/>Verify streaming results"]
    end

    EVENTS --> KAFKA
    KAFKA --> FLINK
    FLINK --> DB
    DB --> QS
    KAFKA -.->|"Replay if needed"| FLINK
    DB -.-> RECON
    KAFKA -.-> RECON

    style KAFKA fill:#FF9800,color:#fff
    style FLINK fill:#4CAF50,color:#fff
    style RECON fill:#9E9E9E,color:#fff
Uu diemNhuoc diem
1 codebase — chi can maintain 1 pipelineCan streaming framework mature (Flink co exactly-once)
Don gian hon Lambda nhieuNeu can re-process, replay tu Kafka (can du retention)
Kafka la immutable log — co the replay bat ky luc nao
Apache Flink ho tro exactly-once processing

Ket luan cua Alex Xu: Kappa architecture la lua chon uu tien cho bai toan nay. Ly do: Apache Flink da mature du de dam bao exactly-once processing. Lambda chi can thiet khi streaming framework chua du tin cay — dieu nay khong con dung voi Flink 2024+.

Aha Moment: Lambda architecture ra doi vi streaming frameworks ngay xua (Storm, Samza) khong dam bao exactly-once. Khi Flink giai quyet van de nay, ly do ton tai cua Lambda giam di dang ke. Tuy nhien, reconciliation (batch job kiem tra lai) van can thiet nhu safety net — day khong phai la Lambda, ma la best practice.

2.3.5 Watermark va Late Events

Day la khai niem kho nhat trong stream processing. Hieu can hieu ro:

Event Time vs Processing Time
Khai niemDinh nghiaVi du
Event timeThoi diem su kien thuc su xay ra (user click)14:00:00 — user click vao quang cao
Processing timeThoi diem he thong nhan duoc event14:00:05 — event den Flink (tre 5 giay do network)
Ingestion timeThoi diem event vao Kafka14:00:02 — Kafka nhan event

Tai sao quan trong? Vi em aggregate theo time window. Neu dung processing time, 1 click xay ra luc 13:59:59 nhung den luc 14:00:05 se bi dem vao window 14:00-14:01 thay vi 13:59-14:00. Sai window = sai ket qua.

Ket luan: Luon dung event time cho aggregation. Processing time chi dung cho monitoring he thong.

Watermark la gi?

Watermark = loi tuyen bo cua he thong rang: “Tat ca event voi event_time W da den het”.

Vi du: Watermark = 14:05:00 co nghia la he thong tin rang khong con event nao voi timestamp truoc 14:05:00 se den nua. He thong co the dong (close) tat ca window truoc 14:05:00 va emit ket qua.

Van de: Lam sao biet tat ca event da den? Khong the biet chac chan. Vi vay watermark luon co allowed lateness (do tre cho phep).

Watermark strategyMo taTrade-off
Punctuated watermarkEmit watermark dua tren event data (vi du: max event_time - 5s)Phu hop traffic deu
Periodic watermarkEmit watermark dinh ky (vi du: moi 200ms, watermark = max event_time seen - 10s)Pho bien nhat, don gian
Watermark = max_event_time - allowed_latenessVi du: max_event_time - 5 phutBalance giua latency va completeness
Late Event Handling

Khi event den sau watermark (late event), co 3 cach xu ly:

StrategyMo taKhi nao dung
DropBo qua late eventKhi data accuracy khong quan trong (dashboard xap xi)
Update aggregationMo lai window, cap nhat ket qua, emit laiKhi can accuracy cao — dung cho ad click
Side outputGui late event sang 1 stream rieng de xu ly sauKhi late event can xu ly khac (reconciliation, alerting)

Trong bai toan ad click, Alex Xu khuyen dung ket hop Update + Side output:

  • Late event trong 5 phut: update aggregation (mo lai window, dem lai)
  • Late event qua 5 phut: gui vao side output → reconciliation batch job xu ly

Aha Moment: Watermark la khai niem kho nhat trong stream processing vi no lien quan den trade-off giua latency va completeness. Watermark qua nhanh (allowed_lateness nho) → nhieu late events bi bo → sai ket qua. Watermark qua cham (allowed_lateness lon) → latency cao → dashboard cap nhat cham. Khong co gia tri “dung” — chi co gia tri “phu hop voi business yeu cau”.

2.3.6 Exactly-Once Processing

Trong he thong phan tan, exactly-once la cuc ky kho vi co nhieu diem failure:

Diem failureVan deGiai phap
Kafka consumer doc eventConsumer crash sau khi xu ly nhung truoc khi commit offsetFlink checkpointing (snapshot state + Kafka offset cung luc)
Aggregation serviceNode crash giua chung xu lyFlink savepoints/checkpoints — resume tu last consistent state
Ghi vao DBGhi thanh cong nhung chua ackIdempotent writes — ghi lai cung data khong thay doi ket qua
Kafka producer (ghi aggregated result)Producer crash sau khi Kafka nhan nhung truoc khi ackKafka idempotent producer (enable.idempotence=true)
Exactly-Once End-to-End Flow
flowchart TB
    subgraph "Source: Kafka Consumer"
        KC["Kafka Consumer<br/>Track offset per partition"]
    end

    subgraph "Processing: Flink"
        CP["Checkpoint Manager<br/>(Periodic snapshots)"]
        STATE["Flink State Backend<br/>(RocksDB / Heap)"]
        OP["Aggregation Operator<br/>(Stateful processing)"]
    end

    subgraph "Sink: Kafka Producer + DB"
        KP["Kafka Producer<br/>(Idempotent + Transactional)"]
        DB[("ClickHouse<br/>(Idempotent Upsert)")]
    end

    subgraph "Checkpoint Flow"
        BARRIER["Checkpoint Barrier<br/>(injected into stream)"]
    end

    KC -->|"Events"| OP
    OP --> STATE
    OP --> KP
    KP --> DB

    CP -->|"Trigger checkpoint"| BARRIER
    BARRIER -->|"Snapshot: offset + state + pending writes"| CP
    CP -->|"On success: commit Kafka offset + flush writes"| KC
    CP -->|"On success: commit Kafka transaction"| KP

    style CP fill:#e53935,color:#fff
    style STATE fill:#1e88e5,color:#fff
    style OP fill:#43a047,color:#fff

Co che Flink Checkpoint hoat dong nhu the nao:

  1. Checkpoint Manager dinh ky (vi du moi 30 giay) inject checkpoint barrier vao stream
  2. Khi barrier di qua operator, operator snapshot state (partial aggregation, HLL counters) vao durable storage (HDFS/S3)
  3. Dong thoi luu Kafka consumer offset tai thoi diem do
  4. Khi tat ca operators da snapshot xong → checkpoint hoan thanh
  5. Neu failure xay ra → Flink rollback ve last successful checkpoint:
    • Reset Kafka consumer offset ve checkpoint offset (doc lai events tu do)
    • Restore operator state tu checkpoint
    • Xu ly lai events tu checkpoint offset → ket qua giong het vi state giong het

Quan trong: Exactly-once trong Flink khong co nghia la moi event chi duoc xu ly 1 lan. No co nghia la ket qua cuoi cung giong nhu moi event chi duoc xu ly 1 lan. Thuc te, events co the duoc xu ly lai (sau checkpoint restore) nhung ket qua khong thay doi nho idempotent writes.

2.3.7 Deduplication — Cung 1 click bi gui 2 lan

Khac voi exactly-once (van de cua he thong), deduplication xu ly van de tu phia client:

Nguyen nhanMo taVi du
Client retryUser click, request timeout, client gui lai1 click → 2 events
Double clickUser click nhanh 2 lan2 clicks thuc te nhung chi nen dem 1
Page reloadUser refresh trang co quang caoClick event gui lai
CDN/Proxy retryIntermediate server retry request1 click → nhieu events
Dedup Strategy
Cach tiep canMo taTrade-off
Dedup key: (ad_id, user_id, click_timestamp)2 events co cung 3 field = duplicateCan luu seen keys trong window → memory
Dedup windowChi check duplicate trong 1 khoang thoi gian (vi du 5 phut)Duplicate xa hon window khong bi bat
Bloom filterProbabilistic — check “da thay chua?” voi false positive rate thapTiet kiem memory, nhung co false positive (bo nham event that)
External dedup store (Redis)Luu dedup key trong Redis voi TTLChinh xac hon Bloom filter, nhung them latency va dependency

Ket luan: Dung Bloom filter trong Flink operator voi dedup window 5 phut. False positive rate ~0.1% la chap nhan duoc (bo nham 1/1000 click that — insignificant voi 1 ty clicks/ngay).

Cho truong hop can chinh xac tuyet doi (billing), dung Redis dedup voi key {ad_id}:{user_id}:{minute_bucket} va TTL 10 phut.

2.3.8 Time Window Types

Flink (va cac streaming framework khac) ho tro nhieu loai time window:

Tumbling Window
Thuoc tinhGia tri
Dinh nghiaFixed-size, non-overlapping windows
Vi duWindow 1 phut: [14:00-14:01), [14:01-14:02), [14:02-14:03)
Khi nao dungAd click aggregation — dem click trong tung phut, tung gio
Dac diemMoi event thuoc dung 1 window. Don gian nhat, efficient nhat
Sliding Window
Thuoc tinhGia tri
Dinh nghiaFixed-size windows nhung overlap voi nhau
Vi duWindow 5 phut, slide moi 1 phut: [14:00-14:05), [14:01-14:06), [14:02-14:07)
Khi nao dungKhi can “moving average” — vi du “so click trong 5 phut gan nhat” cap nhat moi phut
Dac diemMoi event thuoc nhieu windows. Ton memory hon tumbling
Session Window
Thuoc tinhGia tri
Dinh nghiaWindow dong khi khong co event moi trong 1 khoang thoi gian (gap)
Vi duSession gap 30 phut: events luc 14:00, 14:05, 14:10 thuoc 1 session. Event luc 14:50 bat dau session moi
Khi nao dungUser behavior analysis — 1 session browsing cua user
Dac diemKich thuoc window dong (dynamic), khong co dinh

Cho bai toan ad click: Dung tumbling window cho 1 phut, 5 phut, 1 gio. Day la lua chon tu nhien nhat vi:

  • Moi click chi thuoc 1 window → don gian, khong duplicate counting
  • Window 5 phut = aggregate 5 windows 1-phut → co the tinh tu window nho hon
  • Window 1 gio = aggregate 60 windows 1-phut → tiep tuc cascade

Aha Moment: Khong can 3 pipeline rieng cho 3 window sizes. Chi can aggregate window 1 phut (smallest), sau do cascade aggregation: 5 windows 1-phut → 1 window 5-phut, 12 windows 5-phut → 1 window 1-gio. Day giet kiem compute va storage dang ke.

2.3.9 Scaling — Xu ly traffic lon

Kafka Partitioning
Chien luocMo taUu/Nhuoc
Partition by ad_idHash(ad_id) % num_partitionsTat ca clicks cua 1 ad vao cung partition → aggregation don gian, NHUNG hot ads tao hot partition
Partition by ad_id + minuteHash(ad_id + minute_bucket)Phan tan hon, nhung can reduce step de merge
Random partitionRound-robinPhan tan deu nhat, nhung aggregation phai shuffle data (expensive)

Ket luan: Dung partition by ad_id lam default. Xu ly hot partition rieng (Section 2.3.11).

ComponentParallelismLy do
Source (Kafka consumer)= So Kafka partitionsMoi consumer doc 1 partition
Map operator= So Kafka partitions1:1 voi source
Aggregate operatorTuy throughputkeyBy(ad_id) tu dong distribute
Sink (Kafka producer)Tuy throughputCo the nho hon source

Khi can scale:

  1. Tang Kafka partitions (vi du tu 100 → 200)
  2. Tang Flink parallelism tuong ung
  3. Tang consumer group instances
  4. Flink auto-scaling (reactive mode) — Flink 1.13+ ho tro
Aggregation Node Horizontal Scaling

Flink tu dong distribute state dua tren key group. Khi tang parallelism:

  • State cua key group A duoc migrate tu node 1 sang node 2
  • Trong qua trinh migration, co brief pause (managed boi checkpoint/restore)
  • Khong can manual resharding nhu voi database

2.3.10 Fault Tolerance

Co cheMucMo ta
Flink CheckpointsStreamingPeriodic snapshot cua tat ca operator state + Kafka offsets. Automatic recovery
Flink SavepointsStreamingManual checkpoint cho planned maintenance (upgrade, config change)
Kafka Consumer Group RebalancingMessagingKhi 1 consumer die, partitions duoc redistribute cho consumers con lai
Kafka ReplicationMessagingMoi partition co 3 replicas — tolerate 2 broker failures
DB ReplicationStorageCassandra/ClickHouse replication factor 3
End-to-end exactly-onceToan he thongFlink checkpoint + Kafka transaction + idempotent DB write

Failure scenarios va recovery:

ScenarioHe thong phan ung
1 Flink TaskManager crashJobManager detect (heartbeat timeout) → redeploy tasks → restore tu last checkpoint
Flink JobManager crashStandby JobManager take over (HA mode voi ZooKeeper) → restore job tu last checkpoint
1 Kafka broker crashISR (In-Sync Replicas) → leader election → consumer reconnect → continue tu last committed offset
DB node crashReplication → query route sang replica → repair node sau
Network partition (Flink ↔ Kafka)Consumer timeout → rebalance → reconnect → replay tu last offset
Entire data center failureCross-DC replication (Kafka MirrorMaker, Cassandra multi-DC) → failover sang DC khac

2.3.11 Reconciliation — Verify Streaming Results

Du Flink co exactly-once, van can reconciliation vi:

  • Bug trong aggregation logic
  • Clock skew giua servers
  • Edge cases trong watermark/late event handling
  • Data corruption
Reconciliation Flow
BuocMo ta
1. Batch job chay moi gioDoc raw events tu Cassandra/S3 cho window vua qua
2. Re-aggregateTinh lai click_count va unique_count bang batch processing (Spark)
3. So sanhCompare batch result voi streaming result trong aggregated DB
4. Detect driftNeu sai lech > threshold (vi du 0.1%) → alert
5. Auto-correct (optional)Ghi de ket qua streaming bang ket qua batch (batch la source of truth)
MetricThresholdAction
Drift < 0.01%NormalLog only
Drift 0.01% - 0.1%WarningAlert team
Drift > 0.1%CriticalAlert + auto-correct + investigate root cause

Aha Moment: Reconciliation la safety net, khong phai primary mechanism. Trong he thong tot, reconciliation luon cho ket qua khop (drift ~ 0%). Neu drift thuong xuyen > 0, co bug trong streaming pipeline.

2.3.12 Hot Shard Problem — Quang cao viral

Van de: Quang cao viral (vi du Super Bowl ad) co the nhan 100x-1000x traffic binh thuong. Vi partition by ad_id, tat ca clicks cua ad nay vao 1 Kafka partition va 1 Flink subtask → hot shard → bottleneck.

Trieu chungHau qua
1 Kafka partition nhan qua nhieu dataConsumer lag tang, latency tang
1 Flink subtask dung qua nhieu CPU/memoryBackpressure, checkpoint timeout
Cac partition/subtask khac idleWaste resources
Giai phap: Salted Keys + Secondary Aggregation
BuocMo ta
1. Detect hot keyMonitor click rate per ad_id. Neu vuot threshold (vi du 10x average) → mark as hot
2. Salt the keyThay vi partition by ad_id, partition by ad_id + random(0..N) voi N = so shards mong muon
3. First aggregationMoi shard aggregate rieng → N partial results cho 1 ad_id
4. Secondary aggregation1 reducer merge N partial results → final result cho ad_id

Vi du voi ad_001 la hot key, N=10:

  • Thay vi 1 partition nhan tat ca clicks cua ad_001
  • 10 partitions nhan clicks: ad_001_0, ad_001_1, …, ad_001_9
  • 10 aggregation nodes dem rieng
  • 1 reduce node cong 10 partial counts lai

Trade-off:

  • Pro: Traffic phan tan deu, khong con hot shard
  • Con: Them 1 aggregation step, logic phuc tap hon, latency tang chut it
  • Con: Unique count (HLL) can merge — nhung HLL ho tro merge tot nen khong van de

Aha Moment: Hot shard problem khong chi xay ra trong bai toan nay. No xuat hien bat cu khi nao em partition by key va 1 key co traffic bat can xung. Giai phap “salted keys + secondary aggregation” la general pattern ap dung duoc cho nhieu bai toan khac.


3. Back-of-the-Envelope Estimation

3.1 Traffic Estimation

Assumptions:

  • 1 ty (1 billion) clicks/ngay
  • Peak traffic = 5x average

3.2 Kafka Throughput

Raw event size: ~1 KB (ad_id + timestamp + user_id + ip + country + metadata)

Kafka partition estimation:

  • Moi partition handle ~10 MB/sec (conservative)
  • So partitions can thiet cho peak: partitions toi thieu
  • Thuc te dung 100-200 partitions de scale consumers va dam bao parallelism cho Flink

Kafka retention:

  • Raw topic retention: 7 ngay
  • Storage per day:
  • 7-day retention:
  • Voi replication factor 3: Kafka cluster storage

3.3 Raw Data Storage

3.4 Aggregated Data Storage

Assumptions:

  • 2 trieu active ads
  • Aggregation windows: 1 phut, 5 phut, 1 gio
  • Moi aggregated record: ~100 bytes (ad_id + window + counts)

Voi ClickHouse compression ratio ~10:1:

HyperLogLog cho unique count:

Partial aggregation state (tumbling window 1 min):

Tong memory Flink state: ~25 GB — phan bo tren nhieu TaskManagers, hoan toan kha thi.

3.6 Tong hop

ResourceGia tri
Traffic11.5K avg / 58K peak clicks/sec
Kafka cluster storage21 TB (7-day retention, RF=3)
Kafka throughput11 MB/s avg / 57 MB/s peak
Kafka partitions100-200
Raw data (Cassandra)7 TB (7-day TTL)
Raw data (S3 archive)73 TB/year (compressed)
Aggregated data (ClickHouse)10.5 TB/year (compressed)
Flink state memory~25 GB (distributed)
Flink TaskManagers10-20 (tuy CPU/memory spec)

4. Security & Data Privacy

4.1 Click Fraud Detection

Click fraud la moi de doa lon nhat doi voi he thong quang cao. Neu khong phat hien, advertiser mat tien, platform mat uy tin.

Loai fraudMo taDau hieu nhan biet
Bot clicksAutomated scripts click quang cao lien tucClick rate bat thuong tu 1 IP, khong co mouse movement/scroll
Click farmsNhom nguoi duoc tra tien de clickNhieu clicks tu cung geo location, pattern giong nhau
Competitor clickingDoi thu click quang cao cua em de hay ngân sachClicks tu cung IP range, khong conversion
Ad stackingNhieu quang cao chong len nhau, 1 click dem cho nhieu adClick coordinates giong nhau cho nhieu ads
Pixel stuffingQuang cao hien thi 1x1 pixel, khong ai thay nhung dem la “impression”Impression khong tuong ung voi click
Anti-Fraud Strategies
StrategyMo taKhi nao ap dung
IP-based rate limitingGioi han so click tu 1 IP per ad per time windowReal-time, tai ingestion layer
User-agent fingerprintingPhat hien bot dua tren browser fingerprintReal-time, tai ingestion layer
Click-through rate (CTR) anomalyCTR bat thuong cao cho 1 ad → suspiciousNear-real-time, tai aggregation layer
Geo anomalyClick tu country khong phu hop voi target audienceNear-real-time
Session analysisPhan tich hanh vi sau click (bounce rate, time on page)Batch analysis
Machine learning modelTrain model phat hien fraud patternsBatch + real-time inference

Implementation: Fraud detection khong lam trong aggregation pipeline chinh. Thay vao do:

  1. Raw events di qua fraud detection service (parallel voi aggregation)
  2. Fraud service flag suspicious clicks
  3. Flagged clicks khong duoc dem trong aggregation (hoac dem rieng)
  4. Advertiser co the dispute va review flagged clicks

4.2 User Privacy

Nguyen tacMo ta
Aggregate onlyReport chi chua aggregated numbers (click_count), KHONG chua individual user data
No PII in reportsuser_id, ip KHONG xuat hien trong aggregated results
PII only for dedupuser_id chi dung trong dedup window (5 phut), sau do discard
IP hashingIP address duoc hash truoc khi luu (khong luu raw IP)
Data minimizationChi thu thap data can thiet cho aggregation, khong hon
GDPR/CCPA complianceUser co quyen yeu cau xoa data. Raw data TTL 7 ngay ho tro dieu nay
Consent-based trackingChi track click khi user da consent (cookie consent banner)

4.3 Data Retention Policies

Data typeRetentionLy doStorage
Raw click events7 ngay (hot) + 90 ngay (archive)Reconciliation + dispute resolutionCassandra → S3
Aggregated data (1-min)1 namDashboard + reportingClickHouse
Aggregated data (1-hour)3 namLong-term analyticsClickHouse (compressed)
Aggregated data (1-day)Vinh vienHistorical trendClickHouse (cold storage)
Dedup keys10 phutDedup windowRedis (TTL)
Fraud detection logs1 namAudit + disputeS3

5. DevOps & Monitoring

5.1 Key Metrics can monitor

Kafka Metrics

MetricMo taAlert threshold
Consumer lagSo messages chua duoc consume> 100K messages → Warning, > 1M → Critical
Under-replicated partitionsPartitions chua du replicas> 0 → Warning
Broker disk usageDisk usage cua Kafka broker> 80% → Warning, > 90% → Critical
Producer error rateTy le loi khi produce message> 0.1% → Warning
Request latency (p99)Latency cua produce/consume request> 100ms → Warning
MetricMo taAlert threshold
Checkpoint durationThoi gian hoan thanh 1 checkpoint> 30s → Warning, > 60s → Critical
Checkpoint failure rateTy le checkpoint that bai> 0 trong 1 gio → Critical
BackpressureOperator dang bi cham, buffer dayisBackPressured = true > 5 phut → Warning
Records per secondThroughput cua Flink jobGiam > 50% so voi baseline → Warning
State sizeKich thuoc state cua operatorsTang > 2x so voi baseline → Warning
Late events droppedSo late events bi drop> 1% total events → Warning
TaskManager failuresSo TaskManager crash> 0 → Alert

Data Freshness & Accuracy

MetricMo taAlert threshold
End-to-end latencyThoi gian tu click xay ra den data queryable trong DB> 5 phut → Warning, > 10 phut → Critical
Data freshnessKhoang cach giua “now” va timestamp cua record moi nhat trong DB> 3 phut → Warning
Reconciliation drift% chenh lech giua streaming va batch result> 0.01% → Warning, > 0.1% → Critical
Query latency (p99)Latency cua query service> 500ms → Warning, > 2s → Critical

Database Metrics

MetricMo taAlert threshold
Write latency (p99)Latency ghi vao ClickHouse/Cassandra> 100ms → Warning
Disk usageStorage usage> 75% → Warning
Compaction lag (Cassandra)Pending compactions> 10 → Warning
Query queue size (ClickHouse)So queries dang cho> 50 → Warning

5.2 Alerting Strategy

SeverityVi duNotificationResponse time
P1 - CriticalConsumer lag > 1M, checkpoint failures, data freshness > 10 minPagerDuty (phone call)< 15 phut
P2 - WarningConsumer lag > 100K, reconciliation drift > 0.01%, high backpressureSlack + PagerDuty (SMS)< 1 gio
P3 - InfoTraffic spike, scheduled maintenanceSlackNext business day

5.3 Runbook — Common Issues

Van deNguyen nhan thuong gapCach xu ly
Consumer lag tang dot ngotTraffic spike (viral ad), slow consumer, GC pauseScale Flink parallelism, check backpressure, check GC logs
Checkpoint failureState qua lon, slow storage, network issueTang checkpoint timeout, check state size, check HDFS/S3 health
Reconciliation driftBug trong aggregation logic, clock skew, late eventsSo sanh raw events voi aggregated results, check watermark config
High query latencyClickHouse overloaded, missing index, large result setAdd materialized view, optimize query, scale read replicas
Kafka broker downDisk full, hardware failure, network partitionCheck disk, failover sang replica, add new broker

5.4 Deployment Strategy

Chien luocMo taKhi nao dung
Blue-green deploymentChay 2 Flink jobs song song (old + new), compare results, switch trafficMajor logic changes
Canary deploymentRoute 5% traffic sang new version, monitor metricsMinor changes
Savepoint + restartTake savepoint cua Flink job, deploy new version, restore tu savepointConfig changes, minor updates
A/B testing2 pipelines xu ly cung data, compare resultsTesting new aggregation logic

6. Mermaid Diagrams — Tong hop

6.1 High-Level Architecture (Complete)

flowchart TB
    subgraph "Clients"
        BROWSER["User Browsers"]
        APP["Mobile Apps"]
    end

    subgraph "Ingestion Layer"
        LB["Load Balancer<br/>(L7)"]
        IS["Ingestion Service Cluster<br/>(Stateless, Auto-scale)"]
    end

    subgraph "Message Queue Layer"
        KR["Kafka — Raw Events Topic<br/>(100+ partitions, RF=3)"]
    end

    subgraph "Processing Layer"
        FRAUD["Fraud Detection<br/>Service (Async)"]
        FLINK["Apache Flink Cluster<br/>Map → Aggregate → Reduce"]
    end

    subgraph "Storage Layer"
        RAW_CASS[("Cassandra<br/>Raw Events<br/>TTL 7 days")]
        RAW_S3[("S3 Archive<br/>Raw Events<br/>Parquet")]
        AGG_CH[("ClickHouse<br/>Aggregated Data")]
    end

    subgraph "Serving Layer"
        QS["Query Service"]
        REDIS[("Redis Cache")]
    end

    subgraph "Reconciliation"
        SPARK["Spark Batch Job<br/>(Hourly)"]
    end

    subgraph "Consumers"
        DASH["Advertiser Dashboard"]
        BILLING["Billing Service"]
        ANALYTICS["Analytics Platform"]
    end

    BROWSER & APP --> LB
    LB --> IS
    IS --> KR
    KR --> FRAUD
    KR --> FLINK
    KR --> RAW_CASS
    RAW_CASS -.->|"Archive"| RAW_S3
    FLINK --> AGG_CH
    AGG_CH --> QS
    QS --> REDIS
    QS --> DASH & BILLING & ANALYTICS

    RAW_S3 -.-> SPARK
    AGG_CH -.-> SPARK

    style KR fill:#FF9800,stroke:#333,stroke-width:2px,color:#fff
    style FLINK fill:#4CAF50,stroke:#333,stroke-width:2px,color:#fff
    style AGG_CH fill:#2196F3,stroke:#333,stroke-width:2px,color:#fff
    style FRAUD fill:#f44336,stroke:#333,stroke-width:2px,color:#fff
    style SPARK fill:#9C27B0,stroke:#333,stroke-width:2px,color:#fff

6.2 Exactly-Once End-to-End Flow (Detailed)

sequenceDiagram
    participant Kafka as Kafka (Source)
    participant Flink as Flink Operator
    participant State as Flink State (RocksDB)
    participant CM as Checkpoint Manager
    participant Storage as Durable Storage (HDFS)
    participant Sink as Kafka (Sink) + DB

    Note over Kafka,Sink: Normal Processing
    Kafka->>Flink: Event batch (offset 100-200)
    Flink->>State: Update aggregation state
    Flink->>Sink: Emit partial result (uncommitted)

    Note over CM,Storage: Checkpoint Triggered (every 30s)
    CM->>Flink: Inject checkpoint barrier
    Flink->>State: Snapshot current state
    State->>Storage: Persist state snapshot
    Flink->>CM: Report: offset=200, state=snapshot_42
    CM->>Kafka: Commit consumer offset=200
    CM->>Sink: Commit Kafka transaction (flush writes to DB)

    Note over Kafka,Sink: Failure & Recovery
    Flink->>Flink: TaskManager CRASH!
    CM->>Storage: Load last checkpoint (snapshot_42, offset=200)
    CM->>Kafka: Reset consumer to offset=200
    CM->>Flink: Restore state from snapshot_42
    Kafka->>Flink: Replay events from offset 200
    Flink->>State: Re-process (same state → same result)
    Flink->>Sink: Re-emit results (idempotent write → no duplicate)

6.3 Lambda vs Kappa Architecture Comparison

flowchart TB
    subgraph "Lambda Architecture"
        direction TB
        L_SRC["Events"] --> L_BATCH["Batch Layer<br/>(Spark, chay moi gio)"]
        L_SRC --> L_SPEED["Speed Layer<br/>(Flink, real-time)"]
        L_BATCH --> L_BV["Batch View<br/>(accurate, stale)"]
        L_SPEED --> L_RV["Real-time View<br/>(fast, approximate)"]
        L_BV --> L_MERGE["Merge"]
        L_RV --> L_MERGE
        L_MERGE --> L_QS["Query"]

        style L_BATCH fill:#4CAF50,color:#fff
        style L_SPEED fill:#FF9800,color:#fff
        style L_MERGE fill:#f44336,color:#fff
    end

    subgraph "Kappa Architecture (Preferred)"
        direction TB
        K_SRC["Events"] --> K_KAFKA["Kafka<br/>(Immutable Log)"]
        K_KAFKA --> K_STREAM["Stream Layer<br/>(Flink, real-time<br/>exactly-once)"]
        K_STREAM --> K_DB["Serving DB"]
        K_DB --> K_QS["Query"]
        K_KAFKA -.->|"Replay<br/>if needed"| K_STREAM
        K_DB -.-> K_RECON["Reconciliation<br/>(Safety net)"]

        style K_KAFKA fill:#FF9800,color:#fff
        style K_STREAM fill:#4CAF50,color:#fff
        style K_RECON fill:#9E9E9E,color:#fff
    end

6.4 Hot Shard Solution — Salted Keys

flowchart TB
    subgraph "Problem: Hot Shard"
        direction LR
        HOT_AD["ad_001 (viral)<br/>100K clicks/sec"] --> P1_HOT["Partition 0<br/>OVERLOADED"]
        NORMAL1["ad_002<br/>10 clicks/sec"] --> P2["Partition 1<br/>idle"]
        NORMAL2["ad_003<br/>15 clicks/sec"] --> P3["Partition 2<br/>idle"]

        style P1_HOT fill:#f44336,color:#fff
    end

    subgraph "Solution: Salted Keys + Secondary Aggregation"
        direction TB
        SALT["ad_001 → ad_001_0, ad_001_1, ..., ad_001_9"]

        subgraph "First Aggregation (Parallel)"
            S0["ad_001_0<br/>count=10K"]
            S1["ad_001_1<br/>count=10K"]
            S9["ad_001_9<br/>count=10K"]
        end

        subgraph "Secondary Aggregation (Merge)"
            REDUCE["Reduce Node<br/>10K x 10 = 100K"]
        end

        SALT --> S0 & S1 & S9
        S0 & S1 & S9 --> REDUCE

        style SALT fill:#FF9800,color:#fff
        style REDUCE fill:#4CAF50,color:#fff
    end

7. Aha Moments & Pitfalls

7.1 Aha Moments — Nhung dieu Hieu can nho

#InsightGiai thich
1Kappa > Lambda cho hau het truong hopLambda architecture ra doi vi streaming frameworks chua mature. Voi Flink exactly-once, Kappa don gian hon va du chinh xac. Chi can them reconciliation batch job lam safety net — day khong phai Lambda
2Watermark la concept kho nhatNo dai dien cho trade-off giua latency va completeness. Khong co gia tri “dung” — chi co gia tri phu hop voi business. Cach duy nhat hieu watermark: thuc hanh voi Flink
3Hot shard co the giet performance1 quang cao viral → 1 partition overloaded → toan bo pipeline cham lai (backpressure). Salted keys la giai phap, nhung tang complexity. Monitor click rate per ad_id de detect som
4Reconciliation la safety net bat buocDu streaming exactly-once, van can batch job verify. Bug trong logic, clock skew, edge cases — tat ca co the gay sai lech nho. Reconciliation phat hien truoc khi advertiser phat hien
5Exactly-once khong co nghia “xu ly 1 lan”No co nghia “ket qua giong nhu xu ly 1 lan”. Events co the bi replay (sau checkpoint restore) nhung ket qua khong doi nho idempotent writes. Hieu sai concept nay → thiet ke sai
6Cascade aggregation tiet kiem computeAggregate 1-min window, roi tinh 5-min tu 5x 1-min, 1-hour tu 12x 5-min. Khong can 3 pipeline rieng — 1 pipeline + cascade
7HyperLogLog la hero an minhDem unique users trong 12KB memory per counter voi sai so < 1%. Khong co HLL, bai toan unique count o scale nay gan nhu bat kha thi voi memory hop ly
8Raw data la bao hiemLuon luu raw events (du dat). Khi logic sai, khi can re-process, khi advertiser dispute — raw data la nguon su that duy nhat

7.2 Pitfalls — Nhung loi thuong gap

#PitfallHau quaCach tranh
1Dung processing time thay vi event timeClick xay ra luc 13:59 bi dem vao window 14:00Luon dung event time cho aggregation. Processing time chi cho system monitoring
2Khong xu ly late eventsClick den muon bi mat → under-count → advertiser khieu naiConfigure allowed lateness, dung side output cho late events
3Khong co reconciliationBug trong streaming lam sai ket qua ma khong ai bietChay batch reconciliation moi gio, alert khi drift > threshold
4Partition by random keyMoi aggregation node can du lieu cua moi ad → shuffle qua nhieu dataPartition by ad_id de data locality. Chi salt khi can xu ly hot keys
5Checkpoint interval qua lonKhi failure, phai replay nhieu events → recovery cham, duplicate window lonCheckpoint moi 30s-1 phut. Balance giua overhead va recovery time
6Khong monitor consumer lagPipeline cham ma khong biet → data stale → advertiser thay so cuConsumer lag la metric #1 can monitor. Alert ngay khi tang bat thuong
7Single point of failure cho fraud detectionFraud service down → tat ca clicks duoc chap nhan → mat tienFraud detection async, khong block aggregation pipeline. Fallback: flag va review sau
8Unique count dung exact countingO(n) memory per ad per window → memory explosion voi 2M adsDung HyperLogLog. Chi dung exact counting cho billing-critical use cases
9Khong luu raw dataKhong the re-process khi logic sai, khong the reconcileLuon luu raw events — Cassandra (7 ngay) + S3 (archive)
10Scale vertically thay vi horizontallyHit hardware limit, single point of failureKafka partitions + Flink parallelism cho horizontal scaling. Vertical chi cho DB optimization

7.3 Interview Tips

TipGiai thich
Bat dau voi requirementsHoi ro: bao nhieu clicks/ngay, latency target, accuracy yeu cau. Khong nhay vao solution ngay
Noi ve trade-offsLambda vs Kappa, HLL vs exact count, Cassandra vs ClickHouse — interviewer muon nghe em can nhac, khong phai chi chon 1
Ve diagram truoc, chi tiet sauHigh-level flow: Ingestion → Queue → Aggregation → DB → Query. Roi dive vao tung component
Nhac den exactly-once somDay la yeu cau kho nhat cua bai toan. Neu em biet giai thich Flink checkpoint + idempotent write → an tuong lon
Hot shard la bonus pointNhieu ung vien khong nghi toi. Neu em tu de cap va giai phap salted keys → interviewer biet em co kinh nghiem thuc te
Reconciliation cho thay em la seniorJunior engineer chi nghi streaming la du. Senior biet can safety net. Nhac den reconciliation batch job → chung to production experience

8. Tong ket — Summary Table

Khia canhQuyet dinhLy do
ArchitectureKappa (streaming only + reconciliation)Don gian hon Lambda, Flink exactly-once mature
Message queueApache KafkaDurable, partitioned, replayable
Streaming engineApache FlinkExactly-once, stateful processing, checkpoint/savepoint
Raw data storeCassandra (7 days) + S3 (archive)Write-heavy, time-series, cost-effective
Aggregated data storeClickHouseOLAP-optimized, fast aggregation queries
Partitioning strategyBy ad_id (salted for hot keys)Data locality cho aggregation
Time windowTumbling 1-min, cascade to 5-min and 1-hourSimple, no overlap, efficient
Unique countingHyperLogLogMemory efficient, mergeable, ~1% error OK
DeduplicationBloom filter (5-min window)Memory efficient, low false positive
Late eventsWatermark + allowed lateness 5 min + side outputBalance latency va completeness
ReconciliationSpark batch job hourlySafety net, detect drift
Fraud detectionAsync, parallel pipelineKhong block aggregation

Prerequisite (da hoc)

Lien quan (tham khao cheo)

Concepts de tim hieu them

  • Apache Flink — Stateful stream processing, checkpointing, exactly-once semantics
  • ClickHouse — Column-oriented OLAP database, materialized views
  • HyperLogLog — Probabilistic cardinality estimation
  • Kafka Streams vs Flink — Khi nao dung cai nao
  • Event sourcing — Luu tat ca events, derive state tu events (Kafka la event log)

“He thong tot khong phai la he thong khong bao gio sai — ma la he thong biet khi nao no sai, va tu sua duoc. Reconciliation chinh la co che tu sua do.”