Case Study: Design a Distributed Message Queue

“Message Queue la xuong song cua moi distributed system. Khong hieu cach no hoat dong ben trong thi chi la nguoi dung — hieu cach thiet ke no thi moi la kien truc su.”

Tags: system-design message-queue kafka distributed-log replication alex-xu-vol2 case-study Student: Hieu Source: Alex Xu — System Design Interview Volume 2, Chapter 4 Prerequisite: Tuan-01-Scale-From-Zero-To-Millions · Tuan-02-Back-of-the-envelope · Tuan-08-Message-Queue Lien quan: Tuan-07-Database-Sharding-Replication · Tuan-10-Consistent-Hashing · Tuan-13-Monitoring-Observability · Case-Design-Payment-System


1. Context & Why — Tai sao can thiet ke Message Queue tu dau?

1.1 Analogy: He thong buu dien quoc gia

Hieu, tuong tuong em duoc giao nhiem vu thiet ke he thong buu dien quoc gia tu con so khong. Khong phai chi dat mot cai hop thu o goc pho — ma la toan bo he thong: tu luc nguoi gui bo thu vao hop, den luc nguoi nhan cam thu tren tay.

He thong nay phai giai quyet nhung van de sau:

  1. Nhan thu (Ingestion): Hang trieu nguoi gui thu moi ngay. Moi buu cuc phai nhan thu nhanh chong, khong duoc de nguoi gui phai xep hang qua lau. Tuong tu producer gui message vao broker.

  2. Phan loai (Routing): Thu gui den Ha Noi phai di Ha Noi, thu gui Sai Gon phai di Sai Gon. Khong duoc nham. Tuong tu viec partition message theo key — tat ca message cua cung mot khach hang phai vao cung mot partition de dam bao thu tu.

  3. Luu tru (Storage): Truoc khi giao, thu phai duoc luu an toan trong kho. Neu kho bi chay, thu phai co ban sao o kho khac. Tuong tu replication trong message queue — moi partition co nhieu replica de dam bao khong mat du lieu.

  4. Giao hang (Delivery): Nguoi dua thu phai giao dung nguoi, dung dia chi. Neu nguoi nhan khong co nha, phai quay lai giao lan sau (retry). Neu giao nhieu lan ma khong duoc, chuyen vao kho thu chet (Dead Letter Queue). Tuong tu consumer doc message va xu ly.

  5. Dam bao khong mat (Durability): Mot la thu mat = mat long tin cua ca he thong. Tuong tu delivery guarantee — at-least-once, exactly-once.

  6. Dam bao khong trung (Deduplication): Khong duoc giao cung mot la thu hai lan cho nguoi nhan. Tuong tu exactly-once semantics voi idempotent producer.

  7. Theo doi (Tracking): Nguoi gui muon biet thu da den chua. Tuong tu offset tracking — consumer biet minh da doc den dau.

1.2 Bai toan thiet ke tu scratch

Trong Tuan-08-Message-Queue, em da hoc cach su dung message queue (Kafka, RabbitMQ) trong system design. Bay gio, chung ta se thiet ke mot distributed message queue tu dau — giong nhu Kafka, Pulsar, hay Redpanda.

Day la bai toan kho vi no ket hop nhieu khai niem:

Thach thucGiai thich
High throughput1 trieu messages/giay — moi message phai duoc nhan, luu, va giao
DurabilityKhong duoc mat message — du broker bi crash
OrderingMessages trong cung partition phai giu dung thu tu
ScalabilityThem broker, them partition khi traffic tang
Fault toleranceBroker chet → he thong van hoat dong binh thuong
Low latencyProducer gui → consumer nhan trong vai millisecond
Replay capabilityConsumer co the doc lai message tu bat ky thoi diem nao
Exactly-once deliveryTrong nhieu truong hop, khong duoc xu ly trung

1.3 Tai sao khong dung co san?

Cau hoi hay: “Tai sao khong cu dung Kafka?”

  • Interview: Day la bai thiet ke — interviewer muon thay em hieu tai sao Kafka thiet ke nhu vay, khong phai chi biet cach dung.
  • Thuc te: Nhieu cong ty lon (LinkedIn, Uber, Confluent) da xay dung hoac custom message queue rieng vi yeu cau dac thu.
  • Hieu sau: Khi em biet cach thiet ke message queue, em se dung no tot hon — biet parameter nao quan trong, biet khi nao dung at-least-once vs exactly-once, biet cach tune performance.

Aha Moment: Bai nay bo sung cho Tuan-08-Message-Queue. Tuan 08 day em dung MQ nhu mot component. Bai nay day em xay MQ tu scratch. Hai goc nhin bo sung cho nhau — nguoi dung vs kien truc su.


2. Deep Dive — Alex Xu 4-Step Framework

Step 1: Requirements — Hieu va gioi han bai toan

2.1.1 Functional Requirements

Chuc nangMo ta chi tiet
Producer APIProducer gui message vao mot topic cu the. Ho tro batch gui nhieu message cung luc
Consumer APIConsumer doc message tu topic. Ho tro pull-based model (consumer chu dong keo data)
Topic managementTao, xoa, liet ke topics. Moi topic co nhieu partitions
Message retentionGiu message trong mot khoang thoi gian (7 ngay default) hoac theo kich thuoc
Message replayConsumer co the doc lai message tu bat ky offset nao trong retention period
Consumer groupNhom consumer cung doc mot topic — moi partition chi gan cho 1 consumer trong group
Delivery semanticsHo tro at-most-once, at-least-once (default), va exactly-once
Message orderingDam bao ordering trong cung mot partition

2.1.2 Non-Functional Requirements

Yeu cauMuc tieuLy do
Throughput1,000,000 messages/sec (write)Xu ly event streaming cho he thong lon (e-commerce, fintech, IoT)
Latency< 10ms P99 (end-to-end)Producer gui → consumer nhan phai nhanh
DurabilityKhong mat message khi co 2 broker failures dong thoiDu lieu la tai san — mat message = mat tien
Availability99.99% uptimeMessage queue la backbone — no chet thi toan he thong chet
ScalabilityScale horizontally bang cach them brokerTraffic tang → them may, khong can thiet ke lai
Storage efficiencyLuu tru hieu qua tren diskMessage duoc giu nhieu ngay — disk la bottleneck
Ordering guaranteeMessages trong cung partition giu dung thu tuNhieu use case yeu cau strict ordering (payment events, order state changes)

2.1.3 Out of Scope

De giu bai tap trung, chung ta khong thiet ke:

  • Schema registry (Avro, Protobuf schema management)
  • Stream processing engine (Kafka Streams, Flink)
  • Multi-datacenter replication (MirrorMaker)
  • Tiered storage (hot/cold storage)

2.1.4 API Design

Producer API:

APIParametersMo ta
produce(topic, message, key?, partition?)topic: string, message: bytes, key: bytes (optional), partition: int (optional)Gui message vao topic. Neu co key → hash(key) chon partition. Neu co partition → gui truc tiep
produce_batch(topic, messages[])topic: string, messages: list of (key, value)Gui nhieu message cung luc — giam network overhead

Consumer API:

APIParametersMo ta
subscribe(topics[], group_id)topics: list of string, group_id: stringConsumer tham gia group va subscribe vao topics
poll(timeout)timeout: msKeo batch messages moi. Tra ve list messages
commit(offsets)offsets: map<partition, offset>Commit offset — danh dau da xu ly den dau
seek(partition, offset)partition: int, offset: longNhay den vi tri bat ky — dung cho replay

Step 2: High-Level Design — Kien truc tong quan

2.2.1 Cac thanh phan chinh

flowchart TB
    subgraph Producers
        P1["Producer 1<br/>(Order Service)"]
        P2["Producer 2<br/>(Payment Service)"]
        P3["Producer 3<br/>(IoT Devices)"]
    end

    subgraph "Broker Cluster"
        B1["Broker 1<br/>Partitions: 0, 3, 6"]
        B2["Broker 2<br/>Partitions: 1, 4, 7"]
        B3["Broker 3<br/>Partitions: 2, 5, 8"]
    end

    subgraph "Metadata Store"
        MS["ZooKeeper / etcd<br/>- Broker registry<br/>- Topic config<br/>- Partition assignment<br/>- Consumer group state"]
    end

    subgraph "Coordinator"
        CO["Coordinator<br/>- Leader election<br/>- Partition assignment<br/>- Consumer rebalancing"]
    end

    subgraph Consumers
        subgraph "Consumer Group A (Order Processing)"
            C1["Consumer A1"]
            C2["Consumer A2"]
            C3["Consumer A3"]
        end
        subgraph "Consumer Group B (Analytics)"
            C4["Consumer B1"]
            C5["Consumer B2"]
        end
    end

    P1 & P2 & P3 --> B1 & B2 & B3
    B1 & B2 & B3 <--> MS
    CO <--> MS
    CO --> B1 & B2 & B3
    B1 & B2 & B3 --> C1 & C2 & C3
    B1 & B2 & B3 --> C4 & C5

    style B1 fill:#1e88e5,color:#fff
    style B2 fill:#1e88e5,color:#fff
    style B3 fill:#1e88e5,color:#fff
    style MS fill:#f57c00,color:#fff
    style CO fill:#7b1fa2,color:#fff

2.2.2 Vai tro tung thanh phan

Thanh phanVai troChi tiet
ProducerGui message vao brokerChon partition (round-robin, key-based hash, custom). Batch va compress truoc khi gui
BrokerLuu tru va phuc vu messageMoi broker quan ly mot so partitions. Ghi message vao disk (WAL). Phuc vu read cho consumer
Metadata StoreLuu metadata cua clusterDanh sach broker, topic config, partition assignment, consumer group state. Dung ZooKeeper hoac etcd
CoordinatorDieu phoi clusterLeader election cho partition, consumer group rebalancing, broker membership management
ConsumerDoc message tu brokerPull-based — consumer chu dong keo data. Track offset de biet doc den dau
Consumer GroupNhom consumer chia nhau docMoi partition chi gan cho 1 consumer trong group. Cho phep parallel processing

2.2.3 Message Flow tong quan

sequenceDiagram
    participant P as Producer
    participant LB as Broker (Leader)
    participant F1 as Broker (Follower 1)
    participant F2 as Broker (Follower 2)
    participant C as Consumer

    P->>LB: 1. Send message batch
    LB->>LB: 2. Write to WAL (disk)
    LB->>F1: 3. Replicate to follower 1
    LB->>F2: 3. Replicate to follower 2
    F1-->>LB: 4. ACK replication
    F2-->>LB: 4. ACK replication
    LB-->>P: 5. ACK to producer (acks=all)

    C->>LB: 6. Poll(offset=100)
    LB->>LB: 7. Read from WAL/page cache
    LB-->>C: 8. Return messages [100-150]
    C->>C: 9. Process messages
    C->>LB: 10. Commit offset=150

Step 3: Deep Dive — Chi tiet tung thanh phan

2.3.1 Data Model — Mo hinh du lieu

Topic

Topic la logical channel de to chuc messages. Tuong tu nhu chuyen muc trong buu dien — thu kinh doanh, thu ca nhan, thu quang cao.

Thuoc tinhMo taVi du
NameTen topic — dinh danh duy nhatorders, payments, user-events
Partition countSo partitions — quyet dinh parallelism12 partitions
Replication factorSo ban sao moi partition3 (1 leader + 2 followers)
RetentionThoi gian giu message7 ngay (default)
Cleanup policyCach don dep message cudelete (xoa theo thoi gian) hoac compact (giu latest per key)
Partition

Partition la don vi co ban cua parallelism va ordering. Tuong tu nhu nhieu o buu dien song song — moi o xu ly doc lap.

Moi partition la mot ordered, immutable sequence of messages. Messages duoc append vao cuoi — khong bao gio sua hay xoa (chi xoa khi retention het han).

Dac diem quan trong:

  • Messages trong cung mot partition co thu tu dam bao (ordering guarantee)
  • Messages o khac partition khong co thu tu voi nhau
  • Moi partition co mot leader va nhieu followers (replicas)
  • Chi leader xu ly read/write — followers chi replicate
Offset

Offset la so thu tu cua message trong partition. Bat dau tu 0, tang dan. Tuong tu so thu tu cua thu trong hop thu — thu so 0, thu so 1, thu so 2…

Thong tinChi tiet
Kieu du lieu64-bit integer
Pham vi0 den 2^63 - 1 (du cho 292 nam voi 1M msg/sec)
Tinh chatMonotonically increasing trong partition
Muc dichConsumer dung offset de biet doc den dau, co the seek den bat ky offset nao
Message (Record)

Moi message gom cac thanh phan:

TruongKieuMo ta
Keybytes (nullable)Dung de quyet dinh partition. Vi du: user_id, order_id. Nullable — neu null thi round-robin
ValuebytesNoi dung message. Co the la JSON, Avro, Protobuf. Broker khong care format — chi la bytes
Timestamplong (epoch ms)Thoi diem message duoc tao (CreateTime) hoac thoi diem broker nhan (LogAppendTime)
Headerslist of key-valueMetadata bo sung. Vi du: trace-id, source-service, content-type
OffsetlongDuoc broker gan khi message duoc ghi vao partition. Producer khong set
PartitionintPartition ma message thuoc ve
CRCintChecksum de phat hien data corruption

Aha Moment: Broker khong parse hay hieu noi dung message. Voi broker, message chi la mot cuc bytes. Dieu nay giup broker cuc ky nhanh — no chi can ghi bytes vao disk va doc bytes tu disk. Khong co serialization/deserialization overhead.


2.3.2 Broker Storage — Cach luu tru tren disk

Day la phan quan trong nhat cua bai thiet ke. Storage design quyet dinh throughput va durability cua toan he thong.

Write-Ahead Log (WAL) per Partition

Moi partition duoc luu tren disk nhu mot append-only log. Tuong tu nhu cuon so ghi chep cua buu dien — chi ghi them, khong bao gio xoa hay sua.

Tai sao append-only?

  • Disk sequential write cuc nhanh: HDD sequential write dat 200-300 MB/s (nhanh hon random write 1000x). SSD sequential write dat 500-3000 MB/s.
  • Khong can lock phuc tap: Chi ghi vao cuoi → khong co conflict
  • Tu nhien ho tro ordering: Message duoc ghi theo thu tu → doc ra cung dung thu tu
Segment Files

Log cua moi partition duoc chia thanh nhieu segment files. Moi segment la mot file tren disk.

Topic: orders, Partition: 0

/data/orders-0/
    00000000000000000000.log     ← Segment 0: offset 0 - 999,999
    00000000000000000000.index   ← Offset index cho segment 0
    00000000000000000000.timeindex ← Time index cho segment 0
    00000000000001000000.log     ← Segment 1: offset 1,000,000 - 1,999,999
    00000000000001000000.index
    00000000000001000000.timeindex
    00000000000002000000.log     ← Segment 2 (active segment - dang ghi)
    00000000000002000000.index
    00000000000002000000.timeindex

Tai sao chia segment?

Ly doGiai thich
Retention cleanupXoa segment cu de giai phong disk — xoa ca file, khong can scan tung message
CompactionNen segment cu — giu lai latest value per key
File system limitsMot file qua lon → OS xu ly cham. Segment nho hon de quan ly
RecoveryKhi broker restart, chi can recover active segment — cac segment cu da duoc flush

Cau hinh segment:

ConfigDefaultMo ta
segment.bytes1 GBKich thuoc toi da moi segment
segment.ms7 ngayThoi gian toi da truoc khi roll segment moi
segment.index.bytes10 MBKich thuoc index file
Index Files — Tim message nhanh

Khi consumer muon doc tu offset 1,500,000 — lam sao broker tim dung vi tri tren disk ma khong can scan tu dau?

Offset Index: Map tu offset → vi tri (position) trong segment file.

Offset Index (sparse — khong luu moi offset):
Offset 1,000,000 → Position 0
Offset 1,000,100 → Position 15,200
Offset 1,000,200 → Position 30,800
Offset 1,000,300 → Position 46,100
...

Khi can tim offset 1,000,150:

  1. Binary search trong index → tim entry gan nhat nho hon: offset 1,000,100 tai position 15,200
  2. Scan sequential tu position 15,200 den khi gap offset 1,000,150

Timestamp Index: Map tu timestamp → offset. Dung khi consumer muon doc message tu mot thoi diem cu the (vi du: “doc lai tat ca message tu 2 ngay truoc”).

Timestamp Index:
Timestamp 1710700000000 → Offset 1,000,000
Timestamp 1710700060000 → Offset 1,000,500
Timestamp 1710700120000 → Offset 1,001,200
...

Aha Moment: Index la sparse (khong luu moi offset, chi luu moi N offset). Dieu nay tiet kiem disk va memory. Trade-off: can scan mot doan nho de tim chinh xac — nhung doan nay thuong nam trong page cache nen rat nhanh.

Memory-Mapped Files (mmap) va Page Cache

Day la “bi quyet” giup message queue dat throughput cuc cao:

Page Cache cua OS: Khi broker ghi message vao file, data khong ghi thang vao disk. No di qua page cache cua OS truoc:

  1. Producer gui message → broker ghi vao page cache (RAM)
  2. OS tu dong flush page cache xuong disk (async)
  3. Consumer doc message → doc tu page cache (neu data con trong cache) → zero disk I/O

Ket qua: Trong truong hop ly tuong (consumer doc message vua moi duoc ghi), toan bo read xay ra tu RAM — khong cham disk. Day la ly do message queue co the dat millions of messages/sec tren hardware thuong.

Memory-mapped files (mmap): Index files duoc map vao memory thong qua mmap. Dieu nay cho phep truy cap index nhu truy cap mang trong RAM — khong can read() system call.

Ky thuatAp dung vaoLoi ich
Page cacheSegment files (.log)Write nhanh (async flush), read nhanh (tu RAM)
mmapIndex files (.index, .timeindex)Truy cap index nhanh nhu RAM
sendfile() / zero-copyTransfer data tu disk → networkGiam CPU usage, giam copy giua kernel va user space

Zero-copy transfer: Khi consumer doc message, broker dung sendfile() system call de chuyen data truc tiep tu page cache → network socket, khong can copy qua user space. Giam 2 lan copy va 2 lan context switch.

flowchart LR
    subgraph "Truyen thong (4 copies)"
        D1["Disk"] -->|"1. read()"| K1["Kernel Buffer"]
        K1 -->|"2. copy"| U1["User Buffer"]
        U1 -->|"3. write()"| K2["Socket Buffer"]
        K2 -->|"4. send"| N1["Network"]
    end

    subgraph "Zero-copy (2 copies)"
        D2["Disk/Page Cache"] -->|"1. sendfile()"| K3["Kernel Buffer"]
        K3 -->|"2. send"| N2["Network"]
    end

    style D2 fill:#43a047,color:#fff
    style K3 fill:#43a047,color:#fff
    style N2 fill:#43a047,color:#fff

2.3.3 Producer — Gui message hieu qua

Partitioning Strategy

Khi producer gui message, no phai chon partition nao se nhan message. Co 3 strategy:

1. Round-Robin (khong co key):

  • Message duoc gui luan phien den cac partition: P0, P1, P2, P0, P1, P2…
  • Uu diem: Phan bo deu tai — khong bi hotspot
  • Nhuoc diem: Khong dam bao ordering — messages cua cung entity co the nam o khac partition
  • Dung khi: Khong can ordering (vi du: log messages, metrics)

2. Key-based Hash (co key):

  • partition = hash(key) % num_partitions
  • Tat ca messages cung key → luon vao cung partition
  • Uu diem: Dam bao ordering cho cung key
  • Nhuoc diem: Co the bi hotspot neu key phan bo khong deu (vi du: celebrity user co nhieu event)
  • Dung khi: Can ordering per entity (vi du: key = order_id → tat ca event cua 1 order theo dung thu tu)

3. Custom Partitioner:

  • Producer tu implement logic chon partition
  • Dung khi: Logic dac biet — vi du: high-priority messages vao partition rieng, geo-based routing

Pitfall: Khi thay doi so partition (vi du: tu 6 len 12), hash(key) % num_partitions se thay doi. Message cua cung key co the di vao partition khac. Day la ly do partition count rat kho thay doi sau khi da chay production. Chon dung tu dau!

Batching

Producer khong gui tung message mot. No gom nhieu messages thanh 1 batch roi gui cung luc.

ConfigDefaultMo ta
batch.size16 KBKich thuoc toi da cua batch
linger.ms0 ms (gui ngay)Thoi gian cho de gom them messages vao batch

Trade-off:

  • linger.ms = 0: Latency thap nhat, nhung throughput thap (gui tung message)
  • linger.ms = 5-50ms: Tang throughput dang ke (gom nhieu messages), nhung latency tang them vai ms
  • batch.size lon: It network round-trip hon, nhung memory usage cao hon
Compression

Batch duoc compress truoc khi gui qua network. Broker luu batch da compress — khong decompress. Consumer moi decompress.

AlgorithmCompression ratioCPU usageToc doKhi nao dung
gzipCao nhat (~70-80%)CaoChamBandwidth limited, batch lon
snappyTrung binh (~50-60%)ThapNhanhCan balance giua compression va CPU
lz4Trung binh (~55-65%)Rat thapRat nhanhDefault choice — tot nhat cho da so truong hop
zstdCao (~65-75%)Trung binhNhanhMuon compression tot hon lz4, chap nhan them CPU

Aha Moment: Compression xay ra o batch level, khong phai message level. Batch cang lon → compression ratio cang tot (vi co nhieu data giong nhau de compress). Day la ly do tang linger.ms va batch.size giup tiet kiem bandwidth dang ke.

Acknowledgment (acks)

Sau khi gui message, producer cho broker xac nhan (ACK). Muc ACK quyet dinh durability vs latency:

acksHanh viDurabilityLatencyKhi nao dung
0Producer khong cho ACK. “Fire and forget”Thap nhat — co the mat messageThap nhatMetrics, logs — chap nhan mat
1Cho ACK tu leader. Leader ghi xong → ACKTrung binh — mat neu leader crash truoc khi replicateTrung binhDa so truong hop
all (-1)Cho ACK tu tat ca ISR (In-Sync Replicas). Leader + followers ghi xong → ACKCao nhat — chi mat khi tat ca ISR crash cung lucCao nhatPayment events, critical data — khong duoc mat

Pitfall: acks=all khong co nghia la tat ca replicas. No chi doi tat ca replicas trong ISR (In-Sync Replicas). Neu ISR chi con 1 (leader) vi followers lag → acks=all tuong duong acks=1. Day la ly do can set min.insync.replicas=2.


2.3.4 Consumer — Doc message hieu qua

Consumer Group

Consumer group la co che chia cong viec giua nhieu consumers. Tuong tu nhom nhan vien buu dien — moi nguoi phu trach mot khu vuc, khong ai lam trung voi ai.

Quy tac vang:

  • Trong 1 consumer group: moi partition chi gan cho dung 1 consumer
  • 1 consumer co the doc nhieu partitions
  • Neu num_consumers > num_partitions → consumer thua se idle
  • Neu num_consumers < num_partitions → mot consumer doc nhieu partitions
  • Ly tuong: num_consumers = num_partitions

Nhieu consumer groups doc cung topic doc lap voi nhau:

Topic: orders (6 partitions)

Consumer Group "order-processing":
  Consumer 1 → Partition 0, 1
  Consumer 2 → Partition 2, 3
  Consumer 3 → Partition 4, 5

Consumer Group "analytics":
  Consumer A → Partition 0, 1, 2
  Consumer B → Partition 3, 4, 5

Consumer Group "audit":
  Consumer X → Partition 0, 1, 2, 3, 4, 5  (1 consumer doc tat ca)

Moi group co offset rieng. Group “order-processing” co the da doc den offset 10,000 trong khi group “analytics” moi doc den offset 5,000.

Partition Assignment Strategies

Khi consumer join/leave group, partitions phai duoc re-assign. Co nhieu strategy:

1. Range Assignment:

  • Sap xep partitions theo so thu tu, chia deu cho consumers
  • Vi du: 6 partitions, 3 consumers → C1: [P0,P1], C2: [P2,P3], C3: [P4,P5]
  • Uu diem: Don gian, deterministic
  • Nhuoc diem: Consumer dau tien co the bi nhieu partition hon neu chia khong deu

2. Round-Robin Assignment:

  • Phan phoi partition luan phien: C1→P0, C2→P1, C3→P2, C1→P3, C2→P4, C3→P5
  • Uu diem: Phan bo deu hon range
  • Nhuoc diem: Khi rebalance, nhieu partition bi di chuyen

3. Sticky Assignment:

  • Giu nguyen assignment cu nhieu nhat co the, chi di chuyen partition cua consumer da roi di
  • Uu diem: Giam disruption khi rebalance — consumer giu lai state da co
  • Nhuoc diem: Phuc tap hon de implement

Aha Moment: Sticky assignment la best practice trong production. Khi 1 consumer crash, chi partitions cua no bi reassign — cac consumer khac giu nguyen. Giam thoi gian “stop the world” dang ke.

Offset Management

Consumer phai bao cho broker biet da doc den offset nao. Qua trinh nay goi la commit offset.

Auto Commit:

  • Consumer tu dong commit offset sau moi auto.commit.interval.ms (default 5s)
  • Uu diem: Don gian, khong can code them
  • Nhuoc diem: Co the mat message hoac xu ly trung
    • Consumer doc message, chua xu ly xong → auto commit → consumer crash → message mat (at-most-once)
    • Consumer xu ly xong, chua commit → crash → rebalance → consumer moi doc lai (at-least-once voi duplicates)

Manual Commit:

  • Consumer chi commit khi da xu ly xong message
  • Synchronous commit: commitSync() — block cho den khi commit thanh cong. An toan nhung cham
  • Asynchronous commit: commitAsync() — khong block. Nhanh nhung co the fail silently

Best practice: Dung commitAsync() trong vong lap xu ly, va commitSync() truoc khi consumer shutdown. Ket hop ca hai de dat balance giua performance va safety.

Rebalancing Protocol

Khi consumer join/leave group, hệ thống phai rebalance — gan lai partitions cho consumers.

sequenceDiagram
    participant C1 as Consumer 1
    participant C2 as Consumer 2
    participant C3 as Consumer 3 (new)
    participant GC as Group Coordinator

    Note over C1,GC: Consumer 3 joins group
    C3->>GC: JoinGroup request
    GC->>GC: Trigger rebalance
    GC->>C1: Revoke partitions
    GC->>C2: Revoke partitions
    C1->>C1: Commit offsets, release partitions
    C2->>C2: Commit offsets, release partitions
    C1->>GC: JoinGroup (re-join)
    C2->>GC: JoinGroup (re-join)
    C3->>GC: JoinGroup
    GC->>C1: Assign: Consumer 1 is leader
    Note over C1: Leader computes assignment
    C1->>GC: SyncGroup (assignment plan)
    GC->>C1: SyncGroup response (P0, P1)
    GC->>C2: SyncGroup response (P2, P3)
    GC->>C3: SyncGroup response (P4, P5)
    Note over C1,C3: Resume consuming with new assignment

Eager Rebalancing (truyen thong):

  • Tat ca consumers dung lai, tra partitions ve, roi nhan assignment moi
  • Van de: “Stop the world” — trong thoi gian rebalance, khong consumer nao doc duoc

Cooperative / Incremental Rebalancing (moi):

  • Chi revoke partitions can di chuyen. Consumer giu lai partitions khong thay doi
  • Uu diem: Giam disruption dang ke — da so consumers van hoat dong binh thuong trong rebalance
  • Day la best practice cho production systems

Pitfall: Rebalancing la thoi diem nguy hiem nhat cua consumer group. Trong khi rebalance, khong ai xu ly messages → consumer lag tang. Neu rebalance xay ra lien tuc (vi du: consumer unstable, session timeout qua ngan) → he thong khong hoat dong duoc. Monitoring rebalance frequency la critical.


2.3.5 Replication — Dam bao du lieu khong mat

Leader-Follower Model

Moi partition co 1 leader va nhieu followers. Chi leader nhan read/write. Followers chi replicate tu leader.

flowchart TB
    subgraph "Topic: orders, Partition 0"
        subgraph "Broker 1"
            L0["Partition 0<br/>LEADER<br/>Offset: 0-15,000"]
        end
        subgraph "Broker 2"
            F0a["Partition 0<br/>FOLLOWER (ISR)<br/>Offset: 0-14,998"]
        end
        subgraph "Broker 3"
            F0b["Partition 0<br/>FOLLOWER (ISR)<br/>Offset: 0-14,995"]
        end
    end

    subgraph "Topic: orders, Partition 1"
        subgraph "Broker 2 "
            L1["Partition 1<br/>LEADER<br/>Offset: 0-12,000"]
        end
        subgraph "Broker 3 "
            F1a["Partition 1<br/>FOLLOWER (ISR)<br/>Offset: 0-11,999"]
        end
        subgraph "Broker 1 "
            F1b["Partition 1<br/>FOLLOWER (ISR)<br/>Offset: 0-11,997"]
        end
    end

    L0 -->|"Replicate"| F0a
    L0 -->|"Replicate"| F0b
    L1 -->|"Replicate"| F1a
    L1 -->|"Replicate"| F1b

    style L0 fill:#e53935,color:#fff
    style L1 fill:#e53935,color:#fff
    style F0a fill:#1e88e5,color:#fff
    style F0b fill:#1e88e5,color:#fff
    style F1a fill:#1e88e5,color:#fff
    style F1b fill:#1e88e5,color:#fff

Tai sao leader phan bo tren nhieu brokers?

  • Partition 0 leader o Broker 1, Partition 1 leader o Broker 2…
  • Phan bo deu load — khong broker nao bi overload
  • Neu Broker 1 chet → chi leader cua partitions tren Broker 1 can failover
ISR (In-Sync Replicas)

ISR la tap hop cac replicas dang theo kip leader. Mot replica nam trong ISR neu:

  • No da fetch du lieu tu leader va khong bi lag qua nhieu
  • Cau hinh: replica.lag.time.max.ms (default 30s) — neu follower khong fetch trong 30s, bi loai khoi ISR
Thuat nguDinh nghia
ISRTap replicas dang sync voi leader (bao gom leader)
OSR (Out-of-Sync Replicas)Replicas bi lag — chua theo kip leader
LEO (Log End Offset)Offset cua message cuoi cung tren moi replica
HW (High Watermark)Offset cao nhat ma tat ca ISR da replicate. Consumer chi doc duoc den HW

High Watermark la gi?

Leader:    [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
Follower1: [0] [1] [2] [3] [4] [5] [6] [7]
Follower2: [0] [1] [2] [3] [4] [5] [6]
                                        ↑
                                   High Watermark = 6

Consumer chi doc duoc message co offset 6 (High Watermark). Messages 7, 8, 9 da o leader nhung chua duoc tat ca ISR replicate → chua visible cho consumer.

Tai sao? Neu leader crash truoc khi replicate 7, 8, 9 → follower len lam leader moi → messages 7, 8, 9 mat. Neu consumer da doc chung → data inconsistency.

min.insync.replicas

Config nay quyet dinh so ISR toi thieu truoc khi leader chap nhan write.

Cau hinhHanh vi
replication.factor = 3Moi partition co 3 replicas
min.insync.replicas = 2Can it nhat 2 replicas trong ISR de ghi duoc
acks = allProducer doi tat ca ISR xac nhan

Ket hop: acks=all + min.insync.replicas=2 + replication.factor=3

  • Producer gui message → leader ghi + it nhat 1 follower ghi → ACK
  • Co the chiu 1 broker failure ma khong mat data va khong ngung ghi
  • Neu 2 brokers chet → ISR < min.insync.replicas → leader tu choi write → producer nhan error

Aha Moment: min.insync.replicas la “circuit breaker” cua replication. No ngan leader ghi khi khong du replicas — tranh truong hop ghi message vao leader roi leader chet → mat data.

Unclean Leader Election

Khi leader chet va khong co follower nao trong ISR (tat ca followers deu lag) — he thong co 2 lua chon:

Lua chonConfigHau qua
Cho doi ISR follower onlineunclean.leader.election.enable = falsePartition khong hoat dong cho den khi ISR follower quay lai. Durability > Availability
Chon follower ngoai ISR lam leaderunclean.leader.election.enable = truePartition hoat dong ngay, nhung mat messages chua duoc replicate. Availability > Durability

Trade-off kinh dien: Durability vs Availability

  • Payment system, financial data → false (khong duoc mat data)
  • Metrics, logs → true (chap nhan mat vai message, mien la he thong khong dung)

2.3.6 Delivery Semantics — Dam bao giao message

Day la mot trong nhung khai niem kho nhat va hay bi hieu sai nhat trong message queue.

At-Most-Once

Dinh nghia: Message duoc xu ly toi da 1 lan. Co the mat, nhung khong bao gio trung.

Cach implement:

  • Producer: acks = 0 (khong cho ACK)
  • Consumer: Commit offset truoc khi xu ly message

Khi nao dung: Metrics, logs, analytics — mat vai data point khong anh huong ket qua.

Rui ro: Message mat vinh vien — producer khong biet, consumer khong biet.

At-Least-Once (Default)

Dinh nghia: Message duoc xu ly it nhat 1 lan. Khong mat, nhung co the trung.

Cach implement:

  • Producer: acks = all + retry khi gui that bai
  • Consumer: Xu ly message truoc, roi moi commit offset

Khi nao dung: Da so truong hop — notification, email, order processing (voi idempotent consumer).

Rui ro: Message co the duoc xu ly 2 lan. Vi du: consumer xu ly xong, chua commit offset, consumer crash → message duoc xu ly lai.

Giai phap: Consumer phai idempotent — xu ly cung message 2 lan van cho ket qua giong nhau. Vi du: dung unique ID cua message de check “da xu ly chua” truoc khi xu ly.

Exactly-Once

Dinh nghia: Message duoc xu ly chinh xac 1 lan. Khong mat, khong trung.

Cach implement (phuc tap nhat):

  1. Idempotent Producer:

    • Broker gan moi producer mot Producer ID (PID)
    • Moi message co Sequence Number tang dan per partition
    • Neu broker nhan message voi cung PID + sequence number → tu dong bo qua (deduplicate)
    • Giai quyet: producer retry gui trung → broker chi ghi 1 lan
  2. Transactional API:

    • Producer bat dau transaction → gui messages vao nhieu partitions → commit hoac abort
    • Consumer chi doc messages cua committed transactions (isolation.level = read_committed)
    • Giai quyet: atomic write across partitions — hoac tat ca messages duoc ghi, hoac khong co gi
  3. Consumer-side deduplication:

    • Du co idempotent producer va transactions, consumer van can xu ly idempotent
    • Ly do: “exactly-once” cua broker chi dam bao ghi 1 lan. Consumer xu ly la logic rieng — co the fail giua chung

Aha Moment: “Exactly-once” trong distributed systems la cuc ky kho va thuong la ket hop cua nhieu co che: idempotent producer (deduplicate write) + transactions (atomic write) + idempotent consumer (deduplicate processing). Khong co “magic button” nao lam exactly-once tu dong.


2.3.7 Message Retention — Giu message bao lau?

Khac voi traditional message queue (RabbitMQ: message bi xoa sau khi consumed), distributed message queue giu message theo retention policy — bat ke da duoc consume hay chua.

Time-Based Retention
ConfigDefaultMo ta
retention.ms604,800,000 (7 ngay)Message cu hon 7 ngay bi xoa
retention.minutes-Tinh theo phut
retention.hours168Tinh theo gio

Khi segment file cu nhat co timestamp > retention → xoa ca segment file.

Size-Based Retention
ConfigDefaultMo ta
retention.bytes-1 (unlimited)Tong kich thuoc toi da per partition

Khi tong kich thuoc cac segments vuot qua retention.bytes → xoa segment cu nhat.

Log Compaction — Giu latest per key

Day la retention policy dac biet. Thay vi xoa theo thoi gian, broker giu lai message moi nhat cho moi key va xoa cac message cu hon.

TRUOC compaction:
Key=user1, Value=Hanoi     (offset 0)
Key=user2, Value=Saigon    (offset 1)
Key=user1, Value=Danang    (offset 2)   ← user1 cap nhat dia chi
Key=user3, Value=Hue       (offset 3)
Key=user2, Value=null      (offset 4)   ← user2 bi xoa (tombstone)
Key=user1, Value=Dalat     (offset 5)   ← user1 cap nhat lan nua

SAU compaction:
Key=user3, Value=Hue       (offset 3)   ← giu nguyen
Key=user2, Value=null      (offset 4)   ← tombstone (bi xoa sau)
Key=user1, Value=Dalat     (offset 5)   ← chi giu latest

Dung khi nao?

  • Database changelog (CDC): moi row la 1 key, value la state moi nhat
  • Configuration store: moi key la config name, value la config value
  • User profile updates: moi key la user_id, value la profile moi nhat

Pitfall: Log compaction khong xay ra ngay lap tuc. Co mot “dirty ratio” threshold. Compaction thread chay background va co the tao I/O load lon. Trong production, can monitoring disk I/O trong luc compaction.


2.3.8 Coordination — Quan ly cluster

ZooKeeper (truyen thong)

ZooKeeper la external coordination service duoc Kafka dung tu dau de:

Chuc nangChi tiet
Broker registryMoi broker dang ky vao ZooKeeper khi start. ZK biet broker nao dang song
Controller electionChon 1 broker lam “Controller” — phu trach leader election cho partitions
Topic/partition metadataLuu danh sach topics, partitions, replica assignment
Consumer group state(cu) Luu consumer offsets va group membership. Tu Kafka 0.9+, consumer offsets luu trong internal topic __consumer_offsets

Van de cua ZooKeeper:

  • Them 1 dependency: ZooKeeper la hệ thống rieng, can deploy va van hanh rieng biệt
  • Scaling bottleneck: ZK cluster thuong chi 3-5 nodes. Metadata updates tro thanh bottleneck khi cluster lon (hang nghin partitions)
  • Split brain risk: Neu ZK va broker cluster bi network partition → co the co hai leaders cho cung partition
  • Operational complexity: Phai biet van hanh 2 he thong: Kafka + ZooKeeper
KRaft (ZooKeeper-less) — Xu huong moi

Tu Kafka 3.3+, KRaft thay the ZooKeeper bang internal Raft-based consensus:

Khia canhZooKeeperKRaft
ArchitectureExternal ZK clusterBuilt-in — metadata stored trong Kafka brokers
Metadata storageZK znodesInternal __cluster_metadata topic (Raft log)
Controller1 broker duoc ZK bau lam controllerQuorum of controllers (3-5 brokers co role controller)
Failover time10-30 giay (ZK session timeout)1-3 giay (Raft election)
ScalingZK la bottleneckMetadata replicate qua Raft — scale tot hon
Operational2 clusters (Kafka + ZK)1 cluster (chi Kafka)

Controller va metadata management trong KRaft:

  • Mot nhom brokers co role controller (thuong 3 hoac 5)
  • Cac controllers dung Raft consensus de bau leader va replicate metadata
  • Active controller xu ly tat ca metadata changes (tao topic, leader election, v.v.)
  • Cac controller khac la follower — san sang thay the neu active controller chet

Aha Moment: KRaft khong chi don gian la “bo ZooKeeper”. No thay doi co ban cach Kafka quan ly metadata — tu external coordination sang internal log-based coordination. Day la xu huong cua distributed systems: tu ngoai vao trong, tu complexity sang simplicity.


2.3.9 Push vs Pull — Hai mo hinh tieu thu

Khia canhPull (Kafka model)Push (RabbitMQ model)
Ai dieu khien toc do?Consumer — keo data theo toc do cua minhBroker — day data xuong consumer
Consumer cham?Khong anh huong broker — data van nam tren diskBroker phai buffer hoac drop messages
Consumer nhanh?Keo data lien tuc, khong phai doiBroker co the khong day du nhanh
BatchingConsumer tu chon batch size toi uuBroker quyet dinh batch — co the khong phu hop
Long pollingConsumer poll voi timeout — neu khong co data moi, doi mot luc roi poll laiKhong can — broker day ngay khi co data
Back-pressureTu nhien — consumer chi keo khi san sangCan co che rieng (prefetch count, flow control)
ReplayDe dang — consumer chi can seek offsetKho — message da bi xoa sau khi ACK
ComplexityConsumer phuc tap hon (phai tu quan ly polling)Consumer don gian hon (chi nhan va xu ly)

Tai sao thiet ke cua chung ta chon Pull?

  1. Consumer tu quyet dinh toc do — khong bi overwhelm
  2. Batching toi uu — consumer keo dung luong data phu hop
  3. Replay de dang — seek den bat ky offset nao
  4. Back-pressure tu nhien — khong can co che dac biet
  5. Scale consumer doc lap — them consumer khong anh huong broker

Nhuoc diem cua Pull: Khi khong co data moi, consumer van phai poll → lang phi. Giai phap: Long polling — consumer gui poll request, broker giu request cho den khi co data moi hoac timeout.


2.3.10 Dead Letter Queue (DLQ) — Xu ly “thu chet”

Van de: Consumer nhan message nhung khong xu ly duoc. Vi du:

  • Message bi corrupt (bad format)
  • Business logic fail (invalid data)
  • Dependency bi loi (database down → retry nhieu lan van fail)

Neu cu retry mai → consumer bi stuck tai message nay, khong xu ly duoc messages phia sau. Day goi la poison message.

Giai phap: Dead Letter Queue

Main Topic: orders
  ↓ (consumer doc)
  Consumer: xu ly message
    ↓ (thanh cong) → commit offset, tiep tuc
    ↓ (that bai lan 1) → retry
    ↓ (that bai lan 2) → retry
    ↓ (that bai lan 3) → chuyen vao DLQ

DLQ Topic: orders-dlq
  ↓ (ops team review, fix, re-publish)

Cach implement DLQ:

  • Consumer co max.retries (vi du: 3)
  • Sau khi retry het → producer message vao DLQ topic (vi du: orders-dlq)
  • Commit offset cua message goc → consumer tiep tuc xu ly message tiep theo
  • DLQ topic duoc monitor boi ops team
  • Sau khi fix root cause, message trong DLQ duoc re-publish vao main topic
ConfigMo ta
Max retriesSo lan retry truoc khi chuyen vao DLQ
Retry delayThoi gian doi giua cac lan retry (exponential backoff)
DLQ topic nameConvention: {original-topic}-dlq
DLQ retentionThuong dai hon main topic (30-90 ngay) de co thoi gian investigate

Pitfall: Khong co DLQ → poison message se block consumer vinh vien (neu auto-commit tat) hoac bi mat (neu auto-commit bat va consumer crash truoc khi xu ly). Ca hai truong hop deu nguy hiem. Luon thiet ke DLQ cho production systems.


2.3.11 Scaling — Mo rong he thong

Them Broker

Khi cluster can them capacity (disk, CPU, network):

  1. Start broker moi, join cluster
  2. Broker moi chua co partition nao — no la “empty”
  3. Admin chay partition reassignment — di chuyen mot so partitions tu brokers cu sang broker moi
  4. Trong qua trinh di chuyen, data duoc replicate tu broker cu sang broker moi
  5. Khi replicate xong, broker moi nhan role (leader hoac follower) va broker cu giai phong partition

Van de: Partition reassignment ton bandwidth va disk I/O. Trong production:

  • Chay reassignment ngoai gio cao diem
  • Gioi han toc do reassignment (throttle) de khong anh huong traffic production
  • Monitor replication lag trong qua trinh reassignment
Them Partition

Khi topic can them parallelism (nhieu consumers hon muon doc dong thoi):

  1. Admin tang partition count cua topic (vi du: tu 6 len 12)
  2. Broker tao partitions moi tren cac brokers
  3. Consumer group rebalance — partitions duoc gan lai cho consumers
  4. Messages moi duoc phan phoi vao ca partitions cu va moi

Van de nghiem trong:

  • Key ordering bi pha: hash(key) % 6 khac voi hash(key) % 12. Messages cua cung key truoc kia vao P0, bay gio co the vao P6. Ordering per key bi pha trong thoi gian chuyen doi.
  • Khong the giam partition: Kafka khong ho tro giam so partition. Mot khi da tang, khong quay lai duoc.

Pitfall: Partition count la quyet dinh kho thay doi nhat trong message queue. Chon qua it → khong scale duoc. Chon qua nhieu → lang phi resources (moi partition ton memory, file handles, replication bandwidth). Rule of thumb: bat dau voi partition count = 2-3x so consumers du kien, de room cho scale.


3. Estimation — Uoc luong he thong

3.1 Throughput Estimation

Assumptions:

Thong soGia triGiai thich
Target throughput1,000,000 messages/secYeu cau tu requirements
Average message size1 KBJSON event trung binh
Peak/Average ratio3xFlash sales, events
Replication factor31 leader + 2 followers

3.2 Storage Estimation

Thong soGia tri
Messages/day1,000,000 msg/sec x 86,400 sec = 86.4 billion messages
Data/day (raw)86.4B x 1 KB = 86.4 TB
Compression ratio~50% (lz4)
Data/day (compressed)86.4 TB x 0.5 = 43.2 TB
Replication factor3
Data/day (with replication)43.2 TB x 3 = 129.6 TB
Retention7 ngay

3.3 Network Bandwidth Estimation

Inbound (producer → broker):

Replication bandwidth (leader → followers):

Outbound (broker → consumers):

Gia su co 3 consumer groups, moi group doc toan bo data:

Total network per cluster:

Moi broker can 10 Gbps NIC de co du headroom.

3.4 Consumer Throughput Estimation

Consumer processingTimeMessages/sec per consumer
Simple log/forward0.1 ms/msg10,000 msg/sec
Database write1 ms/msg1,000 msg/sec
Complex processing5 ms/msg200 msg/sec

Aha Moment: So partitions phai >= so consumers. Neu can 1,000 consumers → can it nhat 1,000 partitions. Day la ly do chon partition count la quyet dinh quan trong.

3.5 Tom tat Estimation

MetricGia tri
Write throughput1 GB/sec (3 GB/sec peak)
Total write (with replication)3 GB/sec (9 GB/sec peak)
Storage (7-day retention, RF=3, compressed)~907 TB (~0.9 PB)
Number of brokers~19+ (storage-based)
Network per broker~2.5 Gbps (can 10 Gbps NIC)
Consumers needed100 (simple) to 1,000 (DB write)
Partitions needed>= max consumer count

4. Security — Bao mat message queue

4.1 Encryption — Ma hoa du lieu

4.1.1 Encryption in Transit (TLS)

Ket noiTLSMo ta
Producer → BrokerCoTLS 1.2/1.3. Ngan man-in-the-middle doc messages
Broker → Broker (replication)CoInter-broker replication cung phai encrypt
Broker → ConsumerCoConsumer doc messages qua TLS
Broker → ZooKeeper/ControllerCoMetadata communication phai secure

Luu y: TLS giam throughput khoang 10-30% do encryption/decryption overhead. Trong internal network (trusted), co the can nhac dung PLAINTEXT cho inter-broker replication de tang performance — nhung day la trade-off.

4.1.2 Encryption at Rest

Phuong phapMo taKhi nao dung
Disk encryption (dm-crypt, LUKS)Encrypt toan bo disk — transparent cho KafkaStandard cho moi production deployment
File system encryptionEncrypt o file system level (eCryptfs, fscrypt)Khi khong co full disk encryption
Message-level encryptionProducer encrypt message body truoc khi guiKhi broker khong duoc phep doc noi dung (multi-tenant, sensitive data)

Message-level encryption: Producer encrypt bang public key cua consumer. Broker chi thay ciphertext — khong doc duoc. Consumer decrypt bang private key. Dung cho:

  • Healthcare data (HIPAA)
  • Financial data (PCI-DSS)
  • Multi-tenant platforms (tenant A khong doc duoc data cua tenant B, ke ca broker admin)

Pitfall: Message-level encryption lam log compaction khong hoat dong (broker khong doc duoc key de so sanh). Can can nhac trade-off giua security va functionality.

4.2 Authentication — Xac thuc danh tinh

Phuong phapMo taKhi nao dung
SASL/PLAINUsername/passwordDev/test environments. Khong an toan cho production (password di plain text, can dung kem TLS)
SASL/SCRAMChallenge-response (khong gui password)Production — an toan hon PLAIN
SASL/GSSAPI (Kerberos)Enterprise SSOEnterprise environments co Kerberos infrastructure
SASL/OAUTHBEAREROAuth 2.0 tokensCloud-native environments, integration voi identity providers
mTLSMutual TLS — client va server deu present certificateZero-trust environments, service-to-service authentication

Best practice: Dung mTLS cho service-to-service (producer/consumer la services) va SASL/SCRAM hoac OAUTHBEARER cho human users (admin tools, monitoring).

4.3 Authorization — Phan quyen truy cap

ACL (Access Control List) per topic:

PrincipalResourceOperationPermission
order-serviceTopic: ordersWRITEALLOW
order-serviceTopic: paymentsWRITEDENY
analytics-serviceTopic: ordersREADALLOW
analytics-serviceTopic: ordersWRITEDENY
adminTopic: *ALLALLOW

Principle of least privilege: Moi service chi co quyen vua du de lam viec cua no. Order service chi ghi vao orders topic, khong duoc doc payments topic.

4.4 Audit Logging

EventThong tin log
Topic created/deletedWho, when, topic name, config
ACL changedWho, when, old ACL, new ACL
Authentication failureWho (attempted), when, IP address, reason
Admin operationWho, when, operation (reassignment, config change)

Audit logs phai duoc gui den external system (SIEM, Elasticsearch) — khong luu tren broker (de tranh bi xoa boi attacker).

Aha Moment: Security trong message queue thuong bi bo qua vi “no la internal system”. Nhung neu attacker truy cap duoc broker → doc duoc toan bo messages cua moi service. Message queue la central nervous system — bao ve no la bao ve toan bo he thong.


5. DevOps — Van hanh va giam sat

5.1 Critical Metrics — Cac chi so quan trong nhat

5.1.1 Broker Health

MetricThresholdY nghia
Under-replicated partitions> 0Co partition chua duoc replicate du. Alert ngay — risk cua data loss
ISR shrink rate> 0Followers dang bi loai khoi ISR. Broker overloaded hoac network issue
Active controller count!= 1Phai luon co dung 1 controller. 0 = khong ai dieu phoi. >1 = split brain
Offline partitions> 0Partition khong co leader. Critical alert — data khong doc/ghi duoc
Request handler idle ratio< 20%Broker dang qua tai — request threads gan het

5.1.2 Producer Metrics

MetricThresholdY nghia
Produce request rateBao nhieu requests/secThroughput cua producers
Produce latency P99> 100msGhi cham — co the do broker overloaded, replication lag
Record error rate> 0Producer gui that bai — broker reject hoac network error
Batch size average< 1KBBatch qua nho — tang linger.ms de gom nhieu hon
Compression ratio> 0.8Compression khong hieu qua — check data pattern

5.1.3 Consumer Metrics — Quan trong nhat

MetricThresholdY nghia
Consumer lag> 10,000 messagesConsumer khong theo kip producer. Day la metric #1
Consumer lag trendTang lien tucConsumer cham hon producer — se tran memory/disk neu khong fix
Commit rateGiam dot ngotConsumer co the bi stuck hoac crash
Rebalance rate> 1/hourRebalance qua thuong xuyen — consumer unstable
Poll interval> max.poll.interval.msConsumer bi kick khoi group — processing qua cham

Aha Moment: Consumer lag la metric #1 cua toan bo message queue system. Neu lag tang → messages dang bi xu ly cham hon toc do gui → cuoi cung retention het → messages bi xoa truoc khi consumer doc → data loss. Monitoring consumer lag la bat buoc.

5.1.4 Infrastructure Metrics

MetricThresholdY nghia
Disk usage> 80%Gan het disk — can them disk hoac giam retention
Disk I/O wait> 10%Disk la bottleneck — can SSD hoac giam load
Network throughput> 70% capacityNetwork la bottleneck — can nang cap NIC
CPU usage> 70%Thuong do compression/decompression
JVM GC pause> 200msGC pause lam broker khong respond — client timeout
File descriptor count> 80% ulimitMoi partition + segment + connection ton 1 fd. Het fd = broker crash

5.2 Alerting Strategy

SeverityMetricAction
P1 — CriticalOffline partitions > 0Ngay lập tức: page on-call. Data khong truy cap duoc
P1 — CriticalActive controller count != 1Page on-call. Cluster khong co coordinator
P2 — HighUnder-replicated partitions > 0 (> 5 min)Investigate broker health, disk, network
P2 — HighConsumer lag tang lien tuc (> 30 min)Scale consumers hoac investigate bottleneck
P3 — MediumDisk usage > 80%Plan capacity — them disk hoac giam retention
P3 — MediumISR shrink rate > 0Check follower health, replication bandwidth
P4 — LowProduce latency P99 > 100msInvestigate — co the la transient
P4 — LowRebalance rate > 1/hourCheck consumer stability, session timeout config

5.3 Operational Runbook

5.3.1 Broker Failure

  1. Alert: Under-replicated partitions tang
  2. Check: Broker nao bi mat? (broker.id khong con trong cluster)
  3. Automatic: Controller tu dong elect leader moi tu ISR cho cac partitions cua broker chet
  4. Action: Restart broker hoac thay hardware. Sau khi broker online, tu dong rejoin va sync data
  5. Monitor: ISR count tro ve replication factor → recovery hoan tat

5.3.2 Consumer Lag Spike

  1. Alert: Consumer lag > threshold
  2. Check: Consumer co dang chay khong? Processing time tren moi message?
  3. Action short-term: Scale consumers (them instances). Luu y: can du partitions
  4. Action long-term: Optimize processing logic, tang partition count (neu can)
  5. Monitor: Lag giam dan ve 0

5.3.3 Disk Full

  1. Alert: Disk usage > 80%
  2. Action ngay: Giam retention (vi du: tu 7 ngay xuong 3 ngay) de free disk
  3. Action trung han: Them disk, them broker
  4. Action dai han: Implement tiered storage (hot/cold) hoac tang compression

5.4 Capacity Planning

Thoi diemAction
WeeklyReview consumer lag trends, disk growth rate
MonthlyReview throughput growth, plan broker additions
QuarterlyReview partition counts, rebalance strategy, retention policy
YearlyMajor capacity planning — hardware refresh, architecture review

Pitfall: Nhieu team chi monitoring throughput va latency nhung quen consumer lag. He thong trong “healthy” (broker khong bi loi) nhung data dang mat vi consumer khong theo kip va messages bi xoa khi retention het. Consumer lag la “silent killer” cua message queue.


6. Mermaid Diagrams — Tong hop kien truc

6.1 Broker Architecture voi WAL Segments

flowchart TB
    subgraph "Broker 1"
        subgraph "Partition 0 (Leader)"
            direction TB
            SEG0["Segment 0<br/>Offset 0-999K<br/>📁 00000000.log<br/>📁 00000000.index<br/>📁 00000000.timeindex"]
            SEG1["Segment 1<br/>Offset 1M-1.99M<br/>📁 01000000.log<br/>📁 01000000.index<br/>📁 01000000.timeindex"]
            SEG2["Segment 2 (ACTIVE)<br/>Offset 2M-...<br/>📁 02000000.log<br/>📁 02000000.index<br/>📁 02000000.timeindex"]
            SEG0 --> SEG1 --> SEG2
        end

        subgraph "Write Path"
            PC["Page Cache<br/>(OS RAM)"]
            FL["Flush to Disk<br/>(async)"]
        end

        subgraph "Read Path"
            IDX["Index Lookup<br/>(mmap)"]
            ZC["Zero-Copy<br/>sendfile()"]
        end
    end

    P["Producer"] -->|"Append"| SEG2
    SEG2 -->|"Write"| PC
    PC -->|"Async"| FL

    C["Consumer"] -->|"Seek(offset)"| IDX
    IDX -->|"Position"| PC
    PC -->|"sendfile()"| ZC
    ZC -->|"Data"| C

    style SEG2 fill:#e53935,color:#fff
    style PC fill:#1e88e5,color:#fff
    style ZC fill:#43a047,color:#fff

6.2 Replication Flow — Leader + Followers + ISR

flowchart TB
    subgraph "Partition 0 Replication"
        P["Producer<br/>acks=all"] -->|"1. Send msg"| L["Broker 1<br/>LEADER<br/>LEO: 1000"]

        L -->|"2. Write WAL"| WAL["WAL<br/>(disk)"]
        L -->|"3. Replicate"| F1["Broker 2<br/>FOLLOWER<br/>LEO: 999<br/>✅ IN ISR"]
        L -->|"3. Replicate"| F2["Broker 3<br/>FOLLOWER<br/>LEO: 998<br/>✅ IN ISR"]
        L -.->|"3. Replicate"| F3["Broker 4<br/>FOLLOWER<br/>LEO: 500<br/>❌ OUT OF ISR<br/>(lag > 30s)"]

        F1 -->|"4. ACK"| L
        F2 -->|"4. ACK"| L
        L -->|"5. ACK to Producer<br/>(all ISR confirmed)"| P

        HW["High Watermark = 998<br/>(min LEO of ISR)"]
        L --- HW

        C["Consumer"] -->|"6. Read <= HW"| L
    end

    style L fill:#e53935,color:#fff
    style F1 fill:#1e88e5,color:#fff
    style F2 fill:#1e88e5,color:#fff
    style F3 fill:#757575,color:#fff
    style HW fill:#f57c00,color:#fff

6.3 Consumer Group Rebalancing (Cooperative)

flowchart TB
    subgraph "TRUOC — 2 consumers, 6 partitions"
        direction LR
        C1_B["Consumer 1<br/>P0, P1, P2"]
        C2_B["Consumer 2<br/>P3, P4, P5"]
    end

    subgraph "Consumer 3 joins"
        direction TB
        EV["Rebalance triggered"]
    end

    subgraph "SAU — 3 consumers, 6 partitions (Cooperative)"
        direction LR
        C1_A["Consumer 1<br/>P0, P1<br/>(giu P0,P1 — tra P2)"]
        C2_A["Consumer 2<br/>P3, P4<br/>(giu P3,P4 — tra P5)"]
        C3_A["Consumer 3<br/>P2, P5<br/>(nhan P2, P5)"]
    end

    C1_B --> EV
    C2_B --> EV
    EV --> C1_A
    EV --> C2_A
    EV --> C3_A

    style EV fill:#f57c00,color:#fff
    style C3_A fill:#43a047,color:#fff

6.4 End-to-End Message Flow (Chi tiet)

flowchart LR
    subgraph "Producer Side"
        APP["Application"] -->|"1. produce()"| SER["Serializer<br/>(JSON/Avro)"]
        SER -->|"2. serialize"| PART["Partitioner<br/>hash(key) % N"]
        PART -->|"3. assign partition"| BATCH["Batch Buffer<br/>(per partition)"]
        BATCH -->|"4. batch full<br/>or linger.ms"| COMP["Compressor<br/>(lz4/zstd)"]
        COMP -->|"5. compressed batch"| NET1["Network<br/>Send"]
    end

    subgraph "Broker Side"
        NET1 -->|"6. receive"| VAL["Validate<br/>CRC, auth, ACL"]
        VAL -->|"7. append"| WAL2["WAL<br/>(page cache)"]
        WAL2 -->|"8. replicate"| REP["Followers"]
        REP -->|"9. ACK"| ACK["ACK to<br/>Producer"]
    end

    subgraph "Consumer Side"
        POLL["poll()"] -->|"10. fetch"| WAL2
        WAL2 -->|"11. zero-copy"| DECOMP["Decompressor"]
        DECOMP -->|"12. decompress"| DESER["Deserializer"]
        DESER -->|"13. deserialize"| PROC["Process<br/>Message"]
        PROC -->|"14. commit"| OFFSET["Commit<br/>Offset"]
    end

    style BATCH fill:#1e88e5,color:#fff
    style WAL2 fill:#e53935,color:#fff
    style OFFSET fill:#43a047,color:#fff

7. Aha Moments & Pitfalls — Nhung dieu can nho

7.1 Aha Moments

#InsightGiai thich
1WAL la nen tangToan bo message queue duoc xay tren mot y tuong don gian: append-only log. Ghi vao cuoi file la thao tac nhanh nhat tren disk. Tu day, moi thu khac (replication, retention, replay) tro nen tu nhien
2ISR can bang durability vs availabilityISR khong phai “tat ca replicas” — no la “cac replicas dang theo kip”. Config min.insync.replicas quyet dinh ban chap nhan mat bao nhieu replicas ma van ghi duoc
3Consumer lag la metric #1Broker throughput cao, latency thap — nhung consumer lag tang → data dang mat dan. Day la “silent killer”. Monitor consumer lag truoc moi thu khac
4Partition count kho thay doiTang partition → key routing thay doi → ordering bi pha. Giam partition → khong the. Chon partition count dung tu dau la quyet dinh architecture quan trong nhat
5Pull model uu viet cho streamingConsumer tu quyet dinh toc do, batch size, va khi nao doc. Back-pressure tu nhien. Replay de dang. Day la ly do Kafka (pull) thang the cho event streaming so voi RabbitMQ (push)
6Zero-copy la “bi quyet” performanceKhong copy data tu kernel → user space → kernel. Truc tiep tu page cache → network. Giam CPU usage, giam latency. Day la ly do message queue dat millions msg/sec tren hardware thuong
7Broker khong hieu messageBroker chi thay bytes. Khong parse, khong validate noi dung. Dieu nay giup broker cuc nhanh va generic — phu hop voi moi loai data
8Exactly-once la ket hop nhieu co cheKhong co “magic button”. Can idempotent producer + transactions + idempotent consumer. Hieu tung co che giup em biet khi nao actually can exactly-once vs at-least-once da du

7.2 Common Pitfalls

#PitfallHau quaCach tranh
1Khong set message keyRound-robin → khong ordering. Messages cua cung entity di vao partition khac nhauLuon set key = entity ID (user_id, order_id) khi can ordering
2Partition count qua itKhong scale duoc consumers. Throughput bi gioi hanChon partition count = 2-3x so consumers du kien
3acks=0 cho critical dataMat message khi broker crashDung acks=all + min.insync.replicas=2 cho moi data quan trong
4Auto commit voi complex processingMat message hoac xu ly trungDung manual commit — chi commit sau khi xu ly xong
5Khong co DLQPoison message block consumer vinh vienLuon design DLQ voi max retries + exponential backoff
6Tang partition count tuy tienKey routing thay doi, ordering per key bi phaPlan partition count truoc. Neu phai tang, chap nhan downtime cho re-keying
7Khong monitor consumer lagData mat do retention expire truoc khi consumer docAlert consumer lag > threshold. Day la P2 alert
8unclean.leader.election = true cho financial dataMat messages khi leader crashTat unclean leader election cho moi topic quan trong
9Rebalance stormConsumer group lien tuc rebalance → khong ai xu ly messagesTang session.timeout.ms, tang max.poll.interval.ms, dung sticky assignment
10Khong encrypt inter-broker trafficAttacker trong internal network doc duoc tat ca messagesTLS cho moi connection, ke ca inter-broker

8. Summary — Tong ket

8.1 So sanh thiet ke cua chung ta voi Kafka thuc te

ComponentThiet ke cua chung taApache KafkaGhi chu
StorageWAL + segments + indexGiongKafka dung chinh xac mo hinh nay
ReplicationLeader-follower + ISRGiongISR la sang tao cua Kafka team
CoordinationZooKeeper/KRaftZK → KRaftKafka dang migrate sang KRaft
Consumer modelPull + consumer groupsGiongPull la core design cua Kafka
Delivery semanticsAt-most/least/exactly onceGiongExactly-once tu Kafka 0.11+
RetentionTime/size/compactionGiong3 policies giong nhau
Compressiongzip, snappy, lz4, zstdGiongzstd tu Kafka 2.1+

8.2 Khi nao dung message queue architecture nay?

Use casePhu hop?Ly do
Event streaming (logs, metrics, clickstream)Rat phu hopHigh throughput, retention, replay
Microservice communication (async)Phu hopDecoupling, at-least-once, consumer groups
Real-time analytics pipelineRat phu hopMultiple consumer groups doc cung data
Task queue (background jobs)Duoc, nhung RabbitMQ co the tot honPull model khong optimal cho ngan job queue
Request-reply patternKhong phu hopPull model khong tot cho synchronous communication
Small-scale system (< 1000 msg/sec)OverkillDung Redis Pub/Sub hoac SQS don gian hon

8.3 Lien ket voi cac bai khac

BaiLien quan
Tuan-08-Message-QueueBai nay bo sung — Tuan 08 day dung MQ, bai nay day thiet ke MQ
Tuan-10-Consistent-HashingPartition assignment co the dung consistent hashing de giam rebalance disruption
Tuan-07-Database-Sharding-ReplicationReplication model (leader-follower, ISR) tuong tu database replication. Sharding tuong tu partitioning
Tuan-13-Monitoring-ObservabilityConsumer lag monitoring, broker health monitoring la ung dung truc tiep
Tuan-14-AuthN-AuthZ-SecuritySASL, mTLS, ACL trong message queue la ung dung cua authn/authz patterns
Case-Design-Payment-SystemPayment system dung message queue voi exactly-once semantics
Case-Design-Stock-ExchangeStock exchange dung message queue cho event sequencing

“Hieu, sau bai nay em da biet cach thiet ke mot distributed message queue tu scratch — khong chi biet dung no. Khi interview hoi ‘Design Kafka’, em co the tu tin giai thich moi quyet dinh thiet ke: tai sao append-only log, tai sao pull model, tai sao ISR, tai sao partition la don vi cua parallelism. Do la su khac biet giua Backend Dev va System Architect.”