Tuần 17: Design a Chat System

“Chat system trông đơn giản — gửi tin nhắn từ A đến B. Nhưng khi 50 triệu người gửi cùng lúc, mỗi millisecond đều là cuộc chiến giữa consistency và availability.”

Tags: system-design chat-system websocket alex-xu case-study Student: Hieu Prerequisite: Tuan-02-Back-of-the-envelope · Tuan-08-Message-Queue · Tuan-03-Networking-DNS-CDN Liên quan: Tuan-19-Design-Notification-System · Tuan-15-Data-Security-Encryption · Tuan-13-Monitoring-Observability · Tuan-07-Database-Sharding-Replication Reference: Alex Xu, System Design Interview — Chapter 12: Design a Chat System

Step 1 — Understand the Problem & Design Scope

1.1 Clarifying Questions (Câu hỏi làm rõ)

Trong interview, luôn hỏi trước khi thiết kế. Dưới đây là các câu hỏi quan trọng và câu trả lời giả định:

Câu hỏi	Trả lời	Ghi chú
1-on-1 chat hay group chat?	Cả hai	Group tối đa 100 members
Mobile app, web app, hay cả hai?	Cả hai	Cần hỗ trợ multi-device
Scale bao lớn?	50M DAU	Tham chiếu Tuan-02-Back-of-the-envelope
Cần online/offline indicator?	Có	Presence service
Message size limit?	100,000 ký tự text	Tương tự Slack
Có hỗ trợ media (ảnh, file)?	Có	Images, documents, files
End-to-end encryption?	Có	Cho 1-on-1 chat
Message retention?	Vĩnh viễn	Không tự xóa
Push notification khi offline?	Có	FCM/APNs integration
Read receipts?	Có	”Đã xem” indicator

1.2 Functional Requirements (Yêu cầu chức năng)

FR1: 1-on-1 chat với low latency (tin nhắn tới trong < 200ms cùng region)
FR2: Group chat lên đến 100 members
FR3: Online/offline presence indicator
FR4: Push notification cho offline users
FR5: Media sharing (images, files) lên đến 25MB
FR6: Read receipts (đã nhận, đã đọc)
FR7: Message history & search
FR8: Multi-device sync (đọc trên điện thoại, laptop cùng lúc)

1.3 Non-functional Requirements (Yêu cầu phi chức năng)

NFR1: High availability — 99.99% uptime (< 52 phút downtime/năm)
NFR2: Low latency — message delivery < 200ms (same region)
NFR3: Consistency — messages phải đúng thứ tự trong cùng conversation
NFR4: Durability — không được mất tin nhắn đã gửi
NFR5: Scalability — 50M DAU, 2B messages/day

1.4 Capacity Estimation (Ước lượng)

Reuse số liệu từ Tuan-02-Back-of-the-envelope — Chat System section.

Assumptions:

Thông số	Giá trị	Giải thích
DAU	50M	Quy mô WhatsApp/Telegram-like
Messages/user/day	40	Bao gồm cả 1-on-1 và group
Avg message size (text)	100 bytes	UTF-8 encoded
% messages in groups	60%	Group chat chiếm đa số
Avg group size	50 members	Delivery fan-out factor
Media messages	10% of total	Avg 200KB mỗi media
Retention	Vĩnh viễn	—

QPS Calculation:

T o t a l m ess a g es / d a y = 50 M \times 40 = 2 B m ess a g es / d a y

W r i t e QP S_{a vg} = \frac{2 \times 1 0 ^{9}}{86 , 400} \approx 23, 148 w r i t es / s

W r i t e QP S_{p e ak} = 23, 148 \times 3 \approx 69, 444 w r i t es / s

Alert: 69K write QPS — đây là write-heavy system. Cần Message Queue làm buffer → Tuan-08-Message-Queue.

Read QPS (mỗi user đọc nhiều hơn gửi, ước tính 5x):

R e a d QP S_{a vg} = 23, 148 \times 5 = 115, 740 re a d s / s

R e a d QP S_{p e ak} = 115, 740 \times 3 \approx 347, 222 re a d s / s

Storage:

T e x t s t or a g e / d a y = 2 B \times 100 b y t es = 200 GB / d a y

M e d ia s t or a g e / d a y = 2 B \times 10% \times 200 K B = 40 TB / d a y

T e x t s t or a g e / ye a r = 200 GB \times 365 = 73 TB / ye a r

M e d ia s t or a g e / ye a r = 40 TB \times 365 = 14.6 PB / ye a r

Concurrent WebSocket Connections:

C o n c u rre n t co nn ec t i o n s = D A U \times a vg_o n l in e_r a t i o = 50 M \times 10% = 5 M co n c u rre n t

Mỗi WebSocket server handle ~65K connections (limited by file descriptors). Cần ít nhất:

W e b S oc k e t ser v ers = \frac{5 M}{65 , 000} \approx 77 ser v ers (minim u m)

Bandwidth:

B an d w i d t h_{in} = 69, 444 \times 100 b y t es \approx 6.9 MB / s \approx 55 M b p s (t e x t)

B an d w i d t h_{in m e d ia} = 69, 444 \times 10% \times 200 K B \approx 1.4 GB / s \approx 11 G b p s (m e d ia)

Tóm tắt Estimation

Metric	Value
Write QPS (peak)	~69K/s
Read QPS (peak)	~347K/s
Concurrent WebSocket connections	~5M
Text storage/year	~73 TB
Media storage/year	~14.6 PB
WebSocket servers (minimum)	~77
Media bandwidth (peak)	~11 Gbps

Step 2 — High-Level Design

2.1 Communication Protocols — Lựa chọn giao thức

Đây là quyết định kiến trúc quan trọng nhất cho chat system. Hieu cần hiểu rõ trade-offs:

HTTP Polling

Client → Server: "Có tin nhắn mới không?" (mỗi 3 giây)
Server → Client: "Không." (99% thời gian)

Ưu điểm	Nhược điểm
Đơn giản nhất	Lãng phí bandwidth & CPU
Stateless, dễ scale	Latency = polling interval
Works through mọi firewall	50M users × 1 poll/3s = 16.7M req/s (!)

Verdict: Không khả thi cho chat system ở scale 50M DAU.

Long Polling (Hanging GET)

Client → Server: "Cho tôi biết khi có tin nhắn mới" (request treo)
Server: (chờ đến khi có tin nhắn hoặc timeout 30s)
Server → Client: "Đây, tin nhắn mới!"
Client → Server: (ngay lập tức gửi request mới)

Ưu điểm	Nhược điểm
Tiết kiệm hơn polling	Sender & receiver có thể kết nối tới server khác nhau
Tương thích HTTP/1.1	Timeout handling phức tạp
Không cần WebSocket	Vẫn tốn resource giữ connection

Verdict: Fallback option khi WebSocket không khả dụng.

WebSocket (Lựa chọn chính)

Client ←→ Server: Full-duplex, persistent connection
Upgrade: websocket (HTTP → WS handshake)

Ưu điểm	Nhược điểm
Full-duplex (hai chiều)	Stateful → khó scale horizontally
Lowest latency	Cần sticky sessions hoặc connection registry
Efficient (no HTTP overhead)	Firewall/proxy có thể block
Ideal cho real-time	Reconnection logic phức tạp

Verdict: Primary protocol cho chat system. WebSocket cho message delivery, HTTP cho mọi thứ khác (auth, profile, search).

SSE (Server-Sent Events)

Server → Client: One-way stream (server push only)
Client → Server: Vẫn dùng HTTP POST để gửi message

Ưu điểm	Nhược điểm
Đơn giản hơn WebSocket	One-way only (server → client)
Auto-reconnect built-in	Cần HTTP POST cho client → server
Works over HTTP/2	Giới hạn 6 connections/domain (HTTP/1.1)

Verdict: Phù hợp cho notification/presence, không ideal cho full chat.

Protocol Decision Matrix

Tiêu chí	Polling	Long Polling	WebSocket	SSE
Latency	Cao	Trung bình	Thấp nhất	Thấp
Bidirectional	Không	Không	Có	Không
Scalability	Dễ	Trung bình	Khó	Trung bình
Battery (mobile)	Tệ	Trung bình	Tốt	Tốt
Firewall friendly	Tốt	Tốt	Trung bình	Tốt

Quyết định: Dùng WebSocket cho real-time messaging + presence. Dùng HTTP REST cho user profile, search, media upload, group management.

2.2 Service Decomposition (Phân tách dịch vụ)

flowchart TB
    subgraph "Client Layer"
        WEB[Web Client]
        MOB[Mobile Client]
    end

    subgraph "API Gateway / Load Balancer"
        LB[Load Balancer<br/>Nginx / ALB]
    end

    subgraph "Stateless Services (HTTP)"
        AUTH[Auth Service<br/>JWT / OAuth2]
        USER[User Service<br/>Profile, Contacts]
        GROUP[Group Service<br/>Group CRUD, Members]
        MEDIA[Media Service<br/>Upload, Download]
        SEARCH[Search Service<br/>Message Search]
        NOTIF[Notification Service<br/>FCM / APNs]
    end

    subgraph "Stateful Services (WebSocket)"
        CS1[Chat Server 1<br/>WebSocket]
        CS2[Chat Server 2<br/>WebSocket]
        CS3[Chat Server N<br/>WebSocket]
    end

    subgraph "Presence Layer"
        PRES[Presence Service<br/>Online/Offline/Typing]
    end

    subgraph "Message Layer"
        MQ[Message Queue<br/>Kafka / RabbitMQ]
        SYNC[Message Sync Service<br/>Offline → Online delivery]
    end

    subgraph "Storage Layer"
        CASS[(Cassandra<br/>Messages)]
        PG[(PostgreSQL<br/>Users, Groups)]
        REDIS[(Redis<br/>Presence, Sessions)]
        S3[(S3 / MinIO<br/>Media Files)]
        ES[(Elasticsearch<br/>Message Search)]
    end

    subgraph "CDN"
        CDN[CloudFront / Cloudflare<br/>Media Delivery]
    end

    WEB & MOB --> LB
    LB --> AUTH & USER & GROUP & MEDIA & SEARCH
    WEB & MOB -.->|WebSocket| CS1 & CS2 & CS3
    CS1 & CS2 & CS3 --> MQ
    MQ --> CS1 & CS2 & CS3
    MQ --> NOTIF
    MQ --> SYNC
    CS1 & CS2 & CS3 --> PRES
    PRES --> REDIS
    CS1 & CS2 & CS3 --> CASS
    USER & GROUP --> PG
    MEDIA --> S3
    S3 --> CDN
    SEARCH --> ES
    SYNC --> CASS
    NOTIF -.->|Push| MOB

2.3 Message Flow — 1-on-1 Chat

sequenceDiagram
    participant A as User A (Sender)
    participant CS1 as Chat Server 1
    participant MQ as Message Queue (Kafka)
    participant CS2 as Chat Server 2
    participant B as User B (Receiver)
    participant DB as Cassandra
    participant NS as Notification Service

    A->>CS1: 1. Send message via WebSocket
    CS1->>CS1: 2. Validate & generate message_id (ULID)
    CS1->>MQ: 3. Publish to message queue
    CS1-->>A: 4. ACK (message accepted)

    par Parallel processing
        MQ->>DB: 5a. Persist message
        MQ->>CS2: 5b. Route to receiver's chat server
    end

    alt User B is ONLINE
        CS2->>B: 6a. Deliver via WebSocket
        B-->>CS2: 7a. ACK (received)
    else User B is OFFLINE
        MQ->>NS: 6b. Trigger push notification
        NS->>B: 7b. Push via FCM/APNs
    end

2.4 Message Flow — Group Chat

sequenceDiagram
    participant A as User A (Sender)
    participant CS1 as Chat Server 1
    participant MQ as Message Queue (Kafka)
    participant FO as Fan-out Service
    participant CS2 as Chat Server 2
    participant CS3 as Chat Server 3
    participant B as User B (Online)
    participant C as User C (Online)
    participant NS as Notification Service
    participant D as User D (Offline)

    A->>CS1: 1. Send group message
    CS1->>MQ: 2. Publish (topic: group_123)
    CS1-->>A: 3. ACK

    MQ->>FO: 4. Fan-out service picks up
    FO->>FO: 5. Lookup group members & their chat servers

    par Fan-out to online members
        FO->>CS2: 6a. Route to User B's server
        CS2->>B: 7a. Deliver via WebSocket
        FO->>CS3: 6b. Route to User C's server
        CS3->>C: 7b. Deliver via WebSocket
    end

    FO->>NS: 6c. Notify offline members
    NS->>D: 7c. Push notification

2.5 Hybrid Approach: Why not just REST?

Hieu, câu hỏi hay: tại sao không dùng REST cho mọi thứ?

Operation	Protocol	Lý do
Send/receive messages	WebSocket	Real-time, bidirectional
User login/signup	HTTP REST	One-time, stateless
Upload media	HTTP REST (multipart)	Large payload, not real-time
Search messages	HTTP REST	Request-response pattern
Update profile	HTTP REST	Infrequent, stateless
Presence updates	WebSocket (piggyback)	Continuous, low overhead
Push notifications	HTTP (server-side)	Server → FCM/APNs

Step 3 — Deep Dive

3.1 Message Storage — SQL vs NoSQL

Đây là quyết định database quan trọng nhất cho chat system.

Tại sao KHÔNG dùng SQL (PostgreSQL/MySQL)?

Vấn đề	Giải thích
Write-heavy workload	69K writes/s — relational DB struggle ở mức này
Append-only pattern	Chat messages chỉ INSERT, rất ít UPDATE/DELETE
Time-series nature	Messages là sequential theo thời gian, giống time-series data
Horizontal scaling	SQL sharding phức tạp, cần manual re-shard
Storage volume	73TB text/year — vượt khả năng single cluster

Tại sao chọn NoSQL (Cassandra / HBase)?

Ưu điểm	Giải thích
Write-optimized	LSM-tree architecture, sequential writes
Horizontal scaling	Add nodes, data tự redistribute
Time-series friendly	Partition by channel + sort by timestamp
High availability	Tunable consistency (ONE, QUORUM, ALL)
Proven at scale	Discord dùng Cassandra, Facebook dùng HBase

Message Schema (Cassandra)

-- Bảng messages: partition by channel_id, cluster by message_id (time-ordered)
CREATE TABLE messages (
    channel_id    UUID,          -- 1-on-1 hoặc group channel
    message_id    TIMEUUID,      -- Time-based UUID, tự động sort theo thời gian
    sender_id     UUID,
    content       TEXT,
    content_type  TEXT,           -- 'text', 'image', 'file', 'system'
    media_url     TEXT,           -- NULL nếu text-only
    created_at    TIMESTAMP,
    updated_at    TIMESTAMP,     -- NULL nếu chưa edit
    deleted       BOOLEAN,       -- Soft delete
 
    PRIMARY KEY (channel_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC)
  AND compaction = {'class': 'TimeWindowCompactionStrategy',
                    'compaction_window_unit': 'DAYS',
                    'compaction_window_size': 1}
  AND default_time_to_live = 0;  -- No TTL, retain forever

Giải thích partition strategy:

Partition key: channel_id — tất cả messages của 1 conversation nằm trên cùng partition
Clustering key: message_id (TIMEUUID) — messages tự động sort theo thời gian
TimeWindowCompactionStrategy — tối ưu cho time-series data, giảm read amplification

Query patterns:

-- Lấy 50 messages mới nhất của channel
SELECT * FROM messages
WHERE channel_id = ?
ORDER BY message_id DESC
LIMIT 50;
 
-- Lấy messages cũ hơn (pagination/scroll up)
SELECT * FROM messages
WHERE channel_id = ?
AND message_id < ?    -- cursor-based pagination
ORDER BY message_id DESC
LIMIT 50;

Hot Partition Problem (Vấn đề partition nóng)

Nếu một group chat rất active (ví dụ: group 100 người, 10K messages/ngày), partition đó sẽ quá lớn.

Giải pháp: Bucketing theo thời gian

CREATE TABLE messages_bucketed (
    channel_id    UUID,
    bucket        TEXT,           -- '2026-03', '2026-04' (monthly bucket)
    message_id    TIMEUUID,
    sender_id     UUID,
    content       TEXT,
    content_type  TEXT,
    media_url     TEXT,
    created_at    TIMESTAMP,
 
    PRIMARY KEY ((channel_id, bucket), message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);

Partition key = (channel_id, bucket) → mỗi tháng là một partition mới, tránh unbounded growth.

3.2 Message ID Generation — Ordering Guarantee

Tại sao cần custom ID thay vì auto-increment?

Auto-increment	Custom ID (Snowflake/ULID)
Single point of failure	Distributed generation
Bottleneck ở scale lớn	Mỗi server tự generate
Không có timestamp info	Timestamp embedded → sortable

ULID (Universally Unique Lexicographically Sortable Identifier)

 01ARZ3NDEKTSV4RRFFQ69G5FAV
 |----------|  |------------|
  Timestamp     Randomness
   48 bits       80 bits
   (ms)

Ưu điểm ULID cho chat:

Lexicographically sortable → messages tự động sắp xếp theo thời gian
128 bits → collision probability cực thấp
Monotonic — trong cùng millisecond, vẫn tăng dần
Không cần coordination — mỗi server tự generate

Snowflake ID (Twitter style)

|  1 bit  |  41 bits   | 10 bits  |  12 bits  |
|  unused | timestamp  | machine  | sequence  |
|         |  (ms)      |    ID    |  number   |

41 bits timestamp → ~69 years
10 bits machine ID → 1024 servers
12 bits sequence → 4096 IDs/ms/server

Quyết định: Dùng ULID cho message IDs (đơn giản hơn Snowflake, không cần machine ID coordination).

3.3 Presence Service — Online/Offline Indicator

flowchart TB
    subgraph "Client"
        C[Mobile/Web Client]
    end

    subgraph "Presence Service"
        PS[Presence Server]
        HB[Heartbeat Handler]
    end

    subgraph "Storage"
        REDIS[(Redis<br/>user_id → status, last_active)]
    end

    subgraph "Notification"
        PUB[Redis Pub/Sub<br/>Presence Channel]
    end

    subgraph "Subscribers"
        CS1[Chat Server 1<br/>→ User A's friends online]
        CS2[Chat Server 2<br/>→ User B's friends online]
    end

    C -->|WebSocket connect| PS
    PS -->|Set ONLINE| REDIS
    PS -->|Publish status change| PUB

    C -->|Heartbeat every 5s| HB
    HB -->|Update last_active TTL| REDIS

    C -->|WebSocket disconnect| PS
    PS -->|Set OFFLINE| REDIS
    PS -->|Publish status change| PUB

    PUB --> CS1 & CS2

Heartbeat Mechanism (Cơ chế nhịp tim)

User connects via WebSocket → status = ONLINE
Every 5 seconds: client sends heartbeat ping
Server: reset TTL in Redis (key: presence:{user_id}, TTL: 30s)

If no heartbeat for 30s → key expires → status = OFFLINE

Redis data structure:

# Presence status
SET presence:user_123 "online" EX 30      # Auto-expire after 30s
SET last_active:user_123 "1679123456789"   # Epoch ms

# Batch check: ai đang online trong friend list?
MGET presence:user_001 presence:user_002 presence:user_003 ...

Presence Fan-out Problem (Vấn đề fan-out trạng thái)

Khi User A login (status → ONLINE), ai cần biết?

Tất cả friends/contacts của User A.

Nếu User A có 500 friends → 500 push notifications. Nếu 100K users login cùng lúc (9h sáng) → 100K × 500 = 50M presence events/s (!)

Giải pháp cho large groups:

Strategy	Mô tả	Khi nào dùng
Eager (fan-out on write)	Push status change tới tất cả friends ngay	Friends list < 500
Lazy (fan-out on read)	Client tự fetch presence khi mở chat	Friends list > 500
Hybrid	Eager cho close friends, lazy cho others	Most systems
Subscribe on view	Chỉ subscribe presence khi chat window đang mở	Mobile (tiết kiệm pin)

Quyết định: Dùng Subscribe on view — client chỉ subscribe presence của conversations đang hiển thị trên màn hình. Giảm drastically số lượng events.

3.4 Group Chat Optimization

Small Groups (< 100 members): Fan-out on Write

User A gửi tin nhắn vào group (50 members):
1. Message lưu vào messages table (1 write)
2. Fan-out: copy message_id vào inbox của mỗi member (50 writes)
3. Mỗi member nhận push qua WebSocket

Total writes per message: 1 + 50 = 51

Ưu điểm: Read nhanh (mỗi user chỉ đọc inbox của mình), đơn giản. Nhược điểm: Write amplification (1 message → N writes).

Large Groups (> 100 members, nếu mở rộng sau): Fan-out on Read

User A gửi tin nhắn vào group (10,000 members):
1. Message lưu vào messages table (1 write)
2. KHÔNG fan-out, không copy
3. Khi member mở group → query messages table trực tiếp

Total writes per message: 1
Total reads per view: 1 query (nhưng expensive hơn)

Ưu điểm: Write cực nhanh, không amplification. Nhược điểm: Read chậm hơn, cần cache.

Inbox Table (Fan-out on Write approach)

-- User inbox: mỗi user có list message pointers
CREATE TABLE user_inbox (
    user_id       UUID,
    channel_id    UUID,
    message_id    TIMEUUID,
    sender_id     UUID,
    preview       TEXT,          -- First 100 chars for notification
    is_read       BOOLEAN,
    created_at    TIMESTAMP,
 
    PRIMARY KEY (user_id, created_at)
) WITH CLUSTERING ORDER BY (created_at DESC);

3.5 Media Handling — Pre-signed URL Upload

sequenceDiagram
    participant C as Client
    participant API as API Gateway
    participant MS as Media Service
    participant S3 as S3 / Object Storage
    participant TH as Thumbnail Service
    participant CDN as CDN (CloudFront)
    participant CS as Chat Server
    participant R as Receiver

    C->>API: 1. Request upload URL (filename, size, type)
    API->>MS: 2. Validate (file size < 25MB, allowed types)
    MS->>S3: 3. Generate pre-signed URL (expires in 15 min)
    S3-->>MS: 4. Pre-signed URL
    MS-->>C: 5. Return upload URL

    C->>S3: 6. Upload file directly to S3 (bypass backend!)
    S3-->>C: 7. Upload complete (200 OK)

    S3->>TH: 8. S3 Event → trigger thumbnail generation
    TH->>S3: 9. Save thumbnail (compressed, resized)

    C->>CS: 10. Send message with media_url via WebSocket
    CS->>R: 11. Deliver message with CDN URL

    R->>CDN: 12. Load image/file from CDN
    CDN->>S3: 13. Cache miss → fetch from S3
    CDN-->>R: 14. Serve media (cached)

Tại sao Pre-signed URL?

Direct upload qua backend	Pre-signed URL (S3 direct)
Backend là bottleneck	Client upload thẳng S3
Tốn bandwidth server	Server chỉ generate URL
Timeout risk với file lớn	S3 handle large files natively
Scale khó	S3 scale vô hạn

Thumbnail Generation:

Original image (2MB, 3000x2000)
  ↓ S3 Event Notification
Lambda/Worker:
  → Thumbnail small: 150x150 (5KB) — cho chat list
  → Thumbnail medium: 600x400 (50KB) — cho chat view
  → Keep original — cho "View full size"

3.6 Read Receipts — Per-user Read Pointer

Data Model

-- Mỗi user có một "read pointer" cho mỗi channel
CREATE TABLE read_receipts (
    channel_id    UUID,
    user_id       UUID,
    last_read_message_id  TIMEUUID,   -- Pointer tới message cuối cùng đã đọc
    updated_at    TIMESTAMP,
 
    PRIMARY KEY (channel_id, user_id)
);

Flow:

1. User B mở conversation → client gửi "mark as read" với last_message_id
2. Server UPDATE read_receipts SET last_read_message_id = X
3. Server push read receipt event tới User A (sender)
4. User A's client hiển thị "✓✓ Đã đọc"

Optimization:

Batch updates: Không gửi read receipt cho MỖI message. Chỉ gửi khi user scroll hoặc mở chat (debounce 2-3 giây).
Group read receipts: Chỉ hiển thị “Đã đọc bởi 5 người” thay vì fan-out tới mỗi member.

Unread Count Calculation

-- Đếm messages chưa đọc cho user X trong channel Y
SELECT COUNT(*) FROM messages
WHERE channel_id = ?
AND message_id > (
    SELECT last_read_message_id FROM read_receipts
    WHERE channel_id = ? AND user_id = ?
);

Optimization: Cache unread count trong Redis. Mỗi message mới → INCR unread:{user_id}:{channel_id}. Khi đọc → DEL unread:{user_id}:{channel_id}.

3.7 Push Notification Integration

flowchart LR
    subgraph "Message Pipeline"
        MQ[Kafka Topic:<br/>messages]
    end

    subgraph "Notification Service"
        NC[Notification Consumer]
        PC[Presence Check]
        DT[Device Token Store<br/>Redis/PostgreSQL]
        RL[Rate Limiter<br/>Max 1 push/5s per user]
    end

    subgraph "Push Providers"
        FCM[Firebase Cloud Messaging<br/>Android]
        APNS[Apple Push Notification<br/>iOS]
        WPN[Web Push<br/>Browser]
    end

    MQ --> NC
    NC --> PC
    PC -->|User OFFLINE| DT
    DT --> RL
    RL --> FCM & APNS & WPN
    PC -->|User ONLINE| NC
    NC -->|Skip push| NC

    style PC fill:#f9a825

Offline Message Queue:

Khi user B offline, messages tích lũy. Khi B online lại:

1. B connects via WebSocket
2. Chat server queries: messages WHERE channel_id IN (B's channels)
   AND message_id > B's last_synced_message_id
3. Deliver all pending messages (batched)
4. Update B's sync pointer

3.8 Message Search — Elasticsearch

// Elasticsearch index mapping
{
  "mappings": {
    "properties": {
      "channel_id": { "type": "keyword" },
      "message_id": { "type": "keyword" },
      "sender_id": { "type": "keyword" },
      "content": {
        "type": "text",
        "analyzer": "standard",
        "fields": {
          "keyword": { "type": "keyword" }
        }
      },
      "created_at": { "type": "date" },
      "content_type": { "type": "keyword" }
    }
  }
}

Search flow: Message written to Cassandra → CDC (Change Data Capture) → Kafka → Elasticsearch Indexer → ES.

Không search trực tiếp trên Cassandra — Cassandra chỉ support primary key lookup, không full-text search.

3.9 End-to-End Encryption (E2E) — Signal Protocol Overview

flowchart LR
    subgraph "User A"
        A_PK[Public Key A]
        A_SK[Private Key A]
        A_ENC[Encrypt with<br/>shared secret]
    end

    subgraph "Server (Cannot Read)"
        SRV[Server stores<br/>encrypted blob only]
    end

    subgraph "User B"
        B_PK[Public Key B]
        B_SK[Private Key B]
        B_DEC[Decrypt with<br/>shared secret]
    end

    A_PK -->|Key Exchange<br/>X3DH Protocol| B_PK
    B_PK --> A_ENC
    A_ENC -->|Encrypted message| SRV
    SRV -->|Encrypted message| B_DEC
    A_SK --> A_ENC
    B_SK --> B_DEC

    style SRV fill:#ff6666,color:#fff

Signal Protocol flow (simplified):

Key Exchange (X3DH — Extended Triple Diffie-Hellman):
- User A và B trao đổi public keys thông qua server
- Tạo shared secret mà server KHÔNG biết
Double Ratchet Algorithm:
- Mỗi message dùng một encryption key khác nhau
- Nếu 1 key bị compromise, chỉ ảnh hưởng 1 message (forward secrecy)
Server’s role: Chỉ relay encrypted blobs. Không thể đọc nội dung.

Trade-off: E2E encryption → server-side search không hoạt động. Client phải tự search locally.

4. Security

4.1 End-to-End Encryption (Chi tiết)

Aspect	Implementation
Protocol	Signal Protocol (WhatsApp, Signal)
Key exchange	X3DH (Extended Triple Diffie-Hellman)
Message encryption	AES-256-GCM
Key rotation	Double Ratchet — mỗi message key mới
Group E2E	Sender Keys protocol (mỗi member có sender key)
Key backup	Client-side encrypted backup (user’s passphrase)

4.2 Message Retention Policy

# retention-policy.yml
retention:
  text_messages:
    1_on_1: forever          # User có thể xóa manually
    group: forever
    deleted_by_user:
      soft_delete: true      # Mark as deleted, keep 30 days for compliance
      hard_delete_after: 30d
  media:
    original: 1y             # Giữ original 1 năm
    thumbnails: forever
    after_expiry: delete_original_keep_thumbnail
  audit_logs:
    retention: 7y            # Compliance requirement
    storage: cold_storage    # S3 Glacier after 90 days

4.3 Spam & Abuse Detection

Message pipeline:
  User sends message
    → Rate limiter (max 60 messages/minute per user)
    → Content filter (regex: phone numbers, URLs → flag)
    → ML spam classifier (real-time inference, < 10ms)
    → If flagged → hold for review, notify user
    → If clean → deliver normally

Layer	Technique	Latency Budget
Rate limiting	Token bucket per user	< 1ms
Keyword filter	Bloom filter + regex	< 2ms
ML classifier	TensorFlow Serving (text classification)	< 10ms
Image scanning	PhotoDNA / perceptual hashing	< 100ms (async)
Human review	Queue for flagged content	Async

4.4 CSAM Scanning — Ethical Considerations

CSAM = Child Sexual Abuse Material. Bắt buộc theo luật nhiều quốc gia.

Approach	Mô tả	Trade-off
PhotoDNA (Microsoft)	Hash-based matching against known CSAM database	Privacy-preserving (chỉ hash, không xem ảnh)
Client-side scanning	Scan trên device trước khi upload	Controversial (Apple đã rút) — xung đột với E2E
Server-side scanning	Scan after upload, before delivery	Không hoạt động với E2E encrypted media
Report-based	Chỉ scan khi có user report	Miss proactive detection

Ethical dilemma: E2E encryption bảo vệ privacy nhưng cũng bảo vệ tội phạm. Không có giải pháp hoàn hảo. Hầu hết platforms (WhatsApp, Telegram) dùng metadata analysis + user reporting thay vì break E2E.

5. DevOps

5.1 WebSocket Scaling with Sticky Sessions

Problem: WebSocket là stateful connection. Nếu load balancer route request tới server khác, connection bị mất.

Solution 1: Sticky Sessions (Layer 7)

# nginx.conf — WebSocket sticky sessions
upstream chat_servers {
    ip_hash;  # Same client IP → same server
    server chat-server-1:8080;
    server chat-server-2:8080;
    server chat-server-3:8080;
}
 
server {
    listen 443 ssl;
    server_name chat.example.com;
 
    location /ws {
        proxy_pass http://chat_servers;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header X-Real-IP $remote_addr;
        proxy_read_timeout 86400s;  # 24h timeout for WebSocket
        proxy_send_timeout 86400s;
    }
}

Solution 2: Connection Registry (Recommended cho scale lớn)

Redis connection registry:
  ws:conn:user_123 → chat-server-7  (TTL: 300s, refresh on heartbeat)
  ws:conn:user_456 → chat-server-2

Routing logic:
  1. Message for user_123
  2. Lookup Redis: user_123 → chat-server-7
  3. Route message to chat-server-7 via internal message bus
  4. chat-server-7 pushes to user_123's WebSocket

5.2 Monitoring WebSocket Connections

# prometheus-alerts.yml — Chat System specific
groups:
  - name: chat_system_alerts
    rules:
      - alert: WebSocketConnectionsHigh
        expr: websocket_active_connections > 60000  # 92% of 65K limit
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "WebSocket connections near limit ({{ $value }}/65000)"
 
      - alert: MessageDeliveryLatencyHigh
        expr: histogram_quantile(0.99, rate(message_delivery_seconds_bucket[5m])) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P99 message delivery latency > 500ms"
 
      - alert: KafkaConsumerLag
        expr: kafka_consumer_group_lag > 100000
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Kafka consumer lag > 100K messages, delivery delayed"
 
      - alert: WebSocketReconnectionSpike
        expr: rate(websocket_reconnections_total[5m]) > 1000
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "WebSocket reconnection storm detected ({{ $value }}/s)"
 
      - alert: PresenceServiceLatency
        expr: histogram_quantile(0.95, rate(presence_update_seconds_bucket[5m])) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Presence update P95 latency > 100ms"
 
      - alert: CassandraWriteLatency
        expr: cassandra_write_latency_p99 > 50
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Cassandra P99 write latency > 50ms"

5.3 Grafana Dashboard Essentials

Panel	PromQL	Threshold
Active WebSocket Connections	`websocket_active_connections`	Warning: 80% capacity
Message Throughput	`rate(messages_sent_total[1m])`	Compare vs estimation (23K/s avg)
Message Delivery Latency P99	`histogram_quantile(0.99, ...)`	< 200ms
Kafka Consumer Lag	`kafka_consumer_group_lag`	< 10K
Presence Updates/s	`rate(presence_updates_total[1m])`	Monitor for fan-out storms
Undelivered Messages	`undelivered_messages_gauge`	Should trend to 0
WebSocket Error Rate	`rate(websocket_errors_total[5m])`	< 0.1%

5.4 Kafka for Message Delivery

# kafka-topics.yml
topics:
  - name: chat.messages.1on1
    partitions: 64          # Partition by channel_id hash
    replication_factor: 3
    config:
      retention.ms: 604800000   # 7 days (messages persisted in Cassandra)
      min.insync.replicas: 2
      compression.type: lz4
 
  - name: chat.messages.group
    partitions: 128         # Higher throughput for group messages
    replication_factor: 3
    config:
      retention.ms: 604800000
      min.insync.replicas: 2
 
  - name: chat.presence
    partitions: 32
    replication_factor: 2   # Presence is less critical
    config:
      retention.ms: 3600000    # 1 hour only
      cleanup.policy: compact  # Keep latest status per user
 
  - name: chat.notifications
    partitions: 32
    replication_factor: 3
    config:
      retention.ms: 86400000   # 1 day

Partition Strategy: Messages for the same channel_id go to the same Kafka partition → ordering guaranteed within a conversation.

6. Code Examples

6.1 Node.js WebSocket Server with Rooms

// chat-server.js
// WebSocket server with rooms (channels), presence heartbeat, message routing
 
const WebSocket = require('ws');
const Redis = require('ioredis');
const { Kafka } = require('kafkajs');
const { ulid } = require('ulid');
 
// --- Config ---
const PORT = process.env.PORT || 8080;
const SERVER_ID = process.env.SERVER_ID || 'chat-server-1';
const HEARTBEAT_INTERVAL = 5000;   // 5 seconds
const HEARTBEAT_TIMEOUT = 30000;   // 30 seconds → offline
 
// --- Connections ---
const redis = new Redis(process.env.REDIS_URL);
const redisSub = new Redis(process.env.REDIS_URL);
 
const kafka = new Kafka({ brokers: [process.env.KAFKA_BROKER || 'localhost:9092'] });
const producer = kafka.producer();
const consumer = kafka.consumer({ groupId: `${SERVER_ID}-group` });
 
// --- State ---
const clients = new Map();      // user_id → WebSocket
const userChannels = new Map(); // user_id → Set<channel_id>
 
// --- WebSocket Server ---
const wss = new WebSocket.Server({ port: PORT });
 
wss.on('connection', async (ws, req) => {
  const userId = authenticateFromToken(req); // Extract from JWT in query/header
  if (!userId) {
    ws.close(4001, 'Unauthorized');
    return;
  }
 
  // Register connection
  clients.set(userId, ws);
  await registerConnection(userId);
  await setPresence(userId, 'online');
 
  console.log(`[${SERVER_ID}] User ${userId} connected. Total: ${clients.size}`);
 
  // Heartbeat setup
  ws.isAlive = true;
  ws.on('pong', () => { ws.isAlive = true; });
 
  // Message handler
  ws.on('message', async (raw) => {
    try {
      const data = JSON.parse(raw);
      await handleMessage(userId, data);
    } catch (err) {
      ws.send(JSON.stringify({ type: 'error', message: 'Invalid message format' }));
    }
  });
 
  // Disconnect handler
  ws.on('close', async () => {
    clients.delete(userId);
    await unregisterConnection(userId);
    await setPresence(userId, 'offline');
    console.log(`[${SERVER_ID}] User ${userId} disconnected. Total: ${clients.size}`);
  });
 
  // Send pending messages (offline → online sync)
  await deliverPendingMessages(userId, ws);
});
 
// --- Message Handling ---
async function handleMessage(senderId, data) {
  switch (data.type) {
    case 'chat_message':
      await handleChatMessage(senderId, data);
      break;
    case 'typing':
      await handleTypingIndicator(senderId, data);
      break;
    case 'read_receipt':
      await handleReadReceipt(senderId, data);
      break;
    case 'heartbeat':
      await refreshPresence(senderId);
      break;
    default:
      console.warn(`Unknown message type: ${data.type}`);
  }
}
 
async function handleChatMessage(senderId, data) {
  const { channelId, content, contentType = 'text', mediaUrl = null } = data;
 
  // Generate time-sortable ID
  const messageId = ulid();
 
  const message = {
    messageId,
    channelId,
    senderId,
    content,
    contentType,
    mediaUrl,
    createdAt: Date.now(),
  };
 
  // ACK to sender immediately
  sendToUser(senderId, {
    type: 'message_ack',
    messageId,
    channelId,
    status: 'accepted',
    timestamp: message.createdAt,
  });
 
  // Publish to Kafka for persistence + delivery
  await producer.send({
    topic: 'chat.messages.1on1',
    messages: [{
      key: channelId,  // Partition by channel → ordering guarantee
      value: JSON.stringify(message),
    }],
  });
}
 
async function handleTypingIndicator(senderId, data) {
  const { channelId } = data;
  // Broadcast typing indicator to other members in channel
  const members = await getChannelMembers(channelId);
  for (const memberId of members) {
    if (memberId !== senderId) {
      sendToUser(memberId, {
        type: 'typing',
        channelId,
        userId: senderId,
        timestamp: Date.now(),
      });
    }
  }
}
 
async function handleReadReceipt(senderId, data) {
  const { channelId, lastReadMessageId } = data;
 
  // Update read pointer in Redis (will be persisted to Cassandra async)
  await redis.set(
    `read_receipt:${channelId}:${senderId}`,
    lastReadMessageId
  );
 
  // Reset unread count
  await redis.del(`unread:${senderId}:${channelId}`);
 
  // Notify other members
  const members = await getChannelMembers(channelId);
  for (const memberId of members) {
    if (memberId !== senderId) {
      sendToUser(memberId, {
        type: 'read_receipt',
        channelId,
        userId: senderId,
        lastReadMessageId,
      });
    }
  }
}
 
// --- Delivery ---
function sendToUser(userId, payload) {
  const ws = clients.get(userId);
  if (ws && ws.readyState === WebSocket.OPEN) {
    ws.send(JSON.stringify(payload));
    return true;
  }
  return false; // User not on this server or offline
}
 
// --- Presence ---
async function setPresence(userId, status) {
  if (status === 'online') {
    await redis.set(`presence:${userId}`, 'online', 'EX', 30);
  } else {
    await redis.del(`presence:${userId}`);
  }
  // Publish presence change for subscribers
  await redis.publish('presence_changes', JSON.stringify({
    userId,
    status,
    timestamp: Date.now(),
  }));
}
 
async function refreshPresence(userId) {
  await redis.expire(`presence:${userId}`, 30); // Reset TTL
}
 
// --- Connection Registry ---
async function registerConnection(userId) {
  await redis.set(`ws:conn:${userId}`, SERVER_ID, 'EX', 300);
}
 
async function unregisterConnection(userId) {
  await redis.del(`ws:conn:${userId}`);
}
 
// --- Heartbeat Check ---
const heartbeatInterval = setInterval(() => {
  wss.clients.forEach((ws) => {
    if (!ws.isAlive) {
      ws.terminate();
      return;
    }
    ws.isAlive = false;
    ws.ping();
  });
}, HEARTBEAT_INTERVAL);
 
// --- Kafka Consumer (receive messages routed to this server) ---
async function startConsumer() {
  await consumer.connect();
  await consumer.subscribe({ topic: 'chat.messages.1on1', fromBeginning: false });
 
  await consumer.run({
    eachMessage: async ({ message }) => {
      const msg = JSON.parse(message.value.toString());
      const { channelId, senderId } = msg;
 
      // Get channel members and deliver
      const members = await getChannelMembers(channelId);
      for (const memberId of members) {
        if (memberId === senderId) continue; // Don't echo back to sender
 
        const delivered = sendToUser(memberId, {
          type: 'new_message',
          ...msg,
        });
 
        if (delivered) {
          // Increment unread counter
          await redis.incr(`unread:${memberId}:${channelId}`);
        }
      }
    },
  });
}
 
// --- Helper stubs ---
function authenticateFromToken(req) {
  // Extract JWT from query param or header, verify, return userId
  // In production: verify JWT signature, check expiry
  const url = new URL(req.url, `http://${req.headers.host}`);
  return url.searchParams.get('userId'); // Simplified for example
}
 
async function getChannelMembers(channelId) {
  // In production: cache in Redis, fallback to PostgreSQL
  const members = await redis.smembers(`channel:members:${channelId}`);
  return members;
}
 
async function deliverPendingMessages(userId, ws) {
  // In production: query Cassandra for messages after user's last sync point
  console.log(`[${SERVER_ID}] Delivering pending messages to ${userId}`);
}
 
// --- Startup ---
(async () => {
  await producer.connect();
  await startConsumer();
  console.log(`[${SERVER_ID}] Chat server running on port ${PORT}`);
})();

6.2 Message Schema — TypeScript Types

// types/message.ts
 
/** Message content types */
type ContentType = 'text' | 'image' | 'file' | 'video' | 'audio' | 'system';
 
/** WebSocket message types (client ↔ server) */
type WSMessageType =
  | 'chat_message'      // Client → Server: send message
  | 'new_message'       // Server → Client: receive message
  | 'message_ack'       // Server → Client: message accepted
  | 'typing'            // Bidirectional: typing indicator
  | 'read_receipt'      // Bidirectional: mark as read
  | 'presence'          // Server → Client: online/offline
  | 'heartbeat'         // Client → Server: keep alive
  | 'error';            // Server → Client: error
 
/** Chat message stored in Cassandra */
interface ChatMessage {
  messageId: string;      // ULID — time-sortable unique ID
  channelId: string;      // UUID of 1-on-1 or group channel
  senderId: string;       // UUID of sender
  content: string;        // Message text (encrypted if E2E)
  contentType: ContentType;
  mediaUrl: string | null;
  thumbnailUrl: string | null;
  createdAt: number;      // Unix timestamp (ms)
  updatedAt: number | null;
  deleted: boolean;
  replyTo: string | null; // messageId of parent (for threads)
}
 
/** Channel (conversation) stored in PostgreSQL */
interface Channel {
  channelId: string;
  type: 'direct' | 'group';
  name: string | null;        // NULL for 1-on-1
  createdBy: string;
  createdAt: number;
  memberCount: number;
  lastMessageId: string | null;
  lastMessageAt: number | null;
}
 
/** Channel membership */
interface ChannelMember {
  channelId: string;
  userId: string;
  role: 'owner' | 'admin' | 'member';
  joinedAt: number;
  lastReadMessageId: string | null;
  notificationMuted: boolean;
}
 
/** Presence status */
interface UserPresence {
  userId: string;
  status: 'online' | 'offline' | 'away';
  lastActiveAt: number;
  device: 'mobile' | 'web' | 'desktop';
}
 
/** WebSocket envelope (all messages wrapped in this) */
interface WSEnvelope<T = unknown> {
  type: WSMessageType;
  payload: T;
  timestamp: number;
  requestId?: string;     // For request-response correlation
}

6.3 Presence Heartbeat — Client Side

// client/presence.js — Browser/React Native WebSocket client
 
class ChatClient {
  constructor(wsUrl, token) {
    this.wsUrl = wsUrl;
    this.token = token;
    this.ws = null;
    this.heartbeatTimer = null;
    this.reconnectAttempts = 0;
    this.maxReconnectAttempts = 10;
    this.baseReconnectDelay = 1000; // 1 second
  }
 
  connect() {
    this.ws = new WebSocket(`${this.wsUrl}?token=${this.token}`);
 
    this.ws.onopen = () => {
      console.log('Connected to chat server');
      this.reconnectAttempts = 0;
      this.startHeartbeat();
    };
 
    this.ws.onmessage = (event) => {
      const data = JSON.parse(event.data);
      this.handleMessage(data);
    };
 
    this.ws.onclose = (event) => {
      console.log(`Disconnected: ${event.code} ${event.reason}`);
      this.stopHeartbeat();
      this.scheduleReconnect();
    };
 
    this.ws.onerror = (error) => {
      console.error('WebSocket error:', error);
    };
  }
 
  // --- Heartbeat ---
  startHeartbeat() {
    this.heartbeatTimer = setInterval(() => {
      if (this.ws.readyState === WebSocket.OPEN) {
        this.ws.send(JSON.stringify({ type: 'heartbeat' }));
      }
    }, 5000); // Every 5 seconds
  }
 
  stopHeartbeat() {
    if (this.heartbeatTimer) {
      clearInterval(this.heartbeatTimer);
      this.heartbeatTimer = null;
    }
  }
 
  // --- Reconnection with exponential backoff + jitter ---
  scheduleReconnect() {
    if (this.reconnectAttempts >= this.maxReconnectAttempts) {
      console.error('Max reconnect attempts reached. Please refresh.');
      return;
    }
 
    // Exponential backoff with jitter to prevent thundering herd
    const delay = Math.min(
      this.baseReconnectDelay * Math.pow(2, this.reconnectAttempts),
      30000 // Max 30 seconds
    );
    const jitter = delay * (0.5 + Math.random() * 0.5); // 50-100% of delay
 
    console.log(`Reconnecting in ${Math.round(jitter)}ms (attempt ${this.reconnectAttempts + 1})`);
 
    setTimeout(() => {
      this.reconnectAttempts++;
      this.connect();
    }, jitter);
  }
 
  // --- Send Message ---
  sendMessage(channelId, content, contentType = 'text') {
    this.ws.send(JSON.stringify({
      type: 'chat_message',
      channelId,
      content,
      contentType,
    }));
  }
 
  // --- Mark as Read ---
  markAsRead(channelId, lastReadMessageId) {
    this.ws.send(JSON.stringify({
      type: 'read_receipt',
      channelId,
      lastReadMessageId,
    }));
  }
 
  handleMessage(data) {
    switch (data.type) {
      case 'new_message':
        this.onNewMessage?.(data);
        break;
      case 'message_ack':
        this.onMessageAck?.(data);
        break;
      case 'typing':
        this.onTyping?.(data);
        break;
      case 'read_receipt':
        this.onReadReceipt?.(data);
        break;
      case 'presence':
        this.onPresenceChange?.(data);
        break;
      case 'error':
        console.error('Server error:', data.message);
        break;
    }
  }
}
 
// Usage:
// const chat = new ChatClient('wss://chat.example.com/ws', jwtToken);
// chat.onNewMessage = (msg) => updateUI(msg);
// chat.connect();

7. System Design Diagram — Complete Architecture

flowchart TB
    subgraph "Clients"
        WEB["🖥 Web Client<br/>(React)"]
        IOS["📱 iOS Client<br/>(Swift)"]
        AND["📱 Android Client<br/>(Kotlin)"]
    end

    subgraph "Edge Layer"
        CDN["CDN<br/>(CloudFront)"]
        LB["Load Balancer<br/>(ALB + Nginx)"]
    end

    subgraph "API Layer (HTTP - Stateless)"
        GW["API Gateway<br/>(Kong / Envoy)"]
        AUTH["Auth Service<br/>(JWT + OAuth2)"]
        USER["User Service<br/>(Profile, Contacts)"]
        GROUP["Group Service<br/>(CRUD, Members)"]
        MEDIA["Media Service<br/>(Pre-signed URLs)"]
        SRCH["Search Service<br/>(ES queries)"]
    end

    subgraph "Real-time Layer (WebSocket - Stateful)"
        direction LR
        WS1["Chat Server 1<br/>(WS, 65K conn)"]
        WS2["Chat Server 2<br/>(WS, 65K conn)"]
        WSN["Chat Server N<br/>(WS, 65K conn)"]
    end

    subgraph "Presence Layer"
        PRES["Presence Service<br/>(Heartbeat, Status)"]
    end

    subgraph "Message Bus"
        K1["Kafka<br/>chat.messages"]
        K2["Kafka<br/>chat.presence"]
        K3["Kafka<br/>chat.notifications"]
    end

    subgraph "Processing Layer"
        FAN["Fan-out Service<br/>(Group delivery)"]
        SYNC["Sync Service<br/>(Offline → Online)"]
        IDX["Indexer<br/>(Cassandra → ES)"]
        THUMB["Thumbnail Service<br/>(Image processing)"]
    end

    subgraph "Notification"
        NOTIF["Notification Service"]
        FCM["FCM<br/>(Android)"]
        APNS["APNs<br/>(iOS)"]
        WPUSH["Web Push"]
    end

    subgraph "Data Layer"
        CASS[("Cassandra<br/>Messages<br/>(Write-optimized)")]
        PG[("PostgreSQL<br/>Users, Groups<br/>Channels")]
        RD[("Redis Cluster<br/>Presence, Sessions<br/>Unread counts")]
        S3[("S3<br/>Media files")]
        ES[("Elasticsearch<br/>Message search")]
    end

    WEB & IOS & AND -->|HTTPS| LB
    WEB & IOS & AND -.->|WSS| WS1 & WS2 & WSN
    WEB & IOS & AND -->|Media| CDN

    LB --> GW
    GW --> AUTH & USER & GROUP & MEDIA & SRCH

    WS1 & WS2 & WSN --> K1
    WS1 & WS2 & WSN --> PRES
    PRES --> RD
    PRES --> K2

    K1 --> FAN & SYNC
    K1 --> CASS
    FAN --> WS1 & WS2 & WSN
    FAN --> K3
    K3 --> NOTIF
    NOTIF --> FCM & APNS & WPUSH

    SYNC --> CASS
    IDX --> ES
    CASS --> IDX

    USER & GROUP --> PG
    MEDIA --> S3
    S3 --> CDN
    S3 --> THUMB
    THUMB --> S3

    SRCH --> ES

8. Aha Moments — Đúc kết

#1 — WebSocket is a game changer but a scaling nightmare: WebSocket cho latency thấp nhất nhưng stateful connections khiến horizontal scaling phức tạp. Giải pháp: connection registry trong Redis + internal message bus (Kafka) để route giữa servers.

#2 — Chat là write-heavy, nhưng read-heavy ở tầng delivery: 69K write QPS (messages vào DB) nhưng 347K read QPS (delivery tới recipients). Fan-out on write cho small groups, fan-out on read cho large groups — hybrid approach.

#3 — NoSQL wins for messages: Cassandra/HBase với partition by channel_id + cluster by message_id (TIMEUUID) cho O(1) lookup theo conversation, sequential writes, và horizontal scaling tự nhiên. SQL không phù hợp cho workload này.

#4 — Presence is deceptively expensive: “User X online” nghe đơn giản, nhưng fan-out presence changes tới toàn bộ contacts list → thundering herd problem. Solution: subscribe-on-view (chỉ track presence cho conversations đang mở).

#5 — Message ordering is channel-scoped, not global: Không cần global ordering. Chỉ cần messages trong cùng conversation đúng thứ tự. ULID + Kafka partitioning by channel_id giải quyết điều này elegantly.

#6 — E2E encryption kills server-side features: Nếu enable E2E → server không thể search messages, scan content, hoặc generate previews. Trade-off rõ ràng giữa privacy và functionality.

#7 — Pre-signed URL pattern eliminates media bottleneck: Client upload thẳng S3, server chỉ generate URL. Không có media nào đi qua chat server → backend chỉ lo text messages.

9. Common Pitfalls — Sai lầm thường gặp

Pitfall 1: WebSocket Reconnection Storms (Bão reconnect)

Sai: Server restart → 65K clients reconnect cùng lúc → server mới overload ngay lập tức → lại crash → loop.

Đúng: Client PHẢI implement exponential backoff with jitter:

Attempt 1: wait 1s + random(0-0.5s)

Attempt 2: wait 2s + random(0-1s)

Attempt 3: wait 4s + random(0-2s)

…capped at 30s

Jitter đảm bảo clients reconnect rải rác, không cùng lúc.

Pitfall 2: Presence Fan-out in Large Groups

Sai: User A online → push “A is online” tới 500 friends → 500 WebSocket messages. 100K users login cùng lúc (9h sáng) → 50M presence events.

Đúng: Subscribe on view — chỉ track presence cho conversations đang visible trên screen. User nhìn 5 conversations → chỉ subscribe 5 presence channels, không phải 500.

Pitfall 3: Message Ordering in Distributed System

Sai: Dùng timestamp từ client device làm ordering key → clocks không sync → messages hiển thị sai thứ tự.

Đúng: Server generate message_id bằng ULID/Snowflake (server timestamp). Kafka partition by channel_id → messages trong cùng conversation luôn đúng thứ tự. Client display theo message_id, không phải client timestamp.

Pitfall 4: Không handle duplicate delivery

Sai: Network hiccup → client không nhận ACK → retry → message gửi 2 lần → receiver thấy duplicate.

Đúng: Idempotency key — client generate requestId cho mỗi message. Server check requestId trong Redis (TTL 5 min). Nếu đã xử lý → return existing ACK, không process lại.

Pitfall 5: Media upload blocking chat flow

Sai: Upload 10MB image → WebSocket connection busy → text messages bị block.

Đúng: Media upload qua separate HTTP connection (pre-signed URL → S3 direct). WebSocket chỉ carry text messages và lightweight events. Hai channels độc lập.

Pitfall 6: Unbounded group notification

Sai: Group 100 members, mỗi message trigger 99 push notifications → 1000 messages/hour = 99K notifications/hour cho 1 group.

Đúng: Notification batching — aggregate 5 messages trong 30 giây thành 1 notification: “Bạn có 5 tin nhắn mới trong Group ABC”. User đang online → KHÔNG gửi push notification.

10. Wrap Up — Step 4

Scaling Considerations

Component	Scaling Strategy
Chat Servers (WebSocket)	Horizontal + Connection Registry (Redis)
Cassandra	Add nodes, virtual nodes auto-rebalance
Kafka	Add partitions + brokers
Redis	Cluster mode (16384 hash slots)
Media (S3)	Virtually unlimited
Search (ES)	Add data nodes, index sharding
Notification	Horizontal + rate limiting per user

Multi-region Deployment

Region US-East:
  Chat servers (50%) ← Users in Americas
  Cassandra DC-1 (full replica)
  Kafka cluster (independent)

Region EU-West:
  Chat servers (30%) ← Users in Europe/Africa
  Cassandra DC-2 (full replica)
  Kafka cluster (independent)

Region AP-Southeast:
  Chat servers (20%) ← Users in Asia-Pacific
  Cassandra DC-3 (full replica)
  Kafka cluster (independent)

Cross-region sync:
  Cassandra multi-DC replication (async, eventual consistency)
  Cross-region message routing via dedicated Kafka MirrorMaker

Cost Optimization

Item	Monthly Cost (estimated, 50M DAU)	Optimization
WebSocket servers (100 instances)	~$30K	Spot instances for non-critical
Cassandra cluster (50 nodes)	~$50K	Tiered storage (hot/cold)
Kafka cluster (20 brokers)	~$15K	Managed service (MSK)
S3 storage (14.6 PB/year)	~$300K	S3 Intelligent-Tiering
CDN (media delivery)	~$100K	Cache optimization, compression
Redis cluster (presence)	~$10K	Managed ElastiCache
Total	~$505K/month	—

Future Enhancements

Voice/Video calls: WebRTC integration (TURN/STUN servers)
Message reactions: Lightweight (just emoji + userId, no separate table needed)
Threads: Reply-to with parent_message_id, lazy-loaded
Bots/Integrations: Webhook-based bot framework
Stories/Status: Ephemeral content (TTL 24h), separate storage
AI features: Smart reply suggestions, message summarization

Internal Links

Tuan-01-Scale-From-Zero-To-Millions — Scaling fundamentals
Tuan-02-Back-of-the-envelope — Capacity estimation (chat numbers reused here)
Tuan-03-Networking-DNS-CDN — CDN for media, DNS routing
Tuan-05-Load-Balancer — WebSocket-aware load balancing
Tuan-06-Cache-Strategy — Redis caching for presence, unread counts
Tuan-07-Database-Sharding-Replication — Cassandra partitioning strategy
Tuan-08-Message-Queue — Kafka for message routing
Tuan-09-Rate-Limiter — Chat message rate limiting
Tuan-13-Monitoring-Observability — WebSocket connection monitoring
Tuan-15-Data-Security-Encryption — E2E encryption, Signal Protocol
Tuan-16-Design-URL-Shortener — Previous case study
Tuan-18-Design-News-Feed — Fan-out pattern comparison
Tuan-19-Design-Notification-System — Push notification deep dive

Tham khảo

Alex Xu, System Design Interview — Chapter 12: Design a Chat System
Discord Engineering Blog — How Discord Stores Billions of Messages
WhatsApp Architecture — High Scalability
Signal Protocol — Technical Documentation
Kafka as Message Broker — Confluent
Tuan-02-Back-of-the-envelope — Estimation framework & chat numbers

Tuần trước: Tuan-16-Design-URL-Shortener — URL Shortener case study Tuần sau: Tuan-18-Design-News-Feed — News Feed system with fan-out patterns

lthieu's notes

Explorer

Tuan-17-Design-Chat-System