Tuần 17: Design a Chat System

“Chat system trông đơn giản — gửi tin nhắn từ A đến B. Nhưng khi 50 triệu người gửi cùng lúc, mỗi millisecond đều là cuộc chiến giữa consistency và availability.”

Tags: system-design chat-system websocket alex-xu case-study Student: Hieu Prerequisite: Tuan-02-Back-of-the-envelope · Tuan-08-Message-Queue · Tuan-03-Networking-DNS-CDN Liên quan: Tuan-19-Design-Notification-System · Tuan-15-Data-Security-Encryption · Tuan-13-Monitoring-Observability · Tuan-07-Database-Sharding-Replication Reference: Alex Xu, System Design Interview — Chapter 12: Design a Chat System


Step 1 — Understand the Problem & Design Scope

1.1 Clarifying Questions (Câu hỏi làm rõ)

Trong interview, luôn hỏi trước khi thiết kế. Dưới đây là các câu hỏi quan trọng và câu trả lời giả định:

Câu hỏiTrả lờiGhi chú
1-on-1 chat hay group chat?Cả haiGroup tối đa 100 members
Mobile app, web app, hay cả hai?Cả haiCần hỗ trợ multi-device
Scale bao lớn?50M DAUTham chiếu Tuan-02-Back-of-the-envelope
Cần online/offline indicator?Presence service
Message size limit?100,000 ký tự textTương tự Slack
Có hỗ trợ media (ảnh, file)?Images, documents, files
End-to-end encryption?Cho 1-on-1 chat
Message retention?Vĩnh viễnKhông tự xóa
Push notification khi offline?FCM/APNs integration
Read receipts?”Đã xem” indicator

1.2 Functional Requirements (Yêu cầu chức năng)

  • FR1: 1-on-1 chat với low latency (tin nhắn tới trong < 200ms cùng region)
  • FR2: Group chat lên đến 100 members
  • FR3: Online/offline presence indicator
  • FR4: Push notification cho offline users
  • FR5: Media sharing (images, files) lên đến 25MB
  • FR6: Read receipts (đã nhận, đã đọc)
  • FR7: Message history & search
  • FR8: Multi-device sync (đọc trên điện thoại, laptop cùng lúc)

1.3 Non-functional Requirements (Yêu cầu phi chức năng)

  • NFR1: High availability — 99.99% uptime (< 52 phút downtime/năm)
  • NFR2: Low latency — message delivery < 200ms (same region)
  • NFR3: Consistency — messages phải đúng thứ tự trong cùng conversation
  • NFR4: Durability — không được mất tin nhắn đã gửi
  • NFR5: Scalability — 50M DAU, 2B messages/day

1.4 Capacity Estimation (Ước lượng)

Reuse số liệu từ Tuan-02-Back-of-the-envelope — Chat System section.

Assumptions:

Thông sốGiá trịGiải thích
DAU50MQuy mô WhatsApp/Telegram-like
Messages/user/day40Bao gồm cả 1-on-1 và group
Avg message size (text)100 bytesUTF-8 encoded
% messages in groups60%Group chat chiếm đa số
Avg group size50 membersDelivery fan-out factor
Media messages10% of totalAvg 200KB mỗi media
RetentionVĩnh viễn

QPS Calculation:

Alert: 69K write QPS — đây là write-heavy system. Cần Message Queue làm buffer → Tuan-08-Message-Queue.

Read QPS (mỗi user đọc nhiều hơn gửi, ước tính 5x):

Storage:

Concurrent WebSocket Connections:

Mỗi WebSocket server handle ~65K connections (limited by file descriptors). Cần ít nhất:

Bandwidth:

Tóm tắt Estimation

MetricValue
Write QPS (peak)~69K/s
Read QPS (peak)~347K/s
Concurrent WebSocket connections~5M
Text storage/year~73 TB
Media storage/year~14.6 PB
WebSocket servers (minimum)~77
Media bandwidth (peak)~11 Gbps

Step 2 — High-Level Design

2.1 Communication Protocols — Lựa chọn giao thức

Đây là quyết định kiến trúc quan trọng nhất cho chat system. Hieu cần hiểu rõ trade-offs:

HTTP Polling

Client → Server: "Có tin nhắn mới không?" (mỗi 3 giây)
Server → Client: "Không." (99% thời gian)
Ưu điểmNhược điểm
Đơn giản nhấtLãng phí bandwidth & CPU
Stateless, dễ scaleLatency = polling interval
Works through mọi firewall50M users × 1 poll/3s = 16.7M req/s (!)

Verdict: Không khả thi cho chat system ở scale 50M DAU.

Long Polling (Hanging GET)

Client → Server: "Cho tôi biết khi có tin nhắn mới" (request treo)
Server: (chờ đến khi có tin nhắn hoặc timeout 30s)
Server → Client: "Đây, tin nhắn mới!"
Client → Server: (ngay lập tức gửi request mới)
Ưu điểmNhược điểm
Tiết kiệm hơn pollingSender & receiver có thể kết nối tới server khác nhau
Tương thích HTTP/1.1Timeout handling phức tạp
Không cần WebSocketVẫn tốn resource giữ connection

Verdict: Fallback option khi WebSocket không khả dụng.

WebSocket (Lựa chọn chính)

Client ←→ Server: Full-duplex, persistent connection
Upgrade: websocket (HTTP → WS handshake)
Ưu điểmNhược điểm
Full-duplex (hai chiều)Stateful → khó scale horizontally
Lowest latencyCần sticky sessions hoặc connection registry
Efficient (no HTTP overhead)Firewall/proxy có thể block
Ideal cho real-timeReconnection logic phức tạp

Verdict: Primary protocol cho chat system. WebSocket cho message delivery, HTTP cho mọi thứ khác (auth, profile, search).

SSE (Server-Sent Events)

Server → Client: One-way stream (server push only)
Client → Server: Vẫn dùng HTTP POST để gửi message
Ưu điểmNhược điểm
Đơn giản hơn WebSocketOne-way only (server → client)
Auto-reconnect built-inCần HTTP POST cho client → server
Works over HTTP/2Giới hạn 6 connections/domain (HTTP/1.1)

Verdict: Phù hợp cho notification/presence, không ideal cho full chat.

Protocol Decision Matrix

Tiêu chíPollingLong PollingWebSocketSSE
LatencyCaoTrung bìnhThấp nhấtThấp
BidirectionalKhôngKhôngKhông
ScalabilityDễTrung bìnhKhóTrung bình
Battery (mobile)TệTrung bìnhTốtTốt
Firewall friendlyTốtTốtTrung bìnhTốt

Quyết định: Dùng WebSocket cho real-time messaging + presence. Dùng HTTP REST cho user profile, search, media upload, group management.

2.2 Service Decomposition (Phân tách dịch vụ)

flowchart TB
    subgraph "Client Layer"
        WEB[Web Client]
        MOB[Mobile Client]
    end

    subgraph "API Gateway / Load Balancer"
        LB[Load Balancer<br/>Nginx / ALB]
    end

    subgraph "Stateless Services (HTTP)"
        AUTH[Auth Service<br/>JWT / OAuth2]
        USER[User Service<br/>Profile, Contacts]
        GROUP[Group Service<br/>Group CRUD, Members]
        MEDIA[Media Service<br/>Upload, Download]
        SEARCH[Search Service<br/>Message Search]
        NOTIF[Notification Service<br/>FCM / APNs]
    end

    subgraph "Stateful Services (WebSocket)"
        CS1[Chat Server 1<br/>WebSocket]
        CS2[Chat Server 2<br/>WebSocket]
        CS3[Chat Server N<br/>WebSocket]
    end

    subgraph "Presence Layer"
        PRES[Presence Service<br/>Online/Offline/Typing]
    end

    subgraph "Message Layer"
        MQ[Message Queue<br/>Kafka / RabbitMQ]
        SYNC[Message Sync Service<br/>Offline → Online delivery]
    end

    subgraph "Storage Layer"
        CASS[(Cassandra<br/>Messages)]
        PG[(PostgreSQL<br/>Users, Groups)]
        REDIS[(Redis<br/>Presence, Sessions)]
        S3[(S3 / MinIO<br/>Media Files)]
        ES[(Elasticsearch<br/>Message Search)]
    end

    subgraph "CDN"
        CDN[CloudFront / Cloudflare<br/>Media Delivery]
    end

    WEB & MOB --> LB
    LB --> AUTH & USER & GROUP & MEDIA & SEARCH
    WEB & MOB -.->|WebSocket| CS1 & CS2 & CS3
    CS1 & CS2 & CS3 --> MQ
    MQ --> CS1 & CS2 & CS3
    MQ --> NOTIF
    MQ --> SYNC
    CS1 & CS2 & CS3 --> PRES
    PRES --> REDIS
    CS1 & CS2 & CS3 --> CASS
    USER & GROUP --> PG
    MEDIA --> S3
    S3 --> CDN
    SEARCH --> ES
    SYNC --> CASS
    NOTIF -.->|Push| MOB

2.3 Message Flow — 1-on-1 Chat

sequenceDiagram
    participant A as User A (Sender)
    participant CS1 as Chat Server 1
    participant MQ as Message Queue (Kafka)
    participant CS2 as Chat Server 2
    participant B as User B (Receiver)
    participant DB as Cassandra
    participant NS as Notification Service

    A->>CS1: 1. Send message via WebSocket
    CS1->>CS1: 2. Validate & generate message_id (ULID)
    CS1->>MQ: 3. Publish to message queue
    CS1-->>A: 4. ACK (message accepted)

    par Parallel processing
        MQ->>DB: 5a. Persist message
        MQ->>CS2: 5b. Route to receiver's chat server
    end

    alt User B is ONLINE
        CS2->>B: 6a. Deliver via WebSocket
        B-->>CS2: 7a. ACK (received)
    else User B is OFFLINE
        MQ->>NS: 6b. Trigger push notification
        NS->>B: 7b. Push via FCM/APNs
    end

2.4 Message Flow — Group Chat

sequenceDiagram
    participant A as User A (Sender)
    participant CS1 as Chat Server 1
    participant MQ as Message Queue (Kafka)
    participant FO as Fan-out Service
    participant CS2 as Chat Server 2
    participant CS3 as Chat Server 3
    participant B as User B (Online)
    participant C as User C (Online)
    participant NS as Notification Service
    participant D as User D (Offline)

    A->>CS1: 1. Send group message
    CS1->>MQ: 2. Publish (topic: group_123)
    CS1-->>A: 3. ACK

    MQ->>FO: 4. Fan-out service picks up
    FO->>FO: 5. Lookup group members & their chat servers

    par Fan-out to online members
        FO->>CS2: 6a. Route to User B's server
        CS2->>B: 7a. Deliver via WebSocket
        FO->>CS3: 6b. Route to User C's server
        CS3->>C: 7b. Deliver via WebSocket
    end

    FO->>NS: 6c. Notify offline members
    NS->>D: 7c. Push notification

2.5 Hybrid Approach: Why not just REST?

Hieu, câu hỏi hay: tại sao không dùng REST cho mọi thứ?

OperationProtocolLý do
Send/receive messagesWebSocketReal-time, bidirectional
User login/signupHTTP RESTOne-time, stateless
Upload mediaHTTP REST (multipart)Large payload, not real-time
Search messagesHTTP RESTRequest-response pattern
Update profileHTTP RESTInfrequent, stateless
Presence updatesWebSocket (piggyback)Continuous, low overhead
Push notificationsHTTP (server-side)Server → FCM/APNs

Step 3 — Deep Dive

3.1 Message Storage — SQL vs NoSQL

Đây là quyết định database quan trọng nhất cho chat system.

Tại sao KHÔNG dùng SQL (PostgreSQL/MySQL)?

Vấn đềGiải thích
Write-heavy workload69K writes/s — relational DB struggle ở mức này
Append-only patternChat messages chỉ INSERT, rất ít UPDATE/DELETE
Time-series natureMessages là sequential theo thời gian, giống time-series data
Horizontal scalingSQL sharding phức tạp, cần manual re-shard
Storage volume73TB text/year — vượt khả năng single cluster

Tại sao chọn NoSQL (Cassandra / HBase)?

Ưu điểmGiải thích
Write-optimizedLSM-tree architecture, sequential writes
Horizontal scalingAdd nodes, data tự redistribute
Time-series friendlyPartition by channel + sort by timestamp
High availabilityTunable consistency (ONE, QUORUM, ALL)
Proven at scaleDiscord dùng Cassandra, Facebook dùng HBase

Message Schema (Cassandra)

-- Bảng messages: partition by channel_id, cluster by message_id (time-ordered)
CREATE TABLE messages (
    channel_id    UUID,          -- 1-on-1 hoặc group channel
    message_id    TIMEUUID,      -- Time-based UUID, tự động sort theo thời gian
    sender_id     UUID,
    content       TEXT,
    content_type  TEXT,           -- 'text', 'image', 'file', 'system'
    media_url     TEXT,           -- NULL nếu text-only
    created_at    TIMESTAMP,
    updated_at    TIMESTAMP,     -- NULL nếu chưa edit
    deleted       BOOLEAN,       -- Soft delete
 
    PRIMARY KEY (channel_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC)
  AND compaction = {'class': 'TimeWindowCompactionStrategy',
                    'compaction_window_unit': 'DAYS',
                    'compaction_window_size': 1}
  AND default_time_to_live = 0;  -- No TTL, retain forever

Giải thích partition strategy:

  • Partition key: channel_id — tất cả messages của 1 conversation nằm trên cùng partition
  • Clustering key: message_id (TIMEUUID) — messages tự động sort theo thời gian
  • TimeWindowCompactionStrategy — tối ưu cho time-series data, giảm read amplification

Query patterns:

-- Lấy 50 messages mới nhất của channel
SELECT * FROM messages
WHERE channel_id = ?
ORDER BY message_id DESC
LIMIT 50;
 
-- Lấy messages cũ hơn (pagination/scroll up)
SELECT * FROM messages
WHERE channel_id = ?
AND message_id < ?    -- cursor-based pagination
ORDER BY message_id DESC
LIMIT 50;

Hot Partition Problem (Vấn đề partition nóng)

Nếu một group chat rất active (ví dụ: group 100 người, 10K messages/ngày), partition đó sẽ quá lớn.

Giải pháp: Bucketing theo thời gian

CREATE TABLE messages_bucketed (
    channel_id    UUID,
    bucket        TEXT,           -- '2026-03', '2026-04' (monthly bucket)
    message_id    TIMEUUID,
    sender_id     UUID,
    content       TEXT,
    content_type  TEXT,
    media_url     TEXT,
    created_at    TIMESTAMP,
 
    PRIMARY KEY ((channel_id, bucket), message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);

Partition key = (channel_id, bucket) → mỗi tháng là một partition mới, tránh unbounded growth.

3.2 Message ID Generation — Ordering Guarantee

Tại sao cần custom ID thay vì auto-increment?

Auto-incrementCustom ID (Snowflake/ULID)
Single point of failureDistributed generation
Bottleneck ở scale lớnMỗi server tự generate
Không có timestamp infoTimestamp embedded → sortable

ULID (Universally Unique Lexicographically Sortable Identifier)

 01ARZ3NDEKTSV4RRFFQ69G5FAV
 |----------|  |------------|
  Timestamp     Randomness
   48 bits       80 bits
   (ms)

Ưu điểm ULID cho chat:

  • Lexicographically sortable → messages tự động sắp xếp theo thời gian
  • 128 bits → collision probability cực thấp
  • Monotonic — trong cùng millisecond, vẫn tăng dần
  • Không cần coordination — mỗi server tự generate

Snowflake ID (Twitter style)

|  1 bit  |  41 bits   | 10 bits  |  12 bits  |
|  unused | timestamp  | machine  | sequence  |
|         |  (ms)      |    ID    |  number   |
  • 41 bits timestamp → ~69 years
  • 10 bits machine ID → 1024 servers
  • 12 bits sequence → 4096 IDs/ms/server

Quyết định: Dùng ULID cho message IDs (đơn giản hơn Snowflake, không cần machine ID coordination).

3.3 Presence Service — Online/Offline Indicator

flowchart TB
    subgraph "Client"
        C[Mobile/Web Client]
    end

    subgraph "Presence Service"
        PS[Presence Server]
        HB[Heartbeat Handler]
    end

    subgraph "Storage"
        REDIS[(Redis<br/>user_id → status, last_active)]
    end

    subgraph "Notification"
        PUB[Redis Pub/Sub<br/>Presence Channel]
    end

    subgraph "Subscribers"
        CS1[Chat Server 1<br/>→ User A's friends online]
        CS2[Chat Server 2<br/>→ User B's friends online]
    end

    C -->|WebSocket connect| PS
    PS -->|Set ONLINE| REDIS
    PS -->|Publish status change| PUB

    C -->|Heartbeat every 5s| HB
    HB -->|Update last_active TTL| REDIS

    C -->|WebSocket disconnect| PS
    PS -->|Set OFFLINE| REDIS
    PS -->|Publish status change| PUB

    PUB --> CS1 & CS2

Heartbeat Mechanism (Cơ chế nhịp tim)

User connects via WebSocket → status = ONLINE
Every 5 seconds: client sends heartbeat ping
Server: reset TTL in Redis (key: presence:{user_id}, TTL: 30s)

If no heartbeat for 30s → key expires → status = OFFLINE

Redis data structure:

# Presence status
SET presence:user_123 "online" EX 30      # Auto-expire after 30s
SET last_active:user_123 "1679123456789"   # Epoch ms

# Batch check: ai đang online trong friend list?
MGET presence:user_001 presence:user_002 presence:user_003 ...

Presence Fan-out Problem (Vấn đề fan-out trạng thái)

Khi User A login (status → ONLINE), ai cần biết?

  • Tất cả friends/contacts của User A.

Nếu User A có 500 friends → 500 push notifications. Nếu 100K users login cùng lúc (9h sáng) → 100K × 500 = 50M presence events/s (!)

Giải pháp cho large groups:

StrategyMô tảKhi nào dùng
Eager (fan-out on write)Push status change tới tất cả friends ngayFriends list < 500
Lazy (fan-out on read)Client tự fetch presence khi mở chatFriends list > 500
HybridEager cho close friends, lazy cho othersMost systems
Subscribe on viewChỉ subscribe presence khi chat window đang mởMobile (tiết kiệm pin)

Quyết định: Dùng Subscribe on view — client chỉ subscribe presence của conversations đang hiển thị trên màn hình. Giảm drastically số lượng events.

3.4 Group Chat Optimization

Small Groups (< 100 members): Fan-out on Write

User A gửi tin nhắn vào group (50 members):
1. Message lưu vào messages table (1 write)
2. Fan-out: copy message_id vào inbox của mỗi member (50 writes)
3. Mỗi member nhận push qua WebSocket

Total writes per message: 1 + 50 = 51

Ưu điểm: Read nhanh (mỗi user chỉ đọc inbox của mình), đơn giản. Nhược điểm: Write amplification (1 message → N writes).

Large Groups (> 100 members, nếu mở rộng sau): Fan-out on Read

User A gửi tin nhắn vào group (10,000 members):
1. Message lưu vào messages table (1 write)
2. KHÔNG fan-out, không copy
3. Khi member mở group → query messages table trực tiếp

Total writes per message: 1
Total reads per view: 1 query (nhưng expensive hơn)

Ưu điểm: Write cực nhanh, không amplification. Nhược điểm: Read chậm hơn, cần cache.

Inbox Table (Fan-out on Write approach)

-- User inbox: mỗi user có list message pointers
CREATE TABLE user_inbox (
    user_id       UUID,
    channel_id    UUID,
    message_id    TIMEUUID,
    sender_id     UUID,
    preview       TEXT,          -- First 100 chars for notification
    is_read       BOOLEAN,
    created_at    TIMESTAMP,
 
    PRIMARY KEY (user_id, created_at)
) WITH CLUSTERING ORDER BY (created_at DESC);

3.5 Media Handling — Pre-signed URL Upload

sequenceDiagram
    participant C as Client
    participant API as API Gateway
    participant MS as Media Service
    participant S3 as S3 / Object Storage
    participant TH as Thumbnail Service
    participant CDN as CDN (CloudFront)
    participant CS as Chat Server
    participant R as Receiver

    C->>API: 1. Request upload URL (filename, size, type)
    API->>MS: 2. Validate (file size < 25MB, allowed types)
    MS->>S3: 3. Generate pre-signed URL (expires in 15 min)
    S3-->>MS: 4. Pre-signed URL
    MS-->>C: 5. Return upload URL

    C->>S3: 6. Upload file directly to S3 (bypass backend!)
    S3-->>C: 7. Upload complete (200 OK)

    S3->>TH: 8. S3 Event → trigger thumbnail generation
    TH->>S3: 9. Save thumbnail (compressed, resized)

    C->>CS: 10. Send message with media_url via WebSocket
    CS->>R: 11. Deliver message with CDN URL

    R->>CDN: 12. Load image/file from CDN
    CDN->>S3: 13. Cache miss → fetch from S3
    CDN-->>R: 14. Serve media (cached)

Tại sao Pre-signed URL?

Direct upload qua backendPre-signed URL (S3 direct)
Backend là bottleneckClient upload thẳng S3
Tốn bandwidth serverServer chỉ generate URL
Timeout risk với file lớnS3 handle large files natively
Scale khóS3 scale vô hạn

Thumbnail Generation:

Original image (2MB, 3000x2000)
  ↓ S3 Event Notification
Lambda/Worker:
  → Thumbnail small: 150x150 (5KB) — cho chat list
  → Thumbnail medium: 600x400 (50KB) — cho chat view
  → Keep original — cho "View full size"

3.6 Read Receipts — Per-user Read Pointer

Data Model

-- Mỗi user có một "read pointer" cho mỗi channel
CREATE TABLE read_receipts (
    channel_id    UUID,
    user_id       UUID,
    last_read_message_id  TIMEUUID,   -- Pointer tới message cuối cùng đã đọc
    updated_at    TIMESTAMP,
 
    PRIMARY KEY (channel_id, user_id)
);

Flow:

1. User B mở conversation → client gửi "mark as read" với last_message_id
2. Server UPDATE read_receipts SET last_read_message_id = X
3. Server push read receipt event tới User A (sender)
4. User A's client hiển thị "✓✓ Đã đọc"

Optimization:

  • Batch updates: Không gửi read receipt cho MỖI message. Chỉ gửi khi user scroll hoặc mở chat (debounce 2-3 giây).
  • Group read receipts: Chỉ hiển thị “Đã đọc bởi 5 người” thay vì fan-out tới mỗi member.

Unread Count Calculation

-- Đếm messages chưa đọc cho user X trong channel Y
SELECT COUNT(*) FROM messages
WHERE channel_id = ?
AND message_id > (
    SELECT last_read_message_id FROM read_receipts
    WHERE channel_id = ? AND user_id = ?
);

Optimization: Cache unread count trong Redis. Mỗi message mới → INCR unread:{user_id}:{channel_id}. Khi đọc → DEL unread:{user_id}:{channel_id}.

3.7 Push Notification Integration

flowchart LR
    subgraph "Message Pipeline"
        MQ[Kafka Topic:<br/>messages]
    end

    subgraph "Notification Service"
        NC[Notification Consumer]
        PC[Presence Check]
        DT[Device Token Store<br/>Redis/PostgreSQL]
        RL[Rate Limiter<br/>Max 1 push/5s per user]
    end

    subgraph "Push Providers"
        FCM[Firebase Cloud Messaging<br/>Android]
        APNS[Apple Push Notification<br/>iOS]
        WPN[Web Push<br/>Browser]
    end

    MQ --> NC
    NC --> PC
    PC -->|User OFFLINE| DT
    DT --> RL
    RL --> FCM & APNS & WPN
    PC -->|User ONLINE| NC
    NC -->|Skip push| NC

    style PC fill:#f9a825

Offline Message Queue:

Khi user B offline, messages tích lũy. Khi B online lại:

1. B connects via WebSocket
2. Chat server queries: messages WHERE channel_id IN (B's channels)
   AND message_id > B's last_synced_message_id
3. Deliver all pending messages (batched)
4. Update B's sync pointer

3.8 Message Search — Elasticsearch

// Elasticsearch index mapping
{
  "mappings": {
    "properties": {
      "channel_id": { "type": "keyword" },
      "message_id": { "type": "keyword" },
      "sender_id": { "type": "keyword" },
      "content": {
        "type": "text",
        "analyzer": "standard",
        "fields": {
          "keyword": { "type": "keyword" }
        }
      },
      "created_at": { "type": "date" },
      "content_type": { "type": "keyword" }
    }
  }
}

Search flow: Message written to Cassandra → CDC (Change Data Capture) → Kafka → Elasticsearch Indexer → ES.

Không search trực tiếp trên Cassandra — Cassandra chỉ support primary key lookup, không full-text search.

3.9 End-to-End Encryption (E2E) — Signal Protocol Overview

flowchart LR
    subgraph "User A"
        A_PK[Public Key A]
        A_SK[Private Key A]
        A_ENC[Encrypt with<br/>shared secret]
    end

    subgraph "Server (Cannot Read)"
        SRV[Server stores<br/>encrypted blob only]
    end

    subgraph "User B"
        B_PK[Public Key B]
        B_SK[Private Key B]
        B_DEC[Decrypt with<br/>shared secret]
    end

    A_PK -->|Key Exchange<br/>X3DH Protocol| B_PK
    B_PK --> A_ENC
    A_ENC -->|Encrypted message| SRV
    SRV -->|Encrypted message| B_DEC
    A_SK --> A_ENC
    B_SK --> B_DEC

    style SRV fill:#ff6666,color:#fff

Signal Protocol flow (simplified):

  1. Key Exchange (X3DH — Extended Triple Diffie-Hellman):

    • User A và B trao đổi public keys thông qua server
    • Tạo shared secret mà server KHÔNG biết
  2. Double Ratchet Algorithm:

    • Mỗi message dùng một encryption key khác nhau
    • Nếu 1 key bị compromise, chỉ ảnh hưởng 1 message (forward secrecy)
  3. Server’s role: Chỉ relay encrypted blobs. Không thể đọc nội dung.

Trade-off: E2E encryption → server-side search không hoạt động. Client phải tự search locally.


4. Security

4.1 End-to-End Encryption (Chi tiết)

AspectImplementation
ProtocolSignal Protocol (WhatsApp, Signal)
Key exchangeX3DH (Extended Triple Diffie-Hellman)
Message encryptionAES-256-GCM
Key rotationDouble Ratchet — mỗi message key mới
Group E2ESender Keys protocol (mỗi member có sender key)
Key backupClient-side encrypted backup (user’s passphrase)

4.2 Message Retention Policy

# retention-policy.yml
retention:
  text_messages:
    1_on_1: forever          # User có thể xóa manually
    group: forever
    deleted_by_user:
      soft_delete: true      # Mark as deleted, keep 30 days for compliance
      hard_delete_after: 30d
  media:
    original: 1y             # Giữ original 1 năm
    thumbnails: forever
    after_expiry: delete_original_keep_thumbnail
  audit_logs:
    retention: 7y            # Compliance requirement
    storage: cold_storage    # S3 Glacier after 90 days

4.3 Spam & Abuse Detection

Message pipeline:
  User sends message
    → Rate limiter (max 60 messages/minute per user)
    → Content filter (regex: phone numbers, URLs → flag)
    → ML spam classifier (real-time inference, < 10ms)
    → If flagged → hold for review, notify user
    → If clean → deliver normally
LayerTechniqueLatency Budget
Rate limitingToken bucket per user< 1ms
Keyword filterBloom filter + regex< 2ms
ML classifierTensorFlow Serving (text classification)< 10ms
Image scanningPhotoDNA / perceptual hashing< 100ms (async)
Human reviewQueue for flagged contentAsync

4.4 CSAM Scanning — Ethical Considerations

CSAM = Child Sexual Abuse Material. Bắt buộc theo luật nhiều quốc gia.

ApproachMô tảTrade-off
PhotoDNA (Microsoft)Hash-based matching against known CSAM databasePrivacy-preserving (chỉ hash, không xem ảnh)
Client-side scanningScan trên device trước khi uploadControversial (Apple đã rút) — xung đột với E2E
Server-side scanningScan after upload, before deliveryKhông hoạt động với E2E encrypted media
Report-basedChỉ scan khi có user reportMiss proactive detection

Ethical dilemma: E2E encryption bảo vệ privacy nhưng cũng bảo vệ tội phạm. Không có giải pháp hoàn hảo. Hầu hết platforms (WhatsApp, Telegram) dùng metadata analysis + user reporting thay vì break E2E.


5. DevOps

5.1 WebSocket Scaling with Sticky Sessions

Problem: WebSocket là stateful connection. Nếu load balancer route request tới server khác, connection bị mất.

Solution 1: Sticky Sessions (Layer 7)

# nginx.conf — WebSocket sticky sessions
upstream chat_servers {
    ip_hash;  # Same client IP → same server
    server chat-server-1:8080;
    server chat-server-2:8080;
    server chat-server-3:8080;
}
 
server {
    listen 443 ssl;
    server_name chat.example.com;
 
    location /ws {
        proxy_pass http://chat_servers;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header X-Real-IP $remote_addr;
        proxy_read_timeout 86400s;  # 24h timeout for WebSocket
        proxy_send_timeout 86400s;
    }
}

Solution 2: Connection Registry (Recommended cho scale lớn)

Redis connection registry:
  ws:conn:user_123 → chat-server-7  (TTL: 300s, refresh on heartbeat)
  ws:conn:user_456 → chat-server-2

Routing logic:
  1. Message for user_123
  2. Lookup Redis: user_123 → chat-server-7
  3. Route message to chat-server-7 via internal message bus
  4. chat-server-7 pushes to user_123's WebSocket

5.2 Monitoring WebSocket Connections

# prometheus-alerts.yml — Chat System specific
groups:
  - name: chat_system_alerts
    rules:
      - alert: WebSocketConnectionsHigh
        expr: websocket_active_connections > 60000  # 92% of 65K limit
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "WebSocket connections near limit ({{ $value }}/65000)"
 
      - alert: MessageDeliveryLatencyHigh
        expr: histogram_quantile(0.99, rate(message_delivery_seconds_bucket[5m])) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P99 message delivery latency > 500ms"
 
      - alert: KafkaConsumerLag
        expr: kafka_consumer_group_lag > 100000
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Kafka consumer lag > 100K messages, delivery delayed"
 
      - alert: WebSocketReconnectionSpike
        expr: rate(websocket_reconnections_total[5m]) > 1000
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "WebSocket reconnection storm detected ({{ $value }}/s)"
 
      - alert: PresenceServiceLatency
        expr: histogram_quantile(0.95, rate(presence_update_seconds_bucket[5m])) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Presence update P95 latency > 100ms"
 
      - alert: CassandraWriteLatency
        expr: cassandra_write_latency_p99 > 50
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Cassandra P99 write latency > 50ms"

5.3 Grafana Dashboard Essentials

PanelPromQLThreshold
Active WebSocket Connectionswebsocket_active_connectionsWarning: 80% capacity
Message Throughputrate(messages_sent_total[1m])Compare vs estimation (23K/s avg)
Message Delivery Latency P99histogram_quantile(0.99, ...)< 200ms
Kafka Consumer Lagkafka_consumer_group_lag< 10K
Presence Updates/srate(presence_updates_total[1m])Monitor for fan-out storms
Undelivered Messagesundelivered_messages_gaugeShould trend to 0
WebSocket Error Raterate(websocket_errors_total[5m])< 0.1%

5.4 Kafka for Message Delivery

# kafka-topics.yml
topics:
  - name: chat.messages.1on1
    partitions: 64          # Partition by channel_id hash
    replication_factor: 3
    config:
      retention.ms: 604800000   # 7 days (messages persisted in Cassandra)
      min.insync.replicas: 2
      compression.type: lz4
 
  - name: chat.messages.group
    partitions: 128         # Higher throughput for group messages
    replication_factor: 3
    config:
      retention.ms: 604800000
      min.insync.replicas: 2
 
  - name: chat.presence
    partitions: 32
    replication_factor: 2   # Presence is less critical
    config:
      retention.ms: 3600000    # 1 hour only
      cleanup.policy: compact  # Keep latest status per user
 
  - name: chat.notifications
    partitions: 32
    replication_factor: 3
    config:
      retention.ms: 86400000   # 1 day

Partition Strategy: Messages for the same channel_id go to the same Kafka partition → ordering guaranteed within a conversation.


6. Code Examples

6.1 Node.js WebSocket Server with Rooms

// chat-server.js
// WebSocket server with rooms (channels), presence heartbeat, message routing
 
const WebSocket = require('ws');
const Redis = require('ioredis');
const { Kafka } = require('kafkajs');
const { ulid } = require('ulid');
 
// --- Config ---
const PORT = process.env.PORT || 8080;
const SERVER_ID = process.env.SERVER_ID || 'chat-server-1';
const HEARTBEAT_INTERVAL = 5000;   // 5 seconds
const HEARTBEAT_TIMEOUT = 30000;   // 30 seconds → offline
 
// --- Connections ---
const redis = new Redis(process.env.REDIS_URL);
const redisSub = new Redis(process.env.REDIS_URL);
 
const kafka = new Kafka({ brokers: [process.env.KAFKA_BROKER || 'localhost:9092'] });
const producer = kafka.producer();
const consumer = kafka.consumer({ groupId: `${SERVER_ID}-group` });
 
// --- State ---
const clients = new Map();      // user_id → WebSocket
const userChannels = new Map(); // user_id → Set<channel_id>
 
// --- WebSocket Server ---
const wss = new WebSocket.Server({ port: PORT });
 
wss.on('connection', async (ws, req) => {
  const userId = authenticateFromToken(req); // Extract from JWT in query/header
  if (!userId) {
    ws.close(4001, 'Unauthorized');
    return;
  }
 
  // Register connection
  clients.set(userId, ws);
  await registerConnection(userId);
  await setPresence(userId, 'online');
 
  console.log(`[${SERVER_ID}] User ${userId} connected. Total: ${clients.size}`);
 
  // Heartbeat setup
  ws.isAlive = true;
  ws.on('pong', () => { ws.isAlive = true; });
 
  // Message handler
  ws.on('message', async (raw) => {
    try {
      const data = JSON.parse(raw);
      await handleMessage(userId, data);
    } catch (err) {
      ws.send(JSON.stringify({ type: 'error', message: 'Invalid message format' }));
    }
  });
 
  // Disconnect handler
  ws.on('close', async () => {
    clients.delete(userId);
    await unregisterConnection(userId);
    await setPresence(userId, 'offline');
    console.log(`[${SERVER_ID}] User ${userId} disconnected. Total: ${clients.size}`);
  });
 
  // Send pending messages (offline → online sync)
  await deliverPendingMessages(userId, ws);
});
 
// --- Message Handling ---
async function handleMessage(senderId, data) {
  switch (data.type) {
    case 'chat_message':
      await handleChatMessage(senderId, data);
      break;
    case 'typing':
      await handleTypingIndicator(senderId, data);
      break;
    case 'read_receipt':
      await handleReadReceipt(senderId, data);
      break;
    case 'heartbeat':
      await refreshPresence(senderId);
      break;
    default:
      console.warn(`Unknown message type: ${data.type}`);
  }
}
 
async function handleChatMessage(senderId, data) {
  const { channelId, content, contentType = 'text', mediaUrl = null } = data;
 
  // Generate time-sortable ID
  const messageId = ulid();
 
  const message = {
    messageId,
    channelId,
    senderId,
    content,
    contentType,
    mediaUrl,
    createdAt: Date.now(),
  };
 
  // ACK to sender immediately
  sendToUser(senderId, {
    type: 'message_ack',
    messageId,
    channelId,
    status: 'accepted',
    timestamp: message.createdAt,
  });
 
  // Publish to Kafka for persistence + delivery
  await producer.send({
    topic: 'chat.messages.1on1',
    messages: [{
      key: channelId,  // Partition by channel → ordering guarantee
      value: JSON.stringify(message),
    }],
  });
}
 
async function handleTypingIndicator(senderId, data) {
  const { channelId } = data;
  // Broadcast typing indicator to other members in channel
  const members = await getChannelMembers(channelId);
  for (const memberId of members) {
    if (memberId !== senderId) {
      sendToUser(memberId, {
        type: 'typing',
        channelId,
        userId: senderId,
        timestamp: Date.now(),
      });
    }
  }
}
 
async function handleReadReceipt(senderId, data) {
  const { channelId, lastReadMessageId } = data;
 
  // Update read pointer in Redis (will be persisted to Cassandra async)
  await redis.set(
    `read_receipt:${channelId}:${senderId}`,
    lastReadMessageId
  );
 
  // Reset unread count
  await redis.del(`unread:${senderId}:${channelId}`);
 
  // Notify other members
  const members = await getChannelMembers(channelId);
  for (const memberId of members) {
    if (memberId !== senderId) {
      sendToUser(memberId, {
        type: 'read_receipt',
        channelId,
        userId: senderId,
        lastReadMessageId,
      });
    }
  }
}
 
// --- Delivery ---
function sendToUser(userId, payload) {
  const ws = clients.get(userId);
  if (ws && ws.readyState === WebSocket.OPEN) {
    ws.send(JSON.stringify(payload));
    return true;
  }
  return false; // User not on this server or offline
}
 
// --- Presence ---
async function setPresence(userId, status) {
  if (status === 'online') {
    await redis.set(`presence:${userId}`, 'online', 'EX', 30);
  } else {
    await redis.del(`presence:${userId}`);
  }
  // Publish presence change for subscribers
  await redis.publish('presence_changes', JSON.stringify({
    userId,
    status,
    timestamp: Date.now(),
  }));
}
 
async function refreshPresence(userId) {
  await redis.expire(`presence:${userId}`, 30); // Reset TTL
}
 
// --- Connection Registry ---
async function registerConnection(userId) {
  await redis.set(`ws:conn:${userId}`, SERVER_ID, 'EX', 300);
}
 
async function unregisterConnection(userId) {
  await redis.del(`ws:conn:${userId}`);
}
 
// --- Heartbeat Check ---
const heartbeatInterval = setInterval(() => {
  wss.clients.forEach((ws) => {
    if (!ws.isAlive) {
      ws.terminate();
      return;
    }
    ws.isAlive = false;
    ws.ping();
  });
}, HEARTBEAT_INTERVAL);
 
// --- Kafka Consumer (receive messages routed to this server) ---
async function startConsumer() {
  await consumer.connect();
  await consumer.subscribe({ topic: 'chat.messages.1on1', fromBeginning: false });
 
  await consumer.run({
    eachMessage: async ({ message }) => {
      const msg = JSON.parse(message.value.toString());
      const { channelId, senderId } = msg;
 
      // Get channel members and deliver
      const members = await getChannelMembers(channelId);
      for (const memberId of members) {
        if (memberId === senderId) continue; // Don't echo back to sender
 
        const delivered = sendToUser(memberId, {
          type: 'new_message',
          ...msg,
        });
 
        if (delivered) {
          // Increment unread counter
          await redis.incr(`unread:${memberId}:${channelId}`);
        }
      }
    },
  });
}
 
// --- Helper stubs ---
function authenticateFromToken(req) {
  // Extract JWT from query param or header, verify, return userId
  // In production: verify JWT signature, check expiry
  const url = new URL(req.url, `http://${req.headers.host}`);
  return url.searchParams.get('userId'); // Simplified for example
}
 
async function getChannelMembers(channelId) {
  // In production: cache in Redis, fallback to PostgreSQL
  const members = await redis.smembers(`channel:members:${channelId}`);
  return members;
}
 
async function deliverPendingMessages(userId, ws) {
  // In production: query Cassandra for messages after user's last sync point
  console.log(`[${SERVER_ID}] Delivering pending messages to ${userId}`);
}
 
// --- Startup ---
(async () => {
  await producer.connect();
  await startConsumer();
  console.log(`[${SERVER_ID}] Chat server running on port ${PORT}`);
})();

6.2 Message Schema — TypeScript Types

// types/message.ts
 
/** Message content types */
type ContentType = 'text' | 'image' | 'file' | 'video' | 'audio' | 'system';
 
/** WebSocket message types (client ↔ server) */
type WSMessageType =
  | 'chat_message'      // Client → Server: send message
  | 'new_message'       // Server → Client: receive message
  | 'message_ack'       // Server → Client: message accepted
  | 'typing'            // Bidirectional: typing indicator
  | 'read_receipt'      // Bidirectional: mark as read
  | 'presence'          // Server → Client: online/offline
  | 'heartbeat'         // Client → Server: keep alive
  | 'error';            // Server → Client: error
 
/** Chat message stored in Cassandra */
interface ChatMessage {
  messageId: string;      // ULID — time-sortable unique ID
  channelId: string;      // UUID of 1-on-1 or group channel
  senderId: string;       // UUID of sender
  content: string;        // Message text (encrypted if E2E)
  contentType: ContentType;
  mediaUrl: string | null;
  thumbnailUrl: string | null;
  createdAt: number;      // Unix timestamp (ms)
  updatedAt: number | null;
  deleted: boolean;
  replyTo: string | null; // messageId of parent (for threads)
}
 
/** Channel (conversation) stored in PostgreSQL */
interface Channel {
  channelId: string;
  type: 'direct' | 'group';
  name: string | null;        // NULL for 1-on-1
  createdBy: string;
  createdAt: number;
  memberCount: number;
  lastMessageId: string | null;
  lastMessageAt: number | null;
}
 
/** Channel membership */
interface ChannelMember {
  channelId: string;
  userId: string;
  role: 'owner' | 'admin' | 'member';
  joinedAt: number;
  lastReadMessageId: string | null;
  notificationMuted: boolean;
}
 
/** Presence status */
interface UserPresence {
  userId: string;
  status: 'online' | 'offline' | 'away';
  lastActiveAt: number;
  device: 'mobile' | 'web' | 'desktop';
}
 
/** WebSocket envelope (all messages wrapped in this) */
interface WSEnvelope<T = unknown> {
  type: WSMessageType;
  payload: T;
  timestamp: number;
  requestId?: string;     // For request-response correlation
}

6.3 Presence Heartbeat — Client Side

// client/presence.js — Browser/React Native WebSocket client
 
class ChatClient {
  constructor(wsUrl, token) {
    this.wsUrl = wsUrl;
    this.token = token;
    this.ws = null;
    this.heartbeatTimer = null;
    this.reconnectAttempts = 0;
    this.maxReconnectAttempts = 10;
    this.baseReconnectDelay = 1000; // 1 second
  }
 
  connect() {
    this.ws = new WebSocket(`${this.wsUrl}?token=${this.token}`);
 
    this.ws.onopen = () => {
      console.log('Connected to chat server');
      this.reconnectAttempts = 0;
      this.startHeartbeat();
    };
 
    this.ws.onmessage = (event) => {
      const data = JSON.parse(event.data);
      this.handleMessage(data);
    };
 
    this.ws.onclose = (event) => {
      console.log(`Disconnected: ${event.code} ${event.reason}`);
      this.stopHeartbeat();
      this.scheduleReconnect();
    };
 
    this.ws.onerror = (error) => {
      console.error('WebSocket error:', error);
    };
  }
 
  // --- Heartbeat ---
  startHeartbeat() {
    this.heartbeatTimer = setInterval(() => {
      if (this.ws.readyState === WebSocket.OPEN) {
        this.ws.send(JSON.stringify({ type: 'heartbeat' }));
      }
    }, 5000); // Every 5 seconds
  }
 
  stopHeartbeat() {
    if (this.heartbeatTimer) {
      clearInterval(this.heartbeatTimer);
      this.heartbeatTimer = null;
    }
  }
 
  // --- Reconnection with exponential backoff + jitter ---
  scheduleReconnect() {
    if (this.reconnectAttempts >= this.maxReconnectAttempts) {
      console.error('Max reconnect attempts reached. Please refresh.');
      return;
    }
 
    // Exponential backoff with jitter to prevent thundering herd
    const delay = Math.min(
      this.baseReconnectDelay * Math.pow(2, this.reconnectAttempts),
      30000 // Max 30 seconds
    );
    const jitter = delay * (0.5 + Math.random() * 0.5); // 50-100% of delay
 
    console.log(`Reconnecting in ${Math.round(jitter)}ms (attempt ${this.reconnectAttempts + 1})`);
 
    setTimeout(() => {
      this.reconnectAttempts++;
      this.connect();
    }, jitter);
  }
 
  // --- Send Message ---
  sendMessage(channelId, content, contentType = 'text') {
    this.ws.send(JSON.stringify({
      type: 'chat_message',
      channelId,
      content,
      contentType,
    }));
  }
 
  // --- Mark as Read ---
  markAsRead(channelId, lastReadMessageId) {
    this.ws.send(JSON.stringify({
      type: 'read_receipt',
      channelId,
      lastReadMessageId,
    }));
  }
 
  handleMessage(data) {
    switch (data.type) {
      case 'new_message':
        this.onNewMessage?.(data);
        break;
      case 'message_ack':
        this.onMessageAck?.(data);
        break;
      case 'typing':
        this.onTyping?.(data);
        break;
      case 'read_receipt':
        this.onReadReceipt?.(data);
        break;
      case 'presence':
        this.onPresenceChange?.(data);
        break;
      case 'error':
        console.error('Server error:', data.message);
        break;
    }
  }
}
 
// Usage:
// const chat = new ChatClient('wss://chat.example.com/ws', jwtToken);
// chat.onNewMessage = (msg) => updateUI(msg);
// chat.connect();

7. System Design Diagram — Complete Architecture

flowchart TB
    subgraph "Clients"
        WEB["🖥 Web Client<br/>(React)"]
        IOS["📱 iOS Client<br/>(Swift)"]
        AND["📱 Android Client<br/>(Kotlin)"]
    end

    subgraph "Edge Layer"
        CDN["CDN<br/>(CloudFront)"]
        LB["Load Balancer<br/>(ALB + Nginx)"]
    end

    subgraph "API Layer (HTTP - Stateless)"
        GW["API Gateway<br/>(Kong / Envoy)"]
        AUTH["Auth Service<br/>(JWT + OAuth2)"]
        USER["User Service<br/>(Profile, Contacts)"]
        GROUP["Group Service<br/>(CRUD, Members)"]
        MEDIA["Media Service<br/>(Pre-signed URLs)"]
        SRCH["Search Service<br/>(ES queries)"]
    end

    subgraph "Real-time Layer (WebSocket - Stateful)"
        direction LR
        WS1["Chat Server 1<br/>(WS, 65K conn)"]
        WS2["Chat Server 2<br/>(WS, 65K conn)"]
        WSN["Chat Server N<br/>(WS, 65K conn)"]
    end

    subgraph "Presence Layer"
        PRES["Presence Service<br/>(Heartbeat, Status)"]
    end

    subgraph "Message Bus"
        K1["Kafka<br/>chat.messages"]
        K2["Kafka<br/>chat.presence"]
        K3["Kafka<br/>chat.notifications"]
    end

    subgraph "Processing Layer"
        FAN["Fan-out Service<br/>(Group delivery)"]
        SYNC["Sync Service<br/>(Offline → Online)"]
        IDX["Indexer<br/>(Cassandra → ES)"]
        THUMB["Thumbnail Service<br/>(Image processing)"]
    end

    subgraph "Notification"
        NOTIF["Notification Service"]
        FCM["FCM<br/>(Android)"]
        APNS["APNs<br/>(iOS)"]
        WPUSH["Web Push"]
    end

    subgraph "Data Layer"
        CASS[("Cassandra<br/>Messages<br/>(Write-optimized)")]
        PG[("PostgreSQL<br/>Users, Groups<br/>Channels")]
        RD[("Redis Cluster<br/>Presence, Sessions<br/>Unread counts")]
        S3[("S3<br/>Media files")]
        ES[("Elasticsearch<br/>Message search")]
    end

    WEB & IOS & AND -->|HTTPS| LB
    WEB & IOS & AND -.->|WSS| WS1 & WS2 & WSN
    WEB & IOS & AND -->|Media| CDN

    LB --> GW
    GW --> AUTH & USER & GROUP & MEDIA & SRCH

    WS1 & WS2 & WSN --> K1
    WS1 & WS2 & WSN --> PRES
    PRES --> RD
    PRES --> K2

    K1 --> FAN & SYNC
    K1 --> CASS
    FAN --> WS1 & WS2 & WSN
    FAN --> K3
    K3 --> NOTIF
    NOTIF --> FCM & APNS & WPUSH

    SYNC --> CASS
    IDX --> ES
    CASS --> IDX

    USER & GROUP --> PG
    MEDIA --> S3
    S3 --> CDN
    S3 --> THUMB
    THUMB --> S3

    SRCH --> ES

8. Aha Moments — Đúc kết

#1 — WebSocket is a game changer but a scaling nightmare: WebSocket cho latency thấp nhất nhưng stateful connections khiến horizontal scaling phức tạp. Giải pháp: connection registry trong Redis + internal message bus (Kafka) để route giữa servers.

#2 — Chat là write-heavy, nhưng read-heavy ở tầng delivery: 69K write QPS (messages vào DB) nhưng 347K read QPS (delivery tới recipients). Fan-out on write cho small groups, fan-out on read cho large groups — hybrid approach.

#3 — NoSQL wins for messages: Cassandra/HBase với partition by channel_id + cluster by message_id (TIMEUUID) cho O(1) lookup theo conversation, sequential writes, và horizontal scaling tự nhiên. SQL không phù hợp cho workload này.

#4 — Presence is deceptively expensive: “User X online” nghe đơn giản, nhưng fan-out presence changes tới toàn bộ contacts list → thundering herd problem. Solution: subscribe-on-view (chỉ track presence cho conversations đang mở).

#5 — Message ordering is channel-scoped, not global: Không cần global ordering. Chỉ cần messages trong cùng conversation đúng thứ tự. ULID + Kafka partitioning by channel_id giải quyết điều này elegantly.

#6 — E2E encryption kills server-side features: Nếu enable E2E → server không thể search messages, scan content, hoặc generate previews. Trade-off rõ ràng giữa privacy và functionality.

#7 — Pre-signed URL pattern eliminates media bottleneck: Client upload thẳng S3, server chỉ generate URL. Không có media nào đi qua chat server → backend chỉ lo text messages.


9. Common Pitfalls — Sai lầm thường gặp

Pitfall 1: WebSocket Reconnection Storms (Bão reconnect)

Sai: Server restart → 65K clients reconnect cùng lúc → server mới overload ngay lập tức → lại crash → loop.

Đúng: Client PHẢI implement exponential backoff with jitter:

  • Attempt 1: wait 1s + random(0-0.5s)
  • Attempt 2: wait 2s + random(0-1s)
  • Attempt 3: wait 4s + random(0-2s)
  • …capped at 30s

Jitter đảm bảo clients reconnect rải rác, không cùng lúc.

Pitfall 2: Presence Fan-out in Large Groups

Sai: User A online → push “A is online” tới 500 friends → 500 WebSocket messages. 100K users login cùng lúc (9h sáng) → 50M presence events.

Đúng: Subscribe on view — chỉ track presence cho conversations đang visible trên screen. User nhìn 5 conversations → chỉ subscribe 5 presence channels, không phải 500.

Pitfall 3: Message Ordering in Distributed System

Sai: Dùng timestamp từ client device làm ordering key → clocks không sync → messages hiển thị sai thứ tự.

Đúng: Server generate message_id bằng ULID/Snowflake (server timestamp). Kafka partition by channel_id → messages trong cùng conversation luôn đúng thứ tự. Client display theo message_id, không phải client timestamp.

Pitfall 4: Không handle duplicate delivery

Sai: Network hiccup → client không nhận ACK → retry → message gửi 2 lần → receiver thấy duplicate.

Đúng: Idempotency key — client generate requestId cho mỗi message. Server check requestId trong Redis (TTL 5 min). Nếu đã xử lý → return existing ACK, không process lại.

Pitfall 5: Media upload blocking chat flow

Sai: Upload 10MB image → WebSocket connection busy → text messages bị block.

Đúng: Media upload qua separate HTTP connection (pre-signed URL → S3 direct). WebSocket chỉ carry text messages và lightweight events. Hai channels độc lập.

Pitfall 6: Unbounded group notification

Sai: Group 100 members, mỗi message trigger 99 push notifications → 1000 messages/hour = 99K notifications/hour cho 1 group.

Đúng: Notification batching — aggregate 5 messages trong 30 giây thành 1 notification: “Bạn có 5 tin nhắn mới trong Group ABC”. User đang online → KHÔNG gửi push notification.


10. Wrap Up — Step 4

Scaling Considerations

ComponentScaling Strategy
Chat Servers (WebSocket)Horizontal + Connection Registry (Redis)
CassandraAdd nodes, virtual nodes auto-rebalance
KafkaAdd partitions + brokers
RedisCluster mode (16384 hash slots)
Media (S3)Virtually unlimited
Search (ES)Add data nodes, index sharding
NotificationHorizontal + rate limiting per user

Multi-region Deployment

Region US-East:
  Chat servers (50%) ← Users in Americas
  Cassandra DC-1 (full replica)
  Kafka cluster (independent)

Region EU-West:
  Chat servers (30%) ← Users in Europe/Africa
  Cassandra DC-2 (full replica)
  Kafka cluster (independent)

Region AP-Southeast:
  Chat servers (20%) ← Users in Asia-Pacific
  Cassandra DC-3 (full replica)
  Kafka cluster (independent)

Cross-region sync:
  Cassandra multi-DC replication (async, eventual consistency)
  Cross-region message routing via dedicated Kafka MirrorMaker

Cost Optimization

ItemMonthly Cost (estimated, 50M DAU)Optimization
WebSocket servers (100 instances)~$30KSpot instances for non-critical
Cassandra cluster (50 nodes)~$50KTiered storage (hot/cold)
Kafka cluster (20 brokers)~$15KManaged service (MSK)
S3 storage (14.6 PB/year)~$300KS3 Intelligent-Tiering
CDN (media delivery)~$100KCache optimization, compression
Redis cluster (presence)~$10KManaged ElastiCache
Total~$505K/month

Future Enhancements

  • Voice/Video calls: WebRTC integration (TURN/STUN servers)
  • Message reactions: Lightweight (just emoji + userId, no separate table needed)
  • Threads: Reply-to with parent_message_id, lazy-loaded
  • Bots/Integrations: Webhook-based bot framework
  • Stories/Status: Ephemeral content (TTL 24h), separate storage
  • AI features: Smart reply suggestions, message summarization


Tham khảo


Tuần trước: Tuan-16-Design-URL-Shortener — URL Shortener case study Tuần sau: Tuan-18-Design-News-Feed — News Feed system with fan-out patterns