Tuần 17: Design a Chat System
“Chat system trông đơn giản — gửi tin nhắn từ A đến B. Nhưng khi 50 triệu người gửi cùng lúc, mỗi millisecond đều là cuộc chiến giữa consistency và availability.”
Tags: system-design chat-system websocket alex-xu case-study Student: Hieu Prerequisite: Tuan-02-Back-of-the-envelope · Tuan-08-Message-Queue · Tuan-03-Networking-DNS-CDN Liên quan: Tuan-19-Design-Notification-System · Tuan-15-Data-Security-Encryption · Tuan-13-Monitoring-Observability · Tuan-07-Database-Sharding-Replication Reference: Alex Xu, System Design Interview — Chapter 12: Design a Chat System
Step 1 — Understand the Problem & Design Scope
1.1 Clarifying Questions (Câu hỏi làm rõ)
Trong interview, luôn hỏi trước khi thiết kế. Dưới đây là các câu hỏi quan trọng và câu trả lời giả định:
| Câu hỏi | Trả lời | Ghi chú |
|---|---|---|
| 1-on-1 chat hay group chat? | Cả hai | Group tối đa 100 members |
| Mobile app, web app, hay cả hai? | Cả hai | Cần hỗ trợ multi-device |
| Scale bao lớn? | 50M DAU | Tham chiếu Tuan-02-Back-of-the-envelope |
| Cần online/offline indicator? | Có | Presence service |
| Message size limit? | 100,000 ký tự text | Tương tự Slack |
| Có hỗ trợ media (ảnh, file)? | Có | Images, documents, files |
| End-to-end encryption? | Có | Cho 1-on-1 chat |
| Message retention? | Vĩnh viễn | Không tự xóa |
| Push notification khi offline? | Có | FCM/APNs integration |
| Read receipts? | Có | ”Đã xem” indicator |
1.2 Functional Requirements (Yêu cầu chức năng)
- FR1: 1-on-1 chat với low latency (tin nhắn tới trong < 200ms cùng region)
- FR2: Group chat lên đến 100 members
- FR3: Online/offline presence indicator
- FR4: Push notification cho offline users
- FR5: Media sharing (images, files) lên đến 25MB
- FR6: Read receipts (đã nhận, đã đọc)
- FR7: Message history & search
- FR8: Multi-device sync (đọc trên điện thoại, laptop cùng lúc)
1.3 Non-functional Requirements (Yêu cầu phi chức năng)
- NFR1: High availability — 99.99% uptime (< 52 phút downtime/năm)
- NFR2: Low latency — message delivery < 200ms (same region)
- NFR3: Consistency — messages phải đúng thứ tự trong cùng conversation
- NFR4: Durability — không được mất tin nhắn đã gửi
- NFR5: Scalability — 50M DAU, 2B messages/day
1.4 Capacity Estimation (Ước lượng)
Reuse số liệu từ Tuan-02-Back-of-the-envelope — Chat System section.
Assumptions:
| Thông số | Giá trị | Giải thích |
|---|---|---|
| DAU | 50M | Quy mô WhatsApp/Telegram-like |
| Messages/user/day | 40 | Bao gồm cả 1-on-1 và group |
| Avg message size (text) | 100 bytes | UTF-8 encoded |
| % messages in groups | 60% | Group chat chiếm đa số |
| Avg group size | 50 members | Delivery fan-out factor |
| Media messages | 10% of total | Avg 200KB mỗi media |
| Retention | Vĩnh viễn | — |
QPS Calculation:
Alert: 69K write QPS — đây là write-heavy system. Cần Message Queue làm buffer → Tuan-08-Message-Queue.
Read QPS (mỗi user đọc nhiều hơn gửi, ước tính 5x):
Storage:
Concurrent WebSocket Connections:
Mỗi WebSocket server handle ~65K connections (limited by file descriptors). Cần ít nhất:
Bandwidth:
Tóm tắt Estimation
| Metric | Value |
|---|---|
| Write QPS (peak) | ~69K/s |
| Read QPS (peak) | ~347K/s |
| Concurrent WebSocket connections | ~5M |
| Text storage/year | ~73 TB |
| Media storage/year | ~14.6 PB |
| WebSocket servers (minimum) | ~77 |
| Media bandwidth (peak) | ~11 Gbps |
Step 2 — High-Level Design
2.1 Communication Protocols — Lựa chọn giao thức
Đây là quyết định kiến trúc quan trọng nhất cho chat system. Hieu cần hiểu rõ trade-offs:
HTTP Polling
Client → Server: "Có tin nhắn mới không?" (mỗi 3 giây)
Server → Client: "Không." (99% thời gian)
| Ưu điểm | Nhược điểm |
|---|---|
| Đơn giản nhất | Lãng phí bandwidth & CPU |
| Stateless, dễ scale | Latency = polling interval |
| Works through mọi firewall | 50M users × 1 poll/3s = 16.7M req/s (!) |
Verdict: Không khả thi cho chat system ở scale 50M DAU.
Long Polling (Hanging GET)
Client → Server: "Cho tôi biết khi có tin nhắn mới" (request treo)
Server: (chờ đến khi có tin nhắn hoặc timeout 30s)
Server → Client: "Đây, tin nhắn mới!"
Client → Server: (ngay lập tức gửi request mới)
| Ưu điểm | Nhược điểm |
|---|---|
| Tiết kiệm hơn polling | Sender & receiver có thể kết nối tới server khác nhau |
| Tương thích HTTP/1.1 | Timeout handling phức tạp |
| Không cần WebSocket | Vẫn tốn resource giữ connection |
Verdict: Fallback option khi WebSocket không khả dụng.
WebSocket (Lựa chọn chính)
Client ←→ Server: Full-duplex, persistent connection
Upgrade: websocket (HTTP → WS handshake)
| Ưu điểm | Nhược điểm |
|---|---|
| Full-duplex (hai chiều) | Stateful → khó scale horizontally |
| Lowest latency | Cần sticky sessions hoặc connection registry |
| Efficient (no HTTP overhead) | Firewall/proxy có thể block |
| Ideal cho real-time | Reconnection logic phức tạp |
Verdict: Primary protocol cho chat system. WebSocket cho message delivery, HTTP cho mọi thứ khác (auth, profile, search).
SSE (Server-Sent Events)
Server → Client: One-way stream (server push only)
Client → Server: Vẫn dùng HTTP POST để gửi message
| Ưu điểm | Nhược điểm |
|---|---|
| Đơn giản hơn WebSocket | One-way only (server → client) |
| Auto-reconnect built-in | Cần HTTP POST cho client → server |
| Works over HTTP/2 | Giới hạn 6 connections/domain (HTTP/1.1) |
Verdict: Phù hợp cho notification/presence, không ideal cho full chat.
Protocol Decision Matrix
| Tiêu chí | Polling | Long Polling | WebSocket | SSE |
|---|---|---|---|---|
| Latency | Cao | Trung bình | Thấp nhất | Thấp |
| Bidirectional | Không | Không | Có | Không |
| Scalability | Dễ | Trung bình | Khó | Trung bình |
| Battery (mobile) | Tệ | Trung bình | Tốt | Tốt |
| Firewall friendly | Tốt | Tốt | Trung bình | Tốt |
Quyết định: Dùng WebSocket cho real-time messaging + presence. Dùng HTTP REST cho user profile, search, media upload, group management.
2.2 Service Decomposition (Phân tách dịch vụ)
flowchart TB subgraph "Client Layer" WEB[Web Client] MOB[Mobile Client] end subgraph "API Gateway / Load Balancer" LB[Load Balancer<br/>Nginx / ALB] end subgraph "Stateless Services (HTTP)" AUTH[Auth Service<br/>JWT / OAuth2] USER[User Service<br/>Profile, Contacts] GROUP[Group Service<br/>Group CRUD, Members] MEDIA[Media Service<br/>Upload, Download] SEARCH[Search Service<br/>Message Search] NOTIF[Notification Service<br/>FCM / APNs] end subgraph "Stateful Services (WebSocket)" CS1[Chat Server 1<br/>WebSocket] CS2[Chat Server 2<br/>WebSocket] CS3[Chat Server N<br/>WebSocket] end subgraph "Presence Layer" PRES[Presence Service<br/>Online/Offline/Typing] end subgraph "Message Layer" MQ[Message Queue<br/>Kafka / RabbitMQ] SYNC[Message Sync Service<br/>Offline → Online delivery] end subgraph "Storage Layer" CASS[(Cassandra<br/>Messages)] PG[(PostgreSQL<br/>Users, Groups)] REDIS[(Redis<br/>Presence, Sessions)] S3[(S3 / MinIO<br/>Media Files)] ES[(Elasticsearch<br/>Message Search)] end subgraph "CDN" CDN[CloudFront / Cloudflare<br/>Media Delivery] end WEB & MOB --> LB LB --> AUTH & USER & GROUP & MEDIA & SEARCH WEB & MOB -.->|WebSocket| CS1 & CS2 & CS3 CS1 & CS2 & CS3 --> MQ MQ --> CS1 & CS2 & CS3 MQ --> NOTIF MQ --> SYNC CS1 & CS2 & CS3 --> PRES PRES --> REDIS CS1 & CS2 & CS3 --> CASS USER & GROUP --> PG MEDIA --> S3 S3 --> CDN SEARCH --> ES SYNC --> CASS NOTIF -.->|Push| MOB
2.3 Message Flow — 1-on-1 Chat
sequenceDiagram participant A as User A (Sender) participant CS1 as Chat Server 1 participant MQ as Message Queue (Kafka) participant CS2 as Chat Server 2 participant B as User B (Receiver) participant DB as Cassandra participant NS as Notification Service A->>CS1: 1. Send message via WebSocket CS1->>CS1: 2. Validate & generate message_id (ULID) CS1->>MQ: 3. Publish to message queue CS1-->>A: 4. ACK (message accepted) par Parallel processing MQ->>DB: 5a. Persist message MQ->>CS2: 5b. Route to receiver's chat server end alt User B is ONLINE CS2->>B: 6a. Deliver via WebSocket B-->>CS2: 7a. ACK (received) else User B is OFFLINE MQ->>NS: 6b. Trigger push notification NS->>B: 7b. Push via FCM/APNs end
2.4 Message Flow — Group Chat
sequenceDiagram participant A as User A (Sender) participant CS1 as Chat Server 1 participant MQ as Message Queue (Kafka) participant FO as Fan-out Service participant CS2 as Chat Server 2 participant CS3 as Chat Server 3 participant B as User B (Online) participant C as User C (Online) participant NS as Notification Service participant D as User D (Offline) A->>CS1: 1. Send group message CS1->>MQ: 2. Publish (topic: group_123) CS1-->>A: 3. ACK MQ->>FO: 4. Fan-out service picks up FO->>FO: 5. Lookup group members & their chat servers par Fan-out to online members FO->>CS2: 6a. Route to User B's server CS2->>B: 7a. Deliver via WebSocket FO->>CS3: 6b. Route to User C's server CS3->>C: 7b. Deliver via WebSocket end FO->>NS: 6c. Notify offline members NS->>D: 7c. Push notification
2.5 Hybrid Approach: Why not just REST?
Hieu, câu hỏi hay: tại sao không dùng REST cho mọi thứ?
| Operation | Protocol | Lý do |
|---|---|---|
| Send/receive messages | WebSocket | Real-time, bidirectional |
| User login/signup | HTTP REST | One-time, stateless |
| Upload media | HTTP REST (multipart) | Large payload, not real-time |
| Search messages | HTTP REST | Request-response pattern |
| Update profile | HTTP REST | Infrequent, stateless |
| Presence updates | WebSocket (piggyback) | Continuous, low overhead |
| Push notifications | HTTP (server-side) | Server → FCM/APNs |
Step 3 — Deep Dive
3.1 Message Storage — SQL vs NoSQL
Đây là quyết định database quan trọng nhất cho chat system.
Tại sao KHÔNG dùng SQL (PostgreSQL/MySQL)?
| Vấn đề | Giải thích |
|---|---|
| Write-heavy workload | 69K writes/s — relational DB struggle ở mức này |
| Append-only pattern | Chat messages chỉ INSERT, rất ít UPDATE/DELETE |
| Time-series nature | Messages là sequential theo thời gian, giống time-series data |
| Horizontal scaling | SQL sharding phức tạp, cần manual re-shard |
| Storage volume | 73TB text/year — vượt khả năng single cluster |
Tại sao chọn NoSQL (Cassandra / HBase)?
| Ưu điểm | Giải thích |
|---|---|
| Write-optimized | LSM-tree architecture, sequential writes |
| Horizontal scaling | Add nodes, data tự redistribute |
| Time-series friendly | Partition by channel + sort by timestamp |
| High availability | Tunable consistency (ONE, QUORUM, ALL) |
| Proven at scale | Discord dùng Cassandra, Facebook dùng HBase |
Message Schema (Cassandra)
-- Bảng messages: partition by channel_id, cluster by message_id (time-ordered)
CREATE TABLE messages (
channel_id UUID, -- 1-on-1 hoặc group channel
message_id TIMEUUID, -- Time-based UUID, tự động sort theo thời gian
sender_id UUID,
content TEXT,
content_type TEXT, -- 'text', 'image', 'file', 'system'
media_url TEXT, -- NULL nếu text-only
created_at TIMESTAMP,
updated_at TIMESTAMP, -- NULL nếu chưa edit
deleted BOOLEAN, -- Soft delete
PRIMARY KEY (channel_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC)
AND compaction = {'class': 'TimeWindowCompactionStrategy',
'compaction_window_unit': 'DAYS',
'compaction_window_size': 1}
AND default_time_to_live = 0; -- No TTL, retain foreverGiải thích partition strategy:
- Partition key:
channel_id— tất cả messages của 1 conversation nằm trên cùng partition - Clustering key:
message_id(TIMEUUID) — messages tự động sort theo thời gian - TimeWindowCompactionStrategy — tối ưu cho time-series data, giảm read amplification
Query patterns:
-- Lấy 50 messages mới nhất của channel
SELECT * FROM messages
WHERE channel_id = ?
ORDER BY message_id DESC
LIMIT 50;
-- Lấy messages cũ hơn (pagination/scroll up)
SELECT * FROM messages
WHERE channel_id = ?
AND message_id < ? -- cursor-based pagination
ORDER BY message_id DESC
LIMIT 50;Hot Partition Problem (Vấn đề partition nóng)
Nếu một group chat rất active (ví dụ: group 100 người, 10K messages/ngày), partition đó sẽ quá lớn.
Giải pháp: Bucketing theo thời gian
CREATE TABLE messages_bucketed (
channel_id UUID,
bucket TEXT, -- '2026-03', '2026-04' (monthly bucket)
message_id TIMEUUID,
sender_id UUID,
content TEXT,
content_type TEXT,
media_url TEXT,
created_at TIMESTAMP,
PRIMARY KEY ((channel_id, bucket), message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);Partition key =
(channel_id, bucket)→ mỗi tháng là một partition mới, tránh unbounded growth.
3.2 Message ID Generation — Ordering Guarantee
Tại sao cần custom ID thay vì auto-increment?
| Auto-increment | Custom ID (Snowflake/ULID) |
|---|---|
| Single point of failure | Distributed generation |
| Bottleneck ở scale lớn | Mỗi server tự generate |
| Không có timestamp info | Timestamp embedded → sortable |
ULID (Universally Unique Lexicographically Sortable Identifier)
01ARZ3NDEKTSV4RRFFQ69G5FAV
|----------| |------------|
Timestamp Randomness
48 bits 80 bits
(ms)
Ưu điểm ULID cho chat:
- Lexicographically sortable → messages tự động sắp xếp theo thời gian
- 128 bits → collision probability cực thấp
- Monotonic — trong cùng millisecond, vẫn tăng dần
- Không cần coordination — mỗi server tự generate
Snowflake ID (Twitter style)
| 1 bit | 41 bits | 10 bits | 12 bits |
| unused | timestamp | machine | sequence |
| | (ms) | ID | number |
- 41 bits timestamp → ~69 years
- 10 bits machine ID → 1024 servers
- 12 bits sequence → 4096 IDs/ms/server
Quyết định: Dùng ULID cho message IDs (đơn giản hơn Snowflake, không cần machine ID coordination).
3.3 Presence Service — Online/Offline Indicator
flowchart TB subgraph "Client" C[Mobile/Web Client] end subgraph "Presence Service" PS[Presence Server] HB[Heartbeat Handler] end subgraph "Storage" REDIS[(Redis<br/>user_id → status, last_active)] end subgraph "Notification" PUB[Redis Pub/Sub<br/>Presence Channel] end subgraph "Subscribers" CS1[Chat Server 1<br/>→ User A's friends online] CS2[Chat Server 2<br/>→ User B's friends online] end C -->|WebSocket connect| PS PS -->|Set ONLINE| REDIS PS -->|Publish status change| PUB C -->|Heartbeat every 5s| HB HB -->|Update last_active TTL| REDIS C -->|WebSocket disconnect| PS PS -->|Set OFFLINE| REDIS PS -->|Publish status change| PUB PUB --> CS1 & CS2
Heartbeat Mechanism (Cơ chế nhịp tim)
User connects via WebSocket → status = ONLINE
Every 5 seconds: client sends heartbeat ping
Server: reset TTL in Redis (key: presence:{user_id}, TTL: 30s)
If no heartbeat for 30s → key expires → status = OFFLINE
Redis data structure:
# Presence status
SET presence:user_123 "online" EX 30 # Auto-expire after 30s
SET last_active:user_123 "1679123456789" # Epoch ms
# Batch check: ai đang online trong friend list?
MGET presence:user_001 presence:user_002 presence:user_003 ...
Presence Fan-out Problem (Vấn đề fan-out trạng thái)
Khi User A login (status → ONLINE), ai cần biết?
- Tất cả friends/contacts của User A.
Nếu User A có 500 friends → 500 push notifications. Nếu 100K users login cùng lúc (9h sáng) → 100K × 500 = 50M presence events/s (!)
Giải pháp cho large groups:
| Strategy | Mô tả | Khi nào dùng |
|---|---|---|
| Eager (fan-out on write) | Push status change tới tất cả friends ngay | Friends list < 500 |
| Lazy (fan-out on read) | Client tự fetch presence khi mở chat | Friends list > 500 |
| Hybrid | Eager cho close friends, lazy cho others | Most systems |
| Subscribe on view | Chỉ subscribe presence khi chat window đang mở | Mobile (tiết kiệm pin) |
Quyết định: Dùng Subscribe on view — client chỉ subscribe presence của conversations đang hiển thị trên màn hình. Giảm drastically số lượng events.
3.4 Group Chat Optimization
Small Groups (< 100 members): Fan-out on Write
User A gửi tin nhắn vào group (50 members):
1. Message lưu vào messages table (1 write)
2. Fan-out: copy message_id vào inbox của mỗi member (50 writes)
3. Mỗi member nhận push qua WebSocket
Total writes per message: 1 + 50 = 51
Ưu điểm: Read nhanh (mỗi user chỉ đọc inbox của mình), đơn giản. Nhược điểm: Write amplification (1 message → N writes).
Large Groups (> 100 members, nếu mở rộng sau): Fan-out on Read
User A gửi tin nhắn vào group (10,000 members):
1. Message lưu vào messages table (1 write)
2. KHÔNG fan-out, không copy
3. Khi member mở group → query messages table trực tiếp
Total writes per message: 1
Total reads per view: 1 query (nhưng expensive hơn)
Ưu điểm: Write cực nhanh, không amplification. Nhược điểm: Read chậm hơn, cần cache.
Inbox Table (Fan-out on Write approach)
-- User inbox: mỗi user có list message pointers
CREATE TABLE user_inbox (
user_id UUID,
channel_id UUID,
message_id TIMEUUID,
sender_id UUID,
preview TEXT, -- First 100 chars for notification
is_read BOOLEAN,
created_at TIMESTAMP,
PRIMARY KEY (user_id, created_at)
) WITH CLUSTERING ORDER BY (created_at DESC);3.5 Media Handling — Pre-signed URL Upload
sequenceDiagram participant C as Client participant API as API Gateway participant MS as Media Service participant S3 as S3 / Object Storage participant TH as Thumbnail Service participant CDN as CDN (CloudFront) participant CS as Chat Server participant R as Receiver C->>API: 1. Request upload URL (filename, size, type) API->>MS: 2. Validate (file size < 25MB, allowed types) MS->>S3: 3. Generate pre-signed URL (expires in 15 min) S3-->>MS: 4. Pre-signed URL MS-->>C: 5. Return upload URL C->>S3: 6. Upload file directly to S3 (bypass backend!) S3-->>C: 7. Upload complete (200 OK) S3->>TH: 8. S3 Event → trigger thumbnail generation TH->>S3: 9. Save thumbnail (compressed, resized) C->>CS: 10. Send message with media_url via WebSocket CS->>R: 11. Deliver message with CDN URL R->>CDN: 12. Load image/file from CDN CDN->>S3: 13. Cache miss → fetch from S3 CDN-->>R: 14. Serve media (cached)
Tại sao Pre-signed URL?
| Direct upload qua backend | Pre-signed URL (S3 direct) |
|---|---|
| Backend là bottleneck | Client upload thẳng S3 |
| Tốn bandwidth server | Server chỉ generate URL |
| Timeout risk với file lớn | S3 handle large files natively |
| Scale khó | S3 scale vô hạn |
Thumbnail Generation:
Original image (2MB, 3000x2000)
↓ S3 Event Notification
Lambda/Worker:
→ Thumbnail small: 150x150 (5KB) — cho chat list
→ Thumbnail medium: 600x400 (50KB) — cho chat view
→ Keep original — cho "View full size"
3.6 Read Receipts — Per-user Read Pointer
Data Model
-- Mỗi user có một "read pointer" cho mỗi channel
CREATE TABLE read_receipts (
channel_id UUID,
user_id UUID,
last_read_message_id TIMEUUID, -- Pointer tới message cuối cùng đã đọc
updated_at TIMESTAMP,
PRIMARY KEY (channel_id, user_id)
);Flow:
1. User B mở conversation → client gửi "mark as read" với last_message_id
2. Server UPDATE read_receipts SET last_read_message_id = X
3. Server push read receipt event tới User A (sender)
4. User A's client hiển thị "✓✓ Đã đọc"
Optimization:
- Batch updates: Không gửi read receipt cho MỖI message. Chỉ gửi khi user scroll hoặc mở chat (debounce 2-3 giây).
- Group read receipts: Chỉ hiển thị “Đã đọc bởi 5 người” thay vì fan-out tới mỗi member.
Unread Count Calculation
-- Đếm messages chưa đọc cho user X trong channel Y
SELECT COUNT(*) FROM messages
WHERE channel_id = ?
AND message_id > (
SELECT last_read_message_id FROM read_receipts
WHERE channel_id = ? AND user_id = ?
);Optimization: Cache unread count trong Redis. Mỗi message mới →
INCR unread:{user_id}:{channel_id}. Khi đọc →DEL unread:{user_id}:{channel_id}.
3.7 Push Notification Integration
flowchart LR subgraph "Message Pipeline" MQ[Kafka Topic:<br/>messages] end subgraph "Notification Service" NC[Notification Consumer] PC[Presence Check] DT[Device Token Store<br/>Redis/PostgreSQL] RL[Rate Limiter<br/>Max 1 push/5s per user] end subgraph "Push Providers" FCM[Firebase Cloud Messaging<br/>Android] APNS[Apple Push Notification<br/>iOS] WPN[Web Push<br/>Browser] end MQ --> NC NC --> PC PC -->|User OFFLINE| DT DT --> RL RL --> FCM & APNS & WPN PC -->|User ONLINE| NC NC -->|Skip push| NC style PC fill:#f9a825
Offline Message Queue:
Khi user B offline, messages tích lũy. Khi B online lại:
1. B connects via WebSocket
2. Chat server queries: messages WHERE channel_id IN (B's channels)
AND message_id > B's last_synced_message_id
3. Deliver all pending messages (batched)
4. Update B's sync pointer
3.8 Message Search — Elasticsearch
// Elasticsearch index mapping
{
"mappings": {
"properties": {
"channel_id": { "type": "keyword" },
"message_id": { "type": "keyword" },
"sender_id": { "type": "keyword" },
"content": {
"type": "text",
"analyzer": "standard",
"fields": {
"keyword": { "type": "keyword" }
}
},
"created_at": { "type": "date" },
"content_type": { "type": "keyword" }
}
}
}Search flow: Message written to Cassandra → CDC (Change Data Capture) → Kafka → Elasticsearch Indexer → ES.
Không search trực tiếp trên Cassandra — Cassandra chỉ support primary key lookup, không full-text search.
3.9 End-to-End Encryption (E2E) — Signal Protocol Overview
flowchart LR subgraph "User A" A_PK[Public Key A] A_SK[Private Key A] A_ENC[Encrypt with<br/>shared secret] end subgraph "Server (Cannot Read)" SRV[Server stores<br/>encrypted blob only] end subgraph "User B" B_PK[Public Key B] B_SK[Private Key B] B_DEC[Decrypt with<br/>shared secret] end A_PK -->|Key Exchange<br/>X3DH Protocol| B_PK B_PK --> A_ENC A_ENC -->|Encrypted message| SRV SRV -->|Encrypted message| B_DEC A_SK --> A_ENC B_SK --> B_DEC style SRV fill:#ff6666,color:#fff
Signal Protocol flow (simplified):
-
Key Exchange (X3DH — Extended Triple Diffie-Hellman):
- User A và B trao đổi public keys thông qua server
- Tạo shared secret mà server KHÔNG biết
-
Double Ratchet Algorithm:
- Mỗi message dùng một encryption key khác nhau
- Nếu 1 key bị compromise, chỉ ảnh hưởng 1 message (forward secrecy)
-
Server’s role: Chỉ relay encrypted blobs. Không thể đọc nội dung.
Trade-off: E2E encryption → server-side search không hoạt động. Client phải tự search locally.
4. Security
4.1 End-to-End Encryption (Chi tiết)
| Aspect | Implementation |
|---|---|
| Protocol | Signal Protocol (WhatsApp, Signal) |
| Key exchange | X3DH (Extended Triple Diffie-Hellman) |
| Message encryption | AES-256-GCM |
| Key rotation | Double Ratchet — mỗi message key mới |
| Group E2E | Sender Keys protocol (mỗi member có sender key) |
| Key backup | Client-side encrypted backup (user’s passphrase) |
4.2 Message Retention Policy
# retention-policy.yml
retention:
text_messages:
1_on_1: forever # User có thể xóa manually
group: forever
deleted_by_user:
soft_delete: true # Mark as deleted, keep 30 days for compliance
hard_delete_after: 30d
media:
original: 1y # Giữ original 1 năm
thumbnails: forever
after_expiry: delete_original_keep_thumbnail
audit_logs:
retention: 7y # Compliance requirement
storage: cold_storage # S3 Glacier after 90 days4.3 Spam & Abuse Detection
Message pipeline:
User sends message
→ Rate limiter (max 60 messages/minute per user)
→ Content filter (regex: phone numbers, URLs → flag)
→ ML spam classifier (real-time inference, < 10ms)
→ If flagged → hold for review, notify user
→ If clean → deliver normally
| Layer | Technique | Latency Budget |
|---|---|---|
| Rate limiting | Token bucket per user | < 1ms |
| Keyword filter | Bloom filter + regex | < 2ms |
| ML classifier | TensorFlow Serving (text classification) | < 10ms |
| Image scanning | PhotoDNA / perceptual hashing | < 100ms (async) |
| Human review | Queue for flagged content | Async |
4.4 CSAM Scanning — Ethical Considerations
CSAM = Child Sexual Abuse Material. Bắt buộc theo luật nhiều quốc gia.
| Approach | Mô tả | Trade-off |
|---|---|---|
| PhotoDNA (Microsoft) | Hash-based matching against known CSAM database | Privacy-preserving (chỉ hash, không xem ảnh) |
| Client-side scanning | Scan trên device trước khi upload | Controversial (Apple đã rút) — xung đột với E2E |
| Server-side scanning | Scan after upload, before delivery | Không hoạt động với E2E encrypted media |
| Report-based | Chỉ scan khi có user report | Miss proactive detection |
Ethical dilemma: E2E encryption bảo vệ privacy nhưng cũng bảo vệ tội phạm. Không có giải pháp hoàn hảo. Hầu hết platforms (WhatsApp, Telegram) dùng metadata analysis + user reporting thay vì break E2E.
5. DevOps
5.1 WebSocket Scaling with Sticky Sessions
Problem: WebSocket là stateful connection. Nếu load balancer route request tới server khác, connection bị mất.
Solution 1: Sticky Sessions (Layer 7)
# nginx.conf — WebSocket sticky sessions
upstream chat_servers {
ip_hash; # Same client IP → same server
server chat-server-1:8080;
server chat-server-2:8080;
server chat-server-3:8080;
}
server {
listen 443 ssl;
server_name chat.example.com;
location /ws {
proxy_pass http://chat_servers;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header X-Real-IP $remote_addr;
proxy_read_timeout 86400s; # 24h timeout for WebSocket
proxy_send_timeout 86400s;
}
}Solution 2: Connection Registry (Recommended cho scale lớn)
Redis connection registry:
ws:conn:user_123 → chat-server-7 (TTL: 300s, refresh on heartbeat)
ws:conn:user_456 → chat-server-2
Routing logic:
1. Message for user_123
2. Lookup Redis: user_123 → chat-server-7
3. Route message to chat-server-7 via internal message bus
4. chat-server-7 pushes to user_123's WebSocket
5.2 Monitoring WebSocket Connections
# prometheus-alerts.yml — Chat System specific
groups:
- name: chat_system_alerts
rules:
- alert: WebSocketConnectionsHigh
expr: websocket_active_connections > 60000 # 92% of 65K limit
for: 2m
labels:
severity: critical
annotations:
summary: "WebSocket connections near limit ({{ $value }}/65000)"
- alert: MessageDeliveryLatencyHigh
expr: histogram_quantile(0.99, rate(message_delivery_seconds_bucket[5m])) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "P99 message delivery latency > 500ms"
- alert: KafkaConsumerLag
expr: kafka_consumer_group_lag > 100000
for: 5m
labels:
severity: critical
annotations:
summary: "Kafka consumer lag > 100K messages, delivery delayed"
- alert: WebSocketReconnectionSpike
expr: rate(websocket_reconnections_total[5m]) > 1000
for: 2m
labels:
severity: warning
annotations:
summary: "WebSocket reconnection storm detected ({{ $value }}/s)"
- alert: PresenceServiceLatency
expr: histogram_quantile(0.95, rate(presence_update_seconds_bucket[5m])) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Presence update P95 latency > 100ms"
- alert: CassandraWriteLatency
expr: cassandra_write_latency_p99 > 50
for: 5m
labels:
severity: warning
annotations:
summary: "Cassandra P99 write latency > 50ms"5.3 Grafana Dashboard Essentials
| Panel | PromQL | Threshold |
|---|---|---|
| Active WebSocket Connections | websocket_active_connections | Warning: 80% capacity |
| Message Throughput | rate(messages_sent_total[1m]) | Compare vs estimation (23K/s avg) |
| Message Delivery Latency P99 | histogram_quantile(0.99, ...) | < 200ms |
| Kafka Consumer Lag | kafka_consumer_group_lag | < 10K |
| Presence Updates/s | rate(presence_updates_total[1m]) | Monitor for fan-out storms |
| Undelivered Messages | undelivered_messages_gauge | Should trend to 0 |
| WebSocket Error Rate | rate(websocket_errors_total[5m]) | < 0.1% |
5.4 Kafka for Message Delivery
# kafka-topics.yml
topics:
- name: chat.messages.1on1
partitions: 64 # Partition by channel_id hash
replication_factor: 3
config:
retention.ms: 604800000 # 7 days (messages persisted in Cassandra)
min.insync.replicas: 2
compression.type: lz4
- name: chat.messages.group
partitions: 128 # Higher throughput for group messages
replication_factor: 3
config:
retention.ms: 604800000
min.insync.replicas: 2
- name: chat.presence
partitions: 32
replication_factor: 2 # Presence is less critical
config:
retention.ms: 3600000 # 1 hour only
cleanup.policy: compact # Keep latest status per user
- name: chat.notifications
partitions: 32
replication_factor: 3
config:
retention.ms: 86400000 # 1 dayPartition Strategy: Messages for the same channel_id go to the same Kafka partition → ordering guaranteed within a conversation.
6. Code Examples
6.1 Node.js WebSocket Server with Rooms
// chat-server.js
// WebSocket server with rooms (channels), presence heartbeat, message routing
const WebSocket = require('ws');
const Redis = require('ioredis');
const { Kafka } = require('kafkajs');
const { ulid } = require('ulid');
// --- Config ---
const PORT = process.env.PORT || 8080;
const SERVER_ID = process.env.SERVER_ID || 'chat-server-1';
const HEARTBEAT_INTERVAL = 5000; // 5 seconds
const HEARTBEAT_TIMEOUT = 30000; // 30 seconds → offline
// --- Connections ---
const redis = new Redis(process.env.REDIS_URL);
const redisSub = new Redis(process.env.REDIS_URL);
const kafka = new Kafka({ brokers: [process.env.KAFKA_BROKER || 'localhost:9092'] });
const producer = kafka.producer();
const consumer = kafka.consumer({ groupId: `${SERVER_ID}-group` });
// --- State ---
const clients = new Map(); // user_id → WebSocket
const userChannels = new Map(); // user_id → Set<channel_id>
// --- WebSocket Server ---
const wss = new WebSocket.Server({ port: PORT });
wss.on('connection', async (ws, req) => {
const userId = authenticateFromToken(req); // Extract from JWT in query/header
if (!userId) {
ws.close(4001, 'Unauthorized');
return;
}
// Register connection
clients.set(userId, ws);
await registerConnection(userId);
await setPresence(userId, 'online');
console.log(`[${SERVER_ID}] User ${userId} connected. Total: ${clients.size}`);
// Heartbeat setup
ws.isAlive = true;
ws.on('pong', () => { ws.isAlive = true; });
// Message handler
ws.on('message', async (raw) => {
try {
const data = JSON.parse(raw);
await handleMessage(userId, data);
} catch (err) {
ws.send(JSON.stringify({ type: 'error', message: 'Invalid message format' }));
}
});
// Disconnect handler
ws.on('close', async () => {
clients.delete(userId);
await unregisterConnection(userId);
await setPresence(userId, 'offline');
console.log(`[${SERVER_ID}] User ${userId} disconnected. Total: ${clients.size}`);
});
// Send pending messages (offline → online sync)
await deliverPendingMessages(userId, ws);
});
// --- Message Handling ---
async function handleMessage(senderId, data) {
switch (data.type) {
case 'chat_message':
await handleChatMessage(senderId, data);
break;
case 'typing':
await handleTypingIndicator(senderId, data);
break;
case 'read_receipt':
await handleReadReceipt(senderId, data);
break;
case 'heartbeat':
await refreshPresence(senderId);
break;
default:
console.warn(`Unknown message type: ${data.type}`);
}
}
async function handleChatMessage(senderId, data) {
const { channelId, content, contentType = 'text', mediaUrl = null } = data;
// Generate time-sortable ID
const messageId = ulid();
const message = {
messageId,
channelId,
senderId,
content,
contentType,
mediaUrl,
createdAt: Date.now(),
};
// ACK to sender immediately
sendToUser(senderId, {
type: 'message_ack',
messageId,
channelId,
status: 'accepted',
timestamp: message.createdAt,
});
// Publish to Kafka for persistence + delivery
await producer.send({
topic: 'chat.messages.1on1',
messages: [{
key: channelId, // Partition by channel → ordering guarantee
value: JSON.stringify(message),
}],
});
}
async function handleTypingIndicator(senderId, data) {
const { channelId } = data;
// Broadcast typing indicator to other members in channel
const members = await getChannelMembers(channelId);
for (const memberId of members) {
if (memberId !== senderId) {
sendToUser(memberId, {
type: 'typing',
channelId,
userId: senderId,
timestamp: Date.now(),
});
}
}
}
async function handleReadReceipt(senderId, data) {
const { channelId, lastReadMessageId } = data;
// Update read pointer in Redis (will be persisted to Cassandra async)
await redis.set(
`read_receipt:${channelId}:${senderId}`,
lastReadMessageId
);
// Reset unread count
await redis.del(`unread:${senderId}:${channelId}`);
// Notify other members
const members = await getChannelMembers(channelId);
for (const memberId of members) {
if (memberId !== senderId) {
sendToUser(memberId, {
type: 'read_receipt',
channelId,
userId: senderId,
lastReadMessageId,
});
}
}
}
// --- Delivery ---
function sendToUser(userId, payload) {
const ws = clients.get(userId);
if (ws && ws.readyState === WebSocket.OPEN) {
ws.send(JSON.stringify(payload));
return true;
}
return false; // User not on this server or offline
}
// --- Presence ---
async function setPresence(userId, status) {
if (status === 'online') {
await redis.set(`presence:${userId}`, 'online', 'EX', 30);
} else {
await redis.del(`presence:${userId}`);
}
// Publish presence change for subscribers
await redis.publish('presence_changes', JSON.stringify({
userId,
status,
timestamp: Date.now(),
}));
}
async function refreshPresence(userId) {
await redis.expire(`presence:${userId}`, 30); // Reset TTL
}
// --- Connection Registry ---
async function registerConnection(userId) {
await redis.set(`ws:conn:${userId}`, SERVER_ID, 'EX', 300);
}
async function unregisterConnection(userId) {
await redis.del(`ws:conn:${userId}`);
}
// --- Heartbeat Check ---
const heartbeatInterval = setInterval(() => {
wss.clients.forEach((ws) => {
if (!ws.isAlive) {
ws.terminate();
return;
}
ws.isAlive = false;
ws.ping();
});
}, HEARTBEAT_INTERVAL);
// --- Kafka Consumer (receive messages routed to this server) ---
async function startConsumer() {
await consumer.connect();
await consumer.subscribe({ topic: 'chat.messages.1on1', fromBeginning: false });
await consumer.run({
eachMessage: async ({ message }) => {
const msg = JSON.parse(message.value.toString());
const { channelId, senderId } = msg;
// Get channel members and deliver
const members = await getChannelMembers(channelId);
for (const memberId of members) {
if (memberId === senderId) continue; // Don't echo back to sender
const delivered = sendToUser(memberId, {
type: 'new_message',
...msg,
});
if (delivered) {
// Increment unread counter
await redis.incr(`unread:${memberId}:${channelId}`);
}
}
},
});
}
// --- Helper stubs ---
function authenticateFromToken(req) {
// Extract JWT from query param or header, verify, return userId
// In production: verify JWT signature, check expiry
const url = new URL(req.url, `http://${req.headers.host}`);
return url.searchParams.get('userId'); // Simplified for example
}
async function getChannelMembers(channelId) {
// In production: cache in Redis, fallback to PostgreSQL
const members = await redis.smembers(`channel:members:${channelId}`);
return members;
}
async function deliverPendingMessages(userId, ws) {
// In production: query Cassandra for messages after user's last sync point
console.log(`[${SERVER_ID}] Delivering pending messages to ${userId}`);
}
// --- Startup ---
(async () => {
await producer.connect();
await startConsumer();
console.log(`[${SERVER_ID}] Chat server running on port ${PORT}`);
})();6.2 Message Schema — TypeScript Types
// types/message.ts
/** Message content types */
type ContentType = 'text' | 'image' | 'file' | 'video' | 'audio' | 'system';
/** WebSocket message types (client ↔ server) */
type WSMessageType =
| 'chat_message' // Client → Server: send message
| 'new_message' // Server → Client: receive message
| 'message_ack' // Server → Client: message accepted
| 'typing' // Bidirectional: typing indicator
| 'read_receipt' // Bidirectional: mark as read
| 'presence' // Server → Client: online/offline
| 'heartbeat' // Client → Server: keep alive
| 'error'; // Server → Client: error
/** Chat message stored in Cassandra */
interface ChatMessage {
messageId: string; // ULID — time-sortable unique ID
channelId: string; // UUID of 1-on-1 or group channel
senderId: string; // UUID of sender
content: string; // Message text (encrypted if E2E)
contentType: ContentType;
mediaUrl: string | null;
thumbnailUrl: string | null;
createdAt: number; // Unix timestamp (ms)
updatedAt: number | null;
deleted: boolean;
replyTo: string | null; // messageId of parent (for threads)
}
/** Channel (conversation) stored in PostgreSQL */
interface Channel {
channelId: string;
type: 'direct' | 'group';
name: string | null; // NULL for 1-on-1
createdBy: string;
createdAt: number;
memberCount: number;
lastMessageId: string | null;
lastMessageAt: number | null;
}
/** Channel membership */
interface ChannelMember {
channelId: string;
userId: string;
role: 'owner' | 'admin' | 'member';
joinedAt: number;
lastReadMessageId: string | null;
notificationMuted: boolean;
}
/** Presence status */
interface UserPresence {
userId: string;
status: 'online' | 'offline' | 'away';
lastActiveAt: number;
device: 'mobile' | 'web' | 'desktop';
}
/** WebSocket envelope (all messages wrapped in this) */
interface WSEnvelope<T = unknown> {
type: WSMessageType;
payload: T;
timestamp: number;
requestId?: string; // For request-response correlation
}6.3 Presence Heartbeat — Client Side
// client/presence.js — Browser/React Native WebSocket client
class ChatClient {
constructor(wsUrl, token) {
this.wsUrl = wsUrl;
this.token = token;
this.ws = null;
this.heartbeatTimer = null;
this.reconnectAttempts = 0;
this.maxReconnectAttempts = 10;
this.baseReconnectDelay = 1000; // 1 second
}
connect() {
this.ws = new WebSocket(`${this.wsUrl}?token=${this.token}`);
this.ws.onopen = () => {
console.log('Connected to chat server');
this.reconnectAttempts = 0;
this.startHeartbeat();
};
this.ws.onmessage = (event) => {
const data = JSON.parse(event.data);
this.handleMessage(data);
};
this.ws.onclose = (event) => {
console.log(`Disconnected: ${event.code} ${event.reason}`);
this.stopHeartbeat();
this.scheduleReconnect();
};
this.ws.onerror = (error) => {
console.error('WebSocket error:', error);
};
}
// --- Heartbeat ---
startHeartbeat() {
this.heartbeatTimer = setInterval(() => {
if (this.ws.readyState === WebSocket.OPEN) {
this.ws.send(JSON.stringify({ type: 'heartbeat' }));
}
}, 5000); // Every 5 seconds
}
stopHeartbeat() {
if (this.heartbeatTimer) {
clearInterval(this.heartbeatTimer);
this.heartbeatTimer = null;
}
}
// --- Reconnection with exponential backoff + jitter ---
scheduleReconnect() {
if (this.reconnectAttempts >= this.maxReconnectAttempts) {
console.error('Max reconnect attempts reached. Please refresh.');
return;
}
// Exponential backoff with jitter to prevent thundering herd
const delay = Math.min(
this.baseReconnectDelay * Math.pow(2, this.reconnectAttempts),
30000 // Max 30 seconds
);
const jitter = delay * (0.5 + Math.random() * 0.5); // 50-100% of delay
console.log(`Reconnecting in ${Math.round(jitter)}ms (attempt ${this.reconnectAttempts + 1})`);
setTimeout(() => {
this.reconnectAttempts++;
this.connect();
}, jitter);
}
// --- Send Message ---
sendMessage(channelId, content, contentType = 'text') {
this.ws.send(JSON.stringify({
type: 'chat_message',
channelId,
content,
contentType,
}));
}
// --- Mark as Read ---
markAsRead(channelId, lastReadMessageId) {
this.ws.send(JSON.stringify({
type: 'read_receipt',
channelId,
lastReadMessageId,
}));
}
handleMessage(data) {
switch (data.type) {
case 'new_message':
this.onNewMessage?.(data);
break;
case 'message_ack':
this.onMessageAck?.(data);
break;
case 'typing':
this.onTyping?.(data);
break;
case 'read_receipt':
this.onReadReceipt?.(data);
break;
case 'presence':
this.onPresenceChange?.(data);
break;
case 'error':
console.error('Server error:', data.message);
break;
}
}
}
// Usage:
// const chat = new ChatClient('wss://chat.example.com/ws', jwtToken);
// chat.onNewMessage = (msg) => updateUI(msg);
// chat.connect();7. System Design Diagram — Complete Architecture
flowchart TB subgraph "Clients" WEB["🖥 Web Client<br/>(React)"] IOS["📱 iOS Client<br/>(Swift)"] AND["📱 Android Client<br/>(Kotlin)"] end subgraph "Edge Layer" CDN["CDN<br/>(CloudFront)"] LB["Load Balancer<br/>(ALB + Nginx)"] end subgraph "API Layer (HTTP - Stateless)" GW["API Gateway<br/>(Kong / Envoy)"] AUTH["Auth Service<br/>(JWT + OAuth2)"] USER["User Service<br/>(Profile, Contacts)"] GROUP["Group Service<br/>(CRUD, Members)"] MEDIA["Media Service<br/>(Pre-signed URLs)"] SRCH["Search Service<br/>(ES queries)"] end subgraph "Real-time Layer (WebSocket - Stateful)" direction LR WS1["Chat Server 1<br/>(WS, 65K conn)"] WS2["Chat Server 2<br/>(WS, 65K conn)"] WSN["Chat Server N<br/>(WS, 65K conn)"] end subgraph "Presence Layer" PRES["Presence Service<br/>(Heartbeat, Status)"] end subgraph "Message Bus" K1["Kafka<br/>chat.messages"] K2["Kafka<br/>chat.presence"] K3["Kafka<br/>chat.notifications"] end subgraph "Processing Layer" FAN["Fan-out Service<br/>(Group delivery)"] SYNC["Sync Service<br/>(Offline → Online)"] IDX["Indexer<br/>(Cassandra → ES)"] THUMB["Thumbnail Service<br/>(Image processing)"] end subgraph "Notification" NOTIF["Notification Service"] FCM["FCM<br/>(Android)"] APNS["APNs<br/>(iOS)"] WPUSH["Web Push"] end subgraph "Data Layer" CASS[("Cassandra<br/>Messages<br/>(Write-optimized)")] PG[("PostgreSQL<br/>Users, Groups<br/>Channels")] RD[("Redis Cluster<br/>Presence, Sessions<br/>Unread counts")] S3[("S3<br/>Media files")] ES[("Elasticsearch<br/>Message search")] end WEB & IOS & AND -->|HTTPS| LB WEB & IOS & AND -.->|WSS| WS1 & WS2 & WSN WEB & IOS & AND -->|Media| CDN LB --> GW GW --> AUTH & USER & GROUP & MEDIA & SRCH WS1 & WS2 & WSN --> K1 WS1 & WS2 & WSN --> PRES PRES --> RD PRES --> K2 K1 --> FAN & SYNC K1 --> CASS FAN --> WS1 & WS2 & WSN FAN --> K3 K3 --> NOTIF NOTIF --> FCM & APNS & WPUSH SYNC --> CASS IDX --> ES CASS --> IDX USER & GROUP --> PG MEDIA --> S3 S3 --> CDN S3 --> THUMB THUMB --> S3 SRCH --> ES
8. Aha Moments — Đúc kết
#1 — WebSocket is a game changer but a scaling nightmare: WebSocket cho latency thấp nhất nhưng stateful connections khiến horizontal scaling phức tạp. Giải pháp: connection registry trong Redis + internal message bus (Kafka) để route giữa servers.
#2 — Chat là write-heavy, nhưng read-heavy ở tầng delivery: 69K write QPS (messages vào DB) nhưng 347K read QPS (delivery tới recipients). Fan-out on write cho small groups, fan-out on read cho large groups — hybrid approach.
#3 — NoSQL wins for messages: Cassandra/HBase với partition by
channel_id+ cluster bymessage_id(TIMEUUID) cho O(1) lookup theo conversation, sequential writes, và horizontal scaling tự nhiên. SQL không phù hợp cho workload này.
#4 — Presence is deceptively expensive: “User X online” nghe đơn giản, nhưng fan-out presence changes tới toàn bộ contacts list → thundering herd problem. Solution: subscribe-on-view (chỉ track presence cho conversations đang mở).
#5 — Message ordering is channel-scoped, not global: Không cần global ordering. Chỉ cần messages trong cùng conversation đúng thứ tự. ULID + Kafka partitioning by
channel_idgiải quyết điều này elegantly.
#6 — E2E encryption kills server-side features: Nếu enable E2E → server không thể search messages, scan content, hoặc generate previews. Trade-off rõ ràng giữa privacy và functionality.
#7 — Pre-signed URL pattern eliminates media bottleneck: Client upload thẳng S3, server chỉ generate URL. Không có media nào đi qua chat server → backend chỉ lo text messages.
9. Common Pitfalls — Sai lầm thường gặp
Pitfall 1: WebSocket Reconnection Storms (Bão reconnect)
Sai: Server restart → 65K clients reconnect cùng lúc → server mới overload ngay lập tức → lại crash → loop.
Đúng: Client PHẢI implement exponential backoff with jitter:
- Attempt 1: wait 1s + random(0-0.5s)
- Attempt 2: wait 2s + random(0-1s)
- Attempt 3: wait 4s + random(0-2s)
- …capped at 30s
Jitter đảm bảo clients reconnect rải rác, không cùng lúc.
Pitfall 2: Presence Fan-out in Large Groups
Sai: User A online → push “A is online” tới 500 friends → 500 WebSocket messages. 100K users login cùng lúc (9h sáng) → 50M presence events.
Đúng: Subscribe on view — chỉ track presence cho conversations đang visible trên screen. User nhìn 5 conversations → chỉ subscribe 5 presence channels, không phải 500.
Pitfall 3: Message Ordering in Distributed System
Sai: Dùng
timestamptừ client device làm ordering key → clocks không sync → messages hiển thị sai thứ tự.
Đúng: Server generate
message_idbằng ULID/Snowflake (server timestamp). Kafka partition bychannel_id→ messages trong cùng conversation luôn đúng thứ tự. Client display theomessage_id, không phải client timestamp.
Pitfall 4: Không handle duplicate delivery
Sai: Network hiccup → client không nhận ACK → retry → message gửi 2 lần → receiver thấy duplicate.
Đúng: Idempotency key — client generate
requestIdcho mỗi message. Server checkrequestIdtrong Redis (TTL 5 min). Nếu đã xử lý → return existing ACK, không process lại.
Pitfall 5: Media upload blocking chat flow
Sai: Upload 10MB image → WebSocket connection busy → text messages bị block.
Đúng: Media upload qua separate HTTP connection (pre-signed URL → S3 direct). WebSocket chỉ carry text messages và lightweight events. Hai channels độc lập.
Pitfall 6: Unbounded group notification
Sai: Group 100 members, mỗi message trigger 99 push notifications → 1000 messages/hour = 99K notifications/hour cho 1 group.
Đúng: Notification batching — aggregate 5 messages trong 30 giây thành 1 notification: “Bạn có 5 tin nhắn mới trong Group ABC”. User đang online → KHÔNG gửi push notification.
10. Wrap Up — Step 4
Scaling Considerations
| Component | Scaling Strategy |
|---|---|
| Chat Servers (WebSocket) | Horizontal + Connection Registry (Redis) |
| Cassandra | Add nodes, virtual nodes auto-rebalance |
| Kafka | Add partitions + brokers |
| Redis | Cluster mode (16384 hash slots) |
| Media (S3) | Virtually unlimited |
| Search (ES) | Add data nodes, index sharding |
| Notification | Horizontal + rate limiting per user |
Multi-region Deployment
Region US-East:
Chat servers (50%) ← Users in Americas
Cassandra DC-1 (full replica)
Kafka cluster (independent)
Region EU-West:
Chat servers (30%) ← Users in Europe/Africa
Cassandra DC-2 (full replica)
Kafka cluster (independent)
Region AP-Southeast:
Chat servers (20%) ← Users in Asia-Pacific
Cassandra DC-3 (full replica)
Kafka cluster (independent)
Cross-region sync:
Cassandra multi-DC replication (async, eventual consistency)
Cross-region message routing via dedicated Kafka MirrorMaker
Cost Optimization
| Item | Monthly Cost (estimated, 50M DAU) | Optimization |
|---|---|---|
| WebSocket servers (100 instances) | ~$30K | Spot instances for non-critical |
| Cassandra cluster (50 nodes) | ~$50K | Tiered storage (hot/cold) |
| Kafka cluster (20 brokers) | ~$15K | Managed service (MSK) |
| S3 storage (14.6 PB/year) | ~$300K | S3 Intelligent-Tiering |
| CDN (media delivery) | ~$100K | Cache optimization, compression |
| Redis cluster (presence) | ~$10K | Managed ElastiCache |
| Total | ~$505K/month | — |
Future Enhancements
- Voice/Video calls: WebRTC integration (TURN/STUN servers)
- Message reactions: Lightweight (just emoji + userId, no separate table needed)
- Threads: Reply-to with
parent_message_id, lazy-loaded - Bots/Integrations: Webhook-based bot framework
- Stories/Status: Ephemeral content (TTL 24h), separate storage
- AI features: Smart reply suggestions, message summarization
Internal Links
- Tuan-01-Scale-From-Zero-To-Millions — Scaling fundamentals
- Tuan-02-Back-of-the-envelope — Capacity estimation (chat numbers reused here)
- Tuan-03-Networking-DNS-CDN — CDN for media, DNS routing
- Tuan-05-Load-Balancer — WebSocket-aware load balancing
- Tuan-06-Cache-Strategy — Redis caching for presence, unread counts
- Tuan-07-Database-Sharding-Replication — Cassandra partitioning strategy
- Tuan-08-Message-Queue — Kafka for message routing
- Tuan-09-Rate-Limiter — Chat message rate limiting
- Tuan-13-Monitoring-Observability — WebSocket connection monitoring
- Tuan-15-Data-Security-Encryption — E2E encryption, Signal Protocol
- Tuan-16-Design-URL-Shortener — Previous case study
- Tuan-18-Design-News-Feed — Fan-out pattern comparison
- Tuan-19-Design-Notification-System — Push notification deep dive
Tham khảo
- Alex Xu, System Design Interview — Chapter 12: Design a Chat System
- Discord Engineering Blog — How Discord Stores Billions of Messages
- WhatsApp Architecture — High Scalability
- Signal Protocol — Technical Documentation
- Kafka as Message Broker — Confluent
- Tuan-02-Back-of-the-envelope — Estimation framework & chat numbers
Tuần trước: Tuan-16-Design-URL-Shortener — URL Shortener case study Tuần sau: Tuan-18-Design-News-Feed — News Feed system with fan-out patterns