Case Study: Design Nearby Friends (Real-Time Location Sharing)
“Hieu, tuong tuong em dang di choi voi nhom ban o Sai Gon. Ai cung tren Zalo, nhung khong ai biet ai dang o dau. Em mo app — va trong vong 1 giay, ban do hien ra 5 nguoi ban dang o gan em trong ban kinh 5km. Phia sau su ‘don gian’ do la mot he thong real-time phuc tap, noi WebSocket gap Pub/Sub gap Redis o quy mo 100 trieu nguoi dung.”
Tags: system-design nearby-friends real-time websocket pubsub redis alex-xu case-study vol2 Student: Hieu Prerequisite: Tuan-02-Back-of-the-envelope · Tuan-06-Cache-Strategy · Tuan-10-Consistent-Hashing · Tuan-17-Design-Chat-System Lien quan: Case-Design-Proximity-Service · Tuan-05-Load-Balancer · Tuan-08-Message-Queue · Tuan-13-Monitoring-Observability · Tuan-14-AuthN-AuthZ-Security Reference: Alex Xu, System Design Interview — An Insider’s Guide, Volume 2 — Chapter 2: Nearby Friends
1. Context & Why — Tai sao Nearby Friends quan trong?
1.1 Analogy — Nhom ban di choi trong thanh pho
Hieu, hay tuong tuong em va 10 nguoi ban cung di choi o TP.HCM vao toi thu 7. Moi nguoi co ke hoach rieng — co nguoi o quan 1 an uong, co nguoi o Bitexco mua sam, co nguoi o Thu Duc. Em muon biet: ai dang o gan minh de reu nhau?
Cach truyen thong: Em nhan tin hoi tung nguoi — “Ey, may dang o dau?” — roi doi tung nguoi tra loi. 10 nguoi, 10 tin nhan, 10 lan doi. Co nguoi khong tra loi, co nguoi tra loi tre 30 phut. Luc do ho da di cho khac roi.
Cach Nearby Friends: Em mo app, bat tinh nang “Nearby Friends”. App tu dong hien thi tren ban do: “Minh dang o Nguyen Hue, Tuan dang o Dong Khoi cach 500m, Lan dang o Bui Vien cach 800m.” Cap nhat moi 30 giay. Em khong can hoi ai ca — he thong lam moi thu tu dong.
Day chinh la tinh nang Nearby Friends — tuong tu Snap Map (Snapchat), Zalo Nearby, Facebook Nearby Friends (da discontinued 2022), hoac WhatsApp Live Location.
1.2 Van de ky thuat — Tai sao bai toan nay kho?
| Van de | Giai thich | Quy mo |
|---|---|---|
| Real-time location updates | Vi tri thay doi lien tuc, moi 30 giay phai cap nhat. Khong phai query 1 lan nhu Proximity Service | 10M concurrent users x update moi 30s = ~333K updates/s |
| Bidirectional communication | Server can push location cua ban be den client, khong chi client pull | HTTP polling lang phi, can WebSocket |
| Fan-out problem | Moi user co trung binh 400 ban. Moi update location phai notify toi 400 nguoi | 333K updates/s x 400 friends = 133M messages/s |
| Stateful connections | WebSocket la stateful — kho scale hon stateless HTTP | Can connection management, reconnection, server affinity |
| Privacy sensitive | Location la personal data nhay cam nhat — can opt-in, opt-out, precision control | GDPR, CCPA, luat bao ve du lieu ca nhan |
| Selective sharing | Chi hien thi ban be da opt-in va dang online. Khong phai tat ca 400 ban be | Filtering logic phuc tap |
1.3 So sanh voi Proximity Service
| Khia canh | Proximity Service (Chapter 1) | Nearby Friends (Chapter 2) |
|---|---|---|
| Doi tuong | Businesses (tinh) | Friends (dong — di chuyen lien tuc) |
| Update frequency | Businesses hiem khi doi vi tri | Users di chuyen moi giay |
| Communication | Request-Response (HTTP) | Bidirectional (WebSocket) |
| Data freshness | Chap nhan stale vai gio | Can real-time (< 30s) |
| Indexing | Geohash/Quadtree cho millions of POIs | Khong can geospatial index — chi check khoang cach giua friends |
| Scale challenge | Read-heavy (60K QPS) | Write-heavy + fan-out (133M messages/s) |
| Reference | Case-Design-Proximity-Service | Bai nay |
Insight quan trong: Nearby Friends KHONG phai la Proximity Service cho nguoi dung. Proximity Service tim “ai o gan minh?” trong tat ca nguoi la. Nearby Friends chi hien thi ban be — em da biet danh sach ban be, chi can biet vi tri hien tai cua ho.
1.4 Real-World Applications
| App | Tinh nang | Dac diem |
|---|---|---|
| Snap Map (Snapchat) | Hien thi vi tri ban be tren ban do | Real-time, Bitmoji avatar, Ghost Mode de an |
| Zalo Nearby | Tim nguoi dung Zalo gan vi tri hien tai | Khac — tim nguoi la, khong chi ban be |
| WhatsApp Live Location | Chia se vi tri real-time voi contact/group | Co thoi han (15min, 1h, 8h) |
| Find My Friends (Apple) | Xem vi tri gia dinh/ban be | Tich hop iOS, battery-efficient |
| Telegram Live Location | Chia se vi tri trong chat | Tuong tu WhatsApp |
2. Step 1 — Understand the Problem & Establish Design Scope
2.1 Clarifying Questions
| Cau hoi | Tra loi | Ghi chu |
|---|---|---|
| Tinh nang chinh la gi? | Hien thi danh sach ban be dang o gan minh, tren ban do, cap nhat real-time | Core feature duy nhat |
| ”Gan” nghia la bao xa? | Trong ban kinh co the cau hinh, mac dinh 5 miles (~8 km) | Configurable radius |
| Tan suat cap nhat? | Moi 30 giay | Khong phai real-time tung giay — 30s la du |
| Quy mo bao lon? | 100M DAU (Daily Active Users) | Scale cua Facebook/Zalo |
| Bao nhieu nguoi online dong thoi? | ~10% cua DAU o peak = 10M concurrent | Peak hours: 7-10 PM |
| Trung binh moi user co bao nhieu ban? | 400 friends | Trung binh Facebook ~338, lam tron 400 |
| User can opt-in khong? | Co — chi user bat tinh nang moi chia se vi tri | Privacy la so 1 |
| Co can luu location history? | Khong — chi can vi tri hien tai | Giam storage, tang privacy |
| Hien thi khoang cach hay vi tri chinh xac? | Ca hai — khoang cach + vi tri tren ban do | Client render |
2.2 Functional Requirements
- FR1: User co the bat/tat tinh nang Nearby Friends (opt-in/opt-out)
- FR2: Khi bat, app hien thi danh sach ban be dang o trong ban kinh configurable (mac dinh 5 miles)
- FR3: Danh sach cap nhat moi 30 giay ma khong can user refresh
- FR4: Moi friend entry hien thi: ten, khoang cach, thoi gian cap nhat cuoi cung
- FR5: Chi hien thi ban be cung bat tinh nang (mutual opt-in)
- FR6: Khi user tat tinh nang hoac offline, bien mat khoi danh sach cua ban be
2.3 Non-Functional Requirements
| Yeu cau | Muc tieu | Giai thich |
|---|---|---|
| Availability | 99.9% | Tinh nang social, khong phai safety-critical nhu navigation |
| Latency | Location update propagation < 1 giay | Tu luc friend update location den luc em nhan duoc |
| Scalability | 10M concurrent users, 333K location updates/s | Peak traffic |
| Consistency | Eventual consistency ok | Nhan vi tri tre 1-2 giay la chap nhan duoc |
| Battery efficiency | Khong drain battery qua nhieu | GPS polling moi 30s, khong phai lien tuc |
| Privacy | Location chi chia se voi friends da opt-in | GDPR compliant |
2.4 Estimation — Back-of-the-Envelope
Concurrent Users:
Location Update QPS:
Moi giay, he thong nhan ~333K location updates tu clients. Day la write-heavy.
WebSocket Connections:
Neu moi WebSocket server handle 50K connections (con so thuc te cho production):
Pub/Sub Fan-out Volume:
Nhung khong phai tat ca 400 friends deu online va opt-in. Gia su 10% friends dang online va opt-in:
Day la con so quan trong: 13.3 trieu messages moi giay can duoc deliver qua Pub/Sub system. Day la thach thuc lon nhat cua bai toan.
Redis Memory cho Location Cache:
Chi 1 GB Redis memory cho location cache cua 10M users. Mot Redis instance (64 GB RAM) du suc.
Bandwidth cho Location Updates:
Outbound bandwidth gap 40x inbound — dac trung cua fan-out system.
Tom tat Estimation:
| Metric | Value |
|---|---|
| Concurrent users (peak) | 10M |
| WebSocket servers (50K conn/server) | 200 |
| Location update QPS | ~333K/s |
| Pub/Sub fan-out messages | ~13.3M/s |
| Redis memory (location cache) | ~1 GB |
| Inbound bandwidth | ~33 MB/s |
| Outbound bandwidth | ~1.33 GB/s |
3. Step 2 — High-Level Design
3.1 Lua chon Communication Protocol
Truoc khi thiet ke architecture, phai chon giao thuc giao tiep. Day la quyet dinh quan trong nhat cua bai toan.
| Option | Mo ta | Uu diem | Nhuoc diem | Phu hop? |
|---|---|---|---|---|
| HTTP Polling | Client gui GET request moi 30s | Don gian, stateless | Lang phi bandwidth (moi request co HTTP headers ~500 bytes), 10M requests moi 30s = 333K QPS chi de poll | Khong |
| HTTP Long Polling | Client gui request, server giu cho den khi co data moi | It lang phi hon polling | Van la 1 connection per pending request, khong that su bidirectional | Khong |
| Server-Sent Events (SSE) | Server push events qua HTTP | Don gian, built-in browser support | Chi server → client, khong co client → server. Ma ta can ca hai chieu | Khong du |
| WebSocket | Full-duplex, bidirectional communication tren 1 TCP connection | Client gui location, server push friend updates — tren cung 1 connection. Lightweight (2-6 bytes overhead/message) | Stateful, kho scale hon HTTP | Co — day la lua chon |
Tai sao WebSocket? Vi ta can bidirectional communication: client gui location update (client → server) va server push friend location (server → client) tren cung 1 connection. WebSocket la lua chon tu nhien nhat. Tham chieu Tuan-17-Design-Chat-System — chat system cung dung WebSocket vi ly do tuong tu.
3.2 High-Level Architecture Overview
He thong gom 3 thanh phan chinh:
| Component | Chuc nang | Technology |
|---|---|---|
| WebSocket Servers | Duy tri persistent connections voi clients, nhan location updates, push friend locations | WebSocket (ws/wss) |
| Location Cache | Luu vi tri hien tai cua moi user (latest location) | Redis |
| Pub/Sub | Propagate location updates den tat ca friends dang online | Redis Pub/Sub |
Data Flow tom tat:
- User A bat tinh nang → client mo WebSocket connection den server
- Client gui location update moi 30 giay qua WebSocket
- Server luu location vao Redis (Location Cache)
- Server publish location update len Pub/Sub channel cua User A
- Tat ca friends cua User A dang subscribe channel nay → nhan duoc update
- Friends’ WebSocket servers tinh khoang cach → push den friend’s client neu trong radius
3.3 High-Level Architecture Diagram
flowchart TB subgraph Clients A["User A<br/>(Mobile App)"] B["User B<br/>(Mobile App)"] C["User C<br/>(Mobile App)"] end subgraph "API Gateway / Load Balancer" LB["Load Balancer<br/>→ [[Tuan-05-Load-Balancer]]"] end subgraph "WebSocket Server Fleet" WS1["WebSocket Server 1<br/>50K connections"] WS2["WebSocket Server 2<br/>50K connections"] WSN["WebSocket Server N<br/>50K connections"] end subgraph "Data Layer" Redis_Cache["Redis — Location Cache<br/>Key: user_id<br/>Value: {lat, lng, timestamp}"] Redis_PubSub["Redis Pub/Sub<br/>Channel per user<br/>Friends subscribe"] end subgraph "Supporting Services" UserSvc["User Service<br/>(Friends list, profile)"] DB[("Database<br/>User data, friend relationships")] end A -->|"WebSocket"| LB B -->|"WebSocket"| LB C -->|"WebSocket"| LB LB --> WS1 LB --> WS2 LB --> WSN WS1 & WS2 & WSN -->|"SET location"| Redis_Cache WS1 & WS2 & WSN -->|"PUBLISH update"| Redis_PubSub WS1 & WS2 & WSN -->|"SUBSCRIBE friend channels"| Redis_PubSub WS1 & WS2 & WSN -->|"Get friends list"| UserSvc UserSvc --> DB style LB fill:#42a5f5,color:#fff style Redis_Cache fill:#ef5350,color:#fff style Redis_PubSub fill:#ff7043,color:#fff style DB fill:#66bb6a,color:#fff
3.4 API Design (WebSocket Messages)
Vi dung WebSocket, khong co REST endpoints truyen thong. Thay vao do, la cac message types:
Client → Server Messages:
| Message Type | Payload | Muc dich |
|---|---|---|
location_update | {lat, lng, timestamp} | Gui vi tri hien tai moi 30s |
enable_nearby | {} | Bat tinh nang Nearby Friends |
disable_nearby | {} | Tat tinh nang Nearby Friends |
update_radius | {radius_miles} | Thay doi ban kinh hien thi |
Server → Client Messages:
| Message Type | Payload | Muc dich |
|---|---|---|
friend_location | {friend_id, lat, lng, timestamp, distance} | Vi tri moi cua mot friend |
friend_offline | {friend_id} | Friend vua tat tinh nang hoac offline |
nearby_friends_list | [{friend_id, lat, lng, distance}, ...] | Danh sach day du khi moi connect |
4. Step 3 — Deep Dive
4.1 Location Update Flow — Chi tiet tung buoc
Day la flow quan trong nhat cua he thong. Khi User A gui location update, chuyen gi xay ra?
sequenceDiagram participant A as User A (Client) participant WS_A as WebSocket Server<br/>(serving User A) participant Cache as Redis<br/>Location Cache participant PubSub as Redis<br/>Pub/Sub participant WS_B as WebSocket Server<br/>(serving User B) participant B as User B (Client) Note over A,B: User A va User B la ban be. Ca hai deu bat Nearby Friends. A->>WS_A: location_update {lat: 10.77, lng: 106.70, ts: ...} par Parallel Operations WS_A->>Cache: SET user:A:location {lat, lng, ts}<br/>TTL = 120s and WS_A->>PubSub: PUBLISH channel:user_A {lat, lng, ts} end Note over PubSub: User B da SUBSCRIBE channel:user_A<br/>(vi B la ban cua A va dang online) PubSub->>WS_B: Message on channel:user_A {lat, lng, ts} Note over WS_B: WS_B tinh khoang cach giua A va B.<br/>Neu <= radius cua B → push den B. WS_B->>B: friend_location {friend_id: A, lat, lng, distance: 1.2km}
Chi tiet tung buoc:
| Buoc | Action | Component | Latency | Ghi chu |
|---|---|---|---|---|
| 1 | Client gui location qua WebSocket | Client → WS Server | ~5ms (LAN) | Binary message, minimal overhead |
| 2a | Luu location vao Redis cache | WS Server → Redis | ~1ms | SET voi TTL 120s (2x update interval) |
| 2b | Publish len user’s Pub/Sub channel | WS Server → Redis Pub/Sub | ~1ms | Song song voi buoc 2a |
| 3 | Redis fan-out den subscribers | Redis Pub/Sub → WS Servers | ~1ms | Moi subscriber nhan duoc message |
| 4 | Tinh khoang cach | WS Server (receiver) | ~0.01ms | Haversine formula, CPU-bound |
| 5 | Push den client neu trong radius | WS Server → Client | ~5ms | Qua WebSocket connection |
| Total | ~10-15ms | Cuc nhanh |
Aha Moment: Toan bo flow tu A update location den B nhan duoc chi mat ~10-15ms. User cam nhan la “real-time” mac du thuc te location chi cap nhat moi 30 giay. Bottleneck khong phai latency — ma la fan-out volume.
4.2 Redis Pub/Sub — Trai tim cua he thong
4.2.1 Thiet ke Channel
Core design decision: Moi user = 1 Pub/Sub channel.
| Thiet ke | Mo ta | Uu diem | Nhuoc diem |
|---|---|---|---|
| 1 channel per user (chosen) | Channel user:A, friends subscribe | Don gian, granular control | Nhieu channels (10M) |
| 1 channel per geohash cell | Channel geo:w3gvk, users trong cell subscribe | It channels | Nhan updates tu nguoi la (khong chi friends), privacy issue |
| 1 global channel | Moi update broadcast den tat ca | Don gian nhat | 333K updates/s x 10M subscribers = khong kha thi |
Tai sao 1 channel per user? Vi Nearby Friends chi care ve ban be, khong phai tat ca nguoi trong khu vuc. Moi user chi can subscribe channels cua ban be — chinh xac nhung nguoi ho quan tam.
4.2.2 Subscribe/Unsubscribe Flow
Khi User B len online va bat Nearby Friends:
| Buoc | Action | Chi tiet |
|---|---|---|
| 1 | B connect WebSocket | Mo persistent connection |
| 2 | Server lay friends list cua B | Query User Service → DB. Ket qua: [A, C, D, E, …] (400 friends) |
| 3 | Server check ai dang online va opt-in | Query Redis: EXISTS user:A:location, user:C:location, … |
| 4 | Server subscribe B vao channels cua online friends | SUBSCRIBE channel:user_A, channel:user_C, … |
| 5 | Server lay latest location cua online friends | MGET user:A:location, user:C:location, … tu Redis cache |
| 6 | Server tinh khoang cach va gui initial list | Push nearby_friends_list den B qua WebSocket |
Khi User B offline hoac tat Nearby Friends:
| Buoc | Action | Chi tiet |
|---|---|---|
| 1 | Server UNSUBSCRIBE B khoi tat ca friend channels | UNSUBSCRIBE channel:user_A, channel:user_C, … |
| 2 | Server DELETE B’s location tu cache | DEL user:B:location |
| 3 | Server PUBLISH “offline” event len B’s channel | PUBLISH channel:user_B {status: “offline”} |
| 4 | Friends cua B nhan “offline” event | Push friend_offline {friend_id: B} den friends’ clients |
4.2.3 Pub/Sub Fan-out Analysis
Day la phan phuc tap nhat. Khi User A update location:
Nhung khong phai moi user co dung 400 friends. Phan phoi thuc te la long-tail:
| User type | So friends | % users | Subscribers per channel |
|---|---|---|---|
| Light users | < 100 | 40% | ~10 |
| Average users | 100-500 | 45% | ~40 |
| Heavy users | 500-2000 | 14% | ~100-200 |
| Power users / celebrities | 2000-5000 | 1% | ~500+ |
Pitfall: Fan-out explosion voi popular users. Mot user co 5000 friends va 500 nguoi dang online → moi 30 giay, 1 update cua ho fan-out thanh 500 messages. Nhan voi 333K updates/s → co the co nhung burst cuc lon.
Giai phap cho popular users: Rate limit fan-out. Neu user co > 500 subscribers, giam tan suat update tu 30s xuong 60s hoac 120s. User binh thuong khong bi anh huong.
4.2.4 Memory cua Redis Pub/Sub Channels
7 GB cho Pub/Sub metadata — chap nhan duoc cho mot Redis cluster.
4.3 WebSocket Connection Management
4.3.1 Connection Lifecycle
stateDiagram-v2 [*] --> Connecting: User mo app,<br/>bat Nearby Friends Connecting --> Connected: WebSocket handshake OK Connected --> Subscribing: Server subscribe<br/>friend channels Subscribing --> Active: Subscriptions done,<br/>initial list sent Active --> Active: Location updates<br/>moi 30s Active --> Reconnecting: Network drop,<br/>server crash Reconnecting --> Connecting: Retry voi<br/>exponential backoff Active --> Disconnecting: User tat tinh nang<br/>hoac close app Disconnecting --> [*]: Cleanup subscriptions,<br/>delete location cache Reconnecting --> [*]: Max retries exceeded
4.3.2 Connection Parameters
| Parameter | Value | Ly do |
|---|---|---|
| Heartbeat interval | 30 giay | Trung voi location update interval — gui heartbeat kem location |
| Connection timeout | 10 giay | Neu khong connect duoc trong 10s → retry |
| Max reconnect attempts | 10 | Sau 10 lan → thong bao user “Khong the ket noi” |
| Reconnect backoff | Exponential: 1s, 2s, 4s, 8s, …, max 60s | Tranh thundering herd khi server restart |
| Idle timeout | 5 phut khong nhan location update | Coi nhu user da tat GPS hoac app bi kill → disconnect, cleanup |
| Max message size | 1 KB | Location update chi can ~100 bytes. 1 KB la du margin |
4.3.3 Reconnection Strategy
Khi WebSocket connection bi mat (network issue, server crash, app bi background), client can reconnect:
| Buoc | Action | Chi tiet |
|---|---|---|
| 1 | Client detect disconnect | onclose hoac onerror event |
| 2 | Wait theo backoff | Exponential: 1s, 2s, 4s, 8s, 16s, 32s, 60s (cap) |
| 3 | Reconnect den Load Balancer | Co the connect den server khac (stateful concern!) |
| 4 | Re-authenticate | Gui token qua WebSocket |
| 5 | Server re-subscribe friend channels | Giong flow ban dau |
| 6 | Server gui initial nearby friends list | Client refresh UI |
Quan trong: Khi reconnect, client co the duoc route den server khac (vi Load Balancer). Server moi phai re-subscribe tat ca friend channels. Day la chi phi cua reconnection — nhung vi subscribe la O(number_of_friends) va friends list duoc cache, chi phi nay chap nhan duoc.
4.4 Scaling WebSocket Servers
4.4.1 Thach thuc — Stateful Connections
WebSocket connections la stateful: moi connection gan voi 1 server cu the. Khong the “load balance tung request” nhu HTTP. Sau khi connection duoc thiet lap, moi message phai di qua dung server do.
| Van de | Giai thich | Giai phap |
|---|---|---|
| Server failure | Server chet → 50K users mat connection | Client auto-reconnect, LB route den server khac |
| Uneven distribution | Server A co 60K connections, server B co 20K | LB track connection count, route new connections den server it tai |
| Deployment | Rolling update → server restart → 50K users reconnect dong thoi | Graceful shutdown: server thong bao clients truoc 30s, clients reconnect dần |
| Memory pressure | Moi connection ~ 10-50 KB memory. 50K connections = 0.5-2.5 GB/server | Monitor memory, set max connections per server |
4.4.2 Server Assignment Strategy
| Strategy | Mo ta | Uu diem | Nhuoc diem |
|---|---|---|---|
| Random (via LB) | LB random chon server cho new connection | Don gian | Uneven distribution |
| Least connections | LB chon server co it connection nhat | Even distribution | Can LB track state |
| Consistent hashing | Hash user_id → server | Reconnect lai cung server (cache warm) | Kho handle server add/remove |
| Least connections (recommended) |
Dung Least Connections strategy cho WebSocket LB. Tham chieu Tuan-05-Load-Balancer. Consistent hashing (Tuan-10-Consistent-Hashing) co the dung nhung complexity khong dang cho use case nay — vi reconnection chi mat vai giay de re-subscribe.
4.4.3 Graceful Shutdown Flow
Khi can restart WebSocket server (deploy, maintenance):
| Buoc | Action | Duration |
|---|---|---|
| 1 | Mark server as “draining” | LB ngung gui new connections den server nay |
| 2 | Server gui “reconnect” signal den tat ca clients | Clients bat dau reconnect den server khac |
| 3 | Doi cho connections giam ve 0 (hoac timeout) | Max 60 giay |
| 4 | Server unsubscribe tat ca Pub/Sub channels | Cleanup |
| 5 | Shutdown server | Safe |
4.5 Scaling Redis Pub/Sub
4.5.1 Van de — Single Redis Pub/Sub Bottleneck
Mot Redis instance co the handle ~500K messages/s (Pub/Sub throughput). Nhung ta can deliver 13.3M messages/s. Can nhieu Redis instances.
Lam tron len: 30 Redis Pub/Sub instances (voi buffer).
4.5.2 Sharding Strategy cho Pub/Sub Channels
| Strategy | Mo ta | Uu diem | Nhuoc diem |
|---|---|---|---|
| Hash-based sharding | shard = hash(user_id) % N | Even distribution, deterministic | Resharding khi add/remove nodes |
| Consistent hashing | Dung hash ring | Smooth resharding | Phuc tap hon |
| Range-based | user_id 1-1M → shard 1, 1M-2M → shard 2 | Don gian | Uneven neu user distribution khong deu |
Chon: Hash-based sharding (don gian, du tot).
Moi WebSocket server can biet: channel cua user X nam tren Redis shard nao. Vi hash function la deterministic → moi server tinh duoc ma khong can lookup.
flowchart LR subgraph "WebSocket Servers" WS1["WS Server 1"] WS2["WS Server 2"] WS3["WS Server 3"] end subgraph "Redis Pub/Sub Cluster" R1["Redis Shard 1<br/>Channels: user_1, user_31, ..."] R2["Redis Shard 2<br/>Channels: user_2, user_32, ..."] R3["Redis Shard 3<br/>Channels: user_3, user_33, ..."] RN["Redis Shard N<br/>..."] end WS1 -->|"PUBLISH channel:user_1"| R1 WS1 -->|"SUBSCRIBE channel:user_2"| R2 WS2 -->|"PUBLISH channel:user_32"| R2 WS2 -->|"SUBSCRIBE channel:user_3"| R3 WS3 -->|"SUBSCRIBE channel:user_1"| R1 R1 -->|"Fan-out"| WS2 & WS3 R2 -->|"Fan-out"| WS1 & WS3 style R1 fill:#ef5350,color:#fff style R2 fill:#ef5350,color:#fff style R3 fill:#ef5350,color:#fff style RN fill:#ef5350,color:#fff
Luu y: Moi WebSocket server co the connect den nhieu Redis shards — vi friends cua 1 user co the nam tren nhieu shards khac nhau. WS Server 1 serving User B phai subscribe channel:user_A tren shard 1, channel:user_C tren shard 3, v.v.
4.5.3 Redis Pub/Sub Connections Budget
Moi WebSocket server can connect den moi Redis shard (de subscribe/publish). Voi 200 WS servers va 30 Redis shards:
Moi Redis instance co the handle ~10K concurrent connections → du thoai mai.
4.5.4 Alternative — Dung Message Queue thay Redis Pub/Sub?
| Khia canh | Redis Pub/Sub | Message Queue (Kafka, RabbitMQ) |
|---|---|---|
| Delivery guarantee | At-most-once (fire-and-forget) | At-least-once (Kafka), At-most-once (RabbitMQ) |
| Persistence | Khong — message mat neu subscriber offline | Co — Kafka luu messages tren disk |
| Latency | Cuc thap (~1ms) | Cao hon (~5-50ms tuy cau hinh) |
| Fan-out | Native — 1 publish, N subscribers nhan | Can consumer groups hoac topic-per-user |
| Memory | Chi luu trong memory | Luu tren disk (Kafka) |
| Ordering | Guaranteed per channel | Guaranteed per partition (Kafka) |
Tai sao chon Redis Pub/Sub?
- At-most-once la du: Mat 1 location update khong sao — 30 giay sau co update moi. Khong can durability.
- Latency cuc thap: Redis Pub/Sub ~1ms, Kafka ~5-50ms. Cho real-time feature, 1ms matters.
- Khong can persistence: Ta khong care location 5 phut truoc. Chi can latest.
- Don gian: Redis Pub/Sub la built-in, khong can deploy/manage them Kafka cluster.
Trade-off: Neu can guaranteed delivery (vi du: notification system), dung Kafka. Cho Nearby Friends, Redis Pub/Sub la perfect fit vi ta uu tien low latency va simplicity hon durability.
4.6 Nearby Friend Calculation — Tinh khoang cach
4.6.1 Server-side vs Client-side Calculation
| Approach | Mo ta | Uu diem | Nhuoc diem |
|---|---|---|---|
| Server-side (chosen) | WS server tinh khoang cach truoc khi push | Giam bandwidth — chi push friends trong radius | Server can biet vi tri cua subscriber |
| Client-side | Server push tat ca friends’ locations, client tu filter | Server don gian hon | Lang phi bandwidth — push ca friends o xa |
Chon server-side calculation vi:
- Voi 40 online friends, server chi push 5-10 friends trong radius thay vi 40 → tiet kiem 75-87% bandwidth
- Client (mobile) khong phai xu ly nhieu → tiet kiem battery
- Server da co vi tri cua ca hai (sender va receiver) trong Redis cache
4.6.2 Haversine Formula
Khoang cach giua 2 diem tren be mat Trai Dat:
Trong do (ban kinh Trai Dat), la latitude, la longitude.
Performance: Haversine la O(1) — chi la vai phep tinh luong giac. Tinh 40 khoang cach mat < 0.01ms. Khong phai bottleneck.
4.6.3 Flow chi tiet khi nhan friend update
Khi WS Server (serving User B) nhan update tu channel:user_A:
| Buoc | Action | Chi tiet |
|---|---|---|
| 1 | Nhan message tu Pub/Sub | {user_id: A, lat: 10.77, lng: 106.70, ts: ...} |
| 2 | Lay vi tri hien tai cua User B | Tu local memory (cached khi B gui location update) |
| 3 | Tinh Haversine distance | d = haversine(B.lat, B.lng, A.lat, A.lng) |
| 4 | So sanh voi radius cua B | if d <= B.radius |
| 5a | Neu trong radius → Push den B | friend_location {friend_id: A, distance: d} |
| 5b | Neu ngoai radius → bo qua | Khong push, tiet kiem bandwidth |
Optimization: WS Server serving User B luu vi tri cua B trong local memory (khong phai query Redis moi lan). Vi tri duoc cap nhat moi 30 giay khi B gui location update. Day la “cache tai cho” — zero latency.
4.7 Adding/Removing Friends — Dynamic Subscription
Khi friend relationship thay doi (them ban moi, huy ket ban), subscriptions phai cap nhat:
4.7.1 Them ban moi
| Buoc | Action | Trigger |
|---|---|---|
| 1 | User A va User B tro thanh ban | Friend request accepted |
| 2 | Notification gui den WS Server cua A va B | Qua internal message queue |
| 3 | WS Server cua A subscribe channel:user_B | Neu B dang online va opt-in |
| 4 | WS Server cua B subscribe channel:user_A | Neu A dang online va opt-in |
| 5 | Tinh khoang cach va push neu can | Ca A va B nhan vi tri cua nhau |
4.7.2 Huy ket ban
| Buoc | Action | Trigger |
|---|---|---|
| 1 | User A unfriend User B | UI action |
| 2 | Notification gui den WS Server cua A va B | Qua internal message queue |
| 3 | WS Server cua A unsubscribe channel:user_B | |
| 4 | WS Server cua B unsubscribe channel:user_A | |
| 5 | Push friend_offline den ca A va B | Xoa khoi nearby list |
Luu y: Subscribe/unsubscribe tren Redis Pub/Sub la O(1) — cuc nhanh. Khong anh huong performance.
4.8 Handling Inactive Users — TTL va Cleanup
4.8.1 Van de
User co the inactive vi nhieu ly do:
| Tinh huong | He thong nhin thay | Can xu ly |
|---|---|---|
| User tat app | WebSocket disconnect event | Cleanup ngay |
| App bi kill boi OS | WebSocket disconnect event (co the tre) | Cleanup ngay |
| User di vao vung khong co mang | Khong nhan location update, heartbeat miss | Detect va cleanup |
| User de dien thoai yen 1 cho | Van nhan location update (vi tri khong doi) | Khong can xu ly — van active |
| User tat GPS | App khong co GPS data → ngung gui location | Detect va thong bao user |
4.8.2 TTL Strategy
| Data | TTL | Ly do |
|---|---|---|
| Location cache (Redis) | 120 giay (2x update interval) | Neu khong nhan update trong 2 chu ky → user da offline |
| WebSocket heartbeat | 60 giay | Server gui ping, client tra loi pong. Miss 2 ping lien tiep → disconnect |
| Pub/Sub subscription | Khong co TTL — cleanup khi disconnect | Subscribe/unsubscribe la explicit |
Flow khi user inactive:
sequenceDiagram participant A as User A (Client) participant WS as WebSocket Server participant Cache as Redis Cache participant PubSub as Redis Pub/Sub participant Friends as Friends' WS Servers Note over A,WS: User A mat mang / tat app WS->>WS: Heartbeat timeout (60s, 2 missed pings) WS->>WS: Close WebSocket connection par Cleanup WS->>Cache: DEL user:A:location and WS->>PubSub: UNSUBSCRIBE all friend channels and WS->>PubSub: PUBLISH channel:user_A {status: "offline"} end PubSub->>Friends: User A offline notification Friends->>Friends: Remove A from nearby list,<br/>push friend_offline to clients
4.8.3 Redis TTL as Safety Net
Ngay ca khi server khong kip cleanup (vi du server crash), Redis TTL tren location key se tu dong xoa data:
Friends se nhan ra user “da cap nhat lan cuoi 2 phut truoc” → client co the dim hoac an user nay khoi danh sach.
4.9 Geohash Optimization — Giam Computation
4.9.1 Van de
Voi moi location update tu friend, WS server phai tinh Haversine distance. Neu user co 40 online friends va moi friend update moi 30s, moi user nhan:
40 calculations moi 30 giay la it — chua phai bottleneck. Nhung voi 50K users tren 1 WS server:
Haversine la nhe (~0.01ms), nen 67K calculations/s chi mat ~0.67ms CPU time. Khong phai bottleneck.
Ket luan: Cho quy mo 10M concurrent users, geohash optimization KHONG can thiet. Brute-force Haversine calculation du nhanh. Tuy nhien, neu quy mo tang len 100M+ concurrent users, hoac friends count tang len 5000+, thi geohash optimization co the can.
4.9.2 Geohash Optimization (neu can)
Neu can optimize (quy mo cuc lon):
| Ky thuat | Mo ta | Tiet kiem |
|---|---|---|
| Pre-filter by geohash | Moi user co geohash (tinh tu lat/lng). Chi tinh Haversine cho friends co geohash gan | Loai 80-90% friends o xa |
| Lazy calculation | Chi tinh khi friend’s geohash thay doi (khong tinh lai neu van o cung cell) | Giam 90%+ calculations cho friends ngoi yen |
| Batch calculation | Gom nhieu updates va tinh 1 lan moi 5-10 giay thay vi moi update | Giam CPU spikes |
4.10 Multi-Region Architecture
4.10.1 Tai sao can Multi-Region?
| Ly do | Chi tiet |
|---|---|
| Latency | User o Viet Nam connect den server o US → ~200ms latency. WebSocket updates se cham |
| Availability | 1 region down → toan bo he thong down. Multi-region → failover |
| Data sovereignty | GDPR yeu cau data EU users luu o EU |
| User distribution | 100M DAU phan bo toan cau — khong the serve tu 1 region |
4.10.2 Regional Architecture
flowchart TB subgraph "Region: US-East" US_LB["Load Balancer"] US_WS["WebSocket Servers<br/>60 servers"] US_Redis["Redis Cluster<br/>(Cache + Pub/Sub)"] end subgraph "Region: EU-West" EU_LB["Load Balancer"] EU_WS["WebSocket Servers<br/>50 servers"] EU_Redis["Redis Cluster<br/>(Cache + Pub/Sub)"] end subgraph "Region: AP-Southeast" AP_LB["Load Balancer"] AP_WS["WebSocket Servers<br/>90 servers"] AP_Redis["Redis Cluster<br/>(Cache + Pub/Sub)"] end subgraph "Cross-Region" Bridge["Cross-Region<br/>Message Bridge<br/>(for cross-region friend pairs)"] end US_Redis <-->|"Cross-region<br/>friend updates"| Bridge EU_Redis <-->|"Cross-region<br/>friend updates"| Bridge AP_Redis <-->|"Cross-region<br/>friend updates"| Bridge style Bridge fill:#ff9800,color:#000 style US_Redis fill:#ef5350,color:#fff style EU_Redis fill:#ef5350,color:#fff style AP_Redis fill:#ef5350,color:#fff
4.10.3 Cross-Region Friend Pairs
Van de: User A o Viet Nam (AP-Southeast), User B o My (US-East). Ca hai la ban be va bat Nearby Friends. Lam sao A nhan location cua B?
| Approach | Mo ta | Uu diem | Nhuoc diem |
|---|---|---|---|
| Cross-region Pub/Sub bridge (chosen) | Khi A update location, publish o AP region. Message bridge forward den US region cho B’s subscribers | Tach biet regions, bridge chi cho cross-region pairs | Them latency (~100-200ms cross-region) |
| Global Pub/Sub | 1 Pub/Sub cluster serve toan cau | Don gian | Single point of failure, high latency cho remote users |
| Ignore cross-region | Chi hien thi friends cung region | Don gian nhat | Bad UX — khong thay ban be o nuoc khac |
Trade-off: Cross-region friends se nhan location update tre hon ~100-200ms (vi phai di qua internet giua regions). Nhung vi nearby friends thuong o cung thanh pho (cung region), da so updates la intra-region va cuc nhanh.
Thuc te: Hau het friend pairs cung bat Nearby Friends se o cung khu vuc (ai bat Nearby Friends de xem ban o cach 10,000 km?). Cross-region pairs la edge case — co the chap nhan latency cao hon.
4.10.4 Region Assignment
User duoc assign vao region gan nhat dua tren IP hoac GPS location:
| User location | Region | Ghi chu |
|---|---|---|
| Viet Nam, Thai Lan, Indonesia | AP-Southeast (Singapore) | RTT ~10-30ms |
| My, Canada | US-East hoac US-West | RTT ~10-50ms |
| Chau Au | EU-West (Ireland/Frankfurt) | RTT ~10-30ms |
| Nhat Ban, Han Quoc | AP-Northeast (Tokyo) | RTT ~10-20ms |
DNS-based routing (vi du: AWS Route 53 latency-based routing) tu dong route user den region co latency thap nhat.
5. Estimation — Deep Dive Numbers
5.1 WebSocket Server Capacity Planning
Them 20% buffer cho failures va maintenance:
Memory per server:
Them OS, application, Redis connections: ~4 GB total. Server 8 GB RAM la du.
5.2 Redis Cluster Sizing
Location Cache (separate cluster):
1 Redis instance (voi replication) la du.
Pub/Sub Cluster:
5.3 Network Bandwidth
Inbound (client → server):
Per server: — khong dang ke.
Outbound (server → client):
Per server: — chap nhan duoc (server thuong co 1-10 Gbps NIC).
Internal (server ↔ Redis):
Cong voi cache reads/writes:
5.4 Tom tat Capacity
| Resource | Quantity | Spec |
|---|---|---|
| WebSocket servers | 240 | 8 GB RAM, 4 vCPU |
| Redis (Location Cache) | 3 (1 primary + 2 replicas) | 4 GB RAM |
| Redis (Pub/Sub shards) | 30 | 2 GB RAM each |
| Total Redis memory | ~67 GB | Cache (1 GB) + Pub/Sub (60 GB) |
| Network bandwidth (internal) | ~1.4 GB/s | 10 Gbps network |
| Database (User/Friends) | 3 (1 primary + 2 replicas) | Standard PostgreSQL |
6. Security — Bao ve Location Privacy
6.1 Opt-in / Opt-out — Nguyen tac so 1
| Nguyen tac | Implementation | Chi tiet |
|---|---|---|
| Default OFF | Nearby Friends tat mac dinh | User phai chu dong bat. Khong bao gio tu dong bat |
| Granular control | Cho phep chon chia se voi ai | ”Chia se voi tat ca ban be” vs “Chi chia se voi Close Friends list” |
| Easy OFF | 1 tap de tat | Khong phai vao Settings → Privacy → Location → Nearby Friends → Off |
| Auto OFF | Tu dong tat sau thoi gian | Option: tat sau 1h, 4h, 8h. Giong WhatsApp Live Location |
| Visual indicator | Hien thi ro rang khi dang chia se | Icon tren status bar, periodic reminder “Ban dang chia se vi tri” |
6.2 Location Precision Control — Fuzzing
| Level | Precision | Use case | Implementation |
|---|---|---|---|
| Exact | ~10m (GPS accuracy) | Close friends, gia dinh | Gui raw GPS coordinates |
| Approximate | ~500m | Ban be binh thuong | Round lat/lng den 3 decimal places |
| City-level | ~10km | Acquaintances | Chi gui city name, khong gui coordinates |
| Hidden | N/A | Khong muon chia se | Khong gui location, unsubscribe |
Implementation: Server-side fuzzing. Client gui raw GPS, server ap dung precision level truoc khi publish len Pub/Sub. Nhu vay client khong can biet logic — va khong the bypass.
6.3 Stalking Prevention
| Moi de | Giai phap | Chi tiet |
|---|---|---|
| Theo doi lien tuc | Rate limit visibility | Khong cho phep user X xem vi tri cua Y nhieu hon 1 lan/phut (client-side throttle) |
| Location history inference | Khong luu history | Server chi luu latest location. Khong co API de lay history. Client chi hien thi real-time |
| Fake accounts | Verification | Yeu cau phone verification de bat Nearby Friends |
| Harassment | Block + Report | User block → tu dong unsubscribe ca 2 chieu. Report → review boi trust & safety team |
| Ghost Mode | An vi tri nhung van thay ban be | User A bat Ghost Mode → A van subscribe friends’ channels (thay ban be). Nhung A khong publish location → ban be khong thay A |
| Invisible to specific people | Per-friend setting | ”An vi tri voi Tuan” → unsubscribe Tuan khoi channel cua minh |
6.4 Rate Limiting Location Updates
| Tier | Limit | Ly do |
|---|---|---|
| Per user | 1 location update / 10 giay (min) | Ngan client gui qua nhieu (battery drain, bandwidth) |
| Per user | Max 1 update / 30 giay (expected) | Normal operation |
| Per server | 100K updates/s | Bao ve Redis khoi overload |
| Burst | 3 updates trong 5 giay (khi moi bat) | Cho phep initial burst de lay vi tri nhanh |
Tham chieu Tuan-09-Rate-Limiter cho Token Bucket implementation.
6.5 Data Retention — Khong luu Location History
| Du lieu | Retention | Ly do |
|---|---|---|
| Current location (Redis) | TTL 120 giay | Chi can latest, tu dong xoa |
| Location updates (logs) | Khong luu | Privacy — khong co ly do luu |
| Aggregated analytics | 90 ngay | ”Bao nhieu users bat Nearby Friends?” — khong chua location |
| Pub/Sub messages | 0 — fire and forget | Redis Pub/Sub khong persist |
GDPR Article 5(1)(e): Du lieu ca nhan chi duoc luu trong thoi gian can thiet cho muc dich xu ly. Location data cho Nearby Friends chi can hien tai — khong can luu tru.
6.6 GDPR va Data Protection Compliance
| Yeu cau | Implementation |
|---|---|
| Lawful basis | Consent (opt-in). Khong phai legitimate interest — location la “special category” |
| Right to be forgotten | User yeu cau xoa → xoa location cache, unsubscribe tat ca channels, xoa account data |
| Data portability | Export: danh sach friends da chia se location (khong co location data vi khong luu) |
| Purpose limitation | Location chi dung cho Nearby Friends feature. Khong dung cho advertising, analytics, hoac share voi third party |
| Data minimization | Chi thu thap lat, lng, timestamp. Khong thu thap altitude, speed, heading |
| Data Protection Impact Assessment (DPIA) | Bat buoc vi xu ly location data o quy mo lon |
| Data Processing Agreement | Voi cloud provider (AWS/GCP) |
7. DevOps & Monitoring
7.1 Key Metrics — WebSocket Health
| Metric | Alert Threshold | Dashboard | Y nghia |
|---|---|---|---|
ws_connection_count per server | > 55K (110% capacity) | Gauge per server | Server gan day, can scale out |
ws_connection_count total | < 8M (80% expected) | Single number | Co the co van de — users khong connect duoc |
ws_connection_churn_rate | > 5K/min per server | Time series | Nhieu reconnections → network issue hoac server issue |
ws_handshake_latency_p99 | > 500ms | Histogram | Connection cham → check LB, check TLS |
ws_message_send_errors | > 0.1% | Percentage | Messages khong gui duoc den client |
7.2 Key Metrics — Redis Pub/Sub
| Metric | Alert Threshold | Dashboard | Y nghia |
|---|---|---|---|
pubsub_channels_count | > 12M (120% expected) | Gauge | Nhieu channels → nhieu users online (tot) hoac leak (xau) |
pubsub_messages_per_second | > 15M/s (>113% expected) | Time series | Gan capacity → can add shards |
pubsub_subscribers_per_channel max | > 1000 | Histogram | Popular user → co the can rate limit |
redis_memory_used per shard | > 80% max memory | Gauge | Can increase memory hoac add shards |
redis_connected_clients per shard | > 8K | Gauge | Nhieu connections tu WS servers |
7.3 Key Metrics — Location Propagation
| Metric | Alert Threshold | Dashboard | Y nghia |
|---|---|---|---|
location_propagation_latency_p50 | > 50ms | Histogram | Trung binh — phai < 50ms |
location_propagation_latency_p99 | > 500ms | Histogram | Worst case — phai < 500ms |
location_propagation_latency_p999 | > 2s | Histogram | Extreme — investigate neu > 2s |
location_update_rate | < 200K/s (< 60% expected) | Time series | Users khong gui updates → client bug? |
nearby_friends_count avg per user | N/A (informational) | Histogram | Trung binh moi user thay bao nhieu friends nearby |
7.4 Key Metrics — Business Health
| Metric | Alert Threshold | Dashboard | Y nghia |
|---|---|---|---|
feature_opt_in_rate | N/A (informational) | Percentage | Bao nhieu % DAU bat Nearby Friends |
avg_session_duration | N/A (informational) | Time series | Users dung feature bao lau |
error_rate_by_type | > 1% cho bat ky error type nao | Stacked bar | Chi tiet errors: auth, timeout, redis, etc. |
7.5 Alerting Rules
| Alert | Condition | Severity | Action |
|---|---|---|---|
| WebSocket server overloaded | connections > 55K for 5 min | P2 | Auto-scale, add servers |
| Redis Pub/Sub throughput high | > 90% capacity for 10 min | P1 | Add shards, page on-call |
| Location propagation slow | p99 > 1s for 5 min | P1 | Check Redis, check network |
| Connection churn spike | churn > 10K/min for 3 min | P2 | Check network, check DNS, check LB |
| Redis shard down | shard unreachable for 1 min | P1 | Auto-failover, page on-call |
| Cross-region bridge lag | > 5s for 3 min | P2 | Check inter-region network |
Tham chieu Tuan-13-Monitoring-Observability cho alerting best practices va runbook structure.
7.6 Deployment Strategy
| Component | Strategy | Ly do |
|---|---|---|
| WebSocket servers | Rolling update voi graceful drain | Stateful connections — can drain truoc khi shutdown |
| Redis Pub/Sub | Add shards without downtime (resharding) | Khong duoc restart — mat subscriptions |
| Redis Cache | Blue-green voi replication | Failover toi replica neu primary down |
| Configuration changes (radius, TTL) | Feature flags (LaunchDarkly/Unleash) | Thay doi khong can deploy |
| New features | Canary 5% → 25% → 100% | Phat hien van de som |
7.7 Load Testing Considerations
| Scenario | Test method | Target |
|---|---|---|
| 50K WebSocket connections per server | Locust/k6 WebSocket load test | Verify server handles 50K connections |
| 333K location updates/s | Distributed load generators | Verify Redis Pub/Sub throughput |
| Fan-out explosion (user voi 5000 friends) | Synthetic test voi high-fan-out users | Verify no cascading failures |
| Server crash during peak | Kill 1 WS server, observe reconnection | Verify reconnection < 30s |
| Redis shard failure | Kill 1 Redis shard, observe failover | Verify auto-failover < 10s |
8. Diagrams
8.1 Complete Location Update Flow
flowchart TB subgraph "1. Client sends location" Client_A["User A Mobile App"] GPS["GPS Module<br/>lat: 10.7769<br/>lng: 106.7009"] GPS --> Client_A end subgraph "2. WebSocket Server receives" WS_A["WS Server 1<br/>(serving User A)"] end subgraph "3. Parallel writes" Redis_Cache["Redis Cache<br/>SET user:A:location<br/>{lat, lng, ts}<br/>TTL=120s"] Redis_PubSub["Redis Pub/Sub<br/>PUBLISH channel:user_A<br/>{lat, lng, ts}"] end subgraph "4. Fan-out to subscribers" WS_B["WS Server 2<br/>(serving User B)<br/>subscribed to channel:user_A"] WS_C["WS Server 3<br/>(serving User C)<br/>subscribed to channel:user_A"] WS_D["WS Server 1<br/>(serving User D)<br/>subscribed to channel:user_A"] end subgraph "5. Distance check + push" Check_B["B at 10.78, 106.71<br/>d = 1.2km < 5mi ✓"] Check_C["C at 10.90, 106.85<br/>d = 20km > 5mi ✗"] Check_D["D at 10.77, 106.70<br/>d = 0.1km < 5mi ✓"] end subgraph "6. Client receives" Client_B["User B sees:<br/>A is 1.2km away"] Client_D["User D sees:<br/>A is 0.1km away"] end Client_A -->|"WebSocket"| WS_A WS_A --> Redis_Cache WS_A --> Redis_PubSub Redis_PubSub --> WS_B Redis_PubSub --> WS_C Redis_PubSub --> WS_D WS_B --> Check_B WS_C --> Check_C WS_D --> Check_D Check_B -->|"Push"| Client_B Check_C -->|"Skip — out of radius"| X["(no push)"] Check_D -->|"Push"| Client_D style Redis_Cache fill:#ef5350,color:#fff style Redis_PubSub fill:#ff7043,color:#fff style Check_C fill:#e0e0e0,color:#999 style X fill:#e0e0e0,color:#999
8.2 Redis Pub/Sub Fan-out Detail
flowchart LR subgraph "User A updates location" A_Update["User A<br/>location_update<br/>{lat, lng}"] end subgraph "Redis Pub/Sub Shard 3<br/>(channel:user_A hashed to shard 3)" Channel_A["channel:user_A<br/>Subscribers: B, C, D, E, F"] end subgraph "Fan-out (5 messages)" M1["→ WS Server 2 (for B)"] M2["→ WS Server 3 (for C)"] M3["→ WS Server 1 (for D)"] M4["→ WS Server 4 (for E)"] M5["→ WS Server 2 (for F)"] end A_Update -->|"PUBLISH"| Channel_A Channel_A --> M1 Channel_A --> M2 Channel_A --> M3 Channel_A --> M4 Channel_A --> M5 style Channel_A fill:#ef5350,color:#fff style A_Update fill:#42a5f5,color:#fff
8.3 WebSocket Server Scaling
flowchart TB subgraph "Load Balancer Layer" LB["L7 Load Balancer<br/>Least Connections strategy<br/>WebSocket upgrade support"] end subgraph "WebSocket Server Fleet (240 servers)" subgraph "AZ-1 (80 servers)" WS1_1["WS 1<br/>48K conn"] WS1_2["WS 2<br/>50K conn"] WS1_N["WS ...<br/>49K conn"] end subgraph "AZ-2 (80 servers)" WS2_1["WS 81<br/>50K conn"] WS2_2["WS 82<br/>47K conn"] WS2_N["WS ...<br/>50K conn"] end subgraph "AZ-3 (80 servers)" WS3_1["WS 161<br/>49K conn"] WS3_2["WS 162<br/>50K conn"] WS3_N["WS ...<br/>48K conn"] end end subgraph "Auto Scaling" ASG["Auto Scaling Group<br/>Min: 200 | Max: 400<br/>Target: 80% connection capacity"] end LB --> WS1_1 & WS1_2 & WS1_N LB --> WS2_1 & WS2_2 & WS2_N LB --> WS3_1 & WS3_2 & WS3_N ASG -.->|"Scale out/in"| WS1_N & WS2_N & WS3_N style LB fill:#42a5f5,color:#fff style ASG fill:#ff9800,color:#000
Tai sao 3 Availability Zones? Neu 1 AZ down, 2 AZ con lai van handle 67% traffic. Voi 20% buffer (240 vs 200 needed), he thong survive 1 AZ failure.
8.4 Full System Architecture — Production Grade
flowchart TB subgraph "Client Layer" iOS["iOS App"] Android["Android App"] end subgraph "Edge Layer" CDN["CDN<br/>(static assets)"] DNS["DNS<br/>Latency-based routing"] end subgraph "Gateway Layer" ALB["Application Load Balancer<br/>WebSocket support<br/>TLS termination"] AuthZ["Auth Service<br/>JWT validation"] end subgraph "Application Layer" WS_Fleet["WebSocket Server Fleet<br/>240 servers across 3 AZs<br/>50K connections/server"] end subgraph "Data Layer" Redis_Cache2["Redis Cluster<br/>Location Cache<br/>3 nodes (1P+2R)"] Redis_PS["Redis Pub/Sub Cluster<br/>30 shards"] end subgraph "Supporting Services" User_Svc["User Service<br/>(Friends list)"] Notif_Svc["Notification Service<br/>(push notifications)"] Config_Svc["Config Service<br/>(feature flags)"] end subgraph "Storage" PG[("PostgreSQL<br/>Users, Friends<br/>1P + 2R")] end subgraph "Observability" Metrics["Prometheus + Grafana"] Logs["ELK Stack"] Traces["Jaeger/Zipkin"] end iOS & Android --> DNS DNS --> ALB ALB -->|"WS Upgrade"| AuthZ AuthZ -->|"Valid token"| WS_Fleet WS_Fleet --> Redis_Cache2 WS_Fleet --> Redis_PS WS_Fleet --> User_Svc User_Svc --> PG WS_Fleet -.->|"User offline > 5min"| Notif_Svc WS_Fleet -.->|"Metrics"| Metrics WS_Fleet -.->|"Logs"| Logs WS_Fleet -.->|"Traces"| Traces Config_Svc -.->|"Feature flags"| WS_Fleet style ALB fill:#42a5f5,color:#fff style Redis_Cache2 fill:#ef5350,color:#fff style Redis_PS fill:#ff7043,color:#fff style PG fill:#66bb6a,color:#fff style Metrics fill:#ab47bc,color:#fff
9. Aha Moments & Pitfalls
9.1 Aha Moments — Nhung insight quan trong
Aha 1: Pub/Sub don gian hon Polling rat nhieu
Van de: Lam sao de User B biet User A vua update location?
Cach ngay tho: Polling — moi 30 giay, server cua B query Redis de lay location cua tat ca 40 online friends.
Cach Pub/Sub: B subscribe channels cua friends. Khi A update, B tu dong nhan.
Pub/Sub giam Redis read load tu 13.3M xuong 333K — giam 40x. Day la suc manh cua event-driven architecture.
Aha 2: WebSocket la stateful — va do la OK
Nhieu backend devs so stateful services vi kho scale. Nhung cho real-time features, stateful la bat buoc — khong co cach nao khac de maintain persistent bidirectional connection.
Bai hoc: Khong phai moi thu deu phai stateless. Stateful services co cho rieng cua chung — key la biet cach scale chung (connection draining, least-connections LB, graceful shutdown).
Aha 3: Redis Pub/Sub khong persist — va do la uu diem
Neu miss 1 location update vi Pub/Sub la fire-and-forget, khong sao — 30 giay sau co update moi. Tinh chat “khong persist” cua Redis Pub/Sub bien tu nhuoc diem thanh uu diem: khong ton disk, khong can cleanup, khong can retention policy.
So sanh: Kafka persist moi message, can manage offsets, can disk space, can compaction. Cho Nearby Friends, tat ca nhung thu do la overhead khong can thiet.
Aha 4: Fan-out la bottleneck, khong phai latency
Moi location update mat ~10-15ms de propagate — cuc nhanh. Bottleneck la so luong messages can fan-out: 13.3M/s. Day la bai toan throughput, khong phai latency.
Y nghia cho interview: Khi interviewer hoi “lam sao optimize?”, dung tra loi “giam latency”. Tra loi “giam fan-out” hoac “tang throughput cua Pub/Sub layer”.
Aha 5: Server-side filtering tiet kiem bandwidth dang ke
Chi push friends trong radius thay vi tat ca friends. Voi avg 40 online friends nhung chi 5-10 trong radius:
75% outbound bandwidth duoc tiet kiem nho server-side distance calculation. Cho mobile users (4G/5G co data cap), day la rất quan trong.
Aha 6: Khong can Geospatial Index
Khac voi Proximity Service (can geohash/quadtree de tim trong 100M businesses), Nearby Friends chi can check khoang cach giua user va 40 friends. 40 Haversine calculations la trivial — khong can spatial index.
Diem khac biet chinh: Proximity Service la “tim nguoi la trong ban kinh” (hay: “search in a large dataset”). Nearby Friends la “check vi tri cua nguoi da biet” (hay: “lookup known entities”). Search can index. Lookup khong can.
9.2 Pitfalls — Nhung sai lam pho bien
Pitfall 1: Dung HTTP Polling thay vi WebSocket
HTTP Polling cho Nearby Friends la sai lam lon nhat. Moi 30 giay, 10M clients gui HTTP request → 333K QPS. Moi request co ~500 bytes HTTP headers. Va server khong the push — client phai pull.
Fix: WebSocket — persistent connection, bidirectional, minimal overhead (2-6 bytes/message).
Pitfall 2: Fan-out explosion khong duoc xu ly
Mot user co 5000 friends, 500 online → moi 30 giay, 500 messages fan-out. Neu 1000 users nhu vay update cung luc:
Co the lam Redis Pub/Sub spike va anh huong toan he thong.
Fix: Rate limit fan-out cho users co nhieu friends. Hoac dung batching: gom updates cua popular users va publish 1 lan moi 5-10 giay thay vi moi 30 giay.
Pitfall 3: Khong handle WebSocket reconnection dung cach
Server restart, network blip, app bi background → connection mat. Neu client khong reconnect hoac reconnect qua nhanh (thundering herd), he thong co the overload.
Fix: Exponential backoff voi jitter. Client retry: 1s + random(0-500ms), 2s + random, 4s + random, … Max 60s.
Pitfall 4: Quen cleanup khi user offline
User tat app nhung server khong cleanup: location cache van con (TTL chua het), Pub/Sub channel van active, friends van thay user “online”.
Fix: 3 lop cleanup:
- WebSocket disconnect event → immediate cleanup
- Heartbeat miss → detect va cleanup
- Redis TTL → safety net, tu dong xoa data cu
Pitfall 5: Luu location history “de sau nay dung”
“De sau nay lam feature nay no” — va roi GDPR audit phat hien em luu vi tri cua 100M users moi 30 giay ma khong co consent.
Fix: Khong luu nhung gi khong can thiet. Nearby Friends chi can current location. Neu sau nay can location history → thiet ke feature rieng voi consent rieng.
Pitfall 6: Location precision vs privacy trade-off khong duoc can nhac
Gui raw GPS coordinates (10 decimal places, chinh xac ~0.1mm) cho tat ca friends. Khong ai can biet ban o dau chinh xac den 0.1mm.
Fix: Round coordinates den 3-4 decimal places (~10-100m precision). Du chinh xac de hien thi tren ban do. Khong du chinh xac de xac dinh em dang o phong nao trong toa nha.
Pitfall 7: Khong test fan-out o production scale
Dev test voi 100 users, moi user 10 friends → ok. Production 10M users, moi user 400 friends → Redis Pub/Sub overload.
Fix: Load test voi realistic numbers truoc khi launch. Simulate 10M connections, 333K updates/s, 13.3M fan-out messages/s.
10. Summary — Decision Framework
10.1 Khi nao dung Pub/Sub pattern?
| Tinh huong | Recommendation |
|---|---|
| Real-time updates cho known recipients | Pub/Sub — subscribers da biet truoc |
| Fan-out nhieu nhung message khong can durable | Redis Pub/Sub — fire-and-forget |
| Can guaranteed delivery | Kafka/RabbitMQ thay vi Redis Pub/Sub |
| 1-to-1 messaging | WebSocket truc tiep, khong can Pub/Sub |
10.2 Khi nao dung WebSocket?
| Tinh huong | Recommendation |
|---|---|
| Server can push data den client | WebSocket |
| Chi client → server | HTTP la du |
| Chi server → client | SSE (Server-Sent Events) co the du |
| Ca hai chieu + low latency | WebSocket — day la Nearby Friends |
| Low frequency updates (moi vai phut) | Long polling co the du |
10.3 Tong ket Architecture Decisions
| Decision | Chosen | Alternative | Tai sao |
|---|---|---|---|
| Communication protocol | WebSocket | HTTP Polling, SSE | Bidirectional, low overhead |
| Location propagation | Redis Pub/Sub | Kafka, RabbitMQ | Low latency, no persistence needed |
| Location storage | Redis (in-memory cache) | Database | Chi can latest, TTL tu dong xoa |
| Distance calculation | Server-side Haversine | Client-side, Geohash pre-filter | Giam bandwidth, 40 calculations la trivial |
| Channel design | 1 channel per user | Per geohash, per friend-pair | Granular, privacy-friendly |
| Scaling WebSocket | Least-connections LB | Consistent hashing | Don gian, reconnection cost thap |
| Scaling Pub/Sub | Hash-based sharding | Consistent hashing | Don gian, deterministic |
| Multi-region | Regional clusters + cross-region bridge | Global cluster | Low latency cho local, bridge cho remote |
11. Internal Links — Lien ket voi cac bai khac
| Topic | Link | Lien quan |
|---|---|---|
| Proximity Service (businesses) | Case-Design-Proximity-Service | So sanh static vs dynamic location, geohash indexing |
| Chat System (WebSocket) | Tuan-17-Design-Chat-System | WebSocket management, connection lifecycle, message delivery |
| Consistent Hashing | Tuan-10-Consistent-Hashing | Shard assignment cho Redis Pub/Sub, server routing |
| Load Balancer | Tuan-05-Load-Balancer | WebSocket-aware LB, least-connections strategy |
| Message Queue | Tuan-08-Message-Queue | Pub/Sub vs Message Queue trade-offs |
| Cache Strategy | Tuan-06-Cache-Strategy | Redis caching patterns, TTL strategy |
| Rate Limiter | Tuan-09-Rate-Limiter | Rate limiting location updates, fan-out throttling |
| Monitoring | Tuan-13-Monitoring-Observability | WebSocket metrics, Pub/Sub monitoring, alerting |
| Security | Tuan-14-AuthN-AuthZ-Security | JWT cho WebSocket auth, privacy controls |
| Database Replication | Tuan-07-Database-Sharding-Replication | Redis replication, PostgreSQL for user data |
12. Interview Tips — Goi y cho phong van
12.1 Cau truc tra loi
| Buoc | Noi dung | Thoi gian |
|---|---|---|
| 1. Clarify requirements | Hoi ve scale, features, constraints | 3-5 phut |
| 2. High-level design | WebSocket + Redis Cache + Redis Pub/Sub | 5-7 phut |
| 3. Deep dive | Chon 2-3 topics de di sau (Pub/Sub fan-out, scaling WebSocket, privacy) | 15-20 phut |
| 4. Wrap up | Trade-offs, alternatives, monitoring | 3-5 phut |
12.2 Cau hoi interviewer thuong hoi
| Cau hoi | Huong tra loi |
|---|---|
| ”Tai sao khong dung HTTP polling?” | Bandwidth waste, server can push, bidirectional need |
| ”Lam sao scale WebSocket?” | Least-connections LB, graceful drain, auto-scaling |
| ”Neu Redis Pub/Sub down?” | Graceful degradation — nearby list freeze, client show “last updated X min ago" |
| "Lam sao xu ly popular users?” | Rate limit fan-out, batch updates, reduce frequency |
| ”Privacy concerns?” | Opt-in, fuzzing, no history, GDPR, Ghost Mode |
| ”Tai sao khong dung geohash?” | Chi 40 friends — brute force Haversine du nhanh, khong can spatial index |
| ”Alternative cho Redis Pub/Sub?” | Kafka (neu can durability), custom in-memory Pub/Sub (neu can ultra-low latency) |
12.3 Diem thuong (Bonus Points)
| Topic | Chi tiet | An tuong |
|---|---|---|
| Battery optimization | Location updates moi 30s thay vi lien tuc. Dung significant location change API cua iOS/Android | Show mobile awareness |
| Graceful degradation | Redis down → serve stale data, show “approximate” | Show resilience thinking |
| A/B testing | Test 30s vs 60s update interval, test radius defaults | Show product thinking |
| Cost estimation | 240 WS servers x 17K/month | Show business awareness |
“Nearby Friends khong phai la bai toan kho ve algorithm — khong co geohash, khong co quadtree, khong co Dijkstra. No kho o real-time communication o quy mo lon: 10 trieu WebSocket connections, 333 nghin location updates moi giay, 13 trieu Pub/Sub messages moi giay. Day la bai toan ve infrastructure va engineering trade-offs, khong phai ve toan hoc.”