Tuần Bonus: CRDTs — Conflict-free Replicated Data Types
“5 người cùng edit Google Docs trong 5 quốc gia. Internet ai cũng chập chờn. Nhưng văn bản cuối cùng luôn convergent — không ai mất câu nào, không ai phải resolve conflict thủ công. Magic? Không, đó là CRDT — và Figma, Notion, Apple Notes, Linear, Yjs đều dùng pattern này.”
Tags: system-design crdt distributed-systems consistency collaboration bonus Student: Hieu (Backend Dev → Architect) Prerequisite: Tuan-Bonus-Consistency-Models-Isolation · Tuan-Bonus-Consensus-Raft-Paxos Liên quan: Tuan-07-Database-Sharding-Replication · Tuan-20-Design-Key-Value-Store · Case-Design-Nearby-Friends
1. Context & Why
Analogy đời thường — 5 người ghi sổ chung
Hieu, tưởng tượng em và 4 người bạn cùng quản lý sổ tay tổng hợp doanh thu cửa hàng. Nguyên tắc:
- Không có “boss” quyết định ai đúng (no central authority)
- Mỗi người có bản sao sổ riêng
- Internet chập chờn → có lúc không liên lạc được
- Cuối ngày, tất cả phải có cùng tổng kết doanh thu
Có 2 cách:
Cách 1 — Lock & coordinate:
- Trước khi ghi, “raise hand” hỏi: “Ai đang ghi?”
- Đợi tất cả đồng ý → ghi → release
- Vấn đề: 1 người mất internet → cả nhóm pause
Cách 2 — CRDT pattern:
- Mỗi người có sổ riêng, ghi tự do, không hỏi ai
- Khi gặp nhau (sync), so sánh sổ và merge tự động
- Quy tắc merge có tính chất toán học đảm bảo: bất kể merge order nào, kết quả cuối cùng giống nhau
- Vấn đề: chỉ work với structures cụ thể (counter, set, register, sequence)
CRDT (Conflict-free Replicated Data Type) là cách 2. Nó là class các data structures được thiết kế sao cho merge tự động đảm bảo convergence.
Tại sao Backend Dev cần hiểu CRDT?
| Lý do | Ví dụ thực tế |
|---|---|
| Multi-master DB | Riak, Redis CRDT (Active-Active) — không có “primary” |
| Collaborative editing | Figma (multiplayer design), Notion (docs), Linear (issues) |
| Offline-first apps | Apple Notes, Bear, Obsidian Sync — edit offline, merge khi online |
| Mobile sync | iOS CloudKit, signal messenger, WhatsApp |
| Edge computing | Cloudflare Workers, Vercel — replicate state to edge |
| Decentralized apps | Crdts là foundation cho local-first software |
Key insight: CRDT trade-off là giảm expressiveness (chỉ support specific data types) để có automatic merge without coordination. Nếu app của em fit với CRDT shape → có thể eliminate toàn bộ consensus layer.
Tại sao Alex Xu không đi sâu vào CRDT?
Alex Xu vol 1+2 nhắc CRDTs ở mức 1 dòng (trong Chat System chapter). Nhưng CRDTs là active research với production application ngày càng nhiều. Figma, Notion, Apple Notes đều base trên CRDT — đây là gap critical cho hệ thống hiện đại.
Tham chiếu chính
- CRDT survey paper — Shapiro et al., A comprehensive study of Convergent and Commutative Replicated Data Types (2011) — https://hal.inria.fr/inria-00555588/document
- CRDT.tech — Tài liệu interactive — https://crdt.tech/
- Martin Kleppmann’s research — https://martin.kleppmann.com/papers.html (especially Automerge papers)
- Yjs documentation — https://docs.yjs.dev/
- Riak CRDTs — https://docs.riak.com/riak/kv/latest/developing/data-types/
2. Deep Dive — Khái niệm cốt lõi
2.1 Tại sao CRDT — bài toán
Scenario: Multi-master database với 3 replicas. User A update X=10 trên Replica 1. Cùng lúc User B update X=20 trên Replica 2. Khi sync, X = ?
Solutions truyền thống:
| Approach | Vấn đề |
|---|---|
| Last-Write-Wins (LWW) | Mất 1 update; clock drift gây bug |
| Quorum (Cassandra) | Vẫn cần coordinate trong quorum |
| Manual conflict resolution | Bad UX (cho user) |
| 2PC | Block on partition |
CRDT solution: Design data structure sao cho A ⊕ B = B ⊕ A và idempotent. Không có conflict — mọi state đều merge được.
2.2 Hai loại CRDT
2.2.1 State-based CRDT (CvRDT — Convergent)
Cơ chế: Mỗi replica maintain full state. Khi sync, replicas exchange state và merge.
Replica 1: state_1
Replica 2: state_2
Sync: state_merged = merge(state_1, state_2)
Yêu cầu (cho convergence):
- Commutative:
merge(a, b) = merge(b, a) - Associative:
merge(merge(a, b), c) = merge(a, merge(b, c)) - Idempotent:
merge(a, a) = a
3 properties này = “semilattice” — domain math trừu tượng nhưng quan trọng.
Pros:
- Đơn giản về reasoning
- Network có thể lossy (mất message OK, sẽ resync)
Cons:
- State có thể lớn (phải gửi full state)
- Có optimization: delta-state CRDT chỉ gửi delta
2.2.2 Operation-based CRDT (CmRDT — Commutative)
Cơ chế: Replicas exchange operations (deltas). Mỗi op apply locally + broadcast.
Replica 1: apply op1 → broadcast op1
Replica 2: receive op1 → apply op1
Yêu cầu:
- Reliable broadcast: Mọi op phải reach mọi replica
- Causal delivery: Op causally dependent phải arrive theo order
- Commutative: Concurrent ops commute
Pros:
- Smaller messages (chỉ ops, không state)
- Có thể compose ops (e.g., delta CRDTs)
Cons:
- Cần reliable + causal broadcast
- Dropped op = inconsistency
2.3 Foundational CRDTs
2.3.1 G-Counter (Grow-only Counter)
Use case: Page view counter, like count, distributed metric.
Constraint: Chỉ tăng, không giảm.
State: Vector [c_1, c_2, ..., c_N] với c_i = count từ replica i.
class GCounter:
def __init__(self, num_replicas, my_id):
self.counts = [0] * num_replicas
self.my_id = my_id
def increment(self, n=1):
self.counts[self.my_id] += n
def value(self):
return sum(self.counts)
def merge(self, other):
for i in range(len(self.counts)):
self.counts[i] = max(self.counts[i], other.counts[i])Tại sao max() đảm bảo convergence:
- Commutative: max(a, b) = max(b, a) ✓
- Associative: max(max(a,b), c) = max(a, max(b,c)) ✓
- Idempotent: max(a, a) = a ✓
Ví dụ:
R1: counts=[5, 0, 0], increment → [6, 0, 0]
R2: counts=[0, 3, 0], increment → [0, 4, 0]
Merge: [max(6,0), max(0,4), max(0,0)] = [6, 4, 0] → value = 10
2.3.2 PN-Counter (Positive-Negative Counter)
Use case: Counter cần increment + decrement (e.g., shopping cart quantity, balance).
Trick: Combine 2 G-Counters — one for increments (P), one for decrements (N).
class PNCounter:
def __init__(self, num_replicas, my_id):
self.p = GCounter(num_replicas, my_id)
self.n = GCounter(num_replicas, my_id)
def increment(self, k=1):
self.p.increment(k)
def decrement(self, k=1):
self.n.increment(k)
def value(self):
return self.p.value() - self.n.value()
def merge(self, other):
self.p.merge(other.p)
self.n.merge(other.n)2.3.3 LWW-Register (Last-Write-Wins Register)
Use case: Single value với occasional updates (e.g., user profile field).
State: (value, timestamp). Merge: keep value với timestamp lớn nhất.
class LWWRegister:
def __init__(self):
self.value = None
self.timestamp = 0 # Hybrid Logical Clock recommended
def write(self, value, timestamp):
if timestamp > self.timestamp:
self.value = value
self.timestamp = timestamp
def merge(self, other):
if other.timestamp > self.timestamp:
self.value = other.value
self.timestamp = other.timestampVấn đề: Tie-break khi 2 timestamps bằng nhau. Solutions:
- Tie-break by replica ID (deterministic)
- Use HLC (Hybrid Logical Clock) → unique per-replica
Limitation: LWW LOSES UPDATES. Nếu 2 user update đồng thời → 1 update bị “thua”. Không suitable cho collaborative editing.
2.3.4 OR-Set (Observed-Remove Set)
Use case: Distributed set với add + remove (e.g., shopping cart items, task list, friends list).
Vấn đề of naive set: Nếu A add X, B remove X concurrently → final state ?
Naive 2P-Set (two-phase set):
class TwoPhaseSet:
"""Add to A, remove to T (tombstones). Once removed, can't re-add."""
def __init__(self):
self.added = set()
self.removed = set() # tombstones
def contains(self, x):
return x in self.added and x not in self.removed
def add(self, x):
self.added.add(x)
def remove(self, x):
if x in self.added:
self.removed.add(x)
def merge(self, other):
self.added |= other.added
self.removed |= other.removedVấn đề: không thể re-add sau khi remove (forever tombstoned).
OR-Set giải quyết: Mỗi add tag với unique ID. Remove chỉ remove các unique IDs đã observe.
import uuid
class ORSet:
def __init__(self):
# element → set of unique add tags
self.adds = {} # {x: {tag1, tag2, ...}}
self.removes = {} # {x: {tag1, tag2, ...}} (tombstoned tags)
def contains(self, x):
active_tags = self.adds.get(x, set()) - self.removes.get(x, set())
return len(active_tags) > 0
def add(self, x):
tag = str(uuid.uuid4())
self.adds.setdefault(x, set()).add(tag)
def remove(self, x):
# Remove only the tags we've seen
tags = self.adds.get(x, set()).copy()
self.removes.setdefault(x, set()).update(tags)
def merge(self, other):
for x, tags in other.adds.items():
self.adds.setdefault(x, set()).update(tags)
for x, tags in other.removes.items():
self.removes.setdefault(x, set()).update(tags)Magic: Concurrent add and remove resolve correctly:
- A removes X (sees tag T1), B adds X concurrently (creates tag T2)
- A’s remove only tombstones T1
- After merge: adds = {T1, T2}, removes = {T1} → active = {T2} → X is in set (B’s add wins)
Use cases real-world:
- Cart items (add/remove products)
- Tag/label assignments
- Friend/follow lists
2.3.5 RGA (Replicated Growable Array) — for ordered text
Use case: Collaborative text editing (Google Docs, Figma).
Problem: Text = sequence of characters. Insert/delete at any position. 2 user concurrent edits → must converge to same final text.
Naive approach: Use string position (index 5). Bug: A inserts at position 5, B inserts at position 5 → after A’s insert, B’s “position 5” mean different thing.
RGA solution: Each character has unique ID + reference to previous character ID (linked list).
Initial: "ABC"
A:1 → B:2 → C:3
Replica 1 inserts X after B:
A:1 → B:2 → X:1.5 → C:3
Replica 2 inserts Y after B (concurrently):
A:1 → B:2 → Y:1.6 → C:3
Merge: A:1 → B:2 → [X:1.5, Y:1.6] → C:3
Tie-break by ID → "ABXYC" or "ABYXC" (deterministic)
Implementation: Each char = (char, unique_id, parent_id). Merge inserts both X and Y after B, sort by ID.
2.3.6 Yjs / Y-CRDT — Production text CRDT
Yjs (https://yjs.dev) là production CRDT library cho text + structured data, được dùng bởi:
- Figma (multiplayer design)
- Linear (issue tracking)
- Sanity (CMS)
- Notion (parts of)
- JupyterLab (RTC notebooks)
Y-Doc structure:
Y.Doc
├── Y.Text (collaborative string)
├── Y.Array (ordered list)
├── Y.Map (key-value)
├── Y.XmlElement (rich text DOM)
// Yjs example
import * as Y from 'yjs'
import { WebsocketProvider } from 'y-websocket'
const ydoc = new Y.Doc()
const provider = new WebsocketProvider('wss://demos.yjs.dev', 'my-room', ydoc)
const ytext = ydoc.getText('chapter-1')
// Local edit
ytext.insert(0, 'Hello, ')
ytext.insert(7, 'World!')
// Observe remote changes
ytext.observe(event => {
console.log('Text changed:', ytext.toString())
})Yjs underlying algorithm: YATA (Yet Another Transformation Approach) — variant of RGA.
Performance:
- 1M operations: ~100ms merge time
- Memory ~10x text size (with tombstones)
- Garbage collection để reduce overhead
2.3.7 Automerge — alternative to Yjs
Automerge (https://automerge.org/) — CRDT library by Martin Kleppmann.
Differences from Yjs:
- More structured (JSON-like documents)
- Pure JavaScript (Yjs has Rust core)
- Better for complex documents (Y.Doc trees)
- Slightly slower than Yjs
Use cases: Local-first software, offline-first mobile apps.
Tham chiếu:
- Local-first software manifesto — https://www.inkandswitch.com/local-first/
- Martin Kleppmann’s CRDT papers — https://martin.kleppmann.com/papers.html
2.4 CRDT Limitations
2.4.1 Tombstones — garbage problem
OR-Set, RGA, Yjs all use tombstones (remove markers). Khi delete, tombstone vẫn ở trong state forever (cần để reject re-deliver).
Result: State grows indefinitely.
Solutions:
- Causal stability: Khi tombstone đã propagate to all replicas → safe to delete
- Periodic GC: Yjs có GC mode (deletes tombstones after threshold)
- Trade-off: GC giảm size nhưng có thể lose history
2.4.2 Cannot enforce invariants
CRDT không thể: “Total balance ≥ 0”, “Inventory ≥ 0”, “Max 5 admins”.
Reason: Each replica makes decision independently. 2 replicas concurrent withdraw → mỗi replica thấy balance OK → both succeed → balance < 0.
Solutions:
- Use consensus (Raft) cho invariants
- Or escrow tokens: Pre-allocate budget per replica → each replica spend within budget
- Or bounded counter CRDT: research, ngày càng phổ biến
2.4.3 Causal context required
Operation-based CRDT requires causal delivery → cần vector clocks hoặc HLC.
Cost: Vector clock O(N) bytes per operation. Optimizations:
- Dotted version vector
- Compact vector clock (compress runs)
- HLC (Hybrid Logical Clock) — fixed size
2.5 CRDT vs Operational Transformation (OT)
OT là pre-CRDT approach cho collaborative editing (Google Docs uses OT).
| OT | CRDT | |
|---|---|---|
| Conflict resolution | Transform operations (complex) | Commutative ops (simpler) |
| Server requirement | Central server (relay + transform) | Optional (P2P possible) |
| Math foundation | Operational | Algebraic |
| Implementations | Google Docs, ShareJS | Yjs, Automerge, Riak |
| Complexity | Hard to implement correctly | Easier (compose primitives) |
| Performance | Better for centralized | Better for P2P |
Trend 2024-2026: New systems prefer CRDT (Figma, Linear). Legacy stays on OT (Google Docs).
2.6 Real-world implementations
2.6.1 Riak (Database with CRDT support)
# Riak CRDT data types: Counter, Set, Map, Register, Flag
import riak
client = riak.RiakClient()
bucket = client.bucket_type('counters').bucket('page_views')
counter = bucket.new('homepage')
counter.increment(1)
counter.store()
# Active-active across regions: each region writes locally, merges async2.6.2 Redis CRDT (Redis Enterprise)
Redis Enterprise có Active-Active CRDB với CRDT support cho:
- Strings, Counters (G-Counter, PN-Counter)
- Sets (OR-Set)
- Hashes (Map of LWW-Registers)
- Sorted Sets, Lists, Streams
Multi-region active-active without conflicts.
2.6.3 Figma’s multiplayer
Figma uses CRDT cho live collaboration (10+ designers same file):
- Each shape has unique ID
- Properties (color, position) are LWW-Registers with HLC
- Tree structure (parent-child) uses RGA-like ordering
- Server is relay + state snapshot, not coordinator
Tham chiếu: Figma engineering blog — https://www.figma.com/blog/how-figmas-multiplayer-technology-works/
2.6.4 Notion’s blocks
Notion uses CRDT cho block-level editing:
- Each block has UUID
- Block tree = ordered list (RGA)
- Block content = Y.Text
- Sync via WebSocket + persistent server state
2.6.5 Apple Notes
iOS Notes uses CRDT cho cross-device sync:
- Edit offline (CloudKit cache)
- Sync when online → automatic merge
- No “conflict resolution dialog” (unlike old Dropbox)
3. Estimation — CRDT Overhead
3.1 Memory overhead
| CRDT | Memory ratio (vs raw data) |
|---|---|
| G-Counter (5 replicas) | 5x (vector of N ints) |
| OR-Set | 2-3x (each element + tags) |
| LWW-Register | 1.1x (value + timestamp) |
| Y.Text | 5-10x (with tombstones), 2-3x with GC |
| Automerge document | 5-15x |
Example: 100KB text document → Y.Text uses ~500KB memory (5x).
3.2 Network bandwidth
State-based: Send full state on sync (large) Operation-based: Send only ops (small per-op, ~50-200 bytes) Delta-state: Hybrid (send deltas of state, not full)
Yjs sync protocol:
- State vector exchange: ~100 bytes
- Update message: ~40 bytes/op + content
- Compression (LZ4): 2-5x reduction
Bandwidth for 5 collaborators editing Y.Text:
- ~1 KB/s/collaborator at active editing
- Negligible during idle
3.3 Convergence time
For N replicas in fully-connected network:
- Gossip protocol: O(log N) rounds for full convergence
- WebSocket-based (Yjs): Sub-second for small docs, seconds for large
- P2P (libp2p): variable, depends on topology
3.4 Garbage collection cost
Y.Text with 1M edit operations:
- Without GC: 100MB+ memory, slow
- With GC after 10K ops: 10MB memory, fast
- GC pause: ~50-100ms per collection
4. Security First — CRDT Threats
4.1 Trust model
Default CRDT assumes all replicas are non-Byzantine (don’t lie). Attacker compromise 1 replica → can:
- Inject fake operations with backdated timestamps (LWW exploit)
- Generate huge tombstones to DoS storage
- Resurrect deleted data
Mitigation:
- Authenticated CRDT operations: HMAC sign each op với key per-user
- Op validation: Reject ops with timestamps > now() + drift
- Quotas: Limit per-user operations/min
4.2 LWW timestamp attacks
Attack: Attacker writes ("malicious_value", 9999999999) → wins LWW forever.
Mitigation:
- Use server-issued timestamps (not client-generated)
- HLC (Hybrid Logical Clock) bounded by server time
- Reject timestamps > now() + reasonable drift (5 minutes)
4.3 Tombstone DoS
Attack: Attacker spam add/remove pairs → tombstones explode.
Mitigation:
- Per-user op rate limits
- GC tombstones aggressively
- Monitor state size; alert if grow unexpectedly
4.4 Replay attacks
Attack: Capture op, replay later → cause unintended state.
Mitigation:
- Each op has unique nonce
- Replicas track seen nonces (per-user, per-time-window)
- HMAC signature includes timestamp → bounded replay window
4.5 Encryption
End-to-end encrypted CRDTs:
- Difficult — server can’t validate ops without seeing content
- Solutions: ZK-CRDT (research), trust client validation
- Y.js has experiment with encrypted updates
5. DevOps — Vận hành CRDT systems
5.1 Yjs server với y-websocket
// server.js
const WebSocket = require('ws')
const http = require('http')
const { setupWSConnection } = require('y-websocket/bin/utils')
const server = http.createServer()
const wss = new WebSocket.Server({ server })
wss.on('connection', (conn, req) => {
setupWSConnection(conn, req, {
docName: req.url.slice(1).split('?')[0],
gc: true // Enable garbage collection
})
})
server.listen(1234, () => {
console.log('Yjs WebSocket server on port 1234')
})Persistence with LevelDB:
const { LeveldbPersistence } = require('y-leveldb')
const ldb = new LeveldbPersistence('./yjs-data')
setupWSConnection(conn, req, {
docName: 'mydoc',
persistence: ldb
})5.2 Production deployment
# docker-compose.yml
version: "3.8"
services:
yjs-server:
build: .
ports:
- "1234:1234"
volumes:
- yjs-data:/app/yjs-data
environment:
NODE_ENV: production
deploy:
replicas: 3 # multiple instances
restart_policy:
condition: on-failure
redis:
image: redis:7
# Use Redis pub/sub for cross-instance sync
ports:
- "6379:6379"
volumes:
yjs-data:Multi-instance challenge: y-websocket per-instance state. Need:
- Sticky sessions (load balancer route same doc to same instance), OR
- Redis pub/sub to broadcast updates between instances, OR
- CRDT-as-Service (e.g., Cloudflare Durable Objects)
5.3 Monitoring
Key metrics:
- Active connections per doc
- State size per doc (memory)
- Update frequency (ops/sec/doc)
- GC efficiency (tombstones cleaned)
- Sync latency (op → broadcast → ack)
Prometheus alerts:
- alert: YjsHighMemoryPerDoc
expr: yjs_doc_state_size_bytes > 50000000 # 50 MB
annotations:
summary: "Doc {{ $labels.doc }} state is {{ $value | humanize1024 }}"
- alert: YjsHighOpRate
expr: rate(yjs_ops_total[5m]) > 10000
annotations:
summary: "Suspicious op rate {{ $value }} ops/s — possible attack"
- alert: YjsTombstoneAccumulation
expr: yjs_tombstone_count > 100000
annotations:
summary: "Run GC: {{ $value }} tombstones in {{ $labels.doc }}"5.4 Backup & Recovery
State snapshots: Periodically dump full state.
// Backup
const state = Y.encodeStateAsUpdate(ydoc)
fs.writeFileSync('backup.bin', state)
// Restore
const state = fs.readFileSync('backup.bin')
const newDoc = new Y.Doc()
Y.applyUpdate(newDoc, state)Frequency: Hourly snapshots + continuous WAL.
5.5 Sharding
Một Yjs doc lớn → load balance via:
- Per-doc sharding: hash(doc_id) → instance
- Sticky sessions: ensure same doc on same instance
- Hot doc replication: replicate hot docs to N instances
6. Code Implementation
6.1 G-Counter từ scratch
"""
G-Counter (state-based CRDT) — Python implementation
"""
class GCounter:
def __init__(self, num_replicas: int, my_id: int):
if my_id < 0 or my_id >= num_replicas:
raise ValueError("my_id out of range")
self.counts = [0] * num_replicas
self.my_id = my_id
self.num_replicas = num_replicas
def increment(self, n: int = 1):
if n < 0:
raise ValueError("G-Counter only supports increment")
self.counts[self.my_id] += n
def value(self) -> int:
return sum(self.counts)
def merge(self, other: "GCounter"):
if self.num_replicas != other.num_replicas:
raise ValueError("Replica count mismatch")
for i in range(self.num_replicas):
self.counts[i] = max(self.counts[i], other.counts[i])
def __repr__(self):
return f"GCounter({self.counts})"
# Demo: 3 replicas, partition then heal
def demo_gcounter():
print("=== G-Counter Demo ===")
r1 = GCounter(3, 0)
r2 = GCounter(3, 1)
r3 = GCounter(3, 2)
# Phase 1: each replica increments independently
r1.increment(5)
r2.increment(3)
r3.increment(7)
print(f"R1: {r1}, value={r1.value()}")
print(f"R2: {r2}, value={r2.value()}")
print(f"R3: {r3}, value={r3.value()}")
# Phase 2: gossip merge (any order)
r1.merge(r2)
r3.merge(r1)
r2.merge(r3)
r1.merge(r3)
# All should converge
print(f"\nAfter sync:")
print(f"R1: {r1}, value={r1.value()}")
print(f"R2: {r2}, value={r2.value()}")
print(f"R3: {r3}, value={r3.value()}")
assert r1.value() == r2.value() == r3.value() == 15
print("✓ Converged to 15")
demo_gcounter()6.2 OR-Set implementation
"""
OR-Set (Observed-Remove Set) — supports concurrent add+remove correctly.
"""
import uuid
from typing import Any
class ORSet:
def __init__(self):
# element → set of unique tags (each add creates new tag)
self._adds: dict[Any, set[str]] = {}
# element → set of removed tags
self._removes: dict[Any, set[str]] = {}
def add(self, x):
tag = str(uuid.uuid4())
self._adds.setdefault(x, set()).add(tag)
def remove(self, x):
# Tombstone all currently observed tags
observed = self._adds.get(x, set()).copy()
if observed:
self._removes.setdefault(x, set()).update(observed)
def contains(self, x) -> bool:
active_tags = self._adds.get(x, set()) - self._removes.get(x, set())
return len(active_tags) > 0
def elements(self) -> set:
return {x for x in self._adds if self.contains(x)}
def merge(self, other: "ORSet"):
for x, tags in other._adds.items():
self._adds.setdefault(x, set()).update(tags)
for x, tags in other._removes.items():
self._removes.setdefault(x, set()).update(tags)
# Demo: concurrent add + remove
def demo_orset():
print("\n=== OR-Set Demo ===")
# Initial: A and B both have empty set
a = ORSet()
b = ORSet()
# Both add "milk" (different tags!)
a.add("milk")
b.add("milk")
print(f"A: {a.elements()}")
print(f"B: {b.elements()}")
# Concurrent: A removes milk, B adds bread + milk again
a.remove("milk")
b.add("bread")
b.add("milk") # Re-add with new tag
print(f"\nBefore merge:")
print(f"A: {a.elements()}")
print(f"B: {b.elements()}")
# Merge: A's remove only tombstones tags it observed
# B's later add creates new unobserved tag → wins
a.merge(b)
b.merge(a)
print(f"\nAfter merge:")
print(f"A: {a.elements()}")
print(f"B: {b.elements()}")
# Both contain {"milk", "bread"} — A's remove was overridden by B's later add
assert "milk" in a.elements() and "bread" in a.elements()
print("✓ Converged with re-added 'milk' winning")
demo_orset()6.3 LWW-Register với HLC
"""
LWW-Register with Hybrid Logical Clock (HLC) for safer timestamps.
"""
import time
from dataclasses import dataclass
@dataclass
class HLC:
"""Hybrid Logical Clock — physical + logical components."""
physical: int # ms
logical: int
def __lt__(self, other):
if self.physical != other.physical:
return self.physical < other.physical
return self.logical < other.logical
class HLCClock:
def __init__(self):
self.last = HLC(int(time.time() * 1000), 0)
def now(self) -> HLC:
wall = int(time.time() * 1000)
if wall > self.last.physical:
self.last = HLC(wall, 0)
else:
self.last = HLC(self.last.physical, self.last.logical + 1)
return HLC(self.last.physical, self.last.logical)
def update(self, remote: HLC) -> HLC:
wall = int(time.time() * 1000)
max_phys = max(wall, self.last.physical, remote.physical)
if max_phys == self.last.physical == remote.physical:
new_logical = max(self.last.logical, remote.logical) + 1
elif max_phys == self.last.physical:
new_logical = self.last.logical + 1
elif max_phys == remote.physical:
new_logical = remote.logical + 1
else:
new_logical = 0
self.last = HLC(max_phys, new_logical)
return HLC(self.last.physical, self.last.logical)
class LWWRegister:
def __init__(self, replica_id: str, clock: HLCClock):
self.value = None
self.timestamp: HLC = HLC(0, 0)
self.replica_id = replica_id
self.clock = clock
def write(self, value):
self.value = value
self.timestamp = self.clock.now()
def merge(self, other: "LWWRegister"):
if other.timestamp > self.timestamp or (
other.timestamp == self.timestamp
and other.replica_id > self.replica_id # tie-break
):
self.value = other.value
self.timestamp = other.timestamp
self.clock.update(other.timestamp)
def demo_lww():
print("\n=== LWW-Register Demo ===")
clock_a = HLCClock()
clock_b = HLCClock()
a = LWWRegister("A", clock_a)
b = LWWRegister("B", clock_b)
a.write("Alice's value")
time.sleep(0.001)
b.write("Bob's value")
print(f"A: value={a.value}, ts={a.timestamp}")
print(f"B: value={b.value}, ts={b.timestamp}")
a.merge(b)
b.merge(a)
print(f"\nAfter merge:")
print(f"A: {a.value}")
print(f"B: {b.value}")
assert a.value == b.value
print("✓ Converged")
demo_lww()6.4 Yjs với React (Frontend)
import * as Y from 'yjs'
import { WebsocketProvider } from 'y-websocket'
import { useEffect, useState } from 'react'
function CollaborativeEditor() {
const [text, setText] = useState('')
const [ydoc] = useState(() => new Y.Doc())
const [ytext] = useState(() => ydoc.getText('content'))
useEffect(() => {
const provider = new WebsocketProvider('wss://server.example.com', 'doc-id', ydoc)
const observer = () => setText(ytext.toString())
ytext.observe(observer)
return () => {
ytext.unobserve(observer)
provider.destroy()
}
}, [])
const handleChange = (e) => {
const newText = e.target.value
// Calculate diff and apply to Y.Text
ydoc.transact(() => {
ytext.delete(0, ytext.length)
ytext.insert(0, newText)
})
}
return (
<textarea
value={text}
onChange={handleChange}
style={{ width: '100%', height: '400px' }}
/>
)
}7. System Design Diagrams
7.1 CRDT vs Consensus
flowchart LR subgraph Consensus["Consensus (Raft)"] C1[Leader] -.write.-> C2[Follower 1] C1 -.write.-> C3[Follower 2] C4[Client] --> C1 Note1["Cần leader<br/>Latency = quorum RTT<br/>Strong consistency"] end subgraph CRDT["CRDT (Eventually Convergent)"] D1[Replica 1] <-->|gossip| D2[Replica 2] D2 <-->|gossip| D3[Replica 3] D1 <-->|gossip| D3 D4[Client] --> D1 Note2["No leader<br/>Local writes immediate<br/>Eventual consistency"] end style Note1 fill:#fff9c4 style Note2 fill:#c8e6c9
7.2 OR-Set Concurrent Add + Remove
sequenceDiagram participant A as Replica A participant B as Replica B Note over A,B: Initial: both have {} (empty) A->>A: add("milk")<br/>tag=t1 B->>B: add("milk")<br/>tag=t2 Note over A: state.adds = {milk: {t1}} Note over B: state.adds = {milk: {t2}} A->>A: remove("milk") Note over A: state.removes = {milk: {t1}}<br/>(only tombstones t1, doesn't see t2) B->>B: add("bread")<br/>tag=t3 Note over B: state.adds = {milk: {t2}, bread: {t3}} A->B: sync B->A: sync Note over A,B: After merge:<br/>adds = {milk: {t1, t2}, bread: {t3}}<br/>removes = {milk: {t1}}<br/>active = {milk: {t2}, bread: {t3}}<br/>→ {milk, bread}
7.3 Yjs Architecture
flowchart TB subgraph Client1["Client 1 (Browser)"] Y1[Y.Doc] Y1text[Y.Text 'content'] Y1 --> Y1text end subgraph Client2["Client 2 (Browser)"] Y2[Y.Doc] Y2text[Y.Text 'content'] Y2 --> Y2text end subgraph Server["Yjs WebSocket Server"] WS[y-websocket relay] Persistence[(LevelDB / Redis)] end Client1 <-->|"WebSocket<br/>updates"| WS Client2 <-->|"WebSocket<br/>updates"| WS WS --> Persistence Note["Server is RELAY only<br/>Not a coordinator<br/>State converges via CRDT merge"] style Note fill:#fff9c4
7.4 G-Counter Convergence
flowchart TD subgraph Initial["Initial State"] R1a["R1: [0,0,0]"] R2a["R2: [0,0,0]"] R3a["R3: [0,0,0]"] end subgraph Independent["Independent increments"] R1b["R1: [5,0,0]<br/>+5"] R2b["R2: [0,3,0]<br/>+3"] R3b["R3: [0,0,7]<br/>+7"] end subgraph Sync1["Sync R1↔R2"] R1c["R1: [5,3,0]<br/>value=8"] R2c["R2: [5,3,0]<br/>value=8"] end subgraph Sync2["Sync all"] R1d["R1: [5,3,7]<br/>value=15 ✓"] R2d["R2: [5,3,7]<br/>value=15 ✓"] R3d["R3: [5,3,7]<br/>value=15 ✓"] end Initial --> Independent Independent --> Sync1 Sync1 --> Sync2 style R1d fill:#c8e6c9 style R2d fill:#c8e6c9 style R3d fill:#c8e6c9
8. Aha Moments & Pitfalls
Aha Moments
#1: CRDT trade off expressiveness for automatic merge. Bạn không thể CRDT mọi thứ — chỉ specific shapes (counter, set, register, sequence). Nhưng cho những shape đó, bạn được automatic merge mà không cần coordination.
#2: 3 properties (commutative, associative, idempotent) là magic. Với 3 properties này, bất kể order nào em merge, kết quả đều giống. Đây là math foundation của CRDTs.
#3: Tombstones là cái giá phải trả. Để support remove + concurrent re-add, OR-Set tích lũy tombstones. State grows. GC giúp nhưng có trade-off (lose history).
#4: CRDT không thay thế consensus. CRDT cho convergence, không cho invariants. Cần “balance ≥ 0” → vẫn cần Raft. Nhưng cho “shopping cart items”, “page view counter”, “collaborative text” → CRDT > Raft (no leader, no coordination).
#5: Server có thể là pure relay. Yjs server không “decide” anything — chỉ relay updates và persist state. CRDT logic ở client. Khác với traditional server where server arbitrates.
#6: Local-first = CRDT-first. Mọi local-first software (Apple Notes, Obsidian Sync, Linear) underlie CRDTs. Edit offline, sync khi online → automatic merge, no conflict dialogs.
#7: HLC > LWW timestamps. Hybrid Logical Clock fixed size (16 bytes), gần với wall clock (debug dễ), nhưng đảm bảo causality. CockroachDB, YugabyteDB, MongoDB cluster time đều dùng HLC.
Pitfalls
Pitfall 1: Dùng CRDT cho invariant
Sai: Wallet balance là PN-Counter → 2 user concurrent withdraw → balance < 0. Đúng: Invariant cần consensus (Raft) hoặc escrow tokens.
Pitfall 2: LWW cho collaborative editing
Sai: Profile field = LWW, OK. Nhưng dùng LWW cho text → mất paragraph khi 2 user edit cùng lúc. Đúng: Text dùng RGA / Y.Text. LWW chỉ cho atomic single-value updates.
Pitfall 3: Quên garbage collection
Sai: Yjs document grow vô tận → 1 năm sau, 100MB cho 1KB visible text. Đúng: Enable GC trong y-websocket. Periodic state snapshot + clear history.
Pitfall 4: Client-generated timestamps
Sai: Client clock có thể skew → timestamp future → wins LWW forever. Đúng: Server-issued timestamps hoặc HLC bounded by server time.
Pitfall 5: 2P-Set (can’t re-add)
Sai: Use 2P-Set → “user removed friend X, can never add X back”. Đúng: Use OR-Set. 2P-Set chỉ cho true “irrevocable” state.
Pitfall 6: Partition + tombstone DoS
Sai: Replica isolated 30 ngày → return → state diff 100MB → bandwidth + memory crash. Đúng: Limit max state size. Forced GC. Anti-entropy throttling.
Pitfall 7: Vector clock scalability
Sai: 1000 replicas → vector clock 1000 entries per op → bandwidth explosion. Đúng: Use HLC (fixed size) hoặc dotted version vector. Hoặc shard CRDT to limit replica set.
Pitfall 8: Nghĩ Yjs/Automerge là silver bullet
Sai: Migrate everything to Yjs. Performance regress với large docs (>10MB). Đúng: CRDT phù hợp cho real-time collab, offline-first. Không phù hợp cho heavy mutation, large monolithic docs.
Pitfall 9: No backup
Sai: Yjs in-memory → server restart → lose everything. Đúng: Persistent layer (LevelDB, Redis, S3 snapshots).
Pitfall 10: Multi-instance không sync
Sai: 3 Yjs server instances behind LB → user A on instance 1, user B on instance 2 → not sync. Đúng: Sticky sessions OR Redis pub/sub broadcast OR Cloudflare Durable Objects (per-doc instance).
9. Internal Links
| Topic | Liên hệ |
|---|---|
| Tuan-Bonus-Consistency-Models-Isolation | CRDT = một dạng eventual consistency với automatic merge |
| Tuan-Bonus-Consensus-Raft-Paxos | Raft cho strong consistency; CRDT cho high availability |
| Tuan-07-Database-Sharding-Replication | Multi-master replication → CRDT khi không thể coordinate |
| Tuan-20-Design-Key-Value-Store | Dynamo dùng vector clocks; tương tự CRDTs |
| Tuan-17-Design-Chat-System | Chat presence, typing indicators có thể dùng CRDT |
| Case-Design-Nearby-Friends | Real-time location update — CRDT cho LWW position |
| Case-Design-Google-Drive | Collaborative editing → CRDT (Yjs/OT) |
Tham khảo
Foundational papers:
- Shapiro et al., A comprehensive study of Convergent and Commutative Replicated Data Types (Inria 2011) — https://hal.inria.fr/inria-00555588/document
- Bieniusa et al., An Optimized Conflict-free Replicated Set (2012) — OR-Set
- Kleppmann & Beresford, A Conflict-Free Replicated JSON Datatype (2017) — Automerge foundation
- Roh et al., Replicated abstract data types (2011) — CmRDT vs CvRDT formal
Books & resources:
- Roberto Vitillo, Understanding Distributed Systems — chapter on CRDTs
- Kleppmann, DDIA Ch.5 (Replication) — CRDTs section
- Designing data structures for collaborative apps (Matthew Weidner blog) — https://mattweidner.com/2022/02/10/collaborative-data-design.html
- crdt.tech — interactive tutorials — https://crdt.tech/
Production systems:
- Yjs documentation — https://docs.yjs.dev/
- Automerge docs — https://automerge.org/
- Riak data types — https://docs.riak.com/riak/kv/latest/developing/data-types/
- Redis Enterprise CRDB — https://redis.io/docs/latest/operate/rs/databases/active-active/
Engineering blogs:
- Figma — How Figma’s multiplayer technology works — https://www.figma.com/blog/how-figmas-multiplayer-technology-works/
- Linear — Real-time sync (Slug architecture) — https://linear.app/blog/scaling-the-linear-sync-engine
- Notion — Data model behind Notion’s flexibility — https://www.notion.so/blog/data-model-behind-notion
- Ink & Switch — Local-first software — https://www.inkandswitch.com/local-first/
- Cloudflare Durable Objects (CRDT-friendly platform) — https://blog.cloudflare.com/introducing-workers-durable-objects/
Courses & talks:
- Martin Kleppmann talks — CRDTs: The Hard Parts — https://martin.kleppmann.com/2020/07/06/crdt-hard-parts-hydra.html
- Strange Loop talks về CRDTs — YouTube
Hoàn thành toàn bộ Batch bonus chapters: Consensus → Consistency Models → Outbox → CRDTs.
Quay lại Tuan-20-Design-Key-Value-Store với hiểu biết sâu hơn về vector clock + CRDT để áp dụng vào DynamoDB-style design.