Tuần Bonus: Multi-Region Active-Active & Globally Distributed SQL
“Năm 2012 Google launch Spanner — distributed SQL với external consistency, dùng atomic clock để đồng bộ thời gian giữa data centers. Năm 2024 AWS launch Aurora DSQL — đem nguyên concept đó cho mass market với 99.999% SLA. Cùng với CockroachDB, YugabyteDB, TiDB, một category mới đã hình thành: globally distributed SQL với strong consistency.”
Tags: system-design multi-region distributed-sql aurora-dsql spanner cockroachdb disaster-recovery bonus Student: Hieu (Backend Dev → Architect) Prerequisite: Tuan-07-Database-Sharding-Replication · Tuan-Bonus-Consensus-Raft-Paxos · Tuan-Bonus-Consistency-Models-Isolation Liên quan: Case-Design-Payment-System · Case-Design-Stock-Exchange · Tuan-Bonus-Multi-Tenancy-SaaS-Patterns
1. Context & Why
Analogy đời thường — Hệ thống ngân hàng đa quốc gia
Hieu, tưởng tượng một ngân hàng quốc tế có chi nhánh ở Hà Nội, Tokyo, San Francisco, London. Khách hàng:
- VIP đi công tác → cần rút tiền ở chi nhánh nào cũng OK
- Số dư phải chính xác toàn cầu — rút Tokyo phải trừ Hà Nội ngay
- 1 chi nhánh cháy → vẫn phải hoạt động bình thường
- Compliance: data của khách EU phải lưu ở EU (GDPR), khách US lưu ở US
Đây là Multi-Region Active-Active — mọi region đều active (nhận write), data consistent toàn cầu, tolerate mất 1 region.
Không phải:
- Active-Passive: 1 region chính, others standby. Failover = downtime + data loss.
- Read replicas multi-region: write đi 1 region → cross-region latency.
- Sharding by region: customer EU/US tách biệt, không thể giao dịch chéo.
Tại sao Backend Dev cần hiểu?
| Lý do | Hậu quả |
|---|---|
| Outage costs | AWS US-EAST-1 down (2017, 2021, 2024) → mọi app single-region down |
| Compliance | GDPR data residency, China cybersecurity law, India DPDPA |
| Latency global | User VN gọi API US: 300ms RTT → unacceptable cho real-time |
| Disaster recovery | Earthquake, fire, ransomware → cần off-region backup |
| 2024-2026 distributed SQL maturity | Aurora DSQL (Dec 2024), Spanner GA, CockroachDB → no excuse not to use |
Tại sao Alex Xu không đi sâu?
Alex Xu Vol 1+2 (2020-2022) trước khi distributed SQL mature ở mass market. CockroachDB cloud GA 2020, Spanner pricing reform 2023, Aurora DSQL 2024-12. Đây là evolution 2-3 năm gần đây.
Tham chiếu chính
- Spanner paper (Google, 2012) — https://research.google/pubs/spanner-googles-globally-distributed-database-2/
- Aurora DSQL launch (re:Invent 2024) — https://aws.amazon.com/blogs/aws/introducing-amazon-aurora-dsql/
- CockroachDB tech blog — https://www.cockroachlabs.com/blog/
- AWS Multi-site Active/Active — https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-iv-multi-site-active-active/
- Calvin paper (deterministic distributed transactions) — http://cs.yale.edu/homes/thomson/publications/calvin-sigmod12.pdf
2. Deep Dive — Khái niệm cốt lõi
2.1 Disaster Recovery Strategies — Spectrum
RPO (Recovery Point Objective) RTO (Recovery Time Objective)
"Bao nhiêu data có thể mất?" "Bao lâu để recover?"
Backup/Restore ──────── hours ──────── hours-days CHEAP
Pilot Light ──────── minutes ──────── hours ↓
Warm Standby ──────── seconds ──────── minutes ↓
Multi-Site Active-Active ──── ~zero ──────── ~zero EXPENSIVE
| Strategy | RPO | RTO | Cost | Complexity |
|---|---|---|---|---|
| Backup/Restore | Hours | Hours-days | $ | Low |
| Pilot Light | Minutes | Hours | $$ | Medium |
| Warm Standby | Seconds | Minutes | $$$ | Medium |
| Active-Active | ~0 (sync rep) | ~0 (auto failover) | $$$$ | High |
Khi nào chọn cái nào:
- Internal tools, dev/staging: Backup/Restore
- Customer-facing non-critical: Pilot Light hoặc Warm Standby
- Revenue-critical (e-commerce, banking, payment): Active-Active
- Mission-critical (healthcare, aviation): Active-Active + chaos engineering
2.2 The Hard Problem — Why Multi-Region Active-Active is Hard
Light is slow: speed of light = 300,000 km/s. From US East to Asia = 12,000km → 40ms one-way physical limit.
Round trip latencies (typical):
Same DC: 0.5 ms
Same region: 2-5 ms
Cross-region (US): 50-80 ms
Cross-continent: 100-180 ms
Vấn đề (cho strong consistency):
- Synchronous replication cross-region: 100-200ms write latency → bad UX
- Async replication: data loss risk if region fails before replicate
3 fundamental approaches:
- Async với conflict resolution (CRDT, LWW): Available, weak consistency
- Sync với consensus (Raft, Paxos): Consistent but slow
- TrueTime / atomic clocks: External consistency, fast (Spanner, DSQL)
2.3 Spanner — TrueTime External Consistency
Spanner (Google 2012) là first production globally distributed SQL với strong consistency.
Key innovation: TrueTime
- Atomic clocks + GPS receivers in every data center
- Returns TT.now() =
[earliest, latest]interval (~7ms uncertainty) - Guarantees:
TT.after(t)returns true only afterthas passed
Spanner commit protocol:
1. Acquire write timestamp T_commit ≥ TT.now().latest
2. Wait until TT.after(T_commit) = true (avg 7ms)
3. Apply commit
4. Reply to client
Result: External consistency
If T1 commits before T2 starts → T1.commit_ts < T2.commit_ts
Even across regions
External consistency = strongest possible: linearizability + serializability + real-time order across regions.
Cost: 7ms commit wait + 1-2 RTT cross-region for Paxos = ~30ms write latency multi-region.
2.4 Aurora DSQL — Spanner for AWS (2024)
Launched at AWS re:Invent 2024, GA Apr 2025.
Key features:
- Active-Active multi-region by default
- Strong consistency across regions (like Spanner)
- Amazon Time Sync Service (atomic clocks free trên EC2 từ 2023!) — no manual setup
- PostgreSQL-compatible — drop-in replacement for many apps
- 99.999% SLA (5 minutes downtime/year)
- Serverless: scale to zero, no cluster management
- OCC (Optimistic Concurrency Control) — không có pessimistic locks
- Disaggregated storage: storage layer riêng biệt, scale độc lập
Architecture (high level):
┌────────────────────────────────────────────┐
│ Aurora DSQL (Region A) │
│ Compute: query routers (stateless) │
│ Storage: distributed log + KV store │
│ Consensus: across regions │
└──────────────────────┬─────────────────────┘
│ sync replication
│ (atomic clock-coordinated)
┌──────────────────────┴─────────────────────┐
│ Aurora DSQL (Region B) │
│ Same arch, write/read locally │
└────────────────────────────────────────────┘
Cost (2026 pricing):
- $4.00 / Distributed Processing Unit (DPU) hour
- $0.33 / GB-month storage
- Cheaper than Spanner for most workloads
Limitations:
- Currently no foreign keys, no triggers (some PostgreSQL features)
- Max DB size 100 TB
- Limited to AWS regions
2.5 CockroachDB / YugabyteDB / TiDB
2.5.1 CockroachDB
- Origin: ex-Google engineers (2014) — modeled after Spanner
- HLC instead of atomic clocks: Hybrid Logical Clock with bounded skew
- Multi-active: every node accepts reads/writes
- Survival goals: configure để survive zone/region/multi-region failures
- Production: DoorDash, Comcast, Netflix, eBay
- Open source (BSL license)
-- CockroachDB multi-region table
CREATE DATABASE myapp;
USE myapp;
-- Set survival goal
ALTER DATABASE myapp SURVIVE REGION FAILURE;
-- Region-aware table
CREATE TABLE users (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
region crdb_internal_region NOT NULL DEFAULT default_to_database_primary_region(gateway_region()),
name TEXT,
PRIMARY KEY (region, id)
)
LOCALITY REGIONAL BY ROW;
-- Each row pinned to user's home region for low-latency local reads2.5.2 YugabyteDB
- Origin: ex-Facebook engineers (2017)
- YSQL (PostgreSQL-compatible) + YCQL (Cassandra-compatible)
- Raft per shard (tablet)
- Production: General Motors, Wells Fargo, Justuno
2.5.3 TiDB
- Origin: PingCAP (2015), China-focused initially
- MySQL-compatible
- HTAP (OLTP + OLAP via TiFlash columnar)
- Production: ByteDance, Pinterest, Square
2.6 Comparison Matrix
| Feature | Aurora DSQL | Spanner | CockroachDB | YugabyteDB | TiDB |
|---|---|---|---|---|---|
| Vendor | AWS | Google Cloud | OSS + Cloud | OSS + Cloud | OSS + Cloud |
| Time source | Amazon Time Sync (atomic) | TrueTime (atomic+GPS) | HLC | HLC | HLC |
| External consistency | ✅ | ✅ | ⚠️ (linearizable, not strict serializable) | ✅ | ⚠️ |
| Multi-region active | ✅ Built-in | ✅ Built-in | ✅ Configurable | ✅ Configurable | ✅ Configurable |
| SQL dialect | PostgreSQL | Custom (Spanner SQL) | PostgreSQL | PostgreSQL | MySQL |
| HTAP | No | Yes (limited) | No | No | Yes |
| Self-hosted | No (AWS managed) | No (GCP managed) | Yes | Yes | Yes |
| Best for | AWS shop | Google shop, global apps | OSS preference | Polyglot (SQL + NoSQL) | MySQL migration |
2.7 Routing Strategies
Vấn đề: User từ Việt Nam — gọi region nào?
2.7.1 DNS-based Routing (Route 53)
api.myapp.com → DNS query →
If user in Asia → returns IP of ap-southeast-1
If user in Europe → returns IP of eu-west-1
If user in Americas → returns IP of us-east-1
Latency-based routing: Route 53 returns endpoint với lowest latency. Geolocation routing: Route by country.
Pros: Simple, free Cons: DNS TTL caching → slow failover (5-10 min)
2.7.2 AWS Global Accelerator (Anycast)
api.myapp.com → Single static IP
→ Anycast (BGP) routes to nearest edge
→ Edge connects to healthy regional endpoint
Pros:
- Sub-30s failover (no DNS cache)
- Better performance than DNS
- Single IP simpler for clients
Cons:
- 0.018/hour per accelerator
- Vendor-specific (AWS only)
2.7.3 CDN-level routing (Cloudflare, Fastly)
Request → Cloudflare edge (250+ locations)
→ Worker decides routing logic
→ Forwards to optimal regional backend
Pros:
- Edge logic (Workers, Compute@Edge)
- Best UX (closest to user)
- Built-in DDoS, WAF
Cons:
- Vendor lock-in
- Worker cold start consideration
2.8 Conflict Resolution
Active-Active requires strategy when concurrent writes happen.
2.8.1 Strong Consistency (Spanner-style)
- Sync consensus via atomic clocks
- No conflicts possible (all writes serialized globally)
- Cost: 30-100ms write latency
2.8.2 LWW (Last-Write-Wins)
- Each write tagged with HLC timestamp
- Higher timestamp wins
- Risk: Lose updates, clock skew bugs
- Use for: non-critical metadata (user preferences, timestamps)
2.8.3 CRDT-based (Riak, Redis CRDB)
- Mathematical merge guarantees convergence
- Use for: counters, sets, registers
- Tham chiếu: Tuan-Bonus-CRDTs-Conflict-Free-Data-Types
2.8.4 Application-level (Custom)
- App detects conflict, resolves with business logic
- Example: “merge two carts” = union of items
- Most flexible, most work
2.8.5 Comparison
| Strategy | Consistency | Latency | Use case |
|---|---|---|---|
| Sync consensus (Spanner) | Strong | High (~50ms) | Banking, critical |
| LWW | Eventual | Low | Metadata |
| CRDT | Eventual (convergent) | Low | Counters, sets |
| Custom | Varies | Medium | Domain-specific |
2.9 Split-Brain Prevention
Risk: Network partition → 2 regions both think they’re primary → diverge writes → data corruption.
Mitigations:
2.9.1 Quorum-based (most common)
3 regions: A, B, C
Quorum = majority = 2
If A isolated from B+C:
A: minority (1) → cannot accept writes
B+C: majority (2) → accepts writes
When network heals:
A reconciles from B+C (replays missed transactions)
2.9.2 STONITH (Shoot The Other Node In The Head)
- Hardware-level fencing: powered off via management API
- Used in HA clusters (Pacemaker)
- Less common in cloud (use cloud APIs instead)
2.9.3 Lease-based
- Primary holds time-bounded lease
- Must renew or lose primacy
- If can’t communicate → step down automatically
2.10 Cost Considerations
Multi-region active-active là expensive. Tính cho 100 GB DB, 1M tx/day, 3 regions:
| Cost component | Single-region | Active-Active 3-region |
|---|---|---|
| Compute (DB instances) | $500/month | $1,500/month (3x) |
| Storage | $50/month | $150/month (3 copies) |
| Cross-region data transfer | $0 | $200-500/month (replication) |
| Routing (Global Accelerator) | $0 | $50-200/month |
| Monitoring/observability | $50/month | $150/month |
| Total | $600/month | $2,050-2,500/month |
→ ~3-4x cost. ROI question: Worth $1,500/month for higher availability? Depends on revenue impact of downtime.
Rule of thumb: Multi-region cost-effective khi 1 hour downtime > 1 month additional cost. For most SaaS với $100K+/month revenue → worth it.
3. Estimation — Multi-Region Capacity
3.1 Replication bandwidth
Scenario: 1000 transactions/sec, average 5KB write per transaction, 3 regions full mesh.
Outbound from each region = 1000 × 5KB = 5 MB/s per replica peer
With 3 regions full mesh: 5 MB/s × 2 peers = 10 MB/s outbound per region
Cross-region bandwidth at AWS pricing ($0.02/GB):
10 MB/s × 86,400 s × 30 days = 25 TB/month per region
3 regions × 25 TB = 75 TB/month total
75 TB × $0.02/GB = $1,500/month for replication alone
3.2 Read latency budget
P95 latency target: 100ms cho user actions.
Latency breakdown (US user → US region):
Network DNS/TLS: 20ms
App server processing: 30ms
DB query: 30ms
Network return: 20ms
Total: 100ms ✓
Cross-region penalty (US user → EU region):
DNS+TLS: 20ms (same)
App: 30ms
Cross-region DB read: 100ms (RTT)
Return: 20ms
Total: 170ms ✗ (over budget)
→ Read-local pattern bắt buộc.
3.3 RTO/RPO targets
| Tier | RTO | RPO | Strategy |
|---|---|---|---|
| Tier 1 (payment) | < 1 min | 0 (zero loss) | Sync replication multi-region |
| Tier 2 (e-commerce) | < 5 min | < 1 min | Async replication + auto failover |
| Tier 3 (analytics) | < 1 hour | < 1 hour | Daily snapshots cross-region |
| Tier 4 (logs) | < 1 day | < 1 day | Backup to cold storage |
3.4 Failover testing budget
Game day pattern: Simulate region failure quarterly.
Cost per game day:
Engineer time: 5 engineers × 4 hours × $100/h = $2,000
Potential customer impact (production): $0 if done right
Tools (chaos engineering): included
Total: $2,000/quarter = $8,000/year
ROI: 1 prevented production outage = saves $50K-500K depending on scale.
4. Security First
4.1 Data residency & sovereignty
Compliance requirements:
- GDPR (EU): EU citizen data must reside in EU
- China Cybersecurity Law: Chinese data must reside in China
- India DPDPA: Critical personal data must reside in India
- HIPAA (US): PHI must follow specific guidelines
Implementation patterns:
-- CockroachDB multi-region with row-level locality
CREATE TABLE users (
id UUID PRIMARY KEY,
home_region crdb_internal_region NOT NULL,
pii_data JSONB,
PRIMARY KEY (home_region, id)
)
LOCALITY REGIONAL BY ROW AS home_region;
-- EU users → eu-west-1, US users → us-east-1
-- Single SQL, but data physically separated4.2 Cross-region encryption
Mandatory:
- TLS 1.3 for inter-region replication
- KMS keys per region (no single global key)
- Customer-managed keys (CMK) for compliance
# Terraform: per-region KMS
resource "aws_kms_key" "us_east" {
provider = aws.us-east-1
description = "Aurora DSQL encryption key US East"
}
resource "aws_kms_key" "eu_west" {
provider = aws.eu-west-1
description = "Aurora DSQL encryption key EU West"
}4.3 IAM cross-account / cross-region
Principle of least privilege: Each region has separate IAM roles.
Region A app → IAM role A → Aurora DSQL A only
Region B app → IAM role B → Aurora DSQL B only
No app has cross-region admin access.
Replication uses dedicated service role with minimal scope.
4.4 Audit logging
Every cross-region transaction must be logged:
- Source region, destination region
- Transaction ID, timestamp
- User/role identity
- Data classification
Forward to centralized SIEM (Splunk, Datadog, Wazuh) for compliance audits.
4.5 Disaster recovery testing security
Game day must include:
- Verify failover doesn’t expose unauthorized data
- Confirm encryption keys valid in DR region
- Test access controls survive failover
- Validate audit log integrity
5. DevOps — Vận hành Multi-Region
5.1 Aurora DSQL setup (Terraform)
provider "aws" {
alias = "primary"
region = "us-east-1"
}
provider "aws" {
alias = "secondary"
region = "us-west-2"
}
# Primary cluster
resource "aws_dsql_cluster" "primary" {
provider = aws.primary
multi_region_properties {
witness_region = "us-west-2"
}
tags = {
Name = "primary-us-east-1"
}
}
# Secondary cluster
resource "aws_dsql_cluster" "secondary" {
provider = aws.secondary
multi_region_properties {
witness_region = "us-east-1"
}
tags = {
Name = "secondary-us-west-2"
}
}
# Link clusters for active-active
resource "aws_dsql_cluster_peering" "main" {
provider = aws.primary
cluster_id = aws_dsql_cluster.primary.id
peer_cluster_arns = [aws_dsql_cluster.secondary.arn]
}5.2 Application connection pattern
"""
Multi-region aware DB client with automatic failover.
"""
import os
import psycopg
from contextlib import contextmanager
class MultiRegionDB:
def __init__(self):
# Primary endpoint based on user region
self.endpoints = {
"us-east-1": "primary.dsql-cluster.amazonaws.com",
"us-west-2": "secondary.dsql-cluster.amazonaws.com",
"eu-west-1": "tertiary.dsql-cluster.amazonaws.com",
}
self.current_region = os.getenv("AWS_REGION", "us-east-1")
@contextmanager
def connection(self):
"""Try local region first, fall back to others."""
order = [self.current_region] + [
r for r in self.endpoints if r != self.current_region
]
for region in order:
try:
conn = psycopg.connect(
host=self.endpoints[region],
user="app",
password=self._get_iam_token(region),
dbname="postgres",
sslmode="require",
connect_timeout=2,
)
yield conn
conn.close()
return
except (psycopg.OperationalError, TimeoutError) as e:
print(f"Failed connect {region}: {e}")
continue
raise RuntimeError("All regions unreachable")
def _get_iam_token(self, region):
# Aurora DSQL uses IAM auth tokens
import boto3
client = boto3.client("dsql", region_name=region)
return client.generate_db_connect_auth_token(
cluster_endpoint=self.endpoints[region]
)
db = MultiRegionDB()
with db.connection() as conn:
with conn.cursor() as cur:
cur.execute("SELECT * FROM orders WHERE id = %s", (order_id,))5.3 Health check & failover
# Route 53 health check + failover
resource "aws_route53_health_check" "primary" {
fqdn = "api-us-east.myapp.com"
port = 443
type = "HTTPS"
resource_path = "/health"
failure_threshold = "3"
request_interval = "30"
tags = {
Name = "primary-health"
}
}
resource "aws_route53_record" "api_primary" {
zone_id = var.zone_id
name = "api.myapp.com"
type = "A"
set_identifier = "primary"
failover_routing_policy {
type = "PRIMARY"
}
health_check_id = aws_route53_health_check.primary.id
alias {
name = aws_lb.primary.dns_name
zone_id = aws_lb.primary.zone_id
evaluate_target_health = true
}
}
resource "aws_route53_record" "api_secondary" {
zone_id = var.zone_id
name = "api.myapp.com"
type = "A"
set_identifier = "secondary"
failover_routing_policy {
type = "SECONDARY"
}
alias {
name = aws_lb.secondary.dns_name
zone_id = aws_lb.secondary.zone_id
evaluate_target_health = true
}
}5.4 Monitoring metrics
groups:
- name: multi_region_alerts
rules:
- alert: CrossRegionReplicationLag
expr: dsql_replication_lag_seconds > 5
for: 5m
labels: { severity: warning }
annotations:
summary: "Replication lag {{ $value }}s between regions"
- alert: RegionUnhealthy
expr: up{job="api", region=~".+"} == 0
for: 2m
labels: { severity: critical }
annotations:
summary: "Region {{ $labels.region }} unreachable"
- alert: SplitBrainSuspected
expr: |
count(dsql_is_primary == 1) by (cluster) > 1
for: 1m
labels: { severity: critical }
annotations:
summary: "Multiple primaries detected — split brain!"
- alert: HighFailoverFrequency
expr: changes(dsql_primary_region[1h]) > 3
labels: { severity: warning }
annotations:
summary: "Failover happened {{ $value }} times in 1h"5.5 Game day procedure
#!/bin/bash
# game-day-region-failure.sh
# Simulate us-east-1 failure quarterly
echo "Game Day: Simulating US-EAST-1 failure"
echo "Expected: us-west-2 takes over, RTO < 5min"
# 1. Block traffic to us-east-1
aws elbv2 modify-target-group-attributes \
--target-group-arn $US_EAST_TG \
--attributes Key=deregistration_delay.timeout_seconds,Value=0
aws ec2 authorize-security-group-ingress \
--group-id $US_EAST_SG \
--protocol -1 --port -1 --source-group $BLOCKED_SG
# 2. Watch failover
echo "Waiting for failover..."
START=$(date +%s)
while true; do
if curl -sf https://api.myapp.com/health | grep -q '"region":"us-west-2"'; then
END=$(date +%s)
echo "Failover complete: $(($END - $START))s"
break
fi
sleep 5
done
# 3. Verify data consistency
psql -h secondary.dsql-cluster.amazonaws.com -c "
SELECT COUNT(*) FROM orders WHERE created_at > NOW() - INTERVAL '5 min';
"
# 4. Restore us-east-1
aws ec2 revoke-security-group-ingress ...
# 5. Run reconciliation report
echo "Game Day complete. RTO: $(($END - $START))s. Generate report."6. Code Implementation
6.1 CockroachDB region-aware app
"""
CockroachDB region-aware Python application.
Uses gateway region for low-latency local reads.
"""
import os
import psycopg
from psycopg.rows import dict_row
class RegionAwareDB:
def __init__(self):
self.region = os.getenv("CRDB_REGION", "us-east-1")
self.dsn = os.getenv("CRDB_DSN")
def connect(self):
return psycopg.connect(
self.dsn,
row_factory=dict_row,
options=f"--cluster_name=mycluster --search_path=public",
)
def get_user(self, user_id: str) -> dict:
"""Read user from local region (low latency)."""
with self.connect() as conn:
with conn.cursor() as cur:
cur.execute("""
SELECT * FROM users
WHERE id = %s
AND home_region = %s
""", (user_id, self.region))
return cur.fetchone()
def transfer_money(self, from_user: str, to_user: str, amount: int):
"""Cross-region transfer (requires consensus)."""
with self.connect() as conn:
with conn.transaction():
with conn.cursor() as cur:
# CockroachDB uses bounded staleness reads by default
# but transactions are strongly consistent
cur.execute("""
UPDATE accounts
SET balance = balance - %s
WHERE user_id = %s AND balance >= %s
""", (amount, from_user, amount))
if cur.rowcount == 0:
raise ValueError("Insufficient funds")
cur.execute("""
UPDATE accounts
SET balance = balance + %s
WHERE user_id = %s
""", (amount, to_user))
db = RegionAwareDB()
user = db.get_user("user-123") # Low latency, local read
db.transfer_money("user-123", "user-456", 100) # Strong consistency, may cross region6.2 Failover-aware HTTP middleware
from fastapi import FastAPI, Request, Response
from fastapi.middleware.base import BaseHTTPMiddleware
import time
app = FastAPI()
class FailoverAwareMiddleware(BaseHTTPMiddleware):
"""Add region info to responses, monitor cross-region calls."""
async def dispatch(self, request: Request, call_next):
start = time.time()
region = os.getenv("AWS_REGION", "unknown")
response = await call_next(request)
elapsed = time.time() - start
response.headers["X-Region"] = region
response.headers["X-Response-Time"] = f"{elapsed:.3f}s"
# Alert if response time > 200ms (suggests cross-region call)
if elapsed > 0.2:
await self._log_slow_request(request, region, elapsed)
return response
async def _log_slow_request(self, request, region, elapsed):
# Track slow requests for analysis
print(f"[SLOW] {region} {request.url.path} took {elapsed:.3f}s")
app.add_middleware(FailoverAwareMiddleware)6.3 Custom conflict resolution (LWW)
"""
Application-level Last-Write-Wins for cross-region conflicts.
"""
from datetime import datetime
import uuid
class LWWConflictResolver:
def __init__(self, db):
self.db = db
def update_user_profile(self, user_id: str, data: dict):
"""Update with LWW timestamp for cross-region safety."""
timestamp = datetime.utcnow().isoformat() + "Z"
update_id = str(uuid.uuid4())
with self.db.connect() as conn:
with conn.cursor() as cur:
# Check current timestamp; only update if newer
cur.execute("""
UPDATE users
SET profile_data = %s,
last_modified = %s,
last_modified_by = %s
WHERE id = %s
AND (last_modified IS NULL OR last_modified < %s)
RETURNING id, last_modified
""", (
json.dumps(data),
timestamp,
update_id,
user_id,
timestamp,
))
result = cur.fetchone()
if result is None:
print(f"Update rejected: newer write exists for {user_id}")
return False
return True7. System Design Diagrams
7.1 Active-Active Architecture
flowchart TB subgraph Global["Global Layer"] DNS[Route 53<br/>Latency-based routing] CDN[CloudFront / Cloudflare] end subgraph US["US-EAST-1"] USL[Load Balancer] USA[App Tier] USDB[(Aurora DSQL<br/>US Primary)] end subgraph EU["EU-WEST-1"] EUL[Load Balancer] EUA[App Tier] EUDB[(Aurora DSQL<br/>EU Primary)] end subgraph ASIA["AP-SOUTHEAST-1"] ASL[Load Balancer] ASA[App Tier] ASDB[(Aurora DSQL<br/>Asia Primary)] end UserUS[US Users] --> CDN --> DNS UserEU[EU Users] --> CDN UserASIA[Asia Users] --> CDN DNS -->|nearest| USL DNS -->|nearest| EUL DNS -->|nearest| ASL USL --> USA --> USDB EUL --> EUA --> EUDB ASL --> ASA --> ASDB USDB <-.sync replication.-> EUDB EUDB <-.sync replication.-> ASDB USDB <-.sync replication.-> ASDB style USDB fill:#4caf50,color:#fff style EUDB fill:#4caf50,color:#fff style ASDB fill:#4caf50,color:#fff
7.2 Failover Sequence
sequenceDiagram participant U as User participant DNS as Route 53 participant US as US Region participant EU as EU Region participant HC as Health Checks Note over US,EU: Normal operation U->>DNS: Resolve api.myapp.com DNS-->>U: us-east-1 IP (lowest latency) U->>US: Request US-->>U: Response Note over US: ⚡ Region failure ⚡ HC->>US: Health probe Note over HC: 3 consecutive failures<br/>(90 seconds) HC->>DNS: Mark us-east-1 unhealthy DNS->>DNS: Remove from rotation Note over U: Next request U->>DNS: Resolve api.myapp.com DNS-->>U: eu-west-1 IP (next best) U->>EU: Request EU-->>U: Response Note over US,EU: Total RTO: 90-120 seconds
7.3 Spanner-style Commit Wait
sequenceDiagram participant Client participant Coord as Coordinator (Region A) participant TT as TrueTime API participant RegB as Replica (Region B) participant RegC as Replica (Region C) Client->>Coord: BEGIN; UPDATE x = 5; COMMIT; Coord->>TT: now() TT-->>Coord: [t_earliest, t_latest] Coord->>Coord: T_commit = t_latest par Replicate to majority Coord->>RegB: Prepare T_commit Coord->>RegC: Prepare T_commit end RegB-->>Coord: ack RegC-->>Coord: ack Note over Coord: Commit Wait<br/>until TT.after(T_commit) Coord->>TT: after(T_commit)? TT-->>Coord: true Coord->>Coord: Apply commit Coord-->>Client: 200 OK Note over Client,RegC: Total: ~30-50ms (RTT + ~7ms wait)
7.4 Split-Brain Prevention via Quorum
flowchart TB subgraph Before["Before Partition: 3 regions, full mesh"] A1[Region A] <--> B1[Region B] B1 <--> C1[Region C] A1 <--> C1 end subgraph Partition["⚡ Partition: A isolated"] A2[Region A<br/>Minority - 1 node] B2[Region B<br/>Majority - 2 nodes] C2[Region C<br/>Majority - 2 nodes] B2 <--> C2 A2 -.X.- B2 A2 -.X.- C2 AStatus[A: cannot accept writes<br/>read-only mode] BCStatus[B+C: continue as primary<br/>accept writes] end subgraph After["After Heal: A reconciles"] A3[Region A<br/>Replays missed transactions<br/>from B/C] B3[Region B] C3[Region C] A3 <--> B3 B3 <--> C3 A3 <--> C3 end Before --> Partition --> After style A2 fill:#ffcdd2 style B2 fill:#c8e6c9 style C2 fill:#c8e6c9
8. Aha Moments & Pitfalls
Aha Moments
#1: Aurora DSQL = Spanner cho mass market. Trước 2024, chỉ Spanner có atomic clock external consistency. Aurora DSQL democratize công nghệ này — drop-in PostgreSQL với 99.999% SLA cross-region.
#2: Atomic clocks free trên AWS. Amazon Time Sync Service (2023) cung cấp microsecond-accurate time miễn phí trên EC2. Đây là enabling technology cho DSQL.
#3: Active-Active không phải binary. Có spectrum: full active-active (every region writes), regional active (each region owns subset), read-anywhere-write-primary. Chọn đúng level cho use case.
#4: Light speed là physical limit. Cross-region sync replication không thể < 30ms. Kiến trúc phải accept latency cost hoặc relax consistency.
#5: Read-local là bắt buộc cho UX. User APAC không thể đợi 200ms cho mỗi read. Pattern: local read replica + sync write to primary, hoặc CockroachDB locality-aware.
#6: Split-brain rare nhưng catastrophic. 1 lần data corruption = mất trust mãi mãi. Quorum-based + lease-based fencing là 2 chính defense.
#7: DNS failover slow (5-10 min). Cho RTO < 1 min, dùng Anycast (Global Accelerator) hoặc CDN-level routing.
#8: Cost gấp 3-4x single-region. Justify bằng business impact, không phải “best practice”. SaaS $100K+/month → worth it. Internal tools → có thể không.
Pitfalls
Pitfall 1: Thinking active-passive is enough
Sai: “Có warm standby là đủ” → outage failover takes 1 hour, lose data. Đúng: Cho revenue-critical, active-active duy nhất accept zero downtime.
Pitfall 2: Same KMS key across regions
Sai: 1 KMS key dùng cho cả 3 regions → key compromise = total loss. Đúng: Per-region KMS, customer-managed keys.
Pitfall 3: Async replication for critical writes
Sai: Payment ledger với async replication → 1 region fail = lose recent transactions. Đúng: Sync replication với Spanner/DSQL hoặc app-level 2PC.
Pitfall 4: Ignore data residency
Sai: EU user data tự động replicate sang US → GDPR violation, fines. Đúng: Row-level locality (CockroachDB), tagged tables, region-pinning.
Pitfall 5: No game day testing
Sai: “Failover should work” — chưa test bao giờ → real outage discover bug. Đúng: Quarterly game day, simulate region failure, measure RTO.
Pitfall 6: Cross-region in tight loops
Sai: App makes 100 sequential cross-region calls → 10 seconds latency. Đúng: Batch, prefetch, use local cache. Cross-region call expensive.
Pitfall 7: Trust DNS TTL
Sai: Set TTL=300s, expect failover in 5 min → some clients cache 1 hour. Đúng: Use Anycast / Global Accelerator for sub-30s failover.
Pitfall 8: Forget reconciliation after partition heal
Sai: Partition heals, app continues normally → diverged data persists. Đúng: Auto-reconciliation (DSQL/CRDB) or manual procedure.
Pitfall 9: No backup beyond replication
Sai: “Replication is backup” — ransomware encrypts → all replicas encrypted. Đúng: Point-in-time backups + cross-region + immutable storage.
Pitfall 10: Underestimate cost
Sai: “Multi-region just 2x cost” → bill comes 4x because of data transfer. Đúng: Calculate cross-region transfer carefully. Use private connectivity (Direct Connect, ExpressRoute).
9. Internal Links
| Topic | Liên hệ |
|---|---|
| Tuan-07-Database-Sharding-Replication | Foundation; multi-region là extreme case |
| Tuan-Bonus-Consensus-Raft-Paxos | Underlying consensus protocol cho DSQL |
| Tuan-Bonus-Consistency-Models-Isolation | External consistency, linearizability |
| Tuan-Bonus-CRDTs-Conflict-Free-Data-Types | Alternative cho async replication conflict resolution |
| Tuan-Bonus-Multi-Tenancy-SaaS-Patterns | Tenant-region affinity |
| Case-Design-Payment-System | Payment cross-border requires multi-region |
| Case-Design-Stock-Exchange | Geo-distributed exchanges |
| Tuan-13-Monitoring-Observability | Cross-region monitoring, replication lag |
Tham khảo
Papers:
- Spanner (Google, 2012) — https://research.google/pubs/spanner-googles-globally-distributed-database-2/
- Calvin: Fast Distributed Transactions (Yale, 2012) — http://cs.yale.edu/homes/thomson/publications/calvin-sigmod12.pdf
- CockroachDB transaction model — https://www.cockroachlabs.com/docs/stable/architecture/transaction-layer.html
Engineering blogs:
- AWS, Introducing Amazon Aurora DSQL (re:Invent 2024) — https://aws.amazon.com/blogs/aws/introducing-amazon-aurora-dsql/
- AWS, Multi-site Active/Active DR Architecture — https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-iv-multi-site-active-active/
- CockroachDB, Living without atomic clocks — https://www.cockroachlabs.com/blog/living-without-atomic-clocks/
- Yugabyte, Geo-distributed deployment — https://docs.yugabyte.com/preview/explore/multi-region-deployments/
Talks:
- AWS re:Invent 2024 DAT427 — Aurora DSQL deep dive
- Spanner talks at SIGMOD, OSDI
Tools:
- Aurora DSQL — https://aws.amazon.com/rds/aurora/dsql/
- CockroachDB — https://www.cockroachlabs.com/
- YugabyteDB — https://www.yugabyte.com/
- TiDB — https://www.pingcap.com/tidb/
- Spanner — https://cloud.google.com/spanner
Tiếp theo: Tuan-Bonus-Multi-Tenancy-SaaS-Patterns — Tenant isolation patterns cho SaaS, complement với multi-region.