Tuần Bonus: Multi-Region Active-Active & Globally Distributed SQL

“Năm 2012 Google launch Spanner — distributed SQL với external consistency, dùng atomic clock để đồng bộ thời gian giữa data centers. Năm 2024 AWS launch Aurora DSQL — đem nguyên concept đó cho mass market với 99.999% SLA. Cùng với CockroachDB, YugabyteDB, TiDB, một category mới đã hình thành: globally distributed SQL với strong consistency.”

Tags: system-design multi-region distributed-sql aurora-dsql spanner cockroachdb disaster-recovery bonus Student: Hieu (Backend Dev → Architect) Prerequisite: Tuan-07-Database-Sharding-Replication · Tuan-Bonus-Consensus-Raft-Paxos · Tuan-Bonus-Consistency-Models-Isolation Liên quan: Case-Design-Payment-System · Case-Design-Stock-Exchange · Tuan-Bonus-Multi-Tenancy-SaaS-Patterns

1. Context & Why

Analogy đời thường — Hệ thống ngân hàng đa quốc gia

Hieu, tưởng tượng một ngân hàng quốc tế có chi nhánh ở Hà Nội, Tokyo, San Francisco, London. Khách hàng:

VIP đi công tác → cần rút tiền ở chi nhánh nào cũng OK
Số dư phải chính xác toàn cầu — rút Tokyo phải trừ Hà Nội ngay
1 chi nhánh cháy → vẫn phải hoạt động bình thường
Compliance: data của khách EU phải lưu ở EU (GDPR), khách US lưu ở US

Đây là Multi-Region Active-Active — mọi region đều active (nhận write), data consistent toàn cầu, tolerate mất 1 region.

Không phải:

Active-Passive: 1 region chính, others standby. Failover = downtime + data loss.
Read replicas multi-region: write đi 1 region → cross-region latency.
Sharding by region: customer EU/US tách biệt, không thể giao dịch chéo.

Tại sao Backend Dev cần hiểu?

Lý do	Hậu quả
Outage costs	AWS US-EAST-1 down (2017, 2021, 2024) → mọi app single-region down
Compliance	GDPR data residency, China cybersecurity law, India DPDPA
Latency global	User VN gọi API US: 300ms RTT → unacceptable cho real-time
Disaster recovery	Earthquake, fire, ransomware → cần off-region backup
2024-2026 distributed SQL maturity	Aurora DSQL (Dec 2024), Spanner GA, CockroachDB → no excuse not to use

Tại sao Alex Xu không đi sâu?

Alex Xu Vol 1+2 (2020-2022) trước khi distributed SQL mature ở mass market. CockroachDB cloud GA 2020, Spanner pricing reform 2023, Aurora DSQL 2024-12. Đây là evolution 2-3 năm gần đây.

Tham chiếu chính

Spanner paper (Google, 2012) — https://research.google/pubs/spanner-googles-globally-distributed-database-2/
Aurora DSQL launch (re:Invent 2024) — https://aws.amazon.com/blogs/aws/introducing-amazon-aurora-dsql/
CockroachDB tech blog — https://www.cockroachlabs.com/blog/
AWS Multi-site Active/Active — https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-iv-multi-site-active-active/
Calvin paper (deterministic distributed transactions) — http://cs.yale.edu/homes/thomson/publications/calvin-sigmod12.pdf

2. Deep Dive — Khái niệm cốt lõi

2.1 Disaster Recovery Strategies — Spectrum

RPO (Recovery Point Objective)    RTO (Recovery Time Objective)
"Bao nhiêu data có thể mất?"        "Bao lâu để recover?"

Backup/Restore       ────────  hours    ────────  hours-days     CHEAP
Pilot Light          ────────  minutes  ────────  hours          ↓
Warm Standby         ────────  seconds  ────────  minutes        ↓
Multi-Site Active-Active  ──── ~zero   ────────  ~zero          EXPENSIVE

Strategy	RPO	RTO	Cost	Complexity
Backup/Restore	Hours	Hours-days	$	Low
Pilot Light	Minutes	Hours	$$	Medium
Warm Standby	Seconds	Minutes	$$$	Medium
Active-Active	~0 (sync rep)	~0 (auto failover)	$$$$	High

Khi nào chọn cái nào:

Internal tools, dev/staging: Backup/Restore
Customer-facing non-critical: Pilot Light hoặc Warm Standby
Revenue-critical (e-commerce, banking, payment): Active-Active
Mission-critical (healthcare, aviation): Active-Active + chaos engineering

2.2 The Hard Problem — Why Multi-Region Active-Active is Hard

Light is slow: speed of light = 300,000 km/s. From US East to Asia = 12,000km → 40ms one-way physical limit.

Round trip latencies (typical):
  Same DC:           0.5 ms
  Same region:       2-5 ms
  Cross-region (US): 50-80 ms
  Cross-continent:   100-180 ms

Vấn đề (cho strong consistency):

Synchronous replication cross-region: 100-200ms write latency → bad UX
Async replication: data loss risk if region fails before replicate

3 fundamental approaches:

Async với conflict resolution (CRDT, LWW): Available, weak consistency
Sync với consensus (Raft, Paxos): Consistent but slow
TrueTime / atomic clocks: External consistency, fast (Spanner, DSQL)

2.3 Spanner — TrueTime External Consistency

Spanner (Google 2012) là first production globally distributed SQL với strong consistency.

Key innovation: TrueTime

Atomic clocks + GPS receivers in every data center
Returns TT.now() = [earliest, latest] interval (~7ms uncertainty)
Guarantees: TT.after(t) returns true only after t has passed

Spanner commit protocol:
1. Acquire write timestamp T_commit ≥ TT.now().latest
2. Wait until TT.after(T_commit) = true (avg 7ms)
3. Apply commit
4. Reply to client

Result: External consistency
  If T1 commits before T2 starts → T1.commit_ts < T2.commit_ts
  Even across regions

External consistency = strongest possible: linearizability + serializability + real-time order across regions.

Cost: 7ms commit wait + 1-2 RTT cross-region for Paxos = ~30ms write latency multi-region.

2.4 Aurora DSQL — Spanner for AWS (2024)

Launched at AWS re:Invent 2024, GA Apr 2025.

Key features:

Active-Active multi-region by default
Strong consistency across regions (like Spanner)
Amazon Time Sync Service (atomic clocks free trên EC2 từ 2023!) — no manual setup
PostgreSQL-compatible — drop-in replacement for many apps
99.999% SLA (5 minutes downtime/year)
Serverless: scale to zero, no cluster management
OCC (Optimistic Concurrency Control) — không có pessimistic locks
Disaggregated storage: storage layer riêng biệt, scale độc lập

Architecture (high level):

┌────────────────────────────────────────────┐
│         Aurora DSQL (Region A)              │
│  Compute: query routers (stateless)         │
│  Storage: distributed log + KV store        │
│  Consensus: across regions                  │
└──────────────────────┬─────────────────────┘
                       │ sync replication
                       │ (atomic clock-coordinated)
┌──────────────────────┴─────────────────────┐
│         Aurora DSQL (Region B)              │
│  Same arch, write/read locally              │
└────────────────────────────────────────────┘

Cost (2026 pricing):

$4.00 / Distributed Processing Unit (DPU) hour
$0.33 / GB-month storage
Cheaper than Spanner for most workloads

Limitations:

Currently no foreign keys, no triggers (some PostgreSQL features)
Max DB size 100 TB
Limited to AWS regions

2.5 CockroachDB / YugabyteDB / TiDB

2.5.1 CockroachDB

Origin: ex-Google engineers (2014) — modeled after Spanner
HLC instead of atomic clocks: Hybrid Logical Clock with bounded skew
Multi-active: every node accepts reads/writes
Survival goals: configure để survive zone/region/multi-region failures
Production: DoorDash, Comcast, Netflix, eBay
Open source (BSL license)

-- CockroachDB multi-region table
CREATE DATABASE myapp;
USE myapp;
 
-- Set survival goal
ALTER DATABASE myapp SURVIVE REGION FAILURE;
 
-- Region-aware table
CREATE TABLE users (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    region crdb_internal_region NOT NULL DEFAULT default_to_database_primary_region(gateway_region()),
    name TEXT,
    PRIMARY KEY (region, id)
)
LOCALITY REGIONAL BY ROW;
-- Each row pinned to user's home region for low-latency local reads

2.5.2 YugabyteDB

Origin: ex-Facebook engineers (2017)
YSQL (PostgreSQL-compatible) + YCQL (Cassandra-compatible)
Raft per shard (tablet)
Production: General Motors, Wells Fargo, Justuno

2.5.3 TiDB

Origin: PingCAP (2015), China-focused initially
MySQL-compatible
HTAP (OLTP + OLAP via TiFlash columnar)
Production: ByteDance, Pinterest, Square

2.6 Comparison Matrix

Feature	Aurora DSQL	Spanner	CockroachDB	YugabyteDB	TiDB
Vendor	AWS	Google Cloud	OSS + Cloud	OSS + Cloud	OSS + Cloud
Time source	Amazon Time Sync (atomic)	TrueTime (atomic+GPS)	HLC	HLC	HLC
External consistency	✅	✅	⚠️ (linearizable, not strict serializable)	✅	⚠️
Multi-region active	✅ Built-in	✅ Built-in	✅ Configurable	✅ Configurable	✅ Configurable
SQL dialect	PostgreSQL	Custom (Spanner SQL)	PostgreSQL	PostgreSQL	MySQL
HTAP	No	Yes (limited)	No	No	Yes
Self-hosted	No (AWS managed)	No (GCP managed)	Yes	Yes	Yes
Best for	AWS shop	Google shop, global apps	OSS preference	Polyglot (SQL + NoSQL)	MySQL migration

2.7 Routing Strategies

Vấn đề: User từ Việt Nam — gọi region nào?

2.7.1 DNS-based Routing (Route 53)

api.myapp.com → DNS query →
  If user in Asia → returns IP of ap-southeast-1
  If user in Europe → returns IP of eu-west-1
  If user in Americas → returns IP of us-east-1

Latency-based routing: Route 53 returns endpoint với lowest latency. Geolocation routing: Route by country.

Pros: Simple, free Cons: DNS TTL caching → slow failover (5-10 min)

2.7.2 AWS Global Accelerator (Anycast)

api.myapp.com → Single static IP
              → Anycast (BGP) routes to nearest edge
              → Edge connects to healthy regional endpoint

Pros:

Sub-30s failover (no DNS cache)
Better performance than DNS
Single IP simpler for clients

Cons:

$0.025/ GBt r an s f er +$ 0.018/hour per accelerator
Vendor-specific (AWS only)

2.7.3 CDN-level routing (Cloudflare, Fastly)

Request → Cloudflare edge (250+ locations)
       → Worker decides routing logic
       → Forwards to optimal regional backend

Pros:

Edge logic (Workers, Compute@Edge)
Best UX (closest to user)
Built-in DDoS, WAF

Cons:

Vendor lock-in
Worker cold start consideration

2.8 Conflict Resolution

Active-Active requires strategy when concurrent writes happen.

2.8.1 Strong Consistency (Spanner-style)

Sync consensus via atomic clocks
No conflicts possible (all writes serialized globally)
Cost: 30-100ms write latency

2.8.2 LWW (Last-Write-Wins)

Each write tagged with HLC timestamp
Higher timestamp wins
Risk: Lose updates, clock skew bugs
Use for: non-critical metadata (user preferences, timestamps)

2.8.3 CRDT-based (Riak, Redis CRDB)

Mathematical merge guarantees convergence
Use for: counters, sets, registers
Tham chiếu: Tuan-Bonus-CRDTs-Conflict-Free-Data-Types

2.8.4 Application-level (Custom)

App detects conflict, resolves with business logic
Example: “merge two carts” = union of items
Most flexible, most work

2.8.5 Comparison

Strategy	Consistency	Latency	Use case
Sync consensus (Spanner)	Strong	High (~50ms)	Banking, critical
LWW	Eventual	Low	Metadata
CRDT	Eventual (convergent)	Low	Counters, sets
Custom	Varies	Medium	Domain-specific

2.9 Split-Brain Prevention

Risk: Network partition → 2 regions both think they’re primary → diverge writes → data corruption.

Mitigations:

2.9.1 Quorum-based (most common)

3 regions: A, B, C
Quorum = majority = 2

If A isolated from B+C:
  A: minority (1) → cannot accept writes
  B+C: majority (2) → accepts writes

When network heals:
  A reconciles from B+C (replays missed transactions)

2.9.2 STONITH (Shoot The Other Node In The Head)

Hardware-level fencing: powered off via management API
Used in HA clusters (Pacemaker)
Less common in cloud (use cloud APIs instead)

2.9.3 Lease-based

Primary holds time-bounded lease
Must renew or lose primacy
If can’t communicate → step down automatically

2.10 Cost Considerations

Multi-region active-active là expensive. Tính cho 100 GB DB, 1M tx/day, 3 regions:

Cost component	Single-region	Active-Active 3-region
Compute (DB instances)	$500/month	$1,500/month (3x)
Storage	$50/month	$150/month (3 copies)
Cross-region data transfer	$0	$200-500/month (replication)
Routing (Global Accelerator)	$0	$50-200/month
Monitoring/observability	$50/month	$150/month
Total	$600/month	$2,050-2,500/month

→ ~3-4x cost. ROI question: Worth $1,500/month for higher availability? Depends on revenue impact of downtime.

Rule of thumb: Multi-region cost-effective khi 1 hour downtime > 1 month additional cost. For most SaaS với $100K+/month revenue → worth it.

3. Estimation — Multi-Region Capacity

3.1 Replication bandwidth

Scenario: 1000 transactions/sec, average 5KB write per transaction, 3 regions full mesh.

Outbound from each region = 1000 × 5KB = 5 MB/s per replica peer
With 3 regions full mesh: 5 MB/s × 2 peers = 10 MB/s outbound per region

Cross-region bandwidth at AWS pricing ($0.02/GB):

10 MB/s × 86,400 s × 30 days = 25 TB/month per region
3 regions × 25 TB = 75 TB/month total
75 TB × $0.02/GB = $1,500/month for replication alone

3.2 Read latency budget

P95 latency target: 100ms cho user actions.

Latency breakdown (US user → US region):
  Network DNS/TLS: 20ms
  App server processing: 30ms
  DB query: 30ms
  Network return: 20ms
  Total: 100ms ✓

Cross-region penalty (US user → EU region):

DNS+TLS: 20ms (same)
App: 30ms
Cross-region DB read: 100ms (RTT)
Return: 20ms
Total: 170ms ✗ (over budget)

→ Read-local pattern bắt buộc.

3.3 RTO/RPO targets

Tier	RTO	RPO	Strategy
Tier 1 (payment)	< 1 min	0 (zero loss)	Sync replication multi-region
Tier 2 (e-commerce)	< 5 min	< 1 min	Async replication + auto failover
Tier 3 (analytics)	< 1 hour	< 1 hour	Daily snapshots cross-region
Tier 4 (logs)	< 1 day	< 1 day	Backup to cold storage

3.4 Failover testing budget

Game day pattern: Simulate region failure quarterly.

Cost per game day:
  Engineer time: 5 engineers × 4 hours × $100/h = $2,000
  Potential customer impact (production): $0 if done right
  Tools (chaos engineering): included
  Total: $2,000/quarter = $8,000/year

ROI: 1 prevented production outage = saves $50K-500K depending on scale.

4. Security First

4.1 Data residency & sovereignty

Compliance requirements:

GDPR (EU): EU citizen data must reside in EU
China Cybersecurity Law: Chinese data must reside in China
India DPDPA: Critical personal data must reside in India
HIPAA (US): PHI must follow specific guidelines

Implementation patterns:

-- CockroachDB multi-region with row-level locality
CREATE TABLE users (
    id UUID PRIMARY KEY,
    home_region crdb_internal_region NOT NULL,
    pii_data JSONB,
    PRIMARY KEY (home_region, id)
)
LOCALITY REGIONAL BY ROW AS home_region;
-- EU users → eu-west-1, US users → us-east-1
-- Single SQL, but data physically separated

4.2 Cross-region encryption

Mandatory:

TLS 1.3 for inter-region replication
KMS keys per region (no single global key)
Customer-managed keys (CMK) for compliance

# Terraform: per-region KMS
resource "aws_kms_key" "us_east" {
  provider = aws.us-east-1
  description = "Aurora DSQL encryption key US East"
}
 
resource "aws_kms_key" "eu_west" {
  provider = aws.eu-west-1
  description = "Aurora DSQL encryption key EU West"
}

4.3 IAM cross-account / cross-region

Principle of least privilege: Each region has separate IAM roles.

Region A app → IAM role A → Aurora DSQL A only
Region B app → IAM role B → Aurora DSQL B only

No app has cross-region admin access.
Replication uses dedicated service role with minimal scope.

4.4 Audit logging

Every cross-region transaction must be logged:

Source region, destination region
Transaction ID, timestamp
User/role identity
Data classification

Forward to centralized SIEM (Splunk, Datadog, Wazuh) for compliance audits.

4.5 Disaster recovery testing security

Game day must include:

Verify failover doesn’t expose unauthorized data
Confirm encryption keys valid in DR region
Test access controls survive failover
Validate audit log integrity

5. DevOps — Vận hành Multi-Region

5.1 Aurora DSQL setup (Terraform)

provider "aws" {
  alias  = "primary"
  region = "us-east-1"
}
 
provider "aws" {
  alias  = "secondary"
  region = "us-west-2"
}
 
# Primary cluster
resource "aws_dsql_cluster" "primary" {
  provider = aws.primary
 
  multi_region_properties {
    witness_region = "us-west-2"
  }
 
  tags = {
    Name = "primary-us-east-1"
  }
}
 
# Secondary cluster
resource "aws_dsql_cluster" "secondary" {
  provider = aws.secondary
 
  multi_region_properties {
    witness_region = "us-east-1"
  }
 
  tags = {
    Name = "secondary-us-west-2"
  }
}
 
# Link clusters for active-active
resource "aws_dsql_cluster_peering" "main" {
  provider          = aws.primary
  cluster_id        = aws_dsql_cluster.primary.id
  peer_cluster_arns = [aws_dsql_cluster.secondary.arn]
}

5.2 Application connection pattern

"""
Multi-region aware DB client with automatic failover.
"""
 
import os
import psycopg
from contextlib import contextmanager
 
 
class MultiRegionDB:
    def __init__(self):
        # Primary endpoint based on user region
        self.endpoints = {
            "us-east-1": "primary.dsql-cluster.amazonaws.com",
            "us-west-2": "secondary.dsql-cluster.amazonaws.com",
            "eu-west-1": "tertiary.dsql-cluster.amazonaws.com",
        }
        self.current_region = os.getenv("AWS_REGION", "us-east-1")
 
    @contextmanager
    def connection(self):
        """Try local region first, fall back to others."""
        order = [self.current_region] + [
            r for r in self.endpoints if r != self.current_region
        ]
 
        for region in order:
            try:
                conn = psycopg.connect(
                    host=self.endpoints[region],
                    user="app",
                    password=self._get_iam_token(region),
                    dbname="postgres",
                    sslmode="require",
                    connect_timeout=2,
                )
                yield conn
                conn.close()
                return
            except (psycopg.OperationalError, TimeoutError) as e:
                print(f"Failed connect {region}: {e}")
                continue
 
        raise RuntimeError("All regions unreachable")
 
    def _get_iam_token(self, region):
        # Aurora DSQL uses IAM auth tokens
        import boto3
        client = boto3.client("dsql", region_name=region)
        return client.generate_db_connect_auth_token(
            cluster_endpoint=self.endpoints[region]
        )
 
 
db = MultiRegionDB()
 
with db.connection() as conn:
    with conn.cursor() as cur:
        cur.execute("SELECT * FROM orders WHERE id = %s", (order_id,))

5.3 Health check & failover

# Route 53 health check + failover
resource "aws_route53_health_check" "primary" {
  fqdn              = "api-us-east.myapp.com"
  port              = 443
  type              = "HTTPS"
  resource_path     = "/health"
  failure_threshold = "3"
  request_interval  = "30"
 
  tags = {
    Name = "primary-health"
  }
}
 
resource "aws_route53_record" "api_primary" {
  zone_id = var.zone_id
  name    = "api.myapp.com"
  type    = "A"
  set_identifier = "primary"
 
  failover_routing_policy {
    type = "PRIMARY"
  }
 
  health_check_id = aws_route53_health_check.primary.id
  alias {
    name                   = aws_lb.primary.dns_name
    zone_id                = aws_lb.primary.zone_id
    evaluate_target_health = true
  }
}
 
resource "aws_route53_record" "api_secondary" {
  zone_id = var.zone_id
  name    = "api.myapp.com"
  type    = "A"
  set_identifier = "secondary"
 
  failover_routing_policy {
    type = "SECONDARY"
  }
 
  alias {
    name                   = aws_lb.secondary.dns_name
    zone_id                = aws_lb.secondary.zone_id
    evaluate_target_health = true
  }
}

5.4 Monitoring metrics

groups:
  - name: multi_region_alerts
    rules:
      - alert: CrossRegionReplicationLag
        expr: dsql_replication_lag_seconds > 5
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "Replication lag {{ $value }}s between regions"
 
      - alert: RegionUnhealthy
        expr: up{job="api", region=~".+"} == 0
        for: 2m
        labels: { severity: critical }
        annotations:
          summary: "Region {{ $labels.region }} unreachable"
 
      - alert: SplitBrainSuspected
        expr: |
          count(dsql_is_primary == 1) by (cluster) > 1
        for: 1m
        labels: { severity: critical }
        annotations:
          summary: "Multiple primaries detected — split brain!"
 
      - alert: HighFailoverFrequency
        expr: changes(dsql_primary_region[1h]) > 3
        labels: { severity: warning }
        annotations:
          summary: "Failover happened {{ $value }} times in 1h"

5.5 Game day procedure

#!/bin/bash
# game-day-region-failure.sh
# Simulate us-east-1 failure quarterly
 
echo "Game Day: Simulating US-EAST-1 failure"
echo "Expected: us-west-2 takes over, RTO < 5min"
 
# 1. Block traffic to us-east-1
aws elbv2 modify-target-group-attributes \
  --target-group-arn $US_EAST_TG \
  --attributes Key=deregistration_delay.timeout_seconds,Value=0
 
aws ec2 authorize-security-group-ingress \
  --group-id $US_EAST_SG \
  --protocol -1 --port -1 --source-group $BLOCKED_SG
 
# 2. Watch failover
echo "Waiting for failover..."
START=$(date +%s)
while true; do
  if curl -sf https://api.myapp.com/health | grep -q '"region":"us-west-2"'; then
    END=$(date +%s)
    echo "Failover complete: $(($END - $START))s"
    break
  fi
  sleep 5
done
 
# 3. Verify data consistency
psql -h secondary.dsql-cluster.amazonaws.com -c "
  SELECT COUNT(*) FROM orders WHERE created_at > NOW() - INTERVAL '5 min';
"
 
# 4. Restore us-east-1
aws ec2 revoke-security-group-ingress ...
 
# 5. Run reconciliation report
echo "Game Day complete. RTO: $(($END - $START))s. Generate report."

6. Code Implementation

6.1 CockroachDB region-aware app

"""
CockroachDB region-aware Python application.
Uses gateway region for low-latency local reads.
"""
 
import os
import psycopg
from psycopg.rows import dict_row
 
 
class RegionAwareDB:
    def __init__(self):
        self.region = os.getenv("CRDB_REGION", "us-east-1")
        self.dsn = os.getenv("CRDB_DSN")
 
    def connect(self):
        return psycopg.connect(
            self.dsn,
            row_factory=dict_row,
            options=f"--cluster_name=mycluster --search_path=public",
        )
 
    def get_user(self, user_id: str) -> dict:
        """Read user from local region (low latency)."""
        with self.connect() as conn:
            with conn.cursor() as cur:
                cur.execute("""
                    SELECT * FROM users
                    WHERE id = %s
                    AND home_region = %s
                """, (user_id, self.region))
                return cur.fetchone()
 
    def transfer_money(self, from_user: str, to_user: str, amount: int):
        """Cross-region transfer (requires consensus)."""
        with self.connect() as conn:
            with conn.transaction():
                with conn.cursor() as cur:
                    # CockroachDB uses bounded staleness reads by default
                    # but transactions are strongly consistent
                    cur.execute("""
                        UPDATE accounts
                        SET balance = balance - %s
                        WHERE user_id = %s AND balance >= %s
                    """, (amount, from_user, amount))
 
                    if cur.rowcount == 0:
                        raise ValueError("Insufficient funds")
 
                    cur.execute("""
                        UPDATE accounts
                        SET balance = balance + %s
                        WHERE user_id = %s
                    """, (amount, to_user))
 
 
db = RegionAwareDB()
user = db.get_user("user-123")  # Low latency, local read
db.transfer_money("user-123", "user-456", 100)  # Strong consistency, may cross region

6.2 Failover-aware HTTP middleware

from fastapi import FastAPI, Request, Response
from fastapi.middleware.base import BaseHTTPMiddleware
import time
 
app = FastAPI()
 
 
class FailoverAwareMiddleware(BaseHTTPMiddleware):
    """Add region info to responses, monitor cross-region calls."""
 
    async def dispatch(self, request: Request, call_next):
        start = time.time()
        region = os.getenv("AWS_REGION", "unknown")
 
        response = await call_next(request)
 
        elapsed = time.time() - start
        response.headers["X-Region"] = region
        response.headers["X-Response-Time"] = f"{elapsed:.3f}s"
 
        # Alert if response time > 200ms (suggests cross-region call)
        if elapsed > 0.2:
            await self._log_slow_request(request, region, elapsed)
 
        return response
 
    async def _log_slow_request(self, request, region, elapsed):
        # Track slow requests for analysis
        print(f"[SLOW] {region} {request.url.path} took {elapsed:.3f}s")
 
 
app.add_middleware(FailoverAwareMiddleware)

6.3 Custom conflict resolution (LWW)

"""
Application-level Last-Write-Wins for cross-region conflicts.
"""
 
from datetime import datetime
import uuid
 
 
class LWWConflictResolver:
    def __init__(self, db):
        self.db = db
 
    def update_user_profile(self, user_id: str, data: dict):
        """Update with LWW timestamp for cross-region safety."""
        timestamp = datetime.utcnow().isoformat() + "Z"
        update_id = str(uuid.uuid4())
 
        with self.db.connect() as conn:
            with conn.cursor() as cur:
                # Check current timestamp; only update if newer
                cur.execute("""
                    UPDATE users
                    SET profile_data = %s,
                        last_modified = %s,
                        last_modified_by = %s
                    WHERE id = %s
                      AND (last_modified IS NULL OR last_modified < %s)
                    RETURNING id, last_modified
                """, (
                    json.dumps(data),
                    timestamp,
                    update_id,
                    user_id,
                    timestamp,
                ))
 
                result = cur.fetchone()
                if result is None:
                    print(f"Update rejected: newer write exists for {user_id}")
                    return False
 
                return True

7. System Design Diagrams

7.1 Active-Active Architecture

flowchart TB
    subgraph Global["Global Layer"]
        DNS[Route 53<br/>Latency-based routing]
        CDN[CloudFront / Cloudflare]
    end

    subgraph US["US-EAST-1"]
        USL[Load Balancer]
        USA[App Tier]
        USDB[(Aurora DSQL<br/>US Primary)]
    end

    subgraph EU["EU-WEST-1"]
        EUL[Load Balancer]
        EUA[App Tier]
        EUDB[(Aurora DSQL<br/>EU Primary)]
    end

    subgraph ASIA["AP-SOUTHEAST-1"]
        ASL[Load Balancer]
        ASA[App Tier]
        ASDB[(Aurora DSQL<br/>Asia Primary)]
    end

    UserUS[US Users] --> CDN --> DNS
    UserEU[EU Users] --> CDN
    UserASIA[Asia Users] --> CDN

    DNS -->|nearest| USL
    DNS -->|nearest| EUL
    DNS -->|nearest| ASL

    USL --> USA --> USDB
    EUL --> EUA --> EUDB
    ASL --> ASA --> ASDB

    USDB <-.sync replication.-> EUDB
    EUDB <-.sync replication.-> ASDB
    USDB <-.sync replication.-> ASDB

    style USDB fill:#4caf50,color:#fff
    style EUDB fill:#4caf50,color:#fff
    style ASDB fill:#4caf50,color:#fff

7.2 Failover Sequence

sequenceDiagram
    participant U as User
    participant DNS as Route 53
    participant US as US Region
    participant EU as EU Region
    participant HC as Health Checks

    Note over US,EU: Normal operation

    U->>DNS: Resolve api.myapp.com
    DNS-->>U: us-east-1 IP (lowest latency)
    U->>US: Request
    US-->>U: Response

    Note over US: ⚡ Region failure ⚡

    HC->>US: Health probe
    Note over HC: 3 consecutive failures<br/>(90 seconds)

    HC->>DNS: Mark us-east-1 unhealthy
    DNS->>DNS: Remove from rotation

    Note over U: Next request

    U->>DNS: Resolve api.myapp.com
    DNS-->>U: eu-west-1 IP (next best)
    U->>EU: Request
    EU-->>U: Response

    Note over US,EU: Total RTO: 90-120 seconds

7.3 Spanner-style Commit Wait

sequenceDiagram
    participant Client
    participant Coord as Coordinator (Region A)
    participant TT as TrueTime API
    participant RegB as Replica (Region B)
    participant RegC as Replica (Region C)

    Client->>Coord: BEGIN; UPDATE x = 5; COMMIT;

    Coord->>TT: now()
    TT-->>Coord: [t_earliest, t_latest]

    Coord->>Coord: T_commit = t_latest

    par Replicate to majority
        Coord->>RegB: Prepare T_commit
        Coord->>RegC: Prepare T_commit
    end
    RegB-->>Coord: ack
    RegC-->>Coord: ack

    Note over Coord: Commit Wait<br/>until TT.after(T_commit)

    Coord->>TT: after(T_commit)?
    TT-->>Coord: true

    Coord->>Coord: Apply commit
    Coord-->>Client: 200 OK

    Note over Client,RegC: Total: ~30-50ms (RTT + ~7ms wait)

7.4 Split-Brain Prevention via Quorum

flowchart TB
    subgraph Before["Before Partition: 3 regions, full mesh"]
        A1[Region A] <--> B1[Region B]
        B1 <--> C1[Region C]
        A1 <--> C1
    end

    subgraph Partition["⚡ Partition: A isolated"]
        A2[Region A<br/>Minority - 1 node]
        B2[Region B<br/>Majority - 2 nodes]
        C2[Region C<br/>Majority - 2 nodes]

        B2 <--> C2

        A2 -.X.- B2
        A2 -.X.- C2

        AStatus[A: cannot accept writes<br/>read-only mode]
        BCStatus[B+C: continue as primary<br/>accept writes]
    end

    subgraph After["After Heal: A reconciles"]
        A3[Region A<br/>Replays missed transactions<br/>from B/C]
        B3[Region B]
        C3[Region C]
        A3 <--> B3
        B3 <--> C3
        A3 <--> C3
    end

    Before --> Partition --> After

    style A2 fill:#ffcdd2
    style B2 fill:#c8e6c9
    style C2 fill:#c8e6c9

8. Aha Moments & Pitfalls

Aha Moments

#1: Aurora DSQL = Spanner cho mass market. Trước 2024, chỉ Spanner có atomic clock external consistency. Aurora DSQL democratize công nghệ này — drop-in PostgreSQL với 99.999% SLA cross-region.

#2: Atomic clocks free trên AWS. Amazon Time Sync Service (2023) cung cấp microsecond-accurate time miễn phí trên EC2. Đây là enabling technology cho DSQL.

#3: Active-Active không phải binary. Có spectrum: full active-active (every region writes), regional active (each region owns subset), read-anywhere-write-primary. Chọn đúng level cho use case.

#4: Light speed là physical limit. Cross-region sync replication không thể < 30ms. Kiến trúc phải accept latency cost hoặc relax consistency.

#5: Read-local là bắt buộc cho UX. User APAC không thể đợi 200ms cho mỗi read. Pattern: local read replica + sync write to primary, hoặc CockroachDB locality-aware.

#6: Split-brain rare nhưng catastrophic. 1 lần data corruption = mất trust mãi mãi. Quorum-based + lease-based fencing là 2 chính defense.

#7: DNS failover slow (5-10 min). Cho RTO < 1 min, dùng Anycast (Global Accelerator) hoặc CDN-level routing.

#8: Cost gấp 3-4x single-region. Justify bằng business impact, không phải “best practice”. SaaS $100K+/month → worth it. Internal tools → có thể không.

Pitfalls

Pitfall 1: Thinking active-passive is enough

Sai: “Có warm standby là đủ” → outage failover takes 1 hour, lose data. Đúng: Cho revenue-critical, active-active duy nhất accept zero downtime.

Pitfall 2: Same KMS key across regions

Sai: 1 KMS key dùng cho cả 3 regions → key compromise = total loss. Đúng: Per-region KMS, customer-managed keys.

Pitfall 3: Async replication for critical writes

Sai: Payment ledger với async replication → 1 region fail = lose recent transactions. Đúng: Sync replication với Spanner/DSQL hoặc app-level 2PC.

Pitfall 4: Ignore data residency

Sai: EU user data tự động replicate sang US → GDPR violation, fines. Đúng: Row-level locality (CockroachDB), tagged tables, region-pinning.

Pitfall 5: No game day testing

Sai: “Failover should work” — chưa test bao giờ → real outage discover bug. Đúng: Quarterly game day, simulate region failure, measure RTO.

Pitfall 6: Cross-region in tight loops

Sai: App makes 100 sequential cross-region calls → 10 seconds latency. Đúng: Batch, prefetch, use local cache. Cross-region call expensive.

Pitfall 7: Trust DNS TTL

Sai: Set TTL=300s, expect failover in 5 min → some clients cache 1 hour. Đúng: Use Anycast / Global Accelerator for sub-30s failover.

Pitfall 8: Forget reconciliation after partition heal

Sai: Partition heals, app continues normally → diverged data persists. Đúng: Auto-reconciliation (DSQL/CRDB) or manual procedure.

Pitfall 9: No backup beyond replication

Sai: “Replication is backup” — ransomware encrypts → all replicas encrypted. Đúng: Point-in-time backups + cross-region + immutable storage.

Pitfall 10: Underestimate cost

Sai: “Multi-region just 2x cost” → bill comes 4x because of data transfer. Đúng: Calculate cross-region transfer carefully. Use private connectivity (Direct Connect, ExpressRoute).

9. Internal Links

Topic	Liên hệ
Tuan-07-Database-Sharding-Replication	Foundation; multi-region là extreme case
Tuan-Bonus-Consensus-Raft-Paxos	Underlying consensus protocol cho DSQL
Tuan-Bonus-Consistency-Models-Isolation	External consistency, linearizability
Tuan-Bonus-CRDTs-Conflict-Free-Data-Types	Alternative cho async replication conflict resolution
Tuan-Bonus-Multi-Tenancy-SaaS-Patterns	Tenant-region affinity
Case-Design-Payment-System	Payment cross-border requires multi-region
Case-Design-Stock-Exchange	Geo-distributed exchanges
Tuan-13-Monitoring-Observability	Cross-region monitoring, replication lag

Tham khảo

Papers:

Spanner (Google, 2012) — https://research.google/pubs/spanner-googles-globally-distributed-database-2/
Calvin: Fast Distributed Transactions (Yale, 2012) — http://cs.yale.edu/homes/thomson/publications/calvin-sigmod12.pdf
CockroachDB transaction model — https://www.cockroachlabs.com/docs/stable/architecture/transaction-layer.html

Engineering blogs:

AWS, Introducing Amazon Aurora DSQL (re:Invent 2024) — https://aws.amazon.com/blogs/aws/introducing-amazon-aurora-dsql/
AWS, Multi-site Active/Active DR Architecture — https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-iv-multi-site-active-active/
CockroachDB, Living without atomic clocks — https://www.cockroachlabs.com/blog/living-without-atomic-clocks/
Yugabyte, Geo-distributed deployment — https://docs.yugabyte.com/preview/explore/multi-region-deployments/

Talks:

AWS re:Invent 2024 DAT427 — Aurora DSQL deep dive
Spanner talks at SIGMOD, OSDI

Tools:

Aurora DSQL — https://aws.amazon.com/rds/aurora/dsql/
CockroachDB — https://www.cockroachlabs.com/
YugabyteDB — https://www.yugabyte.com/
TiDB — https://www.pingcap.com/tidb/
Spanner — https://cloud.google.com/spanner

Tiếp theo: Tuan-Bonus-Multi-Tenancy-SaaS-Patterns — Tenant isolation patterns cho SaaS, complement với multi-region.

lthieu's notes

Explorer

Tuan-Bonus-Multi-Region-Active-Active-DSQL