Tuần Bonus: Outbox Pattern, CDC & Saga Orchestration

“Junior dev: ‘Sau khi save order, gửi message vào Kafka.’ Senior dev: ‘Nếu DB commit nhưng Kafka send fail thì sao?’ Architect: ‘Đó là dual-write problem — và đó là lý do Outbox Pattern tồn tại.‘”

Tags: system-design outbox cdc debezium saga microservices event-driven bonus Student: Hieu (Backend Dev → Architect) Prerequisite: Tuan-08-Message-Queue · Tuan-11-Microservices-Pattern · Tuan-Bonus-Consistency-Models-Isolation Liên quan: Case-Design-Payment-System · Case-Design-Digital-Wallet · Case-Design-Distributed-Message-Queue · Case-Design-Hotel-Reservation-System


1. Context & Why

Analogy đời thường — Bưu điện và sổ sách

Hieu, tưởng tượng em là chủ một cửa hàng online. Mỗi lần có đơn hàng mới, em phải:

  1. Ghi vào sổ kế toán (database): “Đơn hàng #123, 500K”
  2. Gửi email cho kho (message queue): “Chuẩn bị đóng gói #123”

Vấn đề: 2 hành động này không nằm trong cùng một transaction. Có 4 kịch bản:

Kịch bảnSổ sáchEmailHậu quả
✅ HappyOKOKMọi thứ ổn
❌ AOKFailCó đơn hàng nhưng kho không biết → mất khách
❌ BFailOKKho đóng gói nhưng không có record → mất tiền
❌ COKSent 2 lầnKho đóng 2 đơn → over-shipping

Tỉ lệ xảy ra trong production: ~0.01-1% nếu không có pattern đúng. Với 1M đơn/ngày → 100-10,000 đơn lỗi/ngày. Đây không phải edge case, đây là dual-write problem — và nó xảy ra mọi lúc khi em “save DB rồi send message”.

Outbox Pattern giải quyết bằng nguyên tắc đơn giản:

Chỉ ghi vào 1 nơi (database). Một process khác đọc từ đó và push ra message queue.

Tại sao Backend Dev cần hiểu Outbox?

Lý doVí dụ
Microservice + event-driven gặp phảiOrder service → Kafka → Inventory, Notification, Analytics
Transactional consistencyKhông thể 2PC giữa Postgres + Kafka
Đảm bảo at-least-onceMessage không bao giờ mất
Audit trailOutbox table = log đầy đủ events
DecouplingProducer không phụ thuộc availability của Kafka
CDC chuẩn industryDebezium, Kafka Connect dùng pattern này

Tại sao Alex Xu không đi sâu vào Outbox?

Alex Xu vol 1+2 nói về Saga và microservices ở mức conceptual. Nhưng implementation chi tiết — làm sao đảm bảo:

  • Message luôn được publish nếu DB commit
  • Không publish nếu DB rollback
  • Idempotent consumer
  • Order preservation

— đây là chi tiết mà 80% production system làm sai ở giai đoạn đầu, dẫn đến inconsistency, lost events, duplicate processing. Outbox Pattern là answer định hình mà mọi senior microservice engineer phải biết.

Tham chiếu chính


2. Deep Dive — Khái niệm cốt lõi

2.1 The Dual-Write Problem

Naive approach:

def create_order(order_data):
    # Write 1: Database
    order = db.execute("INSERT INTO orders ... RETURNING id")
 
    # Write 2: Message queue
    kafka.send("orders", {"id": order.id, ...})
 
    return order

Failure modes:

Time →

Mode A: DB commit, Kafka fail
  T0: BEGIN
  T1: INSERT order ✓
  T2: COMMIT ✓ (data persisted)
  T3: kafka.send() → NETWORK ERROR
  Result: Order in DB, no event sent → downstream systems out of sync

Mode B: Kafka succeed, DB rollback
  T0: BEGIN
  T1: INSERT order ✓
  T2: kafka.send() ✓ (event sent!)
  T3: ... some error → ROLLBACK
  Result: Event sent for non-existent order → ghost data

Mode C: Process crash between
  T0: COMMIT ✓
  T1: kafka.send() called
  T2: ⚡ Process crashes ⚡
  Result: Sent or not? Depends on TCP buffer state. Cannot retry safely.

Mode D: Send retry creates duplicate
  T0: kafka.send() → timeout (but actually succeeded)
  T1: Retry → 2 events for same order
  Result: Downstream processes order twice

Key insight: 2PC giữa Postgres và Kafka không khả thi trong practice (Kafka không phải XA-compliant; lock distributed quá đắt). Cần pattern khác.

2.2 Outbox Pattern — Nguyên lý

Nguyên tắc: Chỉ commit vào 1 atomic boundary (database transaction). Một process riêng đọc từ DB và publish vào MQ.

┌──────────────────────────────────────────────────────────┐
│                   Application                              │
│                                                            │
│   BEGIN TRANSACTION                                        │
│   ├─ INSERT INTO orders (...) VALUES (...)                 │
│   ├─ INSERT INTO outbox (event_type, payload, created_at)  │
│   COMMIT  ← một atomic boundary                            │
└──────────────────────────────────────────────────────────┘
                         │
                         │ poll/CDC
                         ▼
┌──────────────────────────────────────────────────────────┐
│                  Outbox Relay                              │
│                                                            │
│   Read unpublished events from outbox table                │
│   → Publish to Kafka                                       │
│   → Mark as published (or delete)                          │
└──────────────────────────────────────────────────────────┘
                         │
                         │ Kafka
                         ▼
                ┌────────────────────┐
                │ Downstream Consumers│
                │ - Inventory         │
                │ - Notification      │
                │ - Analytics         │
                └────────────────────┘

Tính chất đảm bảo:

  1. At-least-once: Nếu DB commit → event sẽ eventually publish (relay retry vô hạn)
  2. No ghost events: Nếu DB rollback → outbox row không tồn tại → không publish
  3. Ordering: Có thể đảm bảo per-aggregate order (qua partition key)
  4. Idempotency: Consumer phải idempotent (vì có thể duplicate)

2.3 Outbox Table Schema

CREATE TABLE outbox (
    id              BIGSERIAL PRIMARY KEY,
    aggregate_type  TEXT NOT NULL,      -- 'Order', 'Payment', 'User'
    aggregate_id    TEXT NOT NULL,      -- ID của entity
    event_type      TEXT NOT NULL,      -- 'OrderCreated', 'PaymentCompleted'
    payload         JSONB NOT NULL,     -- Event body
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
 
    -- For polling-based relay
    published_at    TIMESTAMPTZ,        -- NULL = chưa publish
 
    -- For ordering
    sequence_number BIGINT GENERATED ALWAYS AS IDENTITY,
 
    -- For tracing
    trace_id        TEXT,
    correlation_id  TEXT
);
 
-- Index quan trọng: lookup unpublished
CREATE INDEX idx_outbox_unpublished
    ON outbox (created_at)
    WHERE published_at IS NULL;
 
-- Per-aggregate ordering
CREATE INDEX idx_outbox_aggregate
    ON outbox (aggregate_type, aggregate_id, sequence_number);

Insert pattern:

BEGIN;
 
-- Business logic
INSERT INTO orders (id, customer_id, total)
VALUES ('ord-123', 'cust-1', 500000);
 
-- Outbox event
INSERT INTO outbox (aggregate_type, aggregate_id, event_type, payload)
VALUES (
    'Order',
    'ord-123',
    'OrderCreated',
    '{"id": "ord-123", "customer_id": "cust-1", "total": 500000}'::jsonb
);
 
COMMIT;

2.4 Two implementation styles

2.4.1 Style 1: Polling (đơn giản, latency cao)

Cơ chế: Một worker poll outbox table mỗi N giây, publish unpublished events, mark as published.

async def outbox_worker():
    while True:
        async with db.transaction():
            # Lock row để tránh 2 worker process cùng lúc
            rows = await db.fetch("""
                SELECT id, aggregate_type, aggregate_id, event_type, payload
                FROM outbox
                WHERE published_at IS NULL
                ORDER BY id
                LIMIT 100
                FOR UPDATE SKIP LOCKED
            """)
 
            for row in rows:
                try:
                    await kafka.send(
                        topic=topic_for(row['aggregate_type']),
                        key=row['aggregate_id'],
                        value=row['payload'],
                    )
                    await db.execute(
                        "UPDATE outbox SET published_at = NOW() WHERE id = $1",
                        row['id']
                    )
                except KafkaError:
                    # Don't mark as published; will retry next iteration
                    break
 
        await asyncio.sleep(1.0)  # Poll interval

Pros:

  • Simple, no extra infra
  • Easy to debug (just SELECT outbox)
  • Works with any DB

Cons:

  • Latency = poll interval (e.g., 1s)
  • DB load: continuous polling
  • Doesn’t scale well past ~10K events/s

2.4.2 Style 2: CDC (Change Data Capture)

Cơ chế: Debezium đọc WAL (Write-Ahead Log) của DB, stream changes thành Kafka events real-time.

┌─────────────┐   WAL    ┌──────────────┐   Kafka   ┌──────────┐
│ PostgreSQL  │ ────────►│   Debezium   │ ─────────►│   Kafka  │
│             │  changes │ (Kafka Conn) │  events   │          │
└─────────────┘          └──────────────┘           └──────────┘
                                                          │
                                                          ▼
                                                  ┌──────────────┐
                                                  │  Consumers   │
                                                  └──────────────┘

Pros:

  • Latency < 100ms (real-time)
  • No DB polling load
  • Scales to 100K+ events/s
  • Captures ALL changes (not just outbox)

Cons:

  • Extra infrastructure (Kafka Connect cluster)
  • Complex setup (logical replication slots)
  • Schema evolution challenges
  • Operational overhead

2.4.3 Hybrid: Outbox table + CDC

Best of both worlds (Debezium recommended):

  • App ghi vào outbox table (controlled schema)
  • Debezium stream chỉ outbox table thông qua Outbox Event Router SMT
  • Kafka topic structure: <routedBy>.<aggregateType> (e.g., events.Order, events.Payment)
# debezium-connector.json
{
  "name": "outbox-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "tasks.max": "1",
    "database.hostname": "postgres",
    "database.port": "5432",
    "database.user": "debezium",
    "database.password": "debezium",
    "database.dbname": "ordersdb",
    "table.include.list": "public.outbox",
    "transforms": "outbox",
    "transforms.outbox.type": "io.debezium.transforms.outbox.EventRouter",
    "transforms.outbox.route.by.field": "aggregate_type",
    "transforms.outbox.route.topic.replacement": "events.${routedByValue}",
    "transforms.outbox.table.expand.json.payload": "true"
  }
}

Tham chiếu: https://debezium.io/documentation/reference/stable/transformations/outbox-event-router.html

2.5 Outbox Cleanup

Outbox table sẽ tăng vô hạn nếu không cleanup. Strategies:

StrategyƯu điểmNhược điểm
Delete after publishDB nhỏMất audit trail; race condition risk
TTL (delete > 7 ngày)Cân bằngCần job định kỳ
Archive to cold storageAudit trail đầy đủPhức tạp
Partitioning by monthEasy drop oldCần PostgreSQL 11+ với declarative partitioning

Recommended: TTL 7-30 ngày + monthly archive.

-- Cleanup job (cron daily)
DELETE FROM outbox
WHERE published_at < NOW() - INTERVAL '7 days';

2.6 Idempotent Consumer — Bắt buộc

Vì outbox đảm bảo at-least-once (có thể duplicate), consumer bắt buộc idempotent.

Pattern 1: Unique key check

def handle_order_created(event):
    try:
        db.execute(
            "INSERT INTO inventory_reservations (event_id, ...) VALUES (...)",
            event['event_id'], ...
        )
    except UniqueViolation:
        # Đã xử lý, skip
        log.info(f"Event {event['event_id']} already processed")

Pattern 2: Outbox-style on consumer side

CREATE TABLE processed_events (
    event_id TEXT PRIMARY KEY,
    processed_at TIMESTAMPTZ DEFAULT NOW()
);
 
-- In handler
BEGIN;
INSERT INTO processed_events (event_id) VALUES ('evt-123');
-- ... business logic ...
COMMIT;

Pattern 3: Versioning

def update_inventory(event):
    db.execute("""
        UPDATE inventory
        SET quantity = %s, version = %s
        WHERE product_id = %s AND version < %s
    """, event['new_qty'], event['version'], event['product_id'], event['version'])
    # Nếu version cũ → no-op

2.7 Saga Pattern — Distributed Transactions

Vấn đề: Một business transaction span nhiều service. Không thể 2PC. Cần coordinate.

Saga: Chia transaction thành chuỗi local transactions, mỗi step có compensating action nếu sau đó fail.

Saga "Place Order":
  Step 1: Order Service — create order (compensate: cancel order)
  Step 2: Inventory Service — reserve items (compensate: release items)
  Step 3: Payment Service — charge customer (compensate: refund)
  Step 4: Shipping Service — schedule delivery (compensate: cancel shipment)

If Step 4 fails:
  → Compensate Step 3 (refund)
  → Compensate Step 2 (release items)
  → Compensate Step 1 (cancel order)

2.8 Saga Orchestration vs Choreography

2.8.1 Choreography — Event-driven, không có “boss”

Mỗi service publish event khi xong; service khác subscribe và tự quyết định action.

Order Service ──[OrderCreated]──► Kafka
                                     │
              ┌──────────────────────┼──────────────────────┐
              ▼                      ▼                      ▼
    Inventory Service       Payment Service         Notification Service
    ──[ItemsReserved]──►    ──[PaymentCharged]──►  (no further events)
              │                      │
              └──────────┬───────────┘
                         ▼
                    Order Service
                  ──[OrderFulfilled]──►

Pros:

  • Loose coupling
  • Easy to add new services
  • No SPOF
  • Naturally scales

Cons:

  • Hard to debug (“where is the order?“)
  • Hard to track end-to-end status
  • Compensation logic spread across services
  • Cyclic dependencies risk

Latency: Sum of each step’s processing time + Kafka delivery (~50-100ms per step).

2.8.2 Orchestration — Có “boss” coordinator

Một Saga Orchestrator điều phối, gọi RPC tới mỗi service tuần tự.

                      ┌─────────────────────┐
                      │  Saga Orchestrator  │
                      │  (state machine)    │
                      └──────────┬──────────┘
                                 │
            ┌────────────────────┼────────────────────┐
            ▼                    ▼                    ▼
    Order Service        Inventory Service    Payment Service
    (RPC: create)        (RPC: reserve)       (RPC: charge)

Pros:

  • Easy to track status (all in orchestrator)
  • Compensation logic centralized
  • Easy to debug (one place to look)
  • Can implement complex flows (parallel, conditional)

Cons:

  • Orchestrator = SPOF (cần HA)
  • Tighter coupling
  • Orchestrator có thể become “god class”

Implementation tools:

  • Temporal (https://temporal.io/) — Uber’s Cadence open-sourced; production-grade
  • Camunda (BPMN-based)
  • AWS Step Functions (serverless)
  • Netflix Conductor

Latency: Round-trip cho mỗi step (~10-50ms RPC) + state persistence.

2.8.3 So sánh chi tiết

Tiêu chíChoreographyOrchestration
CouplingLooseTight (services biết orchestrator)
Latency per step50-100ms (Kafka)10-50ms (RPC)
Total latency (4 steps)200-400ms80-200ms
DebuggingHard (distributed trace required)Easy (look at orchestrator state)
Add new serviceEasy (subscribe to events)Modify orchestrator
TestingHard (need full event chain)Easy (mock services)
CompensationDistributed across servicesCentralized
Use caseOpen ecosystem, many teamsEnterprise workflows, regulatory
ToolKafka + customTemporal, AWS Step Functions

Khuyến nghị:

  • Choreography cho ≤3 services, simple flow
  • Orchestration cho ≥4 services hoặc complex compensation
  • Có thể hybrid: orchestrator gọi service, service publish event downstream

2.9 Saga Compensation Patterns

2.9.1 Backward Recovery (rollback)

Compensate by undoing previous steps in reverse order.

saga = OrderSaga()
try:
    saga.create_order()         # Step 1
    saga.reserve_inventory()    # Step 2
    saga.charge_payment()       # Step 3 — FAIL
except PaymentFailed:
    saga.release_inventory()    # Compensate 2
    saga.cancel_order()         # Compensate 1

2.9.2 Forward Recovery (retry + alternative)

Some operations can’t be undone (e.g., email sent). Try alternative path.

try:
    saga.send_confirmation_email()
except EmailServiceDown:
    saga.send_sms_instead()  # Forward path

2.9.3 Pivot Transactions

Some saga steps are “point of no return” — sau đó chỉ có forward recovery.

Step 1: Reserve items     ← compensatable
Step 2: Charge payment    ← compensatable (refund)
Step 3: Pivot — Ship      ← NOT compensatable (đã giao)
Step 4: Send tracking     ← retryable forward

Sau Pivot, không thể rollback. Phải hoàn thành forward.

2.10 Inbox Pattern — Idempotency on Receiver

Counterpart của Outbox: Inbox đảm bảo idempotency consumer-side.

CREATE TABLE inbox (
    event_id    TEXT PRIMARY KEY,
    received_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    processed   BOOLEAN NOT NULL DEFAULT false,
    payload     JSONB
);
def consume(event):
    with db.transaction():
        try:
            db.execute(
                "INSERT INTO inbox (event_id, payload) VALUES (%s, %s)",
                event['id'], event['payload']
            )
        except UniqueViolation:
            return  # Already received
 
        # Process event
        process_business_logic(event)
 
        db.execute(
            "UPDATE inbox SET processed = true WHERE event_id = %s",
            event['id']
        )

2.11 Real-world implementations

SystemApproach
StripeIdempotency keys on API; outbox internally for webhooks
ShopifyOutbox + Kafka for inventory events
UberCadence (now Temporal) for orchestration
NetflixConductor for orchestration; Choreography for parts
AmazonStep Functions for many internal workflows
GitHubOutbox for webhook delivery
ConfluentStandard recommendation: Outbox + Debezium CDC

3. Estimation — Latency & Throughput

3.1 Outbox latency budget

Polling style (poll interval 1s):

  • P50: 500ms (random in poll window)
  • P99: 1100ms (poll + Kafka send + ack)

CDC style (Debezium):

  • P50: 50ms
  • P99: 200ms

Comparison:

ComponentPollingCDC
Detect change0-1000ms<10ms (WAL tail)
Read5ms (DB query)1ms (in-memory)
Publish to Kafka5ms5ms
Total P50~510ms~16ms
Total P99~1100ms~200ms

3.2 Throughput Capacity

Polling:

  • Single worker: ~5K events/s (limited by DB query overhead)
  • With multiple workers + FOR UPDATE SKIP LOCKED: ~20-50K events/s
  • Beyond that: DB IOPS becomes bottleneck

CDC:

  • Debezium single connector: ~10-30K events/s
  • Multiple connectors (sharded): 100K+ events/s
  • Limited by Kafka producer + DB WAL throughput

3.3 Saga latency

Example: 4-step saga

PatternStep latencyTotal (sequential)
Choreography (Kafka)~80ms (publish + consume + process)~320ms
Orchestration (RPC)~30ms (RPC + processing)~120ms
Choreography (parallel)n/a (sequential by nature)~320ms
Orchestration (parallel steps 2-3)step1 + max(step2, step3) + step4~90ms

Aha: Orchestration is faster but tighter coupled. Choreography is slower but more decoupled.

3.4 Outbox table size estimation

Scenario: 1000 orders/s, 5 events per order, 1KB per event, 7-day retention.

Events/day = 1000 × 5 × 86400 = 432M events/day
Storage/day = 432M × 1KB = 432 GB/day
Retention = 432 × 7 = 3.024 TB

Đáng kể. Cần:

  • Partitioning (monthly)
  • Compression (TOAST)
  • Cleanup job
  • Hoặc: chỉ giữ unpublished + archive published

3.5 Saga state storage (Orchestration)

Temporal ví dụ:

  • 1M concurrent sagas, mỗi saga state ~10KB → 10GB
  • Add history: ~100 events/saga × 500 bytes = 50KB → 50GB total
  • Cassandra cluster 5 nodes, ~10GB/node → fits comfortably

4. Security First — Outbox Vulnerabilities

4.1 Outbox Table = Sensitive Data Store

Outbox chứa payload đầy đủ của event — có thể bao gồm:

  • PII (customer info)
  • Financial data (amount, account number)
  • Auth tokens (đừng làm thế!)

Mitigation:

-- Encrypt payload at rest
CREATE EXTENSION pgcrypto;
 
INSERT INTO outbox (..., payload_encrypted)
VALUES (..., pgp_sym_encrypt(payload::text, current_setting('app.encryption_key')));
 
-- Reader decrypts
SELECT pgp_sym_decrypt(payload_encrypted, ...)::jsonb FROM outbox;

Hoặc better: lưu reference thay vì full data:

// BAD
{"event": "PaymentCompleted", "card_number": "4111111111111111"}
 
// GOOD
{"event": "PaymentCompleted", "payment_id": "pay-123"}
// Consumer fetches full data via authenticated API

4.2 Event Tampering — Signing

Nếu consumer tin tưởng event payload mù quáng → attacker có thể inject malicious events.

Mitigation: HMAC signing

import hmac, hashlib, json
 
def sign_event(payload, secret_key):
    payload_bytes = json.dumps(payload, sort_keys=True).encode()
    signature = hmac.new(secret_key, payload_bytes, hashlib.sha256).hexdigest()
    return {"payload": payload, "signature": signature}
 
def verify_event(event, secret_key):
    payload_bytes = json.dumps(event['payload'], sort_keys=True).encode()
    expected = hmac.new(secret_key, payload_bytes, hashlib.sha256).hexdigest()
    if not hmac.compare_digest(expected, event['signature']):
        raise SecurityError("Invalid signature")

4.3 Saga Compensation = Attack Surface

Compensation actions thường có higher privilege (e.g., refund, cancel order). Attacker có thể trigger fake “failure” → trigger compensation → free refund.

Mitigation:

  • Authenticate inter-service calls (mTLS)
  • Validate compensation conditions (don’t refund if amount=0)
  • Audit log every compensation
  • Rate limit compensations per user/account

4.4 Replay Attack

Attacker capture event → replay nhiều lần → trigger duplicate side effect.

Mitigation:

  • Idempotency key bắt buộc trên consumer
  • Event TTL: reject events > 24h cũ
  • Monotonic sequence number check

4.5 Audit Trail

Outbox table = audit trail tự nhiên. Nhưng cần:

  • Append-only (REVOKE UPDATE/DELETE on outbox FROM application_user)
  • Separate read-only role for auditors
  • Forward to immutable log (S3 with Object Lock, blockchain)

5. DevOps — Vận hành Outbox & Saga

5.1 Docker Compose: Postgres + Debezium + Kafka

# docker-compose.yml
version: "3.8"
 
services:
  postgres:
    image: postgres:15
    environment:
      POSTGRES_USER: app
      POSTGRES_PASSWORD: secret
      POSTGRES_DB: ordersdb
    command:
      - "postgres"
      - "-c"
      - "wal_level=logical"
      - "-c"
      - "max_replication_slots=10"
      - "-c"
      - "max_wal_senders=10"
    ports:
      - "5432:5432"
    volumes:
      - pgdata:/var/lib/postgresql/data
 
  zookeeper:
    image: confluentinc/cp-zookeeper:7.5.0
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
 
  kafka:
    image: confluentinc/cp-kafka:7.5.0
    depends_on: [zookeeper]
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092,PLAINTEXT_HOST://localhost:9093
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
    ports:
      - "9093:9093"
 
  debezium:
    image: debezium/connect:2.5
    depends_on: [kafka, postgres]
    environment:
      BOOTSTRAP_SERVERS: kafka:9092
      GROUP_ID: 1
      CONFIG_STORAGE_TOPIC: connect_configs
      OFFSET_STORAGE_TOPIC: connect_offsets
      STATUS_STORAGE_TOPIC: connect_statuses
    ports:
      - "8083:8083"
 
volumes:
  pgdata:

Setup steps:

# 1. Start
docker compose up -d
 
# 2. Create outbox table
docker exec -it postgres psql -U app -d ordersdb -c "
  CREATE TABLE outbox (
    id BIGSERIAL PRIMARY KEY,
    aggregate_type TEXT NOT NULL,
    aggregate_id TEXT NOT NULL,
    event_type TEXT NOT NULL,
    payload JSONB NOT NULL,
    created_at TIMESTAMPTZ DEFAULT NOW()
  );
"
 
# 3. Register Debezium connector
curl -X POST http://localhost:8083/connectors -H 'Content-Type: application/json' -d '{
  "name": "outbox-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres",
    "database.user": "app",
    "database.password": "secret",
    "database.dbname": "ordersdb",
    "topic.prefix": "ordersdb",
    "table.include.list": "public.outbox",
    "transforms": "outbox",
    "transforms.outbox.type": "io.debezium.transforms.outbox.EventRouter",
    "transforms.outbox.route.by.field": "aggregate_type",
    "transforms.outbox.route.topic.replacement": "events.${routedByValue}",
    "plugin.name": "pgoutput"
  }
}'
 
# 4. Test: insert and verify
docker exec -it postgres psql -U app -d ordersdb -c "
  INSERT INTO outbox (aggregate_type, aggregate_id, event_type, payload)
  VALUES ('Order', 'ord-1', 'OrderCreated', '{\"id\":\"ord-1\"}'::jsonb);
"
 
# Check Kafka
docker exec -it kafka kafka-console-consumer.sh \
  --bootstrap-server kafka:9092 \
  --topic events.Order \
  --from-beginning

5.2 Prometheus Metrics & Alerts

groups:
  - name: outbox_alerts
    rules:
      # Outbox lag growing → relay can't keep up
      - alert: OutboxLagHigh
        expr: outbox_unpublished_count > 10000
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "Outbox has {{ $value }} unpublished events"
          description: "Relay lagging — check Debezium/worker health"
 
      # Old unpublished events
      - alert: OutboxOldestUnpublishedHigh
        expr: outbox_oldest_unpublished_seconds > 300
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "Oldest unpublished event is {{ $value }}s old"
 
      # Debezium connector failed
      - alert: DebeziumConnectorDown
        expr: kafka_connect_connector_status{state!="RUNNING"} > 0
        for: 1m
        labels: { severity: critical }
 
      # WAL replication slot lag
      - alert: PostgresReplicationSlotLag
        expr: pg_replication_slot_lag_bytes > 1073741824   # 1 GB
        for: 10m
        labels: { severity: warning }
        annotations:
          description: "Slot {{ $labels.slot_name }} has {{ $value | humanize1024 }}B lag"
 
      # Saga compensation rate spike
      - alert: SagaCompensationRateHigh
        expr: rate(saga_compensation_total[5m]) > 10
        for: 5m
        labels: { severity: warning }
        annotations:
          description: "{{ $value }}/s sagas compensating — investigate downstream failures"

5.3 Grafana Dashboard

PanelQueryMục đích
Outbox lagoutbox_unpublished_countPhát hiện relay chậm
Events published/srate(outbox_events_published_total[5m])Throughput
P99 publish latencyhistogram_quantile(0.99, rate(outbox_publish_duration_bucket[5m]))SLO
Debezium connector statuskafka_connect_connector_statusHealth
WAL replication lagpg_replication_slot_lag_bytesDB health
Saga in-flightsaga_in_flight_countCapacity
Saga success raterate(saga_completed_total[5m]) / rate(saga_started_total[5m])Quality

5.4 Disaster Scenarios

Scenario A: Debezium connector down

Symptom: Outbox events accumulate, no Kafka events. Recovery:

  1. Check Connect cluster health
  2. Restart connector: curl -X POST http://localhost:8083/connectors/outbox-connector/restart
  3. If WAL slot bloated → consider drop + recreate slot (lose unpublished, but they retry from outbox table)

Scenario B: Kafka cluster down

Symptom: Debezium retries indefinitely. Recovery:

  • Outbox table accumulates → check disk space
  • Once Kafka recovers, Debezium auto-resumes from last offset

Scenario C: Outbox table full disk

Symptom: INSERT INTO outbox fails → application errors. Recovery:

  • Run cleanup: DELETE FROM outbox WHERE published_at < NOW() - INTERVAL '1 day'
  • If no published events → check relay first

Scenario D: Saga orchestrator crash

Symptom: In-flight sagas frozen. Recovery:

  • With Temporal: workers reconnect, sagas resume from last state
  • Without Temporal: depends on state storage. If using DB, manual recovery.

5.5 Testing Saga End-to-End

import pytest
from testcontainers.compose import DockerCompose
 
@pytest.fixture(scope="module")
def saga_env():
    with DockerCompose(".", compose_file_name="docker-compose.test.yml") as env:
        env.wait_for("http://localhost:8083/connectors")
        yield env
 
def test_order_saga_happy_path(saga_env):
    response = requests.post("http://order-service/orders", json={...})
    order_id = response.json()['id']
 
    # Wait for saga completion
    deadline = time.time() + 30
    while time.time() < deadline:
        order = requests.get(f"http://order-service/orders/{order_id}").json()
        if order['status'] == 'completed':
            return
        time.sleep(0.5)
 
    pytest.fail("Saga did not complete in 30s")
 
def test_order_saga_payment_failure_compensates(saga_env):
    # Simulate payment failure
    requests.post("http://payment-service/_test/fail-next", json={"count": 1})
 
    response = requests.post("http://order-service/orders", json={...})
    order_id = response.json()['id']
 
    time.sleep(5)
 
    # Verify compensation
    order = requests.get(f"http://order-service/orders/{order_id}").json()
    assert order['status'] == 'cancelled'
 
    inventory = requests.get(f"http://inventory-service/items/{...}").json()
    assert inventory['reserved'] == 0  # Released

6. Code Implementation

6.1 Python Outbox Pattern

"""
Outbox Pattern — Python implementation with Postgres + Kafka
"""
 
import asyncio
import json
import logging
from contextlib import asynccontextmanager
from datetime import datetime
from uuid import uuid4
 
import asyncpg
from aiokafka import AIOKafkaProducer
 
log = logging.getLogger(__name__)
 
 
class OutboxPublisher:
    """Single transaction boundary: business write + outbox event."""
 
    def __init__(self, pool: asyncpg.Pool):
        self.pool = pool
 
    @asynccontextmanager
    async def transaction(self):
        async with self.pool.acquire() as conn:
            async with conn.transaction():
                yield OutboxTransaction(conn)
 
 
class OutboxTransaction:
    def __init__(self, conn):
        self.conn = conn
 
    async def execute(self, query, *args):
        return await self.conn.execute(query, *args)
 
    async def fetch(self, query, *args):
        return await self.conn.fetch(query, *args)
 
    async def emit(self, aggregate_type: str, aggregate_id: str,
                   event_type: str, payload: dict, trace_id: str = None):
        """Emit an event to outbox. Will be published after commit."""
        event_id = str(uuid4())
        await self.conn.execute("""
            INSERT INTO outbox
                (id, aggregate_type, aggregate_id, event_type, payload, trace_id)
            VALUES ($1, $2, $3, $4, $5, $6)
        """, event_id, aggregate_type, aggregate_id, event_type,
            json.dumps(payload), trace_id)
        return event_id
 
 
# === Business Logic ===
 
class OrderService:
    def __init__(self, publisher: OutboxPublisher):
        self.publisher = publisher
 
    async def create_order(self, customer_id: str, items: list, total: int):
        order_id = str(uuid4())
 
        async with self.publisher.transaction() as txn:
            # Business write
            await txn.execute("""
                INSERT INTO orders (id, customer_id, total, status)
                VALUES ($1, $2, $3, 'pending')
            """, order_id, customer_id, total)
 
            await txn.execute("""
                INSERT INTO order_items (order_id, product_id, quantity)
                SELECT $1, product_id, quantity FROM jsonb_to_recordset($2::jsonb)
                AS x(product_id TEXT, quantity INT)
            """, order_id, json.dumps(items))
 
            # Outbox event in same transaction
            await txn.emit(
                aggregate_type='Order',
                aggregate_id=order_id,
                event_type='OrderCreated',
                payload={
                    'id': order_id,
                    'customer_id': customer_id,
                    'items': items,
                    'total': total,
                    'created_at': datetime.utcnow().isoformat(),
                }
            )
 
        return order_id
 
 
# === Polling Relay ===
 
class OutboxRelay:
    """Polls outbox table and publishes to Kafka."""
 
    def __init__(self, pool: asyncpg.Pool, kafka_servers: str,
                 batch_size: int = 100, poll_interval: float = 1.0):
        self.pool = pool
        self.kafka_servers = kafka_servers
        self.batch_size = batch_size
        self.poll_interval = poll_interval
        self.producer: AIOKafkaProducer = None
 
    async def start(self):
        self.producer = AIOKafkaProducer(
            bootstrap_servers=self.kafka_servers,
            enable_idempotence=True,
            acks='all',
        )
        await self.producer.start()
 
        try:
            await self._loop()
        finally:
            await self.producer.stop()
 
    async def _loop(self):
        while True:
            try:
                published = await self._publish_batch()
                if published == 0:
                    await asyncio.sleep(self.poll_interval)
            except Exception:
                log.exception("Relay error")
                await asyncio.sleep(self.poll_interval)
 
    async def _publish_batch(self) -> int:
        async with self.pool.acquire() as conn:
            async with conn.transaction():
                # SKIP LOCKED: multiple workers can run concurrently
                rows = await conn.fetch("""
                    SELECT id, aggregate_type, aggregate_id, event_type, payload, trace_id
                    FROM outbox
                    WHERE published_at IS NULL
                    ORDER BY id
                    LIMIT $1
                    FOR UPDATE SKIP LOCKED
                """, self.batch_size)
 
                if not rows:
                    return 0
 
                published_ids = []
                for row in rows:
                    topic = f"events.{row['aggregate_type']}"
                    headers = [
                        ('event_type', row['event_type'].encode()),
                        ('trace_id', (row['trace_id'] or '').encode()),
                    ]
                    try:
                        await self.producer.send_and_wait(
                            topic=topic,
                            key=row['aggregate_id'].encode(),
                            value=row['payload'].encode() if isinstance(row['payload'], str)
                                  else json.dumps(row['payload']).encode(),
                            headers=headers,
                        )
                        published_ids.append(row['id'])
                    except Exception:
                        log.exception(f"Failed to publish event {row['id']}")
                        break
 
                if published_ids:
                    await conn.execute("""
                        UPDATE outbox SET published_at = NOW()
                        WHERE id = ANY($1::bigint[])
                    """, published_ids)
 
                return len(published_ids)
 
 
# === Idempotent Consumer ===
 
class IdempotentConsumer:
    """Consumer with inbox pattern."""
 
    def __init__(self, pool: asyncpg.Pool):
        self.pool = pool
 
    async def handle(self, event: dict, processor):
        event_id = event['id']
 
        async with self.pool.acquire() as conn:
            async with conn.transaction():
                # Try to mark as received
                try:
                    await conn.execute("""
                        INSERT INTO inbox (event_id) VALUES ($1)
                    """, event_id)
                except asyncpg.UniqueViolationError:
                    log.info(f"Event {event_id} already processed")
                    return
 
                # Process
                await processor(conn, event)
 
 
# === Main ===
 
async def main():
    pool = await asyncpg.create_pool(
        "postgresql://app:secret@localhost/ordersdb",
        min_size=2, max_size=10,
    )
 
    publisher = OutboxPublisher(pool)
    order_service = OrderService(publisher)
 
    # Create order
    order_id = await order_service.create_order(
        customer_id='cust-1',
        items=[{'product_id': 'p1', 'quantity': 2}],
        total=500000,
    )
    print(f"Created order {order_id}")
 
    # Start relay (in production: separate process)
    relay = OutboxRelay(pool, "localhost:9093")
    await relay.start()
 
 
if __name__ == "__main__":
    asyncio.run(main())

6.2 Saga Orchestration với Temporal (skeleton)

"""
Saga Orchestration with Temporal
Reference: https://temporal.io/
"""
 
from datetime import timedelta
from temporalio import workflow, activity
from temporalio.client import Client
from temporalio.worker import Worker
 
 
# === Activities (idempotent steps) ===
 
@activity.defn
async def create_order(customer_id: str, items: list, total: int) -> str:
    # Call OrderService API
    order_id = await order_api.create(customer_id, items, total)
    return order_id
 
@activity.defn
async def cancel_order(order_id: str):
    await order_api.cancel(order_id)
 
@activity.defn
async def reserve_inventory(order_id: str, items: list) -> str:
    reservation_id = await inventory_api.reserve(order_id, items)
    return reservation_id
 
@activity.defn
async def release_inventory(reservation_id: str):
    await inventory_api.release(reservation_id)
 
@activity.defn
async def charge_payment(order_id: str, amount: int) -> str:
    payment_id = await payment_api.charge(order_id, amount)
    return payment_id
 
@activity.defn
async def refund_payment(payment_id: str):
    await payment_api.refund(payment_id)
 
@activity.defn
async def schedule_shipping(order_id: str):
    await shipping_api.schedule(order_id)
 
 
# === Workflow (Orchestrator) ===
 
@workflow.defn
class PlaceOrderSaga:
    @workflow.run
    async def run(self, customer_id: str, items: list, total: int):
        order_id = None
        reservation_id = None
        payment_id = None
 
        try:
            # Step 1: Create order
            order_id = await workflow.execute_activity(
                create_order, args=[customer_id, items, total],
                start_to_close_timeout=timedelta(seconds=30),
                retry_policy={"maximum_attempts": 3},
            )
 
            # Step 2: Reserve inventory
            reservation_id = await workflow.execute_activity(
                reserve_inventory, args=[order_id, items],
                start_to_close_timeout=timedelta(seconds=30),
                retry_policy={"maximum_attempts": 5},
            )
 
            # Step 3: Charge payment
            payment_id = await workflow.execute_activity(
                charge_payment, args=[order_id, total],
                start_to_close_timeout=timedelta(seconds=30),
                retry_policy={"maximum_attempts": 3},
            )
 
            # Step 4 (pivot): Schedule shipping — no compensation after this
            await workflow.execute_activity(
                schedule_shipping, args=[order_id],
                start_to_close_timeout=timedelta(seconds=60),
                retry_policy={"maximum_attempts": 10},
            )
 
            return {"order_id": order_id, "status": "completed"}
 
        except Exception as e:
            # Compensate in reverse order
            workflow.logger.error(f"Saga failed: {e}, compensating")
 
            if payment_id:
                await workflow.execute_activity(
                    refund_payment, args=[payment_id],
                    start_to_close_timeout=timedelta(seconds=30),
                )
 
            if reservation_id:
                await workflow.execute_activity(
                    release_inventory, args=[reservation_id],
                    start_to_close_timeout=timedelta(seconds=30),
                )
 
            if order_id:
                await workflow.execute_activity(
                    cancel_order, args=[order_id],
                    start_to_close_timeout=timedelta(seconds=30),
                )
 
            return {"order_id": order_id, "status": "cancelled", "reason": str(e)}
 
 
# === Worker setup ===
 
async def run_worker():
    client = await Client.connect("localhost:7233")
    worker = Worker(
        client,
        task_queue="order-saga",
        workflows=[PlaceOrderSaga],
        activities=[create_order, cancel_order, reserve_inventory,
                    release_inventory, charge_payment, refund_payment,
                    schedule_shipping],
    )
    await worker.run()
 
 
# === Trigger saga ===
 
async def start_saga(customer_id, items, total):
    client = await Client.connect("localhost:7233")
    handle = await client.start_workflow(
        PlaceOrderSaga.run,
        args=[customer_id, items, total],
        id=f"order-saga-{customer_id}-{int(time.time())}",
        task_queue="order-saga",
    )
    return await handle.result()

7. System Design Diagrams

7.1 Outbox Pattern — Big Picture

flowchart TB
    Client[Client] -->|POST /orders| App[Application]

    subgraph Atomic["Single DB Transaction"]
        OrderTbl[(orders table)]
        OutboxTbl[(outbox table)]
    end

    App --> OrderTbl
    App --> OutboxTbl

    subgraph Relay["Outbox Relay"]
        Poller[Polling Worker<br/>or Debezium CDC]
    end

    OutboxTbl -.read.-> Poller
    Poller -->|publish| Kafka[(Kafka)]

    Kafka --> Inv[Inventory Service]
    Kafka --> Notif[Notification Service]
    Kafka --> Analytics[Analytics Service]

    style Atomic fill:#e1f5fe,stroke:#01579b
    style Relay fill:#fff9c4,stroke:#f57f17

7.2 Outbox vs Naive Dual-Write

flowchart LR
    subgraph Bad["NAIVE (broken)"]
        direction TB
        B1[BEGIN tx] --> B2[INSERT order]
        B2 --> B3[COMMIT]
        B3 --> B4[kafka.send<br/>⚡ may fail ⚡]
        B4 --> B5[Inconsistent!]
    end

    subgraph Good["OUTBOX (correct)"]
        direction TB
        G1[BEGIN tx] --> G2[INSERT order]
        G2 --> G3[INSERT outbox]
        G3 --> G4[COMMIT atomic]
        G4 --> G5[Relay reads outbox<br/>publishes to Kafka<br/>retries until success]
        G5 --> G6[Consistent ✓]
    end

    style Bad fill:#ffcdd2,color:#000
    style Good fill:#c8e6c9,color:#000

7.3 CDC with Debezium

sequenceDiagram
    participant App
    participant PG as PostgreSQL
    participant WAL as PG WAL
    participant Deb as Debezium
    participant K as Kafka
    participant C as Consumer

    App->>PG: BEGIN
    App->>PG: INSERT orders
    App->>PG: INSERT outbox
    App->>PG: COMMIT
    PG->>WAL: write WAL records

    Deb->>WAL: tail (logical replication)
    WAL-->>Deb: change event
    Deb->>Deb: filter outbox table<br/>apply EventRouter SMT
    Deb->>K: publish to events.Order

    K-->>C: deliver event
    C->>C: process (idempotent)

7.4 Saga Choreography

flowchart LR
    O[Order Service] -->|OrderCreated| K1[(Kafka: events.Order)]
    K1 --> I[Inventory Service]
    K1 --> P[Payment Service]
    I -->|ItemsReserved| K2[(Kafka: events.Inventory)]
    P -->|PaymentCharged| K3[(Kafka: events.Payment)]
    K2 --> S[Shipping Service]
    K3 --> S
    S -->|OrderShipped| K4[(Kafka: events.Shipping)]
    K4 --> O

    style O fill:#bbdefb
    style I fill:#c8e6c9
    style P fill:#fff9c4
    style S fill:#ffe0b2

7.5 Saga Orchestration

sequenceDiagram
    participant C as Client
    participant Orch as Saga Orchestrator
    participant O as Order Service
    participant I as Inventory Service
    participant P as Payment Service
    participant S as Shipping Service

    C->>Orch: PlaceOrder(customer, items, total)

    Orch->>O: createOrder()
    O-->>Orch: order_id

    Orch->>I: reserveItems(order_id, items)
    I-->>Orch: reservation_id

    Orch->>P: chargePayment(order_id, total)
    alt payment success
        P-->>Orch: payment_id
        Orch->>S: scheduleShipping(order_id)
        S-->>Orch: tracking_id
        Orch-->>C: completed
    else payment fail
        P-->>Orch: error
        Note over Orch: Compensate in reverse
        Orch->>I: releaseItems(reservation_id)
        Orch->>O: cancelOrder(order_id)
        Orch-->>C: cancelled
    end

7.6 Inbox Pattern (Consumer Idempotency)

flowchart TB
    K[(Kafka)] --> C[Consumer]

    subgraph TX["Single DB Transaction"]
        Inbox[(inbox table)]
        Business[(business tables)]
    end

    C -->|"INSERT inbox<br/>(event_id)"| Inbox
    Inbox -->|"if duplicate<br/>UNIQUE error"| Skip[Skip — already processed]
    Inbox -->|"if new"| Business
    Business --> Commit[COMMIT]

    style TX fill:#e1f5fe
    style Skip fill:#ffe0b2
    style Commit fill:#c8e6c9

8. Aha Moments & Pitfalls

Aha Moments

#1: Dual-write problem là vô phương cứu chữa mà không có pattern. 2PC giữa Postgres + Kafka không khả thi trong production. Outbox biến 2 writes thành 1 atomic write + 1 async publish.

#2: Outbox không khó implement — nó chỉ là 1 thêm INSERT trong cùng transaction. Cái khó là operationalizing: relay HA, lag monitoring, cleanup, idempotent consumer.

#3: Debezium = Outbox CDC mode. Em không cần tự code polling worker. Debezium read WAL real-time và route events theo aggregate_type field.

#4: Idempotent consumer là MUST. Outbox cho at-least-once. Consumer nhận duplicate → must handle. Pattern phổ biến: inbox table với UNIQUE event_id.

#5: Saga ≠ Distributed Transaction. Saga không cung cấp ACID. Có thể có intermediate state inconsistency (e.g., payment đã charge nhưng order chưa created yet). UI phải design cho điều này.

#6: Choreography vs Orchestration không phải binary. Có thể hybrid: orchestrator cho main flow, choreography cho side effects (notifications, analytics).

#7: Compensation phải idempotent. Refund 2 lần = mất tiền. Cancel order 2 lần = không sao. Design every compensation operation idempotent (UPDATE WHERE status=…).

#8: Pivot transaction là điểm “no return”. Sau ship, không undo được. Phải xác định pivot rõ ràng trong saga design.

Pitfalls — Sai lầm thường gặp

Pitfall 1: Send Kafka message rồi commit DB

# BAD
kafka.send(...)  # Sent before commit
db.commit()       # Commit may fail
 
# GOOD: outbox
db.execute("INSERT INTO outbox ...")
db.commit()
# Relay publishes

Pitfall 2: Outbox không có index

Relay query WHERE published_at IS NULL mà thiếu partial index → full table scan → slow → backlog grow.

-- Fix
CREATE INDEX idx_outbox_unpublished ON outbox (created_at) WHERE published_at IS NULL;

Pitfall 3: Đợi 2PC với Kafka

Kafka không support XA. Đừng tìm cách 2PC. Dùng outbox.

Pitfall 4: Forget cleanup

Outbox grow vô tận → 100M rows → query chậm → relay fall behind.

-- Fix: daily cleanup job
DELETE FROM outbox WHERE published_at < NOW() - INTERVAL '7 days';

Pitfall 5: Consumer không idempotent

Network glitch → relay republish → duplicate. Consumer process 2 lần → double-charge customer.

# Fix: inbox pattern
try:
    db.execute("INSERT INTO inbox (event_id) VALUES (%s)", event['id'])
except UniqueViolation:
    return  # Skip
process(event)

Pitfall 6: Saga compensation không idempotent

# BAD
def refund(payment_id):
    payment = get(payment_id)
    refund_amount(payment.amount)  # Called twice → 2x refund!
 
# GOOD
def refund(payment_id):
    db.execute("""
        UPDATE payments SET status='refunded', refunded_at=NOW()
        WHERE id = %s AND status = 'charged'
    """, payment_id)
    if rowcount == 0:
        return  # Already refunded, skip
    refund_amount(payment.amount)

Pitfall 7: Dùng choreography cho complex flow

6 services, mỗi service publish event, listen 2-3 events khác → event spaghetti. Không ai hiểu state của order.

Fix: Switch to orchestration với Temporal/Step Functions cho complex flows. Choreography phù hợp ≤ 3 services hoặc các side effects.

Pitfall 8: Outbox publish mất ordering

Default polling/Debezium không guarantee per-aggregate order nếu publish parallel. Inventory event before Order event → consumer confused.

Fix: Use aggregate_id as Kafka partition key → events of same aggregate go to same partition → ordered.

Pitfall 9: WAL slot bloat

Debezium connector down lâu → WAL slot lag → Postgres giữ WAL files → disk fill up → DB crash.

Fix: Monitor pg_replication_slot_lag_bytes. Alert khi > 1GB. Có thể drop slot nếu acceptable mất events (sẽ resync from outbox).

Pitfall 10: Outbox event payload thay đổi

Schema evolution: thêm field vào event → consumer cũ không hiểu → break.

Fix: Use schema registry (Confluent Schema Registry, Apicurio). Schemas backward + forward compatible. Avro hoặc Protobuf.


Outbox & Saga trong các tuần

TuầnLiên hệ
Tuan-08-Message-QueueKafka transactional producer = alternative outbox; ISR replication
Tuan-11-Microservices-PatternSaga là core microservice pattern; Outbox enable event-driven
Tuan-Bonus-Consensus-Raft-PaxosTemporal dùng Cassandra/Postgres + leader election (not Raft)
Tuan-Bonus-Consistency-Models-IsolationOutbox = atomic single-DB transaction; consumer cần idempotency
Case-Design-Payment-SystemPayment dùng outbox + saga cho refund flow
Case-Design-Digital-WalletWallet update + event sourcing (similar to outbox)
Case-Design-Hotel-Reservation-SystemBooking saga: reserve room → charge → confirm
Case-Design-Distributed-Message-QueueKafka transactional API
Tuan-13-Monitoring-ObservabilityMonitor outbox lag, saga in-flight, compensation rate

Tham khảo bắt buộc đọc

Books:

Patterns:

Engineering blogs:

Papers:

Tools:


Hoàn thành Batch A bonus chapters: Tuan-Bonus-Consensus-Raft-Paxos · Tuan-Bonus-Consistency-Models-Isolation · Tuan-Bonus-Outbox-Pattern

Quay lại: Tuan-11-Microservices-Pattern để áp dụng các pattern này vào microservice design.