Tuần Bonus: Progressive Delivery — Argo Rollouts, Flagger, Feature Flags

“CI/CD truyền thống: deploy → all 100% users → if bug → rollback all. Progressive Delivery: deploy → 1% users → measure → 5% → measure → 25% → 100%. Bug detected? Rollback automatic ở 5%. Đây là evolution từ ‘deploy and hope’ sang ‘deploy and verify’.”

Tags: system-design cicd progressive-delivery canary feature-flags bonus Student: Hieu (Backend Dev → Architect) Prerequisite: Tuan-12-CICD-Pipeline · Tuan-13-Monitoring-Observability Liên quan: Tuan-11-Microservices-Pattern · Tuan-Bonus-Platform-Engineering-IDP

1. Context & Why

Analogy đời thường — Mở quán mới

Hieu, tưởng tượng em mở chuỗi 100 quán phở mới với recipe mới:

Cách 1 — Big bang launch (CI/CD truyền thống):

Day 1: Đổi recipe ở tất cả 100 quán cùng lúc
Khách hàng feedback “phở dở quá!”
100 quán mất doanh thu cùng 1 ngày
Quay lại recipe cũ → 200 quán-ngày bị ảnh hưởng

Cách 2 — Progressive launch (Progressive Delivery):

Day 1: Đổi recipe ở 1 quán
Đo lường: doanh thu, complaint, NPS
Day 2: Nếu OK → 5 quán
Day 3: Nếu vẫn OK → 25 quán
Day 7: 100% nếu metrics tốt
Phát hiện bug ở 5 quán → rollback chỉ 5 quán

Progressive Delivery = Continuous Deployment + Risk Reduction + Automated Decisions.

Tại sao Backend Dev cần hiểu?

Lý do	Hậu quả nếu không
Reduce blast radius	Bug 100% users vs 1% users
Automated rollback	Detect + rollback in seconds vs hours
Real-world validation	Pre-prod test miss issues, prod 1% reveals
Decouple deploy from release	Deploy code mà chưa enable feature
DORA metrics	Higher deploy frequency, lower MTTR
A/B testing built-in	Validate hypothesis with traffic subsets

Tại sao Alex Xu không đi sâu?

Alex Xu Vol 1+2 nói về CI/CD basic (blue-green, canary) nhưng không cover automated analysis, feature flag platforms, progressive rollout patterns. Đây là evolution 2020+.

Tham chiếu chính

Argo Rollouts docs — https://argo-rollouts.readthedocs.io/
Flagger docs — https://docs.flagger.app/
Continuous Delivery (Humble & Farley, 2010) — foundational
Accelerate (Forsgren, Humble, Kim, 2018)
LaunchDarkly Engineering — https://launchdarkly.com/blog/engineering/
The Pragmatic Programmer’s Guide to Feature Flags — https://launchdarkly.com/blog/the-pragmatic-programmers-guide-to-feature-flags/

2. Deep Dive — Khái niệm cốt lõi

2.1 Deployment Strategies Spectrum

                        Risk    Speed   Complexity
Big Bang               ────█    █───   ─
Blue-Green             ──██     ██──   ██
Rolling                ─███     ██──   ██
Canary (manual)        ████     ███─   ███
Canary (automated)     ████     ████   ████
A/B Testing            ████     ████   █████
Shadow / Mirror        █████    ─███   █████

2.1.1 Big Bang

v1: 100% → v2: 100% (instant cutover)

When: Stateful systems where versioning hard (DB schema migration sometimes) Avoid for: stateless services

2.1.2 Blue-Green

Blue (v1) running, Green (v2) deployed
Switch load balancer: 100% Blue → 100% Green
Keep Blue for instant rollback

Pros: Instant rollback, zero-downtime Cons: 2x infrastructure during deploy, no gradual validation

2.1.3 Rolling Update (K8s default)

10 pods running v1
Replace 1 pod at a time:
  Pod 1: v1 → v2
  Pod 2: v1 → v2
  ...
At any time: mix of v1/v2 pods serving traffic

Pros: Default in K8s, gradual Cons: No automated analysis, all-or-nothing per pod

2.1.4 Canary

v1: 99%, v2: 1%
v1: 95%, v2: 5%
v1: 75%, v2: 25%
v1: 50%, v2: 50%
v1: 0%, v2: 100%

Pros: Gradual exposure, can pause at any % Cons: Manual decisions; need infrastructure (service mesh, ingress)

2.1.5 A/B Testing

Same as canary but split based on USER attributes:
  v1: existing users
  v2: new users in segment X

Measure business metrics (conversion, engagement)

Pros: Validate hypothesis with real users Cons: Need experimentation platform, statistical rigor

2.1.6 Shadow / Mirror

v1: receives 100% traffic, returns response
v2: receives 100% mirrored traffic, response discarded
Compare v1 vs v2 outputs (correctness, performance)

Pros: Test against real traffic without impact Cons: 2x infrastructure, side-effect concerns

2.2 Argo Rollouts — Kubernetes-native Progressive Delivery

Argo Rollouts: Replaces K8s Deployment with Rollout CRD, supports canary + blue-green with automated analysis.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payment-service
spec:
  replicas: 10
  selector:
    matchLabels:
      app: payment-service
  template:
    metadata:
      labels:
        app: payment-service
    spec:
      containers:
        - name: payment-service
          image: myorg/payment:v2.0.0
          ports: [{ containerPort: 8080 }]
 
  strategy:
    canary:
      steps:
        - setWeight: 5     # 5% traffic to v2
        - pause: { duration: 10m }
        - analysis:
            templates:
              - templateName: success-rate
        - setWeight: 25    # 25%
        - pause: { duration: 30m }
        - analysis:
            templates:
              - templateName: success-rate
              - templateName: latency
        - setWeight: 50
        - pause: { duration: 30m }
        - setWeight: 100   # full rollout
 
      canaryService: payment-canary    # Routes v2 traffic
      stableService: payment-stable    # Routes v1 traffic
 
      trafficRouting:
        istio:
          virtualService:
            name: payment-vs
            routes: [primary]

Steps execution:

Deploy 1 canary pod with v2
Route 5% traffic to canary
Pause 10 minutes
Run analysis (query Prometheus)
If pass → next step; if fail → rollback automatically
Continue progression

2.3 AnalysisTemplate — Automated Verification

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
    - name: service-name
  metrics:
    - name: success-rate
      interval: 1m
      successCondition: result[0] >= 0.95
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{
              service="{{args.service-name}}",
              status!~"5..",
              version="canary"
            }[5m])) /
            sum(rate(http_requests_total{
              service="{{args.service-name}}",
              version="canary"
            }[5m]))
 
    - name: p99-latency
      interval: 1m
      successCondition: result[0] <= 0.5
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket{
                service="{{args.service-name}}",
                version="canary"
              }[5m])) by (le)
            )

Decision logic:

Run query every minute
If 3 consecutive failures → abort rollout, trigger rollback
If 3 consecutive successes → progress to next step

2.4 Flagger — Service Mesh-native

Flagger (Weaveworks/Flux): Similar concept, integrates with Istio/Linkerd/AWS App Mesh natively.

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: payment
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment
 
  service:
    port: 80
    targetPort: 8080
    gateways: [public-gateway]
 
  analysis:
    interval: 1m
    threshold: 5     # max 5 failed checks
    maxWeight: 50
    stepWeight: 10
 
    metrics:
      - name: request-success-rate
        thresholdRange: { min: 99 }
        interval: 1m
 
      - name: request-duration
        thresholdRange: { max: 500 }    # ms
        interval: 30s
 
    webhooks:
      - name: smoke-test
        type: pre-rollout
        url: http://test-runner/run-smoke
 
      - name: load-test
        url: http://flagger-loadtester.test/
        timeout: 5s
        metadata:
          cmd: "hey -z 1m -q 10 -c 2 http://payment-canary/"

Differences from Argo Rollouts:

Flagger more integrated with service mesh
Argo Rollouts more flexible (custom analysis providers)
Both production-grade

2.5 Feature Flags — Decouple Deploy from Release

Concept: Deploy code with feature OFF. Enable feature later via flag.

# Without feature flag
def get_recommendations(user_id):
    return new_ml_algorithm(user_id)  # Released to all immediately
 
# With feature flag
def get_recommendations(user_id):
    if feature_flag.enabled("new_recommendations", user_id=user_id):
        return new_ml_algorithm(user_id)
    return old_algorithm(user_id)

Use cases:

Gradual rollout: 1% → 10% → 100%
Kill switch: Disable broken feature instantly
A/B testing: Compare variants
Targeted rollout: Enable for specific users (beta testers)
Trunk-based development: Merge unfinished features behind flag

2.6 Feature Flag Platforms

Tool	Type	Best for
LaunchDarkly	Commercial SaaS	Enterprise, full features
Statsig	SaaS, freemium	Experimentation focus
Unleash	OSS + cloud	Self-host preference
Flagsmith	OSS + cloud	Privacy-conscious
OpenFeature (CNCF)	Vendor-neutral spec	Avoid lock-in
Cloudflare Workers KV	DIY simple flags	Existing CF infra
Custom DB	Self-built	Small scale, simple needs

2.7 OpenFeature — Standardization

OpenFeature (CNCF): Vendor-neutral feature flag SDK.

# Python OpenFeature SDK
from openfeature import api
from openfeature.contrib.provider.flagsmith import FlagsmithProvider
 
# Configure provider (swap-able)
api.set_provider(FlagsmithProvider(env_key="..."))
 
# Use anywhere
client = api.get_client("my-app")
 
if client.get_boolean_value("new_checkout", default=False, evaluation_context={
    "user_id": user_id,
    "country": "VN"
}):
    new_checkout_flow()
else:
    old_checkout_flow()

Magic: Switch from LaunchDarkly to Unleash → just change provider, no code change.

2.8 Targeting & Rollout Strategies

2.8.1 Percentage rollout

flag: new_checkout
default: false
targets:
  - rule: percentage(10)  # 10% random users
    value: true

2.8.2 Attribute-based

flag: premium_features
default: false
targets:
  - rule: user.plan == "enterprise"
    value: true
  - rule: user.country in ["US", "VN", "JP"]
    value: true
  - rule: user.id in ["beta-tester-1", "beta-tester-2"]
    value: true

2.8.3 Context-based

flag: heavy_query_optimization
targets:
  - rule: request.path == "/reports"
    value: true
  - rule: request.method == "POST" && request.size > 1MB
    value: true

2.8.4 Sticky bucketing

Important: User in 10% bucket today should stay in 10% tomorrow (consistent UX).

# Hash user_id deterministically
def in_bucket(user_id: str, percentage: int) -> bool:
    hash_val = mmh3.hash(f"new_checkout:{user_id}") % 100
    return hash_val < percentage

2.9 Experimentation (A/B Testing)

Beyond rollout: measure business metrics.

import statsig
 
statsig.initialize("...")
 
# Get experiment value
config = statsig.get_experiment("new_checkout_design", user)
button_color = config.get_string("button_color", "blue")
button_text = config.get_string("button_text", "Buy Now")
 
# Render UI with variant
render_button(color=button_color, text=button_text)
 
# Log conversion event
if user_purchased:
    statsig.log_event("purchase", value=order_total)

Experiment platform computes:

Sample size per variant
Statistical significance
Lift in conversion
Confidence intervals

Tools: Statsig, LaunchDarkly Experiments, Eppo, Amplitude Experiment.

2.10 Rollback Strategies

2.10.1 Automated rollback (Argo Rollouts/Flagger)

analysis:
  failureLimit: 3   # 3 consecutive failures → abort
  inconclusiveLimit: 5

When abort triggered:

Stop progression
Route 100% traffic back to stable
Notify team
Keep canary pod for debugging (configurable)

2.10.2 Feature flag kill switch

# Bug detected → disable feature instantly
launchdarkly.update_flag("new_checkout", enabled=False)
# Effect propagates within seconds, no deploy needed

Power: Sub-30-second rollback vs 10-minute deploy rollback.

2.10.3 Database rollback considerations

Problem: Rolling back code is easy, rolling back DB schema is hard.

Pattern: Expand-Contract:

Migration v1 → v2:
  Phase 1 (Expand): Add new column, code writes to both old & new
  Phase 2: Backfill new column from old
  Phase 3: Code reads from new
  Phase 4 (Contract): Remove old column

Each phase deployable & rollback-safe.

2.11 DORA Metrics

4 key metrics measure DevOps performance:

Metric	Elite
Deploy frequency	On-demand (multiple per day)
Lead time for changes	< 1 day commit to prod
Change failure rate	0-15%
MTTR	< 1 hour

Progressive Delivery improves all 4:

Deploy frequency ↑ (lower risk per deploy)
Lead time ↓ (auto-pipeline)
Change failure rate ↓ (catch issues at 5%)
MTTR ↓ (auto-rollback)

3. Estimation

3.1 Time saved by automation

Manual canary (without Argo Rollouts):

Engineer monitors metrics manually: 2h per deploy
10 deploys/week × 2h × 5 engineers = 100h/week
$200K/year just monitoring time

Automated (Argo Rollouts):

Engineer reviews dashboard occasionally: 0.5h/deploy
Saves 75h/week = $150K/year
Plus prevents bad deploys (avg incident cost $50K)

3.2 Risk reduction

Without progressive delivery:

1% bad deploy rate × 10 deploys/week × 50 weeks = 5 incidents/year
Average impact: 30 min × 100K users × $X cost = $$$

With progressive delivery:

Same bad deploy rate but caught at 5% traffic
Impact: 5 min × 5K users = 95% reduction

3.3 Feature flag overhead

~5-10% perf overhead from flag evaluation
Mitigations: SDK cache, edge evaluation
Cost: LaunchDarkly $40 - 100/ se a t, U n l e a s h$ 0 self-host

4. Security First

4.1 Flag evaluation auth

Threat: Attacker manipulates flag → enable hidden feature.

Mitigations:

API key per environment
Restrict who can change flags (RBAC)
Audit log every flag change
Use cryptographic verification (signed flags)

4.2 Sensitive data in flag context

Don’t include PII in flag context (cached, logged):

# BAD
client.get_boolean_value("flag", context={"email": "user@example.com"})
 
# GOOD
client.get_boolean_value("flag", context={"user_id": hash(email)})

4.3 Flag debt

Problem: Flags accumulate over time (100s of dead flags).

Risks:

Old flags = old code paths = potential bugs
Hard to reason about behavior
Audit nightmare

Solution: Flag lifecycle management

Tag flags with owner, created_at, sunset_date
Auto-alert when flag > 90 days old
Quarterly cleanup ritual

4.4 Canary security testing

Run security scans on canary before promotion:

DAST (OWASP ZAP)
Container scan (Trivy)
API contract test

Fail rollout if security regression.

5. DevOps — Implementation

5.1 Argo Rollouts setup

# Install
kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts -f \
  https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml
 
# Install kubectl plugin
brew install argoproj/tap/kubectl-argo-rollouts
 
# Use
kubectl argo rollouts get rollout payment-service
kubectl argo rollouts pause payment-service
kubectl argo rollouts promote payment-service
kubectl argo rollouts abort payment-service

5.2 Service mesh setup (Istio)

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: payment-vs
spec:
  hosts: [payment]
  http:
    - name: primary
      route:
        - destination:
            host: payment
            subset: stable
          weight: 100
        - destination:
            host: payment
            subset: canary
          weight: 0     # Argo Rollouts updates this
 
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: payment
spec:
  host: payment
  subsets:
    - name: stable
      labels: { version: stable }
    - name: canary
      labels: { version: canary }

5.3 Feature flag (Unleash) setup

# docker-compose
services:
  unleash:
    image: unleashorg/unleash-server
    environment:
      DATABASE_URL: "postgres://unleash:password@db/unleash"
    ports: ["4242:4242"]
 
  db:
    image: postgres:15
    environment:
      POSTGRES_USER: unleash
      POSTGRES_PASSWORD: password
      POSTGRES_DB: unleash

5.4 Application integration

from UnleashClient import UnleashClient
 
client = UnleashClient(
    url="http://unleash:4242/api/",
    app_name="my-app",
    custom_headers={"Authorization": API_TOKEN}
)
client.initialize_client()
 
 
def get_recommendations(user_id):
    if client.is_enabled(
        "new_recommendations",
        context={"userId": user_id}
    ):
        return new_algorithm(user_id)
    return old_algorithm(user_id)

5.5 Monitoring

groups:
  - name: progressive_delivery
    rules:
      - alert: RolloutAborted
        expr: rollout_phase{phase="Aborted"} == 1
        for: 1m
        annotations:
          summary: "Rollout {{ $labels.name }} aborted automatically"
 
      - alert: RolloutPaused
        expr: rollout_phase{phase="Paused"} == 1
        for: 1h
        annotations:
          summary: "Rollout {{ $labels.name }} paused > 1h"
 
      - alert: HighFlagEvaluationLatency
        expr: |
          histogram_quantile(0.99, rate(flag_eval_duration_bucket[5m])) > 0.05
        annotations:
          summary: "Flag SDK P99 > 50ms"
 
      - alert: TooManyOldFlags
        expr: count(flag_age_days > 90) > 50
        annotations:
          summary: "{{ $value }} flags > 90 days old. Cleanup needed."

6. Code Implementation

6.1 Custom feature flag service

"""
Lightweight feature flag service.
For when LaunchDarkly is overkill.
"""
 
import hashlib
import json
from typing import Any
import redis
 
 
class FeatureFlags:
    def __init__(self, redis_client: redis.Redis):
        self.redis = redis_client
 
    def is_enabled(self, flag: str, user_id: str = None,
                   context: dict = None) -> bool:
        """Check if flag enabled for user/context."""
        rules = self._get_rules(flag)
        if not rules:
            return False
 
        if rules.get("enabled") is False:
            return False  # Kill switch
 
        # Check user-specific overrides
        if user_id and user_id in rules.get("enabled_users", []):
            return True
 
        # Check segment rules
        if context:
            for segment in rules.get("segments", []):
                if self._matches_segment(context, segment):
                    return True
 
        # Percentage rollout (sticky)
        rollout = rules.get("rollout_percentage", 0)
        if rollout > 0 and user_id:
            return self._in_bucket(user_id, flag, rollout)
 
        return False
 
    def _get_rules(self, flag: str):
        data = self.redis.get(f"flag:{flag}")
        return json.loads(data) if data else None
 
    def _matches_segment(self, context: dict, segment: dict) -> bool:
        for key, expected in segment.items():
            if context.get(key) != expected:
                return False
        return True
 
    def _in_bucket(self, user_id: str, flag: str, percentage: int) -> bool:
        # Deterministic hashing for sticky assignment
        h = hashlib.md5(f"{flag}:{user_id}".encode()).hexdigest()
        bucket = int(h[:8], 16) % 100
        return bucket < percentage
 
    def update_flag(self, flag: str, rules: dict):
        """Admin API: update flag rules."""
        self.redis.set(f"flag:{flag}", json.dumps(rules))
        # TTL not set = persistent
 
 
# Usage
ff = FeatureFlags(redis.Redis())
 
# Set up rule
ff.update_flag("new_checkout", {
    "enabled": True,
    "rollout_percentage": 25,
    "enabled_users": ["beta-1", "beta-2"],
    "segments": [
        {"plan": "enterprise"},
        {"country": "VN"}
    ]
})
 
# Use
if ff.is_enabled("new_checkout", user_id="user-123",
                 context={"plan": "enterprise"}):
    new_checkout()
else:
    old_checkout()

6.2 Canary deployment manual orchestration

"""
Manual canary if not using Argo Rollouts.
"""
 
import time
import requests
from dataclasses import dataclass
 
 
@dataclass
class CanaryStep:
    weight: int
    duration_min: int
 
 
class CanaryOrchestrator:
    def __init__(self, prometheus_url: str, k8s_client):
        self.prom = prometheus_url
        self.k8s = k8s_client
 
    async def deploy(self, service: str, new_version: str,
                     steps: list[CanaryStep]):
        # 1. Deploy canary pod
        await self.k8s.apply_manifest({
            "apiVersion": "apps/v1",
            "kind": "Deployment",
            "metadata": {"name": f"{service}-canary"},
            "spec": {
                "replicas": 1,
                "template": {
                    "spec": {
                        "containers": [{
                            "image": f"myorg/{service}:{new_version}"
                        }]
                    }
                }
            }
        })
 
        # 2. Progressive rollout
        for step in steps:
            print(f"Routing {step.weight}% to canary...")
            await self._update_traffic_split(service, step.weight)
 
            # Wait
            time.sleep(step.duration_min * 60)
 
            # Analyze
            success_rate = await self._query_prometheus(f"""
                sum(rate(http_requests_total{{
                    service="{service}",
                    version="canary",
                    status!~"5.."
                }}[5m])) /
                sum(rate(http_requests_total{{
                    service="{service}",
                    version="canary"
                }}[5m]))
            """)
 
            if success_rate < 0.95:
                print(f"FAILED at {step.weight}% — rolling back")
                await self._update_traffic_split(service, 0)
                await self.k8s.delete(f"{service}-canary")
                raise Exception("Canary failed analysis")
 
            print(f"OK at {step.weight}% — success_rate={success_rate}")
 
        # 3. Promote
        print("Promoting canary to stable")
        await self._promote(service, new_version)
        await self.k8s.delete(f"{service}-canary")
 
    async def _update_traffic_split(self, service, weight):
        # Update Istio VirtualService
        ...
 
    async def _query_prometheus(self, query):
        resp = requests.get(f"{self.prom}/api/v1/query", params={"query": query})
        return float(resp.json()["data"]["result"][0]["value"][1])

7. System Design Diagrams

7.1 Canary Rollout Flow

sequenceDiagram
    participant Dev
    participant CI
    participant Argo as Argo Rollouts
    participant Mesh as Istio
    participant Prom as Prometheus

    Dev->>CI: Push v2.0.0
    CI->>Argo: Update Rollout image
    Argo->>Argo: Spawn canary pod
    Argo->>Mesh: Set 5% traffic to canary

    Note over Argo: Wait 10min

    loop Every 1min
        Argo->>Prom: Query success_rate
        Prom-->>Argo: 0.97
        Note over Argo: ✓ pass
    end

    Argo->>Mesh: Set 25% traffic to canary

    Note over Argo: Wait 30min

    loop Every 1min
        Argo->>Prom: Query
        Prom-->>Argo: 0.92
        Note over Argo: ✗ fail (3 in row)
    end

    Argo->>Mesh: Rollback: 0% to canary
    Argo->>Argo: Delete canary pod
    Argo->>Dev: Slack alert: rollout aborted

7.2 Blue-Green vs Canary

flowchart TB
    subgraph BG["Blue-Green"]
        BGUser[Users]
        BGUser --> BGLB[Load Balancer]
        BGLB -->|100%| BGBlue[Blue v1]
        BGLB -.0%.-> BGGreen[Green v2 ready]

        Note1[Switch atomically:<br/>0% Blue → 100% Green]
    end

    subgraph Canary["Canary"]
        CUser[Users]
        CUser --> CMesh[Service Mesh]
        CMesh -->|95%| CStable[Stable v1]
        CMesh -->|5%| CCanary[Canary v2]

        Note2[Gradually shift weight:<br/>5% → 25% → 50% → 100%]
    end

    style Note1 fill:#fff9c4
    style Note2 fill:#c8e6c9

7.3 Feature Flag Decision Tree

flowchart TD
    Request[Request] --> Get[Get flag value]

    Get --> Cache{In SDK cache?}
    Cache -->|Yes, valid| Return[Return cached value]
    Cache -->|No| Fetch[Fetch from server]

    Fetch --> Eval{Evaluate rules}

    Eval --> Kill{Kill switch?}
    Kill -->|enabled=false| Default[Return default]

    Kill -->|enabled=true| User{User-specific override?}
    User -->|Yes| Override[Return override value]

    User -->|No| Segment{Match segment?}
    Segment -->|Yes| SegmentVal[Return segment value]

    Segment -->|No| Bucket{In rollout bucket?}
    Bucket -->|Yes| Enabled[Return true]
    Bucket -->|No| Default

    Return --> App[Application logic]
    Override --> App
    SegmentVal --> App
    Enabled --> App
    Default --> App

    style Default fill:#ffcdd2
    style Enabled fill:#c8e6c9

7.4 Decoupling Deploy from Release

gantt
    title Deploy vs Release Timeline
    dateFormat YYYY-MM-DD
    axisFormat %m-%d

    section Code
    Develop feature        :2026-01-01, 14d
    Code merged + deployed :milestone, 2026-01-15, 0d

    section Hidden
    Internal testing       :2026-01-15, 7d
    Beta users (1%)        :2026-01-22, 7d
    Wider beta (10%)       :2026-01-29, 14d

    section Released
    50% rollout            :2026-02-12, 7d
    100% rollout           :milestone, 2026-02-19, 0d
    Remove feature flag    :2026-03-19, 0d

    section Big Bang (legacy)
    Develop                :crit, 2026-01-01, 14d
    Release to all         :milestone, crit, 2026-01-15, 0d

8. Aha Moments & Pitfalls

Aha Moments

#1: Deploy ≠ Release. Code can be deployed but feature OFF. This decouples engineering velocity from product launch decisions.

#2: Automated analysis = unbiased decision. Humans biased toward “ship it”. Automated metrics-based rollout = objective gate.

#3: 5% canary catches 95% of bugs. Issues that don’t reproduce in staging often surface at 5% real traffic.

#4: Feature flag = kill switch. Production incident? Disable feature in 30 seconds, no deploy. Faster than rollback.

#5: Sticky bucketing matters for UX. User getting feature today, not tomorrow = bad. Hash deterministically.

#6: DORA metrics correlate with business. Higher deploy frequency + lower failure rate = better profitability (Accelerate research).

#7: Progressive delivery + feature flags = compound benefit. Combined: deploy continuously, release gradually, rollback instantly.

#8: Flag debt is real. 100+ stale flags = liability. Lifecycle management mandatory.

Pitfalls

Pitfall 1: No analysis on canary

Deploy 5% but no metrics check → just slow rollout. Fix: AnalysisTemplate with success_rate + latency.

Pitfall 2: Flag in tight loop

if flag.enabled() { ... } called 1000x/request → SDK overhead. Fix: Evaluate once per request, cache.

Pitfall 3: Different bucket each visit

User in 10% today, 50% tomorrow → confusing UX. Fix: Sticky bucketing via hash(user_id).

Pitfall 4: No flag cleanup

200 flags accumulate, 80% dead. Fix: Owner + sunset_date. Quarterly cleanup.

Pitfall 5: Flags for permanent config

“Should we use Postgres or MySQL” — this is config, not flag. Fix: Flags = temporary. Permanent decisions in config files.

Pitfall 6: No statistical rigor

A/B test “showed lift” but n=50 users. Fix: Statistical significance, sample size calc.

Pitfall 7: Canary without traffic

5% canary at 3am = 0 actual users → no signal. Fix: Require minimum traffic for analysis.

Pitfall 8: Database not rollback-safe

Code v2 changes schema → can’t rollback to v1. Fix: Expand-Contract pattern.

Pitfall 9: Manual rollback

Bug detected at 50% → 30 minutes to manual rollback. Fix: Automated rollback on metric failure.

Pitfall 10: Feature flag for security

“Disable login for attackers” via flag → flag service is now critical path. Fix: Use rate limiting / WAF, not flags.

9. Internal Links

Topic	Liên hệ
Tuan-12-CICD-Pipeline	Foundation; PD adds verification + automation
Tuan-13-Monitoring-Observability	Metrics drive PD analysis
Tuan-11-Microservices-Pattern	Service mesh enables canary
Tuan-14-AuthN-AuthZ-Security	Flag-based security gates
Tuan-Bonus-Platform-Engineering-IDP	Self-service deploy via IDP

Tham khảo

Tools:

Argo Rollouts — https://argo-rollouts.readthedocs.io/
Flagger — https://docs.flagger.app/
Spinnaker (Netflix) — https://spinnaker.io/
LaunchDarkly — https://launchdarkly.com/
Statsig — https://statsig.com/
Unleash — https://www.getunleash.io/
Flagsmith — https://flagsmith.com/
OpenFeature — https://openfeature.dev/

Books:

Continuous Delivery (Humble & Farley, 2010)
Accelerate (Forsgren, Humble, Kim, 2018)
Feature Flag Best Practices (LaunchDarkly e-book)

Research:

DORA State of DevOps reports — https://dora.dev/research/
Trunk-Based Development — https://trunkbaseddevelopment.com/

Engineering blogs:

LaunchDarkly Engineering — https://launchdarkly.com/blog/engineering/
Netflix, Spinnaker — https://netflixtechblog.com/
Facebook, Gatekeeper — internal tool
Google, Feature flags at scale

Tiếp theo: Tuan-Bonus-Edge-Wasm-Architecture — Edge computing với WebAssembly.

lthieu's notes

Explorer

Tuan-Bonus-Progressive-Delivery