Tuần Bonus: Progressive Delivery — Argo Rollouts, Flagger, Feature Flags

“CI/CD truyền thống: deploy → all 100% users → if bug → rollback all. Progressive Delivery: deploy → 1% users → measure → 5% → measure → 25% → 100%. Bug detected? Rollback automatic ở 5%. Đây là evolution từ ‘deploy and hope’ sang ‘deploy and verify’.”

Tags: system-design cicd progressive-delivery canary feature-flags bonus Student: Hieu (Backend Dev → Architect) Prerequisite: Tuan-12-CICD-Pipeline · Tuan-13-Monitoring-Observability Liên quan: Tuan-11-Microservices-Pattern · Tuan-Bonus-Platform-Engineering-IDP


1. Context & Why

Analogy đời thường — Mở quán mới

Hieu, tưởng tượng em mở chuỗi 100 quán phở mới với recipe mới:

Cách 1 — Big bang launch (CI/CD truyền thống):

  • Day 1: Đổi recipe ở tất cả 100 quán cùng lúc
  • Khách hàng feedback “phở dở quá!”
  • 100 quán mất doanh thu cùng 1 ngày
  • Quay lại recipe cũ → 200 quán-ngày bị ảnh hưởng

Cách 2 — Progressive launch (Progressive Delivery):

  • Day 1: Đổi recipe ở 1 quán
  • Đo lường: doanh thu, complaint, NPS
  • Day 2: Nếu OK → 5 quán
  • Day 3: Nếu vẫn OK → 25 quán
  • Day 7: 100% nếu metrics tốt
  • Phát hiện bug ở 5 quán → rollback chỉ 5 quán

Progressive Delivery = Continuous Deployment + Risk Reduction + Automated Decisions.

Tại sao Backend Dev cần hiểu?

Lý doHậu quả nếu không
Reduce blast radiusBug 100% users vs 1% users
Automated rollbackDetect + rollback in seconds vs hours
Real-world validationPre-prod test miss issues, prod 1% reveals
Decouple deploy from releaseDeploy code mà chưa enable feature
DORA metricsHigher deploy frequency, lower MTTR
A/B testing built-inValidate hypothesis with traffic subsets

Tại sao Alex Xu không đi sâu?

Alex Xu Vol 1+2 nói về CI/CD basic (blue-green, canary) nhưng không cover automated analysis, feature flag platforms, progressive rollout patterns. Đây là evolution 2020+.

Tham chiếu chính


2. Deep Dive — Khái niệm cốt lõi

2.1 Deployment Strategies Spectrum

                        Risk    Speed   Complexity
Big Bang               ────█    █───   ─
Blue-Green             ──██     ██──   ██
Rolling                ─███     ██──   ██
Canary (manual)        ████     ███─   ███
Canary (automated)     ████     ████   ████
A/B Testing            ████     ████   █████
Shadow / Mirror        █████    ─███   █████

2.1.1 Big Bang

v1: 100% → v2: 100% (instant cutover)

When: Stateful systems where versioning hard (DB schema migration sometimes) Avoid for: stateless services

2.1.2 Blue-Green

Blue (v1) running, Green (v2) deployed
Switch load balancer: 100% Blue → 100% Green
Keep Blue for instant rollback

Pros: Instant rollback, zero-downtime Cons: 2x infrastructure during deploy, no gradual validation

2.1.3 Rolling Update (K8s default)

10 pods running v1
Replace 1 pod at a time:
  Pod 1: v1 → v2
  Pod 2: v1 → v2
  ...
At any time: mix of v1/v2 pods serving traffic

Pros: Default in K8s, gradual Cons: No automated analysis, all-or-nothing per pod

2.1.4 Canary

v1: 99%, v2: 1%
v1: 95%, v2: 5%
v1: 75%, v2: 25%
v1: 50%, v2: 50%
v1: 0%, v2: 100%

Pros: Gradual exposure, can pause at any % Cons: Manual decisions; need infrastructure (service mesh, ingress)

2.1.5 A/B Testing

Same as canary but split based on USER attributes:
  v1: existing users
  v2: new users in segment X

Measure business metrics (conversion, engagement)

Pros: Validate hypothesis with real users Cons: Need experimentation platform, statistical rigor

2.1.6 Shadow / Mirror

v1: receives 100% traffic, returns response
v2: receives 100% mirrored traffic, response discarded
Compare v1 vs v2 outputs (correctness, performance)

Pros: Test against real traffic without impact Cons: 2x infrastructure, side-effect concerns

2.2 Argo Rollouts — Kubernetes-native Progressive Delivery

Argo Rollouts: Replaces K8s Deployment with Rollout CRD, supports canary + blue-green with automated analysis.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payment-service
spec:
  replicas: 10
  selector:
    matchLabels:
      app: payment-service
  template:
    metadata:
      labels:
        app: payment-service
    spec:
      containers:
        - name: payment-service
          image: myorg/payment:v2.0.0
          ports: [{ containerPort: 8080 }]
 
  strategy:
    canary:
      steps:
        - setWeight: 5     # 5% traffic to v2
        - pause: { duration: 10m }
        - analysis:
            templates:
              - templateName: success-rate
        - setWeight: 25    # 25%
        - pause: { duration: 30m }
        - analysis:
            templates:
              - templateName: success-rate
              - templateName: latency
        - setWeight: 50
        - pause: { duration: 30m }
        - setWeight: 100   # full rollout
 
      canaryService: payment-canary    # Routes v2 traffic
      stableService: payment-stable    # Routes v1 traffic
 
      trafficRouting:
        istio:
          virtualService:
            name: payment-vs
            routes: [primary]

Steps execution:

  1. Deploy 1 canary pod with v2
  2. Route 5% traffic to canary
  3. Pause 10 minutes
  4. Run analysis (query Prometheus)
  5. If pass → next step; if fail → rollback automatically
  6. Continue progression

2.3 AnalysisTemplate — Automated Verification

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
    - name: service-name
  metrics:
    - name: success-rate
      interval: 1m
      successCondition: result[0] >= 0.95
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{
              service="{{args.service-name}}",
              status!~"5..",
              version="canary"
            }[5m])) /
            sum(rate(http_requests_total{
              service="{{args.service-name}}",
              version="canary"
            }[5m]))
 
    - name: p99-latency
      interval: 1m
      successCondition: result[0] <= 0.5
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket{
                service="{{args.service-name}}",
                version="canary"
              }[5m])) by (le)
            )

Decision logic:

  • Run query every minute
  • If 3 consecutive failures → abort rollout, trigger rollback
  • If 3 consecutive successes → progress to next step

2.4 Flagger — Service Mesh-native

Flagger (Weaveworks/Flux): Similar concept, integrates with Istio/Linkerd/AWS App Mesh natively.

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: payment
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment
 
  service:
    port: 80
    targetPort: 8080
    gateways: [public-gateway]
 
  analysis:
    interval: 1m
    threshold: 5     # max 5 failed checks
    maxWeight: 50
    stepWeight: 10
 
    metrics:
      - name: request-success-rate
        thresholdRange: { min: 99 }
        interval: 1m
 
      - name: request-duration
        thresholdRange: { max: 500 }    # ms
        interval: 30s
 
    webhooks:
      - name: smoke-test
        type: pre-rollout
        url: http://test-runner/run-smoke
 
      - name: load-test
        url: http://flagger-loadtester.test/
        timeout: 5s
        metadata:
          cmd: "hey -z 1m -q 10 -c 2 http://payment-canary/"

Differences from Argo Rollouts:

  • Flagger more integrated with service mesh
  • Argo Rollouts more flexible (custom analysis providers)
  • Both production-grade

2.5 Feature Flags — Decouple Deploy from Release

Concept: Deploy code with feature OFF. Enable feature later via flag.

# Without feature flag
def get_recommendations(user_id):
    return new_ml_algorithm(user_id)  # Released to all immediately
 
# With feature flag
def get_recommendations(user_id):
    if feature_flag.enabled("new_recommendations", user_id=user_id):
        return new_ml_algorithm(user_id)
    return old_algorithm(user_id)

Use cases:

  1. Gradual rollout: 1% → 10% → 100%
  2. Kill switch: Disable broken feature instantly
  3. A/B testing: Compare variants
  4. Targeted rollout: Enable for specific users (beta testers)
  5. Trunk-based development: Merge unfinished features behind flag

2.6 Feature Flag Platforms

ToolTypeBest for
LaunchDarklyCommercial SaaSEnterprise, full features
StatsigSaaS, freemiumExperimentation focus
UnleashOSS + cloudSelf-host preference
FlagsmithOSS + cloudPrivacy-conscious
OpenFeature (CNCF)Vendor-neutral specAvoid lock-in
Cloudflare Workers KVDIY simple flagsExisting CF infra
Custom DBSelf-builtSmall scale, simple needs

2.7 OpenFeature — Standardization

OpenFeature (CNCF): Vendor-neutral feature flag SDK.

# Python OpenFeature SDK
from openfeature import api
from openfeature.contrib.provider.flagsmith import FlagsmithProvider
 
# Configure provider (swap-able)
api.set_provider(FlagsmithProvider(env_key="..."))
 
# Use anywhere
client = api.get_client("my-app")
 
if client.get_boolean_value("new_checkout", default=False, evaluation_context={
    "user_id": user_id,
    "country": "VN"
}):
    new_checkout_flow()
else:
    old_checkout_flow()

Magic: Switch from LaunchDarkly to Unleash → just change provider, no code change.

2.8 Targeting & Rollout Strategies

2.8.1 Percentage rollout

flag: new_checkout
default: false
targets:
  - rule: percentage(10)  # 10% random users
    value: true

2.8.2 Attribute-based

flag: premium_features
default: false
targets:
  - rule: user.plan == "enterprise"
    value: true
  - rule: user.country in ["US", "VN", "JP"]
    value: true
  - rule: user.id in ["beta-tester-1", "beta-tester-2"]
    value: true

2.8.3 Context-based

flag: heavy_query_optimization
targets:
  - rule: request.path == "/reports"
    value: true
  - rule: request.method == "POST" && request.size > 1MB
    value: true

2.8.4 Sticky bucketing

Important: User in 10% bucket today should stay in 10% tomorrow (consistent UX).

# Hash user_id deterministically
def in_bucket(user_id: str, percentage: int) -> bool:
    hash_val = mmh3.hash(f"new_checkout:{user_id}") % 100
    return hash_val < percentage

2.9 Experimentation (A/B Testing)

Beyond rollout: measure business metrics.

import statsig
 
statsig.initialize("...")
 
# Get experiment value
config = statsig.get_experiment("new_checkout_design", user)
button_color = config.get_string("button_color", "blue")
button_text = config.get_string("button_text", "Buy Now")
 
# Render UI with variant
render_button(color=button_color, text=button_text)
 
# Log conversion event
if user_purchased:
    statsig.log_event("purchase", value=order_total)

Experiment platform computes:

  • Sample size per variant
  • Statistical significance
  • Lift in conversion
  • Confidence intervals

Tools: Statsig, LaunchDarkly Experiments, Eppo, Amplitude Experiment.

2.10 Rollback Strategies

2.10.1 Automated rollback (Argo Rollouts/Flagger)

analysis:
  failureLimit: 3   # 3 consecutive failures → abort
  inconclusiveLimit: 5

When abort triggered:

  1. Stop progression
  2. Route 100% traffic back to stable
  3. Notify team
  4. Keep canary pod for debugging (configurable)

2.10.2 Feature flag kill switch

# Bug detected → disable feature instantly
launchdarkly.update_flag("new_checkout", enabled=False)
# Effect propagates within seconds, no deploy needed

Power: Sub-30-second rollback vs 10-minute deploy rollback.

2.10.3 Database rollback considerations

Problem: Rolling back code is easy, rolling back DB schema is hard.

Pattern: Expand-Contract:

Migration v1 → v2:
  Phase 1 (Expand): Add new column, code writes to both old & new
  Phase 2: Backfill new column from old
  Phase 3: Code reads from new
  Phase 4 (Contract): Remove old column

Each phase deployable & rollback-safe.

2.11 DORA Metrics

4 key metrics measure DevOps performance:

MetricElite
Deploy frequencyOn-demand (multiple per day)
Lead time for changes< 1 day commit to prod
Change failure rate0-15%
MTTR< 1 hour

Progressive Delivery improves all 4:

  • Deploy frequency ↑ (lower risk per deploy)
  • Lead time ↓ (auto-pipeline)
  • Change failure rate ↓ (catch issues at 5%)
  • MTTR ↓ (auto-rollback)

3. Estimation

3.1 Time saved by automation

Manual canary (without Argo Rollouts):

  • Engineer monitors metrics manually: 2h per deploy
  • 10 deploys/week × 2h × 5 engineers = 100h/week
  • $200K/year just monitoring time

Automated (Argo Rollouts):

  • Engineer reviews dashboard occasionally: 0.5h/deploy
  • Saves 75h/week = $150K/year
  • Plus prevents bad deploys (avg incident cost $50K)

3.2 Risk reduction

Without progressive delivery:

  • 1% bad deploy rate × 10 deploys/week × 50 weeks = 5 incidents/year
  • Average impact: 30 min × 100K users × $X cost = $$$

With progressive delivery:

  • Same bad deploy rate but caught at 5% traffic
  • Impact: 5 min × 5K users = 95% reduction

3.3 Feature flag overhead

  • ~5-10% perf overhead from flag evaluation
  • Mitigations: SDK cache, edge evaluation
  • Cost: LaunchDarkly 0 self-host

4. Security First

4.1 Flag evaluation auth

Threat: Attacker manipulates flag → enable hidden feature.

Mitigations:

  • API key per environment
  • Restrict who can change flags (RBAC)
  • Audit log every flag change
  • Use cryptographic verification (signed flags)

4.2 Sensitive data in flag context

Don’t include PII in flag context (cached, logged):

# BAD
client.get_boolean_value("flag", context={"email": "user@example.com"})
 
# GOOD
client.get_boolean_value("flag", context={"user_id": hash(email)})

4.3 Flag debt

Problem: Flags accumulate over time (100s of dead flags).

Risks:

  • Old flags = old code paths = potential bugs
  • Hard to reason about behavior
  • Audit nightmare

Solution: Flag lifecycle management

  • Tag flags with owner, created_at, sunset_date
  • Auto-alert when flag > 90 days old
  • Quarterly cleanup ritual

4.4 Canary security testing

Run security scans on canary before promotion:

  • DAST (OWASP ZAP)
  • Container scan (Trivy)
  • API contract test

Fail rollout if security regression.


5. DevOps — Implementation

5.1 Argo Rollouts setup

# Install
kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts -f \
  https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml
 
# Install kubectl plugin
brew install argoproj/tap/kubectl-argo-rollouts
 
# Use
kubectl argo rollouts get rollout payment-service
kubectl argo rollouts pause payment-service
kubectl argo rollouts promote payment-service
kubectl argo rollouts abort payment-service

5.2 Service mesh setup (Istio)

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: payment-vs
spec:
  hosts: [payment]
  http:
    - name: primary
      route:
        - destination:
            host: payment
            subset: stable
          weight: 100
        - destination:
            host: payment
            subset: canary
          weight: 0     # Argo Rollouts updates this
 
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: payment
spec:
  host: payment
  subsets:
    - name: stable
      labels: { version: stable }
    - name: canary
      labels: { version: canary }

5.3 Feature flag (Unleash) setup

# docker-compose
services:
  unleash:
    image: unleashorg/unleash-server
    environment:
      DATABASE_URL: "postgres://unleash:password@db/unleash"
    ports: ["4242:4242"]
 
  db:
    image: postgres:15
    environment:
      POSTGRES_USER: unleash
      POSTGRES_PASSWORD: password
      POSTGRES_DB: unleash

5.4 Application integration

from UnleashClient import UnleashClient
 
client = UnleashClient(
    url="http://unleash:4242/api/",
    app_name="my-app",
    custom_headers={"Authorization": API_TOKEN}
)
client.initialize_client()
 
 
def get_recommendations(user_id):
    if client.is_enabled(
        "new_recommendations",
        context={"userId": user_id}
    ):
        return new_algorithm(user_id)
    return old_algorithm(user_id)

5.5 Monitoring

groups:
  - name: progressive_delivery
    rules:
      - alert: RolloutAborted
        expr: rollout_phase{phase="Aborted"} == 1
        for: 1m
        annotations:
          summary: "Rollout {{ $labels.name }} aborted automatically"
 
      - alert: RolloutPaused
        expr: rollout_phase{phase="Paused"} == 1
        for: 1h
        annotations:
          summary: "Rollout {{ $labels.name }} paused > 1h"
 
      - alert: HighFlagEvaluationLatency
        expr: |
          histogram_quantile(0.99, rate(flag_eval_duration_bucket[5m])) > 0.05
        annotations:
          summary: "Flag SDK P99 > 50ms"
 
      - alert: TooManyOldFlags
        expr: count(flag_age_days > 90) > 50
        annotations:
          summary: "{{ $value }} flags > 90 days old. Cleanup needed."

6. Code Implementation

6.1 Custom feature flag service

"""
Lightweight feature flag service.
For when LaunchDarkly is overkill.
"""
 
import hashlib
import json
from typing import Any
import redis
 
 
class FeatureFlags:
    def __init__(self, redis_client: redis.Redis):
        self.redis = redis_client
 
    def is_enabled(self, flag: str, user_id: str = None,
                   context: dict = None) -> bool:
        """Check if flag enabled for user/context."""
        rules = self._get_rules(flag)
        if not rules:
            return False
 
        if rules.get("enabled") is False:
            return False  # Kill switch
 
        # Check user-specific overrides
        if user_id and user_id in rules.get("enabled_users", []):
            return True
 
        # Check segment rules
        if context:
            for segment in rules.get("segments", []):
                if self._matches_segment(context, segment):
                    return True
 
        # Percentage rollout (sticky)
        rollout = rules.get("rollout_percentage", 0)
        if rollout > 0 and user_id:
            return self._in_bucket(user_id, flag, rollout)
 
        return False
 
    def _get_rules(self, flag: str):
        data = self.redis.get(f"flag:{flag}")
        return json.loads(data) if data else None
 
    def _matches_segment(self, context: dict, segment: dict) -> bool:
        for key, expected in segment.items():
            if context.get(key) != expected:
                return False
        return True
 
    def _in_bucket(self, user_id: str, flag: str, percentage: int) -> bool:
        # Deterministic hashing for sticky assignment
        h = hashlib.md5(f"{flag}:{user_id}".encode()).hexdigest()
        bucket = int(h[:8], 16) % 100
        return bucket < percentage
 
    def update_flag(self, flag: str, rules: dict):
        """Admin API: update flag rules."""
        self.redis.set(f"flag:{flag}", json.dumps(rules))
        # TTL not set = persistent
 
 
# Usage
ff = FeatureFlags(redis.Redis())
 
# Set up rule
ff.update_flag("new_checkout", {
    "enabled": True,
    "rollout_percentage": 25,
    "enabled_users": ["beta-1", "beta-2"],
    "segments": [
        {"plan": "enterprise"},
        {"country": "VN"}
    ]
})
 
# Use
if ff.is_enabled("new_checkout", user_id="user-123",
                 context={"plan": "enterprise"}):
    new_checkout()
else:
    old_checkout()

6.2 Canary deployment manual orchestration

"""
Manual canary if not using Argo Rollouts.
"""
 
import time
import requests
from dataclasses import dataclass
 
 
@dataclass
class CanaryStep:
    weight: int
    duration_min: int
 
 
class CanaryOrchestrator:
    def __init__(self, prometheus_url: str, k8s_client):
        self.prom = prometheus_url
        self.k8s = k8s_client
 
    async def deploy(self, service: str, new_version: str,
                     steps: list[CanaryStep]):
        # 1. Deploy canary pod
        await self.k8s.apply_manifest({
            "apiVersion": "apps/v1",
            "kind": "Deployment",
            "metadata": {"name": f"{service}-canary"},
            "spec": {
                "replicas": 1,
                "template": {
                    "spec": {
                        "containers": [{
                            "image": f"myorg/{service}:{new_version}"
                        }]
                    }
                }
            }
        })
 
        # 2. Progressive rollout
        for step in steps:
            print(f"Routing {step.weight}% to canary...")
            await self._update_traffic_split(service, step.weight)
 
            # Wait
            time.sleep(step.duration_min * 60)
 
            # Analyze
            success_rate = await self._query_prometheus(f"""
                sum(rate(http_requests_total{{
                    service="{service}",
                    version="canary",
                    status!~"5.."
                }}[5m])) /
                sum(rate(http_requests_total{{
                    service="{service}",
                    version="canary"
                }}[5m]))
            """)
 
            if success_rate < 0.95:
                print(f"FAILED at {step.weight}% — rolling back")
                await self._update_traffic_split(service, 0)
                await self.k8s.delete(f"{service}-canary")
                raise Exception("Canary failed analysis")
 
            print(f"OK at {step.weight}% — success_rate={success_rate}")
 
        # 3. Promote
        print("Promoting canary to stable")
        await self._promote(service, new_version)
        await self.k8s.delete(f"{service}-canary")
 
    async def _update_traffic_split(self, service, weight):
        # Update Istio VirtualService
        ...
 
    async def _query_prometheus(self, query):
        resp = requests.get(f"{self.prom}/api/v1/query", params={"query": query})
        return float(resp.json()["data"]["result"][0]["value"][1])

7. System Design Diagrams

7.1 Canary Rollout Flow

sequenceDiagram
    participant Dev
    participant CI
    participant Argo as Argo Rollouts
    participant Mesh as Istio
    participant Prom as Prometheus

    Dev->>CI: Push v2.0.0
    CI->>Argo: Update Rollout image
    Argo->>Argo: Spawn canary pod
    Argo->>Mesh: Set 5% traffic to canary

    Note over Argo: Wait 10min

    loop Every 1min
        Argo->>Prom: Query success_rate
        Prom-->>Argo: 0.97
        Note over Argo: ✓ pass
    end

    Argo->>Mesh: Set 25% traffic to canary

    Note over Argo: Wait 30min

    loop Every 1min
        Argo->>Prom: Query
        Prom-->>Argo: 0.92
        Note over Argo: ✗ fail (3 in row)
    end

    Argo->>Mesh: Rollback: 0% to canary
    Argo->>Argo: Delete canary pod
    Argo->>Dev: Slack alert: rollout aborted

7.2 Blue-Green vs Canary

flowchart TB
    subgraph BG["Blue-Green"]
        BGUser[Users]
        BGUser --> BGLB[Load Balancer]
        BGLB -->|100%| BGBlue[Blue v1]
        BGLB -.0%.-> BGGreen[Green v2 ready]

        Note1[Switch atomically:<br/>0% Blue → 100% Green]
    end

    subgraph Canary["Canary"]
        CUser[Users]
        CUser --> CMesh[Service Mesh]
        CMesh -->|95%| CStable[Stable v1]
        CMesh -->|5%| CCanary[Canary v2]

        Note2[Gradually shift weight:<br/>5% → 25% → 50% → 100%]
    end

    style Note1 fill:#fff9c4
    style Note2 fill:#c8e6c9

7.3 Feature Flag Decision Tree

flowchart TD
    Request[Request] --> Get[Get flag value]

    Get --> Cache{In SDK cache?}
    Cache -->|Yes, valid| Return[Return cached value]
    Cache -->|No| Fetch[Fetch from server]

    Fetch --> Eval{Evaluate rules}

    Eval --> Kill{Kill switch?}
    Kill -->|enabled=false| Default[Return default]

    Kill -->|enabled=true| User{User-specific override?}
    User -->|Yes| Override[Return override value]

    User -->|No| Segment{Match segment?}
    Segment -->|Yes| SegmentVal[Return segment value]

    Segment -->|No| Bucket{In rollout bucket?}
    Bucket -->|Yes| Enabled[Return true]
    Bucket -->|No| Default

    Return --> App[Application logic]
    Override --> App
    SegmentVal --> App
    Enabled --> App
    Default --> App

    style Default fill:#ffcdd2
    style Enabled fill:#c8e6c9

7.4 Decoupling Deploy from Release

gantt
    title Deploy vs Release Timeline
    dateFormat YYYY-MM-DD
    axisFormat %m-%d

    section Code
    Develop feature        :2026-01-01, 14d
    Code merged + deployed :milestone, 2026-01-15, 0d

    section Hidden
    Internal testing       :2026-01-15, 7d
    Beta users (1%)        :2026-01-22, 7d
    Wider beta (10%)       :2026-01-29, 14d

    section Released
    50% rollout            :2026-02-12, 7d
    100% rollout           :milestone, 2026-02-19, 0d
    Remove feature flag    :2026-03-19, 0d

    section Big Bang (legacy)
    Develop                :crit, 2026-01-01, 14d
    Release to all         :milestone, crit, 2026-01-15, 0d

8. Aha Moments & Pitfalls

Aha Moments

#1: Deploy ≠ Release. Code can be deployed but feature OFF. This decouples engineering velocity from product launch decisions.

#2: Automated analysis = unbiased decision. Humans biased toward “ship it”. Automated metrics-based rollout = objective gate.

#3: 5% canary catches 95% of bugs. Issues that don’t reproduce in staging often surface at 5% real traffic.

#4: Feature flag = kill switch. Production incident? Disable feature in 30 seconds, no deploy. Faster than rollback.

#5: Sticky bucketing matters for UX. User getting feature today, not tomorrow = bad. Hash deterministically.

#6: DORA metrics correlate with business. Higher deploy frequency + lower failure rate = better profitability (Accelerate research).

#7: Progressive delivery + feature flags = compound benefit. Combined: deploy continuously, release gradually, rollback instantly.

#8: Flag debt is real. 100+ stale flags = liability. Lifecycle management mandatory.

Pitfalls

Pitfall 1: No analysis on canary

Deploy 5% but no metrics check → just slow rollout. Fix: AnalysisTemplate with success_rate + latency.

Pitfall 2: Flag in tight loop

if flag.enabled() { ... } called 1000x/request → SDK overhead. Fix: Evaluate once per request, cache.

Pitfall 3: Different bucket each visit

User in 10% today, 50% tomorrow → confusing UX. Fix: Sticky bucketing via hash(user_id).

Pitfall 4: No flag cleanup

200 flags accumulate, 80% dead. Fix: Owner + sunset_date. Quarterly cleanup.

Pitfall 5: Flags for permanent config

“Should we use Postgres or MySQL” — this is config, not flag. Fix: Flags = temporary. Permanent decisions in config files.

Pitfall 6: No statistical rigor

A/B test “showed lift” but n=50 users. Fix: Statistical significance, sample size calc.

Pitfall 7: Canary without traffic

5% canary at 3am = 0 actual users → no signal. Fix: Require minimum traffic for analysis.

Pitfall 8: Database not rollback-safe

Code v2 changes schema → can’t rollback to v1. Fix: Expand-Contract pattern.

Pitfall 9: Manual rollback

Bug detected at 50% → 30 minutes to manual rollback. Fix: Automated rollback on metric failure.

Pitfall 10: Feature flag for security

“Disable login for attackers” via flag → flag service is now critical path. Fix: Use rate limiting / WAF, not flags.


TopicLiên hệ
Tuan-12-CICD-PipelineFoundation; PD adds verification + automation
Tuan-13-Monitoring-ObservabilityMetrics drive PD analysis
Tuan-11-Microservices-PatternService mesh enables canary
Tuan-14-AuthN-AuthZ-SecurityFlag-based security gates
Tuan-Bonus-Platform-Engineering-IDPSelf-service deploy via IDP

Tham khảo

Tools:

Books:

  • Continuous Delivery (Humble & Farley, 2010)
  • Accelerate (Forsgren, Humble, Kim, 2018)
  • Feature Flag Best Practices (LaunchDarkly e-book)

Research:

Engineering blogs:


Tiếp theo: Tuan-Bonus-Edge-Wasm-Architecture — Edge computing với WebAssembly.