Tuần Bonus: Platform Engineering & Internal Developer Platform (IDP)

“Year 1: Mỗi dev tự setup K8s, CI/CD, monitoring → 2 tuần để ship feature đầu tiên. Year 2: Có IDP với golden paths → dev mới ship feature trong 2 giờ. Đó là ‘platform as a product’ — Platform Engineering không phải ‘DevOps mới’, nó là discipline thiết kế DevEx ở quy mô.”

Tags: system-design platform-engineering idp backstage devex bonus Student: Hieu (Backend Dev → Architect) Prerequisite: Tuan-11-Microservices-Pattern · Tuan-12-CICD-Pipeline Liên quan: Tuan-13-Monitoring-Observability · Tuan-Bonus-FinOps-Cloud-Unit-Economics


1. Context & Why

Analogy đời thường — Khu công nghiệp tự cung

Hieu, tưởng tượng em mở công ty 100 startup nhỏ trong cùng 1 khu công nghiệp. Có 2 mô hình:

Mô hình 1 — Mỗi startup tự xoay:

  • Startup A tự thuê điện, nước, internet, security guard
  • Startup B làm tương tự
  • 100 startup × 5 việc setup × 2 tuần = 1000 tuần chỉ để bắt đầu
  • Mỗi startup mất tiền & thời gian “non-core”

Mô hình 2 — Khu công nghiệp có shared services:

  • Khu đã setup: điện, nước, internet, security, fire safety
  • Startup chỉ cần “đăng ký, ký hợp đồng, plug-in” → 1 ngày bắt đầu
  • “Self-service catalog” cho mọi tiện ích
  • 100 startup × 1 ngày = 100 ngày total

Đây chính là Platform Engineering: build “khu công nghiệp” cho dev teams. Internal Developer Platform (IDP) là portal self-service đó.

Tại sao Backend Dev cần hiểu Platform Engineering?

Lý doHậu quả
53% organizations dùng IDP (Port 2025 report)Industry standard, không adopt = behind
DevEx = retentionBad DevEx → engineer churn ($150K/hire)
Cognitive loadBackend dev không nên phải biết K8s, Terraform, Prometheus chi tiết
Cost vs valueInvesting in platform → 2-5x productivity team
Career path”Platform Engineer” là role hot 2024-2026

Tại sao Alex Xu không cover?

Alex Xu Vol 1+2 nói về CI/CD, K8s nhưng không cover operating model của infrastructure. Platform Engineering là organizational pattern, không phải tool — gap cho architect-level.

Tham chiếu chính


2. Deep Dive — Khái niệm cốt lõi

2.1 Team Topologies — Foundation

Team Topologies (Skelton & Pais 2019) defines 4 team types:

Team typeVai trò
Stream-alignedCustomer-facing, ship features (most teams)
PlatformProvide internal capabilities for stream teams
EnablingCoach stream teams on new tech (temporary)
Complicated subsystemSpecialized expertise (e.g., ML, cryptography)

Interaction modes:

  • X-as-a-Service: Platform team provides service, stream consumes (most common)
  • Collaboration: Two teams work together (limited duration)
  • Facilitating: Enabling team helps stream team adopt new tech

2.2 Platform as a Product

Critical mindset shift: Internal platform = product, dev teams = customers.

Product disciplines apply:

  • User research: Interview devs, identify pain points
  • Roadmap: Prioritize features by impact
  • Adoption metrics: Are devs using the platform?
  • NPS / CSAT: Are devs happy?
  • Iterate: Continuous improvement

Anti-pattern: “Build it, they will come”. Force devs to use → resentment, shadow IT.

Right mindset: “Make the right way the easy way” → devs want to use platform.

2.3 Golden Paths

Golden Path = opinionated, well-supported way to do common task.

Example: “Deploy a new microservice”

Without golden path (10 days):
  1. Decide language (Go? Python? Node?)
  2. Setup repo, CI, linters
  3. Write Dockerfile from scratch
  4. Setup K8s manifests
  5. Configure ingress, certs
  6. Setup monitoring (Prometheus scrape, dashboards)
  7. Setup logging (Loki, log format)
  8. Setup tracing (OpenTelemetry SDK)
  9. Setup secrets (Vault integration)
  10. Setup CI/CD pipeline
  11. Code review process
  12. Deploy to staging, prod

With golden path (1 day):
  $ idp create-service --template=python-api --name=my-service
  → Repo created with template (Dockerfile, helm chart, monitoring)
  → CI/CD pipeline auto-configured
  → Service registered in catalog
  → Developer just writes business logic

Key principles:

  • Opinionated: Strong defaults (specific language, framework)
  • Paved: Well-supported (docs, on-call, examples)
  • Optional: Devs can deviate if needed (but harder)
  • Versioned: v1 → v2 with migration path

Common golden paths:

  • New microservice
  • New frontend app
  • New data pipeline
  • New ML model serving
  • Database migration

2.4 Internal Developer Platform (IDP) Components

┌─────────────────────────────────────────────────────┐
│              Developer Portal UI                      │
│  (Backstage, Port, Cortex)                           │
│                                                       │
│  - Service Catalog                                    │
│  - TechDocs                                           │
│  - Software Templates (scaffolder)                    │
│  - Plugins (CI status, on-call, costs)               │
└──────────────────────┬──────────────────────────────┘
                       │
        ┌──────────────┼──────────────┐
        ▼              ▼              ▼
   ┌────────┐    ┌─────────┐    ┌─────────┐
   │Source  │    │  CI/CD  │    │Infra as │
   │Control │    │         │    │Code     │
   │GitHub/ │    │ Argo,   │    │ Terra-  │
   │GitLab  │    │ Jenkins │    │ form,   │
   │        │    │         │    │ Cross-  │
   │        │    │         │    │ plane   │
   └────────┘    └─────────┘    └─────────┘

   ┌────────────────────────────────────────┐
   │          Underlying Infrastructure      │
   │   K8s, Cloud (AWS/GCP/Azure),          │
   │   DBs, Monitoring, Logging              │
   └────────────────────────────────────────┘

2.4.1 Service Catalog

What every dev team needs: “What services exist? Who owns them? What do they depend on?”

Backstage Catalog:

# catalog-info.yaml — committed to repo
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: payment-service
  description: Handles payment processing
  annotations:
    backstage.io/techdocs-ref: dir:.
    pagerduty.com/integration-key: PE-12345
    prometheus.io/dashboard: https://grafana/d/payment
spec:
  type: service
  lifecycle: production
  owner: team-payments
  system: checkout
  dependsOn:
    - resource:postgres-payments
    - component:fraud-detection
  providesApis:
    - payment-api-v1

Auto-discovery: Backstage scans repos for catalog-info.yaml files.

Visualizations: Service dependency graph, ownership map.

2.4.2 TechDocs

Documentation lives với code, not in separate wiki.

my-service/
├── catalog-info.yaml
├── mkdocs.yml
├── docs/
│   ├── index.md
│   ├── architecture.md
│   ├── runbook.md
│   └── api.md
└── src/

Backstage TechDocs plugin auto-renders Markdown → searchable docs site.

Benefit: Docs versioned with code. Update code → update docs in same PR.

2.4.3 Software Templates (Scaffolder)

Self-service service creation.

# template.yaml
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: python-microservice
  title: Python Microservice
  description: Create new Python service with FastAPI
spec:
  parameters:
    - title: Basic info
      properties:
        name:
          type: string
          title: Service name
        description:
          type: string
        owner:
          type: string
          ui:field: OwnerPicker
 
  steps:
    - id: fetch-template
      action: fetch:template
      input:
        url: ./skeleton
        values:
          name: ${{ parameters.name }}
 
    - id: publish
      action: publish:github
      input:
        repoUrl: github.com?repo=${{ parameters.name }}&owner=myorg
        defaultBranch: main
 
    - id: register
      action: catalog:register
      input:
        repoContentsUrl: ${{ steps.publish.output.repoContentsUrl }}

Result: Dev clicks button → repo created with template, registered in catalog, CI configured.

2.4.4 Crossplane / OpenTofu — Infrastructure as platform service

Crossplane: Provision cloud resources via K8s CRDs.

# Postgres database via Crossplane composition
apiVersion: db.example.org/v1alpha1
kind: PostgreSQLDatabase
metadata:
  name: payments-db
spec:
  parameters:
    storageGB: 100
    tier: production
    region: us-east-1
  compositionSelector:
    matchLabels:
      provider: aws
      tier: production

Behind the scenes: Crossplane creates RDS instance, security group, secret in K8s.

Why? Devs use familiar K8s YAML, platform team controls compositions.

2.5 Backstage vs Port vs Cortex vs OpsLevel

ToolOriginStrengthsBest for
BackstageSpotify (open source)Most flexible, plugin ecosystemEngineering-heavy orgs
PortIsraeli startupNo-code, fast time-to-valueMid-size, less custom
CortexUS startupService quality scorecardsQuality-focused orgs
OpsLevelUS startupDevEx maturity modelOps-focused orgs
HumanitecScore.devWorkload-centric, multi-cloudEnterprise, multi-cloud
CNOE (CNCF)Adobe et al.OSS reference architectureVendor-neutral preference

2.6 Adoption Patterns

Common failure: Build platform 18 months → 0 adoption.

Right approach (Camille Fournier):

  1. Start with 1-2 stream teams as design partners
  2. Solve their top 3 pain points (don’t build big bang)
  3. Make it easy to adopt (auto-migration tools)
  4. Measure adoption, iterate
  5. Expand to more teams

Adoption metrics:

  • % services in catalog
  • % services using golden path template
  • DevEx survey scores (NPS, satisfaction)
  • Time to first deploy (new dev)
  • Incident frequency (lower with platform)

2.7 Score.dev — Workload Specification

Score (open spec): Cloud-native workload definition agnostic of platform.

# score.yaml — describe workload portably
apiVersion: score.dev/v1b1
metadata:
  name: my-service
 
containers:
  app:
    image: myorg/my-service:latest
    variables:
      PORT: "8080"
      DB_URL: ${resources.db.uri}
 
resources:
  db:
    type: postgres
 
service:
  ports:
    web:
      port: 80
      targetPort: 8080

Translate to:

  • Local: score-compose generate → docker-compose.yml
  • K8s: score-helm → Helm values
  • Humanitec: native consumption

Goal: Dev writes 1 spec, deploys anywhere.

2.8 GitOps for Platform

Platform configuration = Git repo. Apply via ArgoCD/Flux.

platform-config/
├── teams/
│   ├── team-payments/
│   │   ├── members.yaml
│   │   ├── services.yaml
│   │   └── permissions.yaml
│   └── team-fraud/
├── golden-paths/
│   ├── python-api/
│   └── go-cli/
├── policies/
│   ├── opa/                # Open Policy Agent rules
│   └── kyverno/
└── infra/
    ├── shared/             # Shared infrastructure (DBs, queues)
    └── per-tenant/

Changes via PR: Platform changes reviewed like code.

2.9 Policy as Code (OPA, Kyverno)

Enforce platform standards.

# policy: deployments must have resource limits
package kubernetes.admission
 
deny[msg] {
    input.request.kind.kind == "Deployment"
    container := input.request.object.spec.template.spec.containers[_]
    not container.resources.limits.cpu
    msg := sprintf("Container %v missing CPU limit", [container.name])
}
# Kyverno: enforce ownership label
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-owner-label
spec:
  validationFailureAction: enforce
  rules:
    - name: check-owner-label
      match:
        any:
          - resources:
              kinds: ["Deployment"]
      validate:
        message: "Deployments must have 'owner' label"
        pattern:
          metadata:
            labels:
              owner: "?*"

2.10 Anti-Patterns

Anti-pattern 1: Platform as Ivory Tower

Platform team builds without listening to dev needs. Fix: Embed platform engineers in stream teams initially.

Anti-pattern 2: Mandatory adoption from Day 1

“Use platform or get fired”. Fix: Voluntary adoption, make it 10x better than alternatives.

Anti-pattern 3: Platform as DevOps Rebrand

Same DevOps team, just renamed. Fix: Platform mindset = product mindset. Hire product manager.

Anti-pattern 4: Tool-first

“Let’s deploy Backstage!” without understanding why. Fix: Start with user research. Tool comes after problem.

Anti-pattern 5: All-in-one mega platform

Try to platform-ize everything immediately. Fix: Start with 1-2 golden paths, expand based on demand.


3. Estimation

3.1 Platform team size

Rule of thumb: 1 platform engineer per 10-30 stream-aligned engineers.

Org sizeStream engPlatform eng
Startup (50)302-3
Growth (200)1508-15
Scale (1000)70030-70

3.2 Time investment

To ship MVP IDP:

  • Service catalog + TechDocs: 1-2 months
  • 1-2 golden paths: 2-3 months
  • Self-service infrastructure: 3-6 months
  • Mature multi-team adoption: 12-18 months

Cost trade-off:

  • Investment: 4-8 platform engineers × 12 months × 1-2M
  • Savings: 100 stream engineers × 20% productivity gain × 3M/year
  • ROI: Year 2 onwards

3.3 Adoption metrics

Healthy IDP:

  • 80% services in catalog

  • 60% new services use templates

  • DevEx NPS > 40
  • Time to first deploy: < 1 day for new dev

Unhealthy:

  • < 20% adoption (platform built but unused)
  • DevEx NPS < 0
  • Stream teams build shadow platforms

4. Security First

4.1 RBAC across platform

# Backstage RBAC permission policy
permissions:
  - resource: catalog-entity
    actions:
      - read: allow
      - update:
          conditions:
            - rule: IS_OWNER
              params:
                claims:
                  - sub
  - resource: scaffolder-template
    actions:
      - execute:
          conditions:
            - rule: HAS_GROUP
              params:
                claims:
                  - groups
              expected: developers

4.2 Secret management

Platform must integrate with secret store (Vault, AWS Secrets Manager).

# Service template includes secret integration
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: my-service-secrets
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: vault-backend
    kind: ClusterSecretStore
  target:
    name: my-service-secrets
  data:
    - secretKey: db-password
      remoteRef:
        key: services/my-service/db-password

4.3 Supply chain security

  • SBOM (Software Bill of Materials) for every service
  • Image signing (Cosign, Sigstore)
  • Dependency scanning (Snyk, Dependabot)
  • Policy enforcement (no critical CVEs in production)

4.4 Audit trail

Every platform action logged:

  • Who triggered template scaffolder
  • What changed in catalog
  • Who deployed to production

Forward to SIEM for compliance.


5. DevOps — Vận hành Platform

5.1 Backstage deployment

# docker-compose.yml — local dev
version: "3"
services:
  backstage:
    image: backstage:latest
    ports:
      - "3000:3000"
      - "7007:7007"
    environment:
      POSTGRES_HOST: db
      POSTGRES_PORT: 5432
      POSTGRES_USER: backstage
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      GITHUB_TOKEN: ${GITHUB_TOKEN}
    depends_on: [db]
 
  db:
    image: postgres:15
    environment:
      POSTGRES_PASSWORD: ${DB_PASSWORD}
    volumes:
      - bs-data:/var/lib/postgresql/data
 
volumes:
  bs-data:

Production: K8s deployment with HA postgres, Redis cache, OAuth integration.

5.2 Catalog auto-discovery

# app-config.yaml
catalog:
  rules:
    - allow: [Component, System, API, Resource, Location, Group, User]
 
  locations:
    # Static location
    - type: file
      target: ../../examples/entities.yaml
 
    # GitHub org auto-discovery
    - type: github-discovery
      target: https://github.com/myorg/*
 
  providers:
    githubOrg:
      myorg:
        orgUrl: https://github.com/myorg
        catalogPath: /catalog-info.yaml
        schedule:
          frequency: { minutes: 30 }
          timeout: { minutes: 3 }

5.3 Plugins ecosystem

Common plugins:

  • CI/CD: GitHub Actions, GitLab CI, ArgoCD
  • Monitoring: Grafana, Prometheus, Datadog
  • On-call: PagerDuty, Opsgenie
  • Cost: Kubecost, Vantage
  • Security: Snyk, Dependabot
  • Infra: AWS, Crossplane, Terraform

5.4 Metrics

groups:
  - name: idp_metrics
    rules:
      - alert: BackstageDown
        expr: up{job="backstage"} == 0
        for: 5m
 
      - alert: CatalogIngestionLag
        expr: backstage_catalog_processing_duration_seconds > 600
        for: 30m
 
      - alert: ScaffolderFailures
        expr: rate(backstage_scaffolder_task_failed_total[1h]) > 0.1
        for: 30m

Custom DevEx metrics:

  • services_in_catalog_total
  • golden_path_usage_total
  • template_executions_total
  • time_to_first_deploy_seconds (per dev)

5.5 Roll out plan

Phase 1 (Months 1-3): Pilot
  - 1-2 design partner teams
  - Service catalog only
  - Validate value

Phase 2 (Months 4-6): Expand
  - All teams onboarded to catalog
  - 1-2 golden paths (most common service type)
  - TechDocs for top services

Phase 3 (Months 7-12): Mature
  - 5+ golden paths
  - Self-service infrastructure
  - Cost dashboards
  - On-call integration

Phase 4 (Year 2+): Optimize
  - DevEx metrics-driven improvement
  - Multi-cluster, multi-cloud
  - Advanced governance

6. Code Implementation

6.1 Custom Backstage plugin

// plugins/cost-tracker/src/plugin.ts
import { createPlugin, createRouteRef } from '@backstage/core-plugin-api';
import { Entity } from '@backstage/catalog-model';
 
export const costTrackerPlugin = createPlugin({
  id: 'cost-tracker',
  routes: {
    root: createRouteRef({ id: 'cost-tracker' }),
  },
});
 
// Component to display cost on entity page
export const CostCard = ({ entity }: { entity: Entity }) => {
  const cost = useEntityCost(entity);
  return (
    <Card>
      <CardHeader title="Monthly Cost" />
      <CardContent>
        <Typography variant="h3">${cost.total}</Typography>
        <Typography>Compute: ${cost.compute}</Typography>
        <Typography>Storage: ${cost.storage}</Typography>
        <Typography>Network: ${cost.network}</Typography>
      </CardContent>
    </Card>
  );
};
 
const useEntityCost = (entity: Entity) => {
  // Query Kubecost / Vantage / etc.
  return useApi(costApiRef).getEntityCost(entity);
};

6.2 Golden path template (FastAPI service)

# template.yaml
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: fastapi-service
  title: FastAPI Microservice
spec:
  type: service
  parameters:
    - title: Service Info
      properties:
        name: { type: string, title: Service Name }
        owner: { type: string, ui:field: OwnerPicker }
        description: { type: string }
 
    - title: Database
      properties:
        useDatabase:
          type: boolean
          title: Need Postgres?
        dbSize:
          type: string
          enum: [small, medium, large]
          if:
            properties:
              useDatabase: { const: true }
 
  steps:
    - id: fetch-skeleton
      action: fetch:template
      input:
        url: ./skeleton
        values:
          name: ${{ parameters.name }}
          owner: ${{ parameters.owner }}
          useDatabase: ${{ parameters.useDatabase }}
 
    - id: publish-github
      action: publish:github
      input:
        repoUrl: github.com?owner=myorg&repo=${{ parameters.name }}
        description: ${{ parameters.description }}
        repoVisibility: internal
 
    - id: provision-db
      if: ${{ parameters.useDatabase }}
      action: aws:rds:create
      input:
        dbName: ${{ parameters.name }}-db
        dbSize: ${{ parameters.dbSize }}
        region: us-east-1
 
    - id: register-catalog
      action: catalog:register
      input:
        repoContentsUrl: ${{ steps.publish-github.output.repoContentsUrl }}
# skeleton/${{values.name}}/app/main.py
"""${{ values.name }} - FastAPI service"""
from fastapi import FastAPI
from prometheus_fastapi_instrumentator import Instrumentator
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
 
app = FastAPI(title="${{ values.name }}")
 
# Auto-monitoring
Instrumentator().instrument(app).expose(app)
FastAPIInstrumentor.instrument_app(app)
 
@app.get("/")
async def root():
    return {"service": "${{ values.name }}"}
 
@app.get("/health")
async def health():
    return {"status": "ok"}
# skeleton/${{values.name}}/catalog-info.yaml
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: ${{ values.name }}
  description: ${{ values.description }}
  annotations:
    github.com/project-slug: myorg/${{ values.name }}
    backstage.io/techdocs-ref: dir:.
    pagerduty.com/integration-key: REPLACE_ME
    prometheus.io/dashboard: https://grafana/d/service-template
spec:
  type: service
  lifecycle: experimental
  owner: ${{ values.owner }}
  system: my-system
# skeleton/${{values.name}}/.github/workflows/ci.yml
name: CI
on:
  push: { branches: [main] }
  pull_request:
 
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
      - run: pip install -r requirements.txt
      - run: pytest
      - run: ruff check .
 
  build:
    needs: test
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      - run: |
          docker build -t ghcr.io/myorg/${{ values.name }}:latest .
          docker push ghcr.io/myorg/${{ values.name }}:latest

6.3 Crossplane composition for Postgres

# composition for Postgres in K8s
apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
  name: postgres-aws
  labels:
    provider: aws
    tier: production
spec:
  compositeTypeRef:
    apiVersion: db.example.org/v1alpha1
    kind: PostgreSQLDatabase
  resources:
    - name: rds-instance
      base:
        apiVersion: rds.aws.crossplane.io/v1alpha1
        kind: DBInstance
        spec:
          forProvider:
            engine: postgres
            engineVersion: "15.4"
            dbInstanceClass: db.t3.medium
            allocatedStorage: 100
            multiAZ: true
            backupRetentionPeriod: 7
            deletionProtection: true
      patches:
        - fromFieldPath: spec.parameters.storageGB
          toFieldPath: spec.forProvider.allocatedStorage
        - fromFieldPath: spec.parameters.tier
          toFieldPath: spec.forProvider.dbInstanceClass
          transforms:
            - type: map
              map:
                small: db.t3.small
                medium: db.t3.medium
                production: db.r6i.xlarge

7. System Design Diagrams

7.1 Team Topologies

flowchart TB
    subgraph Platform["Platform Teams"]
        DataPlatform[Data Platform Team]
        InfraPlatform[Infra Platform Team]
        SecPlatform[Security Platform Team]
    end

    subgraph Stream["Stream-Aligned Teams"]
        Payments[Payments Team]
        Checkout[Checkout Team]
        Catalog[Catalog Team]
        Search[Search Team]
    end

    subgraph Enabling["Enabling Teams"]
        SRE[SRE Coaches]
        Architects[Architecture Council]
    end

    subgraph Subsystem["Complicated Subsystem"]
        ML[ML Platform]
        Crypto[Crypto/PKI]
    end

    Stream -->|consume X-as-a-Service| Platform
    Enabling -.facilitate.-> Stream
    Subsystem -->|specialized service| Stream

    style Platform fill:#bbdefb
    style Stream fill:#c8e6c9
    style Enabling fill:#fff9c4
    style Subsystem fill:#ffe0b2

7.2 IDP Layered Architecture

flowchart TB
    Dev[Developer] --> Portal[Developer Portal<br/>Backstage / Port]

    Portal --> Catalog[Service Catalog]
    Portal --> Templates[Software Templates]
    Portal --> Docs[TechDocs]
    Portal --> Insights[Insights & Scorecards]

    Templates --> Scaffolder[Scaffolder Engine]
    Scaffolder --> SCM[GitHub / GitLab]
    Scaffolder --> CICD[CI/CD Pipelines]
    Scaffolder --> Infra[Crossplane / Terraform]

    SCM --> Apps[Application Code]
    CICD --> Deploy[Deploy to K8s]
    Infra --> Cloud[Cloud Resources]

    Apps --> Deploy
    Deploy --> Runtime[K8s Cluster]
    Cloud --> Runtime

    Runtime --> Observ[Observability<br/>Prometheus / Grafana]
    Observ --> Insights

    style Portal fill:#bbdefb
    style Scaffolder fill:#c8e6c9

7.3 Golden Path Flow

sequenceDiagram
    participant Dev as Developer
    participant Portal as IDP Portal
    participant Scaff as Scaffolder
    participant GitHub
    participant CICD
    participant K8s
    participant Catalog

    Dev->>Portal: Click "New Service" template
    Portal->>Dev: Form (name, owner, options)
    Dev->>Portal: Submit

    Portal->>Scaff: Execute template
    Scaff->>GitHub: Create repo
    Scaff->>GitHub: Push skeleton code
    Scaff->>CICD: Configure pipeline
    Scaff->>K8s: Create namespace
    Scaff->>Catalog: Register entity

    Catalog-->>Portal: Service appears
    Portal-->>Dev: Done! Repo URL, dashboard link

    Note over Dev,Catalog: 5 minutes vs 5 days previously

7.4 Team-Platform Interaction Modes

flowchart LR
    subgraph TS1["Stream Team"]
        DevA[Developer A]
    end

    subgraph TP["Platform Team"]
        PE[Platform Engineer]
    end

    subgraph Modes["Interaction Modes"]
        XaaS["X-as-a-Service<br/>(default mode)<br/>Self-service portal,<br/>docs, low-friction"]
        Collab["Collaboration<br/>(temporary)<br/>Joint work on<br/>new pattern"]
        Facilitate["Facilitating<br/>(coaching)<br/>Help adopt<br/>new tech"]
    end

    DevA -->|90% interactions| XaaS
    DevA <-->|5%| Collab
    DevA <-.5%.-> Facilitate

    XaaS --> PE
    Collab --> PE
    Facilitate --> PE

    style XaaS fill:#c8e6c9
    style Collab fill:#fff9c4
    style Facilitate fill:#bbdefb

8. Aha Moments & Pitfalls

Aha Moments

#1: Platform = product, devs = customers. Mindset shift quan trọng nhất. Apply product disciplines: research, roadmap, NPS.

#2: Golden paths > flexibility. Strong opinions với good defaults > “you can use anything”. Reduce cognitive load là main value.

#3: Self-service > tickets. Dev mở ticket “give me K8s namespace” → 2 ngày. Self-service template → 2 phút. Time saved = team velocity.

#4: Catalog là source of truth. Service ownership, dependencies, runbooks — all in catalog. On-call can find anything in 30 seconds.

#5: Adoption is hard. Built ≠ used. Continuous sales effort to platform team. Make it 10x better than DIY.

#6: Team Topologies trumps tools. Right team structure > best tool. Wrong team structure can’t be fixed by Backstage.

#7: TechDocs với code. Docs in repo, versioned, updated với PR. No more outdated wiki.

#8: Platform engineering ≠ DevOps rebrand. Different mindset. DevOps = “you build it, you run it”. Platform = “we provide tools so you build it well”.

Pitfalls

Pitfall 1: Build first, ask later

Spend 18 months → 5% adoption. Fix: Start with 1-2 design partners, MVP fast, iterate.

Pitfall 2: Force adoption

Mandate platform → resentment, shadow IT. Fix: Make it 10x better, voluntary adoption.

Pitfall 3: One platform fits all

Try to satisfy every team’s needs → bloat. Fix: 80/20 rule. Solve common cases well. Allow exceptions.

Pitfall 4: No product manager

Platform team without PM → no roadmap, no user research. Fix: Hire dedicated platform PM.

Pitfall 5: Tool worship

“We adopted Backstage!” → not used. Fix: Tool serves people. Start with problem, end with tool.

Pitfall 6: No DevEx metrics

Don’t know if platform is working. Fix: NPS quarterly, time-to-first-deploy, adoption metrics.

Pitfall 7: Platform team isolation

Platform team in vacuum, away from stream teams. Fix: Embed engineers in stream teams initially. Office hours.

Pitfall 8: Reinventing wheels

Build custom service catalog instead of Backstage. Fix: Adopt OSS, customize. Don’t compete with category leaders.

Pitfall 9: No security baked in

Devs use platform but bypass security controls. Fix: Make secure way the easy way. Policy-as-code enforces.

Pitfall 10: Underestimate ongoing investment

Build once, expect to last forever. Fix: Continuous investment. Tech debt accumulates fast.


TopicLiên hệ
Tuan-11-Microservices-PatternMicroservices need platform; service catalog tracks them
Tuan-12-CICD-PipelineGolden paths automate CI/CD setup
Tuan-13-Monitoring-ObservabilityPlatform integrates monitoring
Tuan-14-AuthN-AuthZ-SecurityRBAC across platform
Tuan-Bonus-FinOps-Cloud-Unit-EconomicsCost dashboard in IDP
Tuan-Bonus-Progressive-DeliveryDeploy strategy via platform

Tham khảo

Books:

  • Team Topologies (Skelton & Pais, 2nd ed 2024) — https://teamtopologies.com/
  • Platform Engineering (Camille Fournier, 2024)
  • The DevOps Handbook (Kim, Humble, Debois 2016)

Reports:

Tools docs:

Engineering blogs:


Tiếp theo: Tuan-Bonus-FinOps-Cloud-Unit-Economics — FinOps complement Platform Engineering với cost lens.