Tuần 04 — Query Optimization & EXPLAIN

“Đọc EXPLAIN ANALYZE giống đọc bản đồ. Người mới nhìn thấy chỉ là một mớ chữ. Người có kinh nghiệm thấy được con đường, các nút thắt, và tại sao xe kẹt ở đâu. Tuần này dạy bạn cách đọc bản đồ đó.”

Tags: database postgresql explain query-optimization performance Thời lượng: 7 ngày (5-7h/ngày) Prerequisites: Tuan-03-Indexing-Mastery (hiểu index để hiểu plan) Liên quan: Tuan-09-DB-Observability-Tuning (pg_stat_statements) · Tuan-05-Transactions-Locking

1. Context & Why

1.1 EXPLAIN ANALYZE là dụng cụ chẩn đoán quan trọng nhất của DB

Khi query slow, có 3 cách phản ứng:

Add index “phòng hờ” — guessing → 50% miss
Rewrite query “có thể nhanh hơn” — guessing → 50% miss
Đọc EXPLAIN ANALYZE → biết chính xác bottleneck → fix targeted → 95% hit

Senior engineer dành 80% thời gian DB cho việc đọc plan, 20% cho việc fix.

1.2 Mục tiêu tuần

Đọc EXPLAIN ANALYZE output trong 30 giây tìm ra bottleneck
Hiểu cost model: pages × cost + tuples × cost
Phân biệt và biết khi nào dùng: Seq Scan, Index Scan, Index-Only, Bitmap, Sort, Hash
Đọc 3 join algorithms: nested loop, hash, merge — và biết khi nào planner chọn sai
Statistics: hiểu default_statistics_target, pg_statistic, extended statistics
Tránh được 10 anti-pattern phổ biến nhất
Dùng tools: explain.depesz.com, pev2, pgMustard

1.3 Tham chiếu

PostgreSQL Documentation: EXPLAIN — https://www.postgresql.org/docs/current/sql-explain.html
PostgreSQL Documentation: Performance Tips — https://www.postgresql.org/docs/current/performance-tips.html
Use The Index, Luke! Ch. “The Join Operation”
explain.depesz.com — paste EXPLAIN, get colored analysis
dalibo/pev2 — visual plan viewer
pgMustard — AI-assisted EXPLAIN analysis (commercial)
Bruce Momjian — “Explaining the Postgres Query Optimizer” slides

2. EXPLAIN — Syntax & Options

2.1 Forms

EXPLAIN <query>;                    -- chỉ plan, không chạy
EXPLAIN ANALYZE <query>;             -- chạy thật, đo thời gian thật
EXPLAIN (ANALYZE, BUFFERS) <query>;  -- thêm I/O stats
EXPLAIN (ANALYZE, BUFFERS, VERBOSE, FORMAT JSON) <query>;

2.2 Options đầy đủ

Option	Mặc định	Mục đích
`ANALYZE`	off	Chạy query, đo actual time/rows
`BUFFERS`	off	Thống kê shared/local buffers hit, read
`VERBOSE`	off	Output column list, schema-qualified
`COSTS`	on	Hiện estimated cost
`TIMING`	on	Per-node timing (cần ANALYZE)
`SUMMARY`	on if ANALYZE	Tổng planning/execution time
`SETTINGS`	off (PG12+)	List non-default settings ảnh hưởng plan
`WAL`	off (PG13+)	WAL records generated by DML
`GENERIC_PLAN`	off (PG16+)	Plan cho parameterized query không có giá trị thật
`FORMAT`	text	text/xml/json/yaml

Recommended để debug:

EXPLAIN (ANALYZE, BUFFERS, VERBOSE, SETTINGS, FORMAT TEXT) <query>;

2.3 Đọc output cơ bản

                                                  QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────
 Limit  (cost=0.43..8.55 rows=20 width=64) (actual time=0.025..0.075 rows=20 loops=1)
   ->  Index Scan using idx_orders_user_placed on orders  (cost=0.43..406.32 rows=1000 width=64)
                                                          (actual time=0.024..0.071 rows=20 loops=1)
         Index Cond: (user_id = 42)
 Planning Time: 0.123 ms
 Execution Time: 0.105 ms

Đọc:

Node tree từ trong ra ngoài (deepest first)
cost=startup..total — abstract unit (xem section 3)
rows=N — estimate
actual time=startup..total ms — đo thực
rows=N loops=M — node chạy M lần, mỗi lần produce ~N rows
Planning Time — thời gian build plan
Execution Time — thời gian chạy

2.4 Đọc BUFFERS

Index Scan using idx_x on orders (cost=...) (actual time=...)
  Buffers: shared hit=1234 read=56 dirtied=2 written=0

shared hit — pages tìm thấy trong shared_buffers
shared read — pages phải đọc từ disk (hoặc OS cache)
shared dirtied — pages bị modify (kể cả từ HOT update set hint bits)
temp read/written — temp files (sort spill, hash spill)

Key insight: read cao = cache miss → tune shared_buffers hoặc query khác.

2.5 EXPLAIN ANALYZE side effects

⚠️ EXPLAIN ANALYZE INSERT/UPDATE/DELETE thực sự thay đổi data. Wrap trong transaction nếu chỉ test:

BEGIN;
EXPLAIN (ANALYZE, BUFFERS) UPDATE orders SET status='paid' WHERE id=1;
ROLLBACK;

⚠️ ANALYZE đo per-row timing → overhead lớn (5-30%). Cẩn thận khi đọc số tuyệt đối — relative comparison đáng tin hơn.

3. Cost Model — Tiền tệ ảo của planner

3.1 Đơn vị cost

Cost là arbitrary unit. Định nghĩa relative theo seq_page_cost = 1.0. Mọi cost khác là multiplier.

seq_page_cost        = 1.0     # 1 sequential page
random_page_cost     = 4.0     # 1 random page (4x SD legacy; tune 1.1 cho SSD)
cpu_tuple_cost       = 0.01    # process 1 tuple
cpu_index_tuple_cost = 0.005   # process 1 index tuple
cpu_operator_cost    = 0.0025  # 1 operator/function call
parallel_setup_cost  = 1000    # spawn parallel workers
parallel_tuple_cost  = 0.1     # send 1 tuple from worker to leader

3.2 Công thức ước tính

Seq Scan:

cost = relation_pages × seq_page_cost + reltuples × cpu_tuple_cost

Index Scan:

cost = log(N) × random_page_cost                            (index pages)
     + estimated_matching_rows × random_page_cost           (heap fetches)
     + estimated_matching_rows × cpu_tuple_cost

3.3 Vì sao `random_page_cost = 4.0` sai trên SSD

Default từ era HDD (random ~10x slower than sequential). Modern SSD: random ≈ sequential.

random_page_cost = 1.1     # SSD
effective_io_concurrency = 200  # SSD parallel I/O

Tác động:

random_page_cost cao → planner avoid index → seq scan
Tune xuống → planner chuộng index → đúng intent với SSD

3.4 Cost vs Actual time — khi nào mismatch

Cost: 1000
Actual time: 5000 ms

Lý do thường gặp:

Bad row estimate — planner predict 1000 rows, thực tế 100000
Stats stale — chưa ANALYZE
Correlated columns — Postgres assume independent (xem section 6)
Data distribution skew — vài value chiếm 90% rows

→ Tune stats trước khi tune query.

4. Plan Nodes — Bestiary

graph TB
    subgraph "Scan nodes"
        SS[Seq Scan]
        IS[Index Scan]
        IOS[Index Only Scan]
        BHS[Bitmap Heap Scan + Bitmap Index Scan]
        TS[Tid Scan]
        FS[Foreign Scan]
    end

    subgraph "Join nodes"
        NL[Nested Loop]
        HJ[Hash Join]
        MJ[Merge Join]
    end

    subgraph "Aggregate"
        AGG[Aggregate / GroupAggregate]
        HASHAGG[HashAggregate]
    end

    subgraph "Other"
        SORT[Sort]
        LIMIT[Limit]
        MAT[Materialize]
        HASH[Hash]
        GATHER[Gather / Gather Merge - parallel]
        CTE[CTE Scan]
        SUB[SubPlan / InitPlan]
    end

4.1 Seq Scan

Đọc tuần tự toàn bộ heap. Fastest cho:

Bảng nhỏ (< 1000 rows, fit 1 page)
Query trả >5-10% rows
Không có index match
Hot table fully cached

Seq Scan on orders  (cost=0.00..21370.00 rows=1000000 width=64)
  Filter: (status = 'pending'::text)
  Rows Removed by Filter: 750000

Rows Removed by Filter — đọc rows, filter ra ngoài. Nếu lớn → cần index.

4.2 Index Scan

Đi vào index → tìm ctid → fetch heap.

Index Scan using idx_orders_user on orders (cost=...) (actual rows=20 loops=1)
  Index Cond: (user_id = 42)
  Filter: (status = 'pending'::text)
  Rows Removed by Filter: 30

Index Cond — filter dùng index (efficient)
Filter — filter post-fetch heap (recheck)

Nếu nhiều Rows Removed by Filter → consider include filter column trong index.

4.3 Index Only Scan

Đọc chỉ từ index, không fetch heap. Cần:

Mọi column query reference có trong index (key hoặc INCLUDE)
VM bit “all-visible” (Tuần 01)

Index Only Scan using idx_orders_cov on orders (cost=...) (actual rows=20 loops=1)
  Index Cond: (user_id = 42)
  Heap Fetches: 0          ← perfect index-only scan

Heap Fetches: 0 = thực sự index-only. > 0 = phải fetch heap cho rows không all-visible → VACUUM more.

4.4 Bitmap Heap Scan + Bitmap Index Scan

Combine nhiều index, deduplicate bằng bitmap.

Bitmap Heap Scan on orders (cost=...) (actual rows=5000 loops=1)
  Recheck Cond: ((user_id = 42) OR (user_id = 43))
  Heap Blocks: exact=1234
  ->  BitmapOr (cost=...)
        ->  Bitmap Index Scan on idx_user
              Index Cond: (user_id = 42)
        ->  Bitmap Index Scan on idx_user
              Index Cond: (user_id = 43)

Hoặc combine 2 conditions:

Bitmap Heap Scan on orders
  Recheck Cond: ((user_id = 42) AND (status = 'pending'))
  ->  BitmapAnd
        ->  Bitmap Index Scan on idx_user
        ->  Bitmap Index Scan on idx_status

Khi nào planner pick Bitmap:

Trung gian giữa Index Scan (very few rows) và Seq Scan (many rows)
Combine 2+ indexes
Khi rows lớn nhưng vẫn nhỏ hơn full table

4.5 Nested Loop Join

Nested Loop  (cost=...) (actual rows=20 loops=1)
  ->  Index Scan using idx_users_pkey on users   (1 row)
        Index Cond: (id = 42)
  ->  Index Scan using idx_orders_user on orders  (20 rows, loops=1)
        Index Cond: (user_id = users.id)

Cho mỗi row của outer, scan inner. Tốt khi:

Outer ít rows (1-100)
Inner có index trên join key
⚠️ Quadratic nếu outer nhiều → chậm khủng khiếp

4.6 Hash Join

Hash Join  (cost=...) (actual rows=100000 loops=1)
  Hash Cond: (orders.user_id = users.id)
  ->  Seq Scan on orders
  ->  Hash  (cost=...)
        Buckets: 65536  Batches: 1  Memory Usage: 4096kB
        ->  Seq Scan on users

Build hash table từ smaller side, probe từ larger side.

Tốt khi:

Join 2 bảng to
Không index trên join column
Có RAM cho hash table

Spill: nếu hash table > work_mem → spill to disk (batches > 1) → chậm:

Batches: 4  Memory Usage: 16384kB  Disk Usage: 32MB

→ Tăng work_mem.

4.7 Merge Join

Merge Join  (cost=...) (actual rows=100000)
  Merge Cond: (a.id = b.a_id)
  ->  Index Scan ... (sorted by a.id)
  ->  Index Scan ... (sorted by b.a_id)

Cả 2 sides sorted theo join column. Walk through đồng thời.

Tốt khi:

Cả 2 sides đã sorted (index hoặc Sort cheap)
Join keys unique → no duplicate work

4.8 Sort

Sort  (cost=...) (actual rows=1000 loops=1)
  Sort Key: orders.created_at DESC
  Sort Method: quicksort  Memory: 256kB

Methods:

quicksort — fit in work_mem
external sort Disk: NNNkB — spill, slow
top-N heapsort — for LIMIT k

Tip: Nếu thấy external sort, tăng work_mem cho session/query.

4.9 HashAggregate vs GroupAggregate

HashAggregate (cost=...) (actual rows=100 loops=1)
  Group Key: user_id
  ->  Seq Scan on orders

-- vs

GroupAggregate (cost=...) (actual rows=100 loops=1)
  Group Key: user_id
  ->  Sort  (Sort Key: user_id)
        ->  Seq Scan on orders

HashAggregate: hash table, fast, fit memory
GroupAggregate: cần sorted input, dùng khi data đã sorted hoặc cần ORDER BY group

4.10 Gather / Gather Merge — parallelism

Gather  (cost=...) (actual rows=1000 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  ->  Parallel Seq Scan on orders  (rows=500 loops=3)

Parallel query (PG9.6+):

loops=3 = leader + 2 workers
Total rows = N × loops
Khi nào planner pick parallel: bảng to + có CPU + cost > parallel_setup_cost

Settings:

max_parallel_workers_per_gather = 2
max_parallel_workers = 8
parallel_setup_cost = 1000
min_parallel_table_scan_size = 8MB

4.11 CTE Scan / InitPlan / SubPlan

WITH active_users AS MATERIALIZED (
    SELECT * FROM users WHERE active = true
)
SELECT * FROM orders o JOIN active_users u ON o.user_id = u.id;

Hash Join
  ->  Seq Scan on orders
  ->  CTE Scan on active_users
        ->  Seq Scan on users (rows=10000)

PG12 behavior change: CTE inline by default (như subquery), trừ khi:

Recursive CTE
Marked MATERIALIZED
Referenced multiple times

Trước PG12: CTE always materialized → optimization fence, slow. Pattern cũ dùng để “force planner” giờ cần explicit MATERIALIZED.

5. Cách đọc EXPLAIN ANALYZE — Recipe 30 giây

5.1 Workflow

flowchart TD
    A[Get EXPLAIN ANALYZE BUFFERS] --> B[Tìm execution time]
    B --> C{>100ms cho OLTP?}
    C -->|No| C1[Có thể OK, move on]
    C -->|Yes| D[Tìm node tốn thời gian nhất<br/>actual time biggest]
    D --> E{Estimate vs Actual rows mismatch?}
    E -->|Yes 10x+| E1[Stats stale<br/>ANALYZE + recheck]
    E -->|No| F[Check node type]
    F --> G{Seq Scan trên bảng lớn?}
    G -->|Yes| G1[Add index hoặc rewrite]
    F --> H{Sort spill to disk?}
    H -->|Yes| H1[Tăng work_mem hoặc add index]
    F --> I{Nested Loop với outer >10K rows?}
    I -->|Yes| I1[Force Hash/Merge join]
    F --> J{Heap Fetches > 0 trong Index Only Scan?}
    J -->|Yes| J1[VACUUM table]
    F --> K{Buffers read >> hit?}
    K -->|Yes| K1[Cache cold - tune shared_buffers hoặc query thường xuyên hơn]

5.2 Worked example

Query:

EXPLAIN (ANALYZE, BUFFERS)
SELECT u.name, count(o.id) AS order_count
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
WHERE u.country = 'VN'
GROUP BY u.id, u.name
ORDER BY order_count DESC
LIMIT 100;

Plan:

Limit  (cost=84421.05..84421.30 rows=100) (actual time=2103..2103 rows=100 loops=1)
  ->  Sort  (cost=84421.05..84671.05 rows=100000) (actual time=2095..2098 rows=100 loops=1)
        Sort Key: (count(o.id)) DESC
        Sort Method: top-N heapsort  Memory: 41kB
        ->  HashAggregate  (cost=82500..83000 rows=100000) (actual time=1900..2050 rows=100000 loops=1)
              Group Key: u.id
              Buffers: shared hit=10000 read=5000
              ->  Hash Right Join  (cost=...) (actual time=300..1500 rows=500000 loops=1)
                    Hash Cond: (o.user_id = u.id)
                    Buffers: shared hit=8000 read=4500
                    ->  Seq Scan on orders o  (cost=...) (actual rows=1000000 loops=1)
                          Buffers: shared hit=5000 read=4000
                    ->  Hash  (rows=100000)
                          Buckets: 131072  Memory: 8192kB
                          ->  Seq Scan on users u  (cost=...) (actual rows=100000 loops=1)
                                Filter: (country = 'VN')
                                Rows Removed by Filter: 400000
                                Buffers: shared hit=3000 read=500
 Planning Time: 0.5 ms
 Execution Time: 2103 ms

Đọc:

Total time 2103ms — quá chậm cho OLTP
Nút thắt: Seq Scan on orders (1M rows, ~1200ms estimated)
Seq Scan on users: filter country = 'VN', removed 400K rows — cần index trên country
Buffers: shared read=4000 trên orders → cold cache

Action:

CREATE INDEX idx_users_country ON users(country);
CREATE INDEX idx_orders_user ON orders(user_id);
ANALYZE;

Rerun. Check new plan — có thể chuyển sang Hash Join với Index Scan, time giảm xuống ~100ms.

5.3 Common red flags

Red flag	Ý nghĩa	Hành động
`Estimate rows`: 1, `Actual rows`: 10000	Stats wrong	ANALYZE, hoặc extended stats
`Seq Scan` trên huge table	Missing index	Add index
`Rows Removed by Filter`: 10000+	Filter sau index	Composite/partial index
`Heap Fetches`: large in Index Only Scan	VM stale	VACUUM
`Sort Method: external sort`	Spill to disk	Tăng work_mem hoặc add index
`Batches: 4+` trong Hash	Hash spill	Tăng work_mem
`loops` >1000 với Index Scan	Repeated nested loop	Restructure / better join order
`shared read >> hit`	Cache cold	Warmup hoặc tune shared_buffers
`Planning Time` >> Execution Time	Plan cache miss / complex query	Prepared statement

6. Statistics — Bạn bí mật của planner

6.1 pg_statistic

Planner ước cardinality dựa trên pg_statistic. Mỗi column lưu:

null_frac — % NULL
avg_width — bytes per value
n_distinct — distinct values estimate (-1 = unique)
most_common_vals (MCV) + most_common_freqs — top frequent values
histogram_bounds — distribution buckets
correlation — physical vs logical order

SELECT attname, null_frac, n_distinct, most_common_vals, most_common_freqs, histogram_bounds
FROM pg_stats
WHERE tablename = 'orders' AND attname = 'status';

6.2 default_statistics_target

Số buckets trong histogram + MCV count. Default 100. Tăng cho column quan trọng có distribution phức tạp:

ALTER TABLE orders ALTER COLUMN status SET STATISTICS 1000;
ANALYZE orders;

Trade-off:

Cao hơn → planner accurate hơn → faster plans
Cao hơn → ANALYZE chậm hơn, planning time tăng

Pattern 2024: 1000 cho column hot (filter, join, group), giữ default cho rest.

6.3 Correlated columns — extended statistics (PG10+)

Planner default assume column independent:

SELECT * FROM addresses WHERE city = 'San Francisco' AND state = 'CA';
-- Planner: P(city=SF) × P(state=CA) ≈ 0.001 × 0.05 = 0.00005
-- Reality: 100% SF in CA → P = 0.001 (not 0.00005)

Extended statistics fix:

CREATE STATISTICS stats_addr (dependencies, ndistinct, mcv)
ON city, state FROM addresses;
 
ANALYZE addresses;

Type:

ndistinct — distinct combos
dependencies — functional dependency (state → country)
mcv (PG12+) — multi-column MCV list

6.4 ANALYZE timing

Autovacuum cũng analyze:

autovacuum_analyze_scale_factor = 0.1 (10% rows changed → analyze)
autovacuum_analyze_threshold = 50

High-write table → tune sớm hơn:

ALTER TABLE events SET (autovacuum_analyze_scale_factor = 0.02);

Manual ANALYZE sau bulk operation:

ANALYZE orders;
ANALYZE VERBOSE orders;   -- in progress
ANALYZE orders (column1, column2);  -- specific columns PG14+

7. Join Algorithms Deep Dive

7.1 Nested Loop

For each row in outer:
    For each row in inner matching join cond:
        Output combination

Cost: outer_rows × inner_cost_per_lookup

Tốt khi: outer rows small (1-100) AND inner has index on join key.

Tệ khi: outer >10000 → quadratic. Đa số “query đang nhanh, đột nhiên chậm” do data growth làm nested loop unfit.

7.2 Hash Join

Build hash table from smaller side (in memory)
For each row in other side:
    Lookup hash table
    Output matches

Cost: O(N + M) linear.

Tốt khi: both sides large, không có sortable index.

Tệ khi: build side > work_mem → spill batches → slow.

work_mem = 64MB  -- session override khi cần big hash

7.3 Merge Join

Sort both sides by join key (or use sorted index)
Walk both in parallel

Cost: O(N log N + M log M) if need sort, O(N + M) if already sorted.

Tốt khi: cả 2 sides đã sorted (vd: index scan trên join key).

Tệ khi: cần sort cả 2 sides huge → expensive.

7.4 Khi nào planner pick sai

Trường hợp kinh điển:

SELECT * FROM orders o JOIN users u ON o.user_id = u.id WHERE u.country = 'VN';

Stats nói “100 users trong VN” (correct). Planner pick Nested Loop:

Nested Loop
  Seq Scan on users (filter country='VN') → 100 rows
  Index Scan on orders by user_id (per outer row) → 100 lookups, mỗi ~100 rows = 10000 total

Sau 1 năm, “VN users” tăng lên 100K. Planner stats stale, vẫn ước 100. Vẫn pick Nested Loop:

Nested Loop
  100K outer rows × 100 inner lookups each = 10M operations
  ← Slow!

Fix:

ANALYZE → planner update → pick Hash Join
Hoặc extended stats nếu correlation issue

7.5 Force / hint joins

Postgres không có hint syntax. Workaround for debugging:

SET enable_nestloop = off;
EXPLAIN ANALYZE <query>;
SET enable_nestloop = on;

Production: fix root cause (stats), không tắt feature.

8. CTE Optimization — PG12 behavior change

8.1 Pre-PG12: CTE là optimization fence

-- Pre-PG12
WITH active_users AS (
    SELECT * FROM users WHERE active = true  -- always materialized
)
SELECT * FROM active_users WHERE id = 1;
-- Planner: scan all active users, filter id=1 (slow!)

8.2 PG12+: CTE inline by default

-- PG12+
WITH active_users AS (
    SELECT * FROM users WHERE active = true  -- inlined!
)
SELECT * FROM active_users WHERE id = 1;
-- Planner: Index Scan on users WHERE active=true AND id=1 (fast)

Inline khi:

Non-recursive
Referenced 1 lần
Không có side effect

8.3 Khi cần force materialize

WITH expensive_calc AS MATERIALIZED (
    SELECT ... FROM ... -- heavy aggregation
)
SELECT ... FROM expensive_calc WHERE ... -- nhiều reference
UNION
SELECT ... FROM expensive_calc WHERE ...;

MATERIALIZED keyword: tính 1 lần, reuse.

8.4 Recursive CTE

WITH RECURSIVE category_tree AS (
    SELECT id, name, parent_id, 0 AS depth FROM categories WHERE parent_id IS NULL
    UNION ALL
    SELECT c.id, c.name, c.parent_id, ct.depth + 1
    FROM categories c JOIN category_tree ct ON c.parent_id = ct.id
)
SELECT * FROM category_tree WHERE depth <= 5;

Plan:

CTE category_tree
  ->  Recursive Union
        ->  Seq Scan on categories  -- anchor
        ->  Hash Join                 -- iteration
              ->  WorkTable Scan on category_tree
              ->  Seq Scan on categories

Pitfall: vô tận khi data có cycle. Add CYCLE clause (PG14+) hoặc depth limit.

9. Window Functions — Phân tích trong SQL

9.1 Anatomy

function_name([args]) OVER (
    [PARTITION BY expr_list]
    [ORDER BY expr_list]
    [frame_clause]
)

3 parts:

PARTITION BY: divide rows into groups (like GROUP BY but kept separate)
ORDER BY: within partition, ordering matters for ranking + frames
Frame: subset of partition relative to current row (sum-so-far, moving avg, etc.)

9.2 Function categories

Ranking

row_number()                   -- 1, 2, 3 sequential
rank()                         -- 1, 2, 2, 4 (gaps after tie)
dense_rank()                   -- 1, 2, 2, 3 (no gaps)
percent_rank()                 -- (rank - 1) / (count - 1)
cume_dist()                    -- cumulative distribution
ntile(N)                       -- divide into N buckets

Offset / position

lag(col, offset, default)      -- previous row value
lead(col, offset, default)     -- next row value
first_value(col)               -- first in frame
last_value(col)                -- last in frame
nth_value(col, n)              -- nth in frame

Aggregates as window

sum(col), avg(col), min(col), max(col), count(*), etc.
-- Plus string_agg, array_agg, etc. with window

9.3 Frames — the underused gem

Default frame when ORDER BY present: RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW (running total).

Three frame modes:

ROWS — physical rows

-- Moving average over last 7 rows
avg(value) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)

RANGE — by value

-- Sum within $100 of current
sum(amount) OVER (ORDER BY amount RANGE BETWEEN 100 PRECEDING AND 100 FOLLOWING)
 
-- 7-day window by date
sum(value) OVER (ORDER BY date RANGE BETWEEN INTERVAL '6 days' PRECEDING AND CURRENT ROW)

GROUPS (PG11+) — peer groups

-- 3 distinct date groups before current
avg(value) OVER (ORDER BY date GROUPS BETWEEN 3 PRECEDING AND CURRENT ROW)

EXCLUDE (PG11+)

sum(value) OVER (
    ORDER BY id
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    EXCLUDE CURRENT ROW       -- or TIES, GROUP, NO OTHERS
)

Frame bounds

UNBOUNDED PRECEDING / FOLLOWING
N PRECEDING / FOLLOWING
CURRENT ROW

9.4 Trong plan

WindowAgg  (cost=...) (actual rows=1M loops=1)
  ->  Sort
        Sort Key: user_id, placed_at
        ->  Seq Scan on orders

Window function cần sort theo PARTITION BY + ORDER BY. Index khớp có thể skip sort:

CREATE INDEX idx_orders_user_placed ON orders(user_id, placed_at);

Multiple windows with same PARTITION + ORDER share Sort:

-- 1 Sort, multiple WindowAgg
SELECT
    sum(total) OVER w,
    avg(total) OVER w,
    rank() OVER w
FROM orders
WINDOW w AS (PARTITION BY user_id ORDER BY placed_at);

9.5 Patterns

Top N per group (3 ways)

-- A. Window function
SELECT * FROM (
    SELECT *, row_number() OVER (PARTITION BY user_id ORDER BY placed_at DESC) AS rn
    FROM orders
) t WHERE rn <= 3;
 
-- B. LATERAL join (often faster for small N)
SELECT u.id, o.* FROM users u, LATERAL (
    SELECT * FROM orders WHERE user_id = u.id
    ORDER BY placed_at DESC LIMIT 3
) o;
 
-- C. DISTINCT ON (Postgres-specific)
SELECT DISTINCT ON (user_id) * FROM orders ORDER BY user_id, placed_at DESC;
-- Only 1 per group; for N > 1 use A or B

Benchmark which is fastest for your data.

Lag / Lead (compare with previous)

-- Time between orders
SELECT
    id, placed_at,
    placed_at - lag(placed_at) OVER (PARTITION BY user_id ORDER BY placed_at) AS gap_since_last
FROM orders;
 
-- Did value change?
SELECT id, value,
    CASE WHEN value <> lag(value) OVER (ORDER BY ts) THEN 'changed' ELSE 'same' END AS status
FROM measurements;

Running total / cumulative

SELECT id, sum(total_cents) OVER (PARTITION BY user_id ORDER BY placed_at) AS lifetime_so_far
FROM orders;

Moving average

SELECT date, value,
    avg(value) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS ma_7day
FROM metrics;

Percentile / quartile

SELECT name, score,
    ntile(4) OVER (ORDER BY score) AS quartile,
    percent_rank() OVER (ORDER BY score) AS pct_rank
FROM students;

Gaps & islands (consecutive runs)

-- Find runs of consecutive days
WITH numbered AS (
    SELECT date,
        date - (row_number() OVER (ORDER BY date))::int * interval '1 day' AS group_key
    FROM events
)
SELECT min(date) AS run_start, max(date) AS run_end, count(*) AS days
FROM numbered GROUP BY group_key;

Sessionization

-- Group events into sessions (>30min gap = new session)
WITH gaps AS (
    SELECT *,
        CASE WHEN ts - lag(ts) OVER (PARTITION BY user_id ORDER BY ts) > interval '30 min'
             OR lag(ts) OVER (PARTITION BY user_id ORDER BY ts) IS NULL
        THEN 1 ELSE 0 END AS is_new_session
    FROM events
),
sessions AS (
    SELECT *, sum(is_new_session) OVER (PARTITION BY user_id ORDER BY ts) AS session_id
    FROM gaps
)
SELECT user_id, session_id, count(*) FROM sessions GROUP BY user_id, session_id;

Last value with NULL handling

-- last non-null status
SELECT id, ts, status,
    last_value(status) OVER (
        PARTITION BY user_id ORDER BY ts
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS effective_status
FROM history WHERE status IS NOT NULL;

9.6 Window vs Group By

Group By	Window
Reduce rows	Keep all rows
Aggregate per group	Per-row computation with context
`SELECT count(*) GROUP BY user_id`	`SELECT , count() OVER (PARTITION BY user_id)`
Use for summary	Use for ranking, comparing, running totals

9.7 Performance tips

Match index to PARTITION BY + ORDER BY → avoid Sort
Reuse named windows (WINDOW w AS ...)
Filter before window when possible (use CTE/subquery)
ROWS frame faster than RANGE in some cases
LATERAL often beats window for top-N per group with small N

10. LATERAL Join — superpower

SELECT u.id, latest.placed_at, latest.total_cents
FROM users u
LATERAL (
    SELECT placed_at, total_cents FROM orders WHERE user_id = u.id
    ORDER BY placed_at DESC LIMIT 1
) latest;

Cho mỗi user, top-1 latest order. Cú pháp này LATERAL cho phép subquery reference outer columns.

Useful cho:

Top-N per group (alternative to window function)
Correlated subquery rewrite
JSON expand per row

11. Anti-patterns

11.1 `SELECT *` thay vì column cụ thể

→ Block index-only scan (cần index INCLUDE mọi column) → Network bandwidth waste → Refactoring nightmare (drop column nhưng app vẫn select *)

11.2 OFFSET pagination

-- Page 100, 20 per page
SELECT * FROM orders ORDER BY placed_at DESC OFFSET 2000 LIMIT 20;
-- Plan: scan 2020 rows, throw 2000, return 20

→ Use keyset pagination (Tuần 03 section 10.9).

11.3 `function(column) = value`

-- BAD
SELECT * FROM users WHERE lower(email) = 'a@x.com';
-- Index trên (email) không dùng được
 
-- FIX 1: expression index
CREATE INDEX idx_users_email_lower ON users(lower(email));
 
-- FIX 2: store lowercase, hoặc citext type

11.4 Implicit type cast

-- column id is bigint
SELECT * FROM users WHERE id = '42';   -- text cast to bigint
SELECT * FROM users WHERE id = 42.0;   -- numeric cast

Có thể prevent index use trong edge case. Always match types.

11.5 `NOT IN` với nullable column

-- BAD - returns no rows if subquery has NULL
SELECT * FROM users WHERE id NOT IN (SELECT user_id FROM blocked);
 
-- GOOD
SELECT * FROM users WHERE NOT EXISTS (SELECT 1 FROM blocked WHERE blocked.user_id = users.id);

NULL semantics trong NOT IN là gốc của bug kinh điển.

11.6 ORM N+1

Mỗi parent row → 1 query con. 100 users → 101 queries.

→ Use JOIN, batch loading, DataLoader pattern. Sẽ học Tuan-16-ORM-CQRS-Multi-Tenancy.

11.7 `count(*)` trên huge table

SELECT count(*) FROM orders;  -- scan 100M rows

→ Approximate từ pg_class.reltuples:

SELECT reltuples::bigint FROM pg_class WHERE relname = 'orders';

→ Denormalized counter (Tuần 02).

11.8 `LIKE '%abc%'` mà không có trigram index

→ Seq scan. Add pg_trgm GIN index.

11.9 Subquery uncorrelated trong WHERE

-- BAD: subquery runs once for every parent row potentially
SELECT * FROM orders WHERE total > (SELECT avg(total) FROM orders);
 
-- BETTER: precompute
WITH stats AS MATERIALIZED (SELECT avg(total) AS avg_t FROM orders)
SELECT o.* FROM orders o, stats WHERE o.total > stats.avg_t;

11.10 OR conditions across columns

-- Often defeats index
SELECT * FROM orders WHERE user_id = 1 OR seller_id = 1;
 
-- BETTER: UNION
SELECT * FROM orders WHERE user_id = 1
UNION
SELECT * FROM orders WHERE seller_id = 1;

Mỗi side dùng index riêng, planner combine bằng Bitmap.

12. Tools

12.1 explain.depesz.com

Paste EXPLAIN ANALYZE output → colored highlight slow nodes.

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) <query>;
-- Copy output → paste vào website

12.2 pev2 (Postgres EXPLAIN Visualizer 2)

https://github.com/dalibo/pev2

Visual tree, percentage time per node, statistics inline. Self-host hoặc dùng demo.

Paste JSON format:

EXPLAIN (ANALYZE, BUFFERS, VERBOSE, FORMAT JSON) <query>;

12.3 pgMustard

Commercial. AI-assisted analysis với recommendation cụ thể.

12.4 auto_explain

Extension auto log slow queries’ plan.

LOAD 'auto_explain';
SET auto_explain.log_min_duration = 1000;     -- ms
SET auto_explain.log_analyze = true;
SET auto_explain.log_buffers = true;
SET auto_explain.log_format = 'json';

Trong postgresql.conf:

shared_preload_libraries = 'auto_explain, pg_stat_statements'
auto_explain.log_min_duration = 1000
auto_explain.log_analyze = on

Mọi query >1s sẽ log plan → tìm slow query mà không chạy lại.

12.5 pg_stat_statements

Aggregate query stats. Sẽ đào sâu Tuần 09. Preview:

SELECT query, calls, mean_exec_time, total_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC LIMIT 10;

13. Lab — 7-day deep dive

13.1 Setup data

Reuse e-commerce schema từ Tuần 02. Bulk insert larger:

-- 10M orders, 500K users, 200K products
-- See Tuần 03 lab section

13.2 Day 1-2: Plan reading basics

Run 20 different queries, EXPLAIN ANALYZE each, identify:

Total time
Bottleneck node
Estimate vs actual mismatch

13.3 Day 3: Statistics deep dive

-- Check current stats
SELECT * FROM pg_stats WHERE tablename = 'orders';
 
-- Tune
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 1000;
ANALYZE orders;
 
-- Verify change in plan

Reproduce planner getting it wrong:

INSERT INTO orders ... 100K rows ...
-- Don't ANALYZE
EXPLAIN <query>;
-- Bad estimate
ANALYZE orders;
EXPLAIN <query>;
-- Better

13.4 Day 4: Joins

Compare 3 join algorithms forced:

SET enable_hashjoin = off;
SET enable_mergejoin = off;
EXPLAIN ANALYZE <query>;
-- Nested loop only
 
-- Repeat for hash, merge
RESET enable_hashjoin;
RESET enable_mergejoin;

13.5 Day 5: Window functions

Build dashboard query: top spender per country, running total per user, rank changes month over month.

13.6 Day 6: Plan visualizers

Paste plans into depesz, pev2. Practice reading visual representation.

13.7 Day 7: Capstone

Take a real query slow in production (or create one):

Get baseline EXPLAIN ANALYZE BUFFERS
Identify bottleneck
Hypothesis fix
Implement (index, rewrite, stats tune)
Compare plans
Document decision

14. Self-check

EXPLAIN vs EXPLAIN ANALYZE — khác nhau gì? Khi nào dùng cái nào?
Cost trong EXPLAIN — đơn vị gì? Default random_page_cost = 4.0 sai khi nào?
3 loại scan: Seq, Index, Index Only. Khi nào planner pick mỗi loại?
Heap Fetches > 0 trong Index Only Scan — nguyên nhân? Fix?
3 join algorithm — Nested Loop, Hash, Merge. Cho mỗi cái, khi nào tốt nhất?
Sort Method: external sort — vấn đề gì? Fix?
Estimate rows = 100, Actual rows = 100K. Lý do và cách fix?
CTE pre-PG12 vs PG12+ — khác biệt? Khi nào dùng MATERIALIZED?
WHERE lower(email) = 'a@x.com' không dùng index trên email. Tại sao? Fix?
Khi nào pick window function vs GROUP BY?

15. Tiếp theo

Bài tiếp: Tuan-05-Transactions-Locking — sau khi master read path, master write path + concurrency.

Đọc:

Use The Index, Luke! full
Postgres docs Ch.14 Performance Tips
Database Internals (Petrov) Ch. Query Execution

Tuần 04 hoàn thành. Read the plan, not the query. Cập nhật: 2026-05-16

lthieu's notes

Explorer

Tuan-04-Query-Optimization-EXPLAIN