Tuần 04 — Query Optimization & EXPLAIN
“Đọc EXPLAIN ANALYZE giống đọc bản đồ. Người mới nhìn thấy chỉ là một mớ chữ. Người có kinh nghiệm thấy được con đường, các nút thắt, và tại sao xe kẹt ở đâu. Tuần này dạy bạn cách đọc bản đồ đó.”
Tags: database postgresql explain query-optimization performance Thời lượng: 7 ngày (5-7h/ngày) Prerequisites: Tuan-03-Indexing-Mastery (hiểu index để hiểu plan) Liên quan: Tuan-09-DB-Observability-Tuning (pg_stat_statements) · Tuan-05-Transactions-Locking
1. Context & Why
1.1 EXPLAIN ANALYZE là dụng cụ chẩn đoán quan trọng nhất của DB
Khi query slow, có 3 cách phản ứng:
- Add index “phòng hờ” — guessing → 50% miss
- Rewrite query “có thể nhanh hơn” — guessing → 50% miss
- Đọc EXPLAIN ANALYZE → biết chính xác bottleneck → fix targeted → 95% hit
Senior engineer dành 80% thời gian DB cho việc đọc plan, 20% cho việc fix.
1.2 Mục tiêu tuần
- Đọc EXPLAIN ANALYZE output trong 30 giây tìm ra bottleneck
- Hiểu cost model: pages × cost + tuples × cost
- Phân biệt và biết khi nào dùng: Seq Scan, Index Scan, Index-Only, Bitmap, Sort, Hash
- Đọc 3 join algorithms: nested loop, hash, merge — và biết khi nào planner chọn sai
- Statistics: hiểu
default_statistics_target,pg_statistic, extended statistics - Tránh được 10 anti-pattern phổ biến nhất
- Dùng tools: explain.depesz.com, pev2, pgMustard
1.3 Tham chiếu
- PostgreSQL Documentation: EXPLAIN — https://www.postgresql.org/docs/current/sql-explain.html
- PostgreSQL Documentation: Performance Tips — https://www.postgresql.org/docs/current/performance-tips.html
- Use The Index, Luke! Ch. “The Join Operation”
- explain.depesz.com — paste EXPLAIN, get colored analysis
- dalibo/pev2 — visual plan viewer
- pgMustard — AI-assisted EXPLAIN analysis (commercial)
- Bruce Momjian — “Explaining the Postgres Query Optimizer” slides
2. EXPLAIN — Syntax & Options
2.1 Forms
EXPLAIN <query>; -- chỉ plan, không chạy
EXPLAIN ANALYZE <query>; -- chạy thật, đo thời gian thật
EXPLAIN (ANALYZE, BUFFERS) <query>; -- thêm I/O stats
EXPLAIN (ANALYZE, BUFFERS, VERBOSE, FORMAT JSON) <query>;2.2 Options đầy đủ
| Option | Mặc định | Mục đích |
|---|---|---|
ANALYZE | off | Chạy query, đo actual time/rows |
BUFFERS | off | Thống kê shared/local buffers hit, read |
VERBOSE | off | Output column list, schema-qualified |
COSTS | on | Hiện estimated cost |
TIMING | on | Per-node timing (cần ANALYZE) |
SUMMARY | on if ANALYZE | Tổng planning/execution time |
SETTINGS | off (PG12+) | List non-default settings ảnh hưởng plan |
WAL | off (PG13+) | WAL records generated by DML |
GENERIC_PLAN | off (PG16+) | Plan cho parameterized query không có giá trị thật |
FORMAT | text | text/xml/json/yaml |
Recommended để debug:
EXPLAIN (ANALYZE, BUFFERS, VERBOSE, SETTINGS, FORMAT TEXT) <query>;2.3 Đọc output cơ bản
QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────
Limit (cost=0.43..8.55 rows=20 width=64) (actual time=0.025..0.075 rows=20 loops=1)
-> Index Scan using idx_orders_user_placed on orders (cost=0.43..406.32 rows=1000 width=64)
(actual time=0.024..0.071 rows=20 loops=1)
Index Cond: (user_id = 42)
Planning Time: 0.123 ms
Execution Time: 0.105 ms
Đọc:
- Node tree từ trong ra ngoài (deepest first)
cost=startup..total— abstract unit (xem section 3)rows=N— estimateactual time=startup..total ms— đo thựcrows=N loops=M— node chạy M lần, mỗi lần produce ~N rowsPlanning Time— thời gian build planExecution Time— thời gian chạy
2.4 Đọc BUFFERS
Index Scan using idx_x on orders (cost=...) (actual time=...)
Buffers: shared hit=1234 read=56 dirtied=2 written=0
shared hit— pages tìm thấy trong shared_buffersshared read— pages phải đọc từ disk (hoặc OS cache)shared dirtied— pages bị modify (kể cả từ HOT update set hint bits)temp read/written— temp files (sort spill, hash spill)
Key insight: read cao = cache miss → tune shared_buffers hoặc query khác.
2.5 EXPLAIN ANALYZE side effects
⚠️ EXPLAIN ANALYZE INSERT/UPDATE/DELETE thực sự thay đổi data. Wrap trong transaction nếu chỉ test:
BEGIN;
EXPLAIN (ANALYZE, BUFFERS) UPDATE orders SET status='paid' WHERE id=1;
ROLLBACK;⚠️ ANALYZE đo per-row timing → overhead lớn (5-30%). Cẩn thận khi đọc số tuyệt đối — relative comparison đáng tin hơn.
3. Cost Model — Tiền tệ ảo của planner
3.1 Đơn vị cost
Cost là arbitrary unit. Định nghĩa relative theo seq_page_cost = 1.0. Mọi cost khác là multiplier.
seq_page_cost = 1.0 # 1 sequential page
random_page_cost = 4.0 # 1 random page (4x SD legacy; tune 1.1 cho SSD)
cpu_tuple_cost = 0.01 # process 1 tuple
cpu_index_tuple_cost = 0.005 # process 1 index tuple
cpu_operator_cost = 0.0025 # 1 operator/function call
parallel_setup_cost = 1000 # spawn parallel workers
parallel_tuple_cost = 0.1 # send 1 tuple from worker to leader3.2 Công thức ước tính
Seq Scan:
cost = relation_pages × seq_page_cost + reltuples × cpu_tuple_cost
Index Scan:
cost = log(N) × random_page_cost (index pages)
+ estimated_matching_rows × random_page_cost (heap fetches)
+ estimated_matching_rows × cpu_tuple_cost
3.3 Vì sao random_page_cost = 4.0 sai trên SSD
Default từ era HDD (random ~10x slower than sequential). Modern SSD: random ≈ sequential.
random_page_cost = 1.1 # SSD
effective_io_concurrency = 200 # SSD parallel I/OTác động:
random_page_costcao → planner avoid index → seq scan- Tune xuống → planner chuộng index → đúng intent với SSD
3.4 Cost vs Actual time — khi nào mismatch
Cost: 1000
Actual time: 5000 ms
Lý do thường gặp:
- Bad row estimate — planner predict 1000 rows, thực tế 100000
- Stats stale — chưa ANALYZE
- Correlated columns — Postgres assume independent (xem section 6)
- Data distribution skew — vài value chiếm 90% rows
→ Tune stats trước khi tune query.
4. Plan Nodes — Bestiary
graph TB subgraph "Scan nodes" SS[Seq Scan] IS[Index Scan] IOS[Index Only Scan] BHS[Bitmap Heap Scan + Bitmap Index Scan] TS[Tid Scan] FS[Foreign Scan] end subgraph "Join nodes" NL[Nested Loop] HJ[Hash Join] MJ[Merge Join] end subgraph "Aggregate" AGG[Aggregate / GroupAggregate] HASHAGG[HashAggregate] end subgraph "Other" SORT[Sort] LIMIT[Limit] MAT[Materialize] HASH[Hash] GATHER[Gather / Gather Merge - parallel] CTE[CTE Scan] SUB[SubPlan / InitPlan] end
4.1 Seq Scan
Đọc tuần tự toàn bộ heap. Fastest cho:
- Bảng nhỏ (< 1000 rows, fit 1 page)
- Query trả >5-10% rows
- Không có index match
- Hot table fully cached
Seq Scan on orders (cost=0.00..21370.00 rows=1000000 width=64)
Filter: (status = 'pending'::text)
Rows Removed by Filter: 750000
Rows Removed by Filter — đọc rows, filter ra ngoài. Nếu lớn → cần index.
4.2 Index Scan
Đi vào index → tìm ctid → fetch heap.
Index Scan using idx_orders_user on orders (cost=...) (actual rows=20 loops=1)
Index Cond: (user_id = 42)
Filter: (status = 'pending'::text)
Rows Removed by Filter: 30
Index Cond— filter dùng index (efficient)Filter— filter post-fetch heap (recheck)
Nếu nhiều Rows Removed by Filter → consider include filter column trong index.
4.3 Index Only Scan
Đọc chỉ từ index, không fetch heap. Cần:
- Mọi column query reference có trong index (key hoặc INCLUDE)
- VM bit “all-visible” (Tuần 01)
Index Only Scan using idx_orders_cov on orders (cost=...) (actual rows=20 loops=1)
Index Cond: (user_id = 42)
Heap Fetches: 0 ← perfect index-only scan
Heap Fetches: 0 = thực sự index-only. > 0 = phải fetch heap cho rows không all-visible → VACUUM more.
4.4 Bitmap Heap Scan + Bitmap Index Scan
Combine nhiều index, deduplicate bằng bitmap.
Bitmap Heap Scan on orders (cost=...) (actual rows=5000 loops=1)
Recheck Cond: ((user_id = 42) OR (user_id = 43))
Heap Blocks: exact=1234
-> BitmapOr (cost=...)
-> Bitmap Index Scan on idx_user
Index Cond: (user_id = 42)
-> Bitmap Index Scan on idx_user
Index Cond: (user_id = 43)
Hoặc combine 2 conditions:
Bitmap Heap Scan on orders
Recheck Cond: ((user_id = 42) AND (status = 'pending'))
-> BitmapAnd
-> Bitmap Index Scan on idx_user
-> Bitmap Index Scan on idx_status
Khi nào planner pick Bitmap:
- Trung gian giữa Index Scan (very few rows) và Seq Scan (many rows)
- Combine 2+ indexes
- Khi rows lớn nhưng vẫn nhỏ hơn full table
4.5 Nested Loop Join
Nested Loop (cost=...) (actual rows=20 loops=1)
-> Index Scan using idx_users_pkey on users (1 row)
Index Cond: (id = 42)
-> Index Scan using idx_orders_user on orders (20 rows, loops=1)
Index Cond: (user_id = users.id)
Cho mỗi row của outer, scan inner. Tốt khi:
- Outer ít rows (1-100)
- Inner có index trên join key
- ⚠️ Quadratic nếu outer nhiều → chậm khủng khiếp
4.6 Hash Join
Hash Join (cost=...) (actual rows=100000 loops=1)
Hash Cond: (orders.user_id = users.id)
-> Seq Scan on orders
-> Hash (cost=...)
Buckets: 65536 Batches: 1 Memory Usage: 4096kB
-> Seq Scan on users
Build hash table từ smaller side, probe từ larger side.
Tốt khi:
- Join 2 bảng to
- Không index trên join column
- Có RAM cho hash table
Spill: nếu hash table > work_mem → spill to disk (batches > 1) → chậm:
Batches: 4 Memory Usage: 16384kB Disk Usage: 32MB
→ Tăng work_mem.
4.7 Merge Join
Merge Join (cost=...) (actual rows=100000)
Merge Cond: (a.id = b.a_id)
-> Index Scan ... (sorted by a.id)
-> Index Scan ... (sorted by b.a_id)
Cả 2 sides sorted theo join column. Walk through đồng thời.
Tốt khi:
- Cả 2 sides đã sorted (index hoặc Sort cheap)
- Join keys unique → no duplicate work
4.8 Sort
Sort (cost=...) (actual rows=1000 loops=1)
Sort Key: orders.created_at DESC
Sort Method: quicksort Memory: 256kB
Methods:
quicksort— fit inwork_memexternal sort Disk: NNNkB— spill, slowtop-N heapsort— forLIMIT k
Tip: Nếu thấy external sort, tăng work_mem cho session/query.
4.9 HashAggregate vs GroupAggregate
HashAggregate (cost=...) (actual rows=100 loops=1)
Group Key: user_id
-> Seq Scan on orders
-- vs
GroupAggregate (cost=...) (actual rows=100 loops=1)
Group Key: user_id
-> Sort (Sort Key: user_id)
-> Seq Scan on orders
- HashAggregate: hash table, fast, fit memory
- GroupAggregate: cần sorted input, dùng khi data đã sorted hoặc cần ORDER BY group
4.10 Gather / Gather Merge — parallelism
Gather (cost=...) (actual rows=1000 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on orders (rows=500 loops=3)
Parallel query (PG9.6+):
loops=3= leader + 2 workers- Total rows = N × loops
- Khi nào planner pick parallel: bảng to + có CPU + cost >
parallel_setup_cost
Settings:
max_parallel_workers_per_gather = 2
max_parallel_workers = 8
parallel_setup_cost = 1000
min_parallel_table_scan_size = 8MB4.11 CTE Scan / InitPlan / SubPlan
WITH active_users AS MATERIALIZED (
SELECT * FROM users WHERE active = true
)
SELECT * FROM orders o JOIN active_users u ON o.user_id = u.id;Hash Join
-> Seq Scan on orders
-> CTE Scan on active_users
-> Seq Scan on users (rows=10000)
PG12 behavior change: CTE inline by default (như subquery), trừ khi:
- Recursive CTE
- Marked
MATERIALIZED - Referenced multiple times
Trước PG12: CTE always materialized → optimization fence, slow. Pattern cũ dùng để “force planner” giờ cần explicit MATERIALIZED.
5. Cách đọc EXPLAIN ANALYZE — Recipe 30 giây
5.1 Workflow
flowchart TD A[Get EXPLAIN ANALYZE BUFFERS] --> B[Tìm execution time] B --> C{>100ms cho OLTP?} C -->|No| C1[Có thể OK, move on] C -->|Yes| D[Tìm node tốn thời gian nhất<br/>actual time biggest] D --> E{Estimate vs Actual rows mismatch?} E -->|Yes 10x+| E1[Stats stale<br/>ANALYZE + recheck] E -->|No| F[Check node type] F --> G{Seq Scan trên bảng lớn?} G -->|Yes| G1[Add index hoặc rewrite] F --> H{Sort spill to disk?} H -->|Yes| H1[Tăng work_mem hoặc add index] F --> I{Nested Loop với outer >10K rows?} I -->|Yes| I1[Force Hash/Merge join] F --> J{Heap Fetches > 0 trong Index Only Scan?} J -->|Yes| J1[VACUUM table] F --> K{Buffers read >> hit?} K -->|Yes| K1[Cache cold - tune shared_buffers hoặc query thường xuyên hơn]
5.2 Worked example
Query:
EXPLAIN (ANALYZE, BUFFERS)
SELECT u.name, count(o.id) AS order_count
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
WHERE u.country = 'VN'
GROUP BY u.id, u.name
ORDER BY order_count DESC
LIMIT 100;Plan:
Limit (cost=84421.05..84421.30 rows=100) (actual time=2103..2103 rows=100 loops=1)
-> Sort (cost=84421.05..84671.05 rows=100000) (actual time=2095..2098 rows=100 loops=1)
Sort Key: (count(o.id)) DESC
Sort Method: top-N heapsort Memory: 41kB
-> HashAggregate (cost=82500..83000 rows=100000) (actual time=1900..2050 rows=100000 loops=1)
Group Key: u.id
Buffers: shared hit=10000 read=5000
-> Hash Right Join (cost=...) (actual time=300..1500 rows=500000 loops=1)
Hash Cond: (o.user_id = u.id)
Buffers: shared hit=8000 read=4500
-> Seq Scan on orders o (cost=...) (actual rows=1000000 loops=1)
Buffers: shared hit=5000 read=4000
-> Hash (rows=100000)
Buckets: 131072 Memory: 8192kB
-> Seq Scan on users u (cost=...) (actual rows=100000 loops=1)
Filter: (country = 'VN')
Rows Removed by Filter: 400000
Buffers: shared hit=3000 read=500
Planning Time: 0.5 ms
Execution Time: 2103 ms
Đọc:
- Total time 2103ms — quá chậm cho OLTP
- Nút thắt: Seq Scan on orders (1M rows, ~1200ms estimated)
- Seq Scan on users: filter
country = 'VN', removed 400K rows — cần index trêncountry - Buffers: shared read=4000 trên orders → cold cache
Action:
CREATE INDEX idx_users_country ON users(country);
CREATE INDEX idx_orders_user ON orders(user_id);
ANALYZE;Rerun. Check new plan — có thể chuyển sang Hash Join với Index Scan, time giảm xuống ~100ms.
5.3 Common red flags
| Red flag | Ý nghĩa | Hành động |
|---|---|---|
Estimate rows: 1, Actual rows: 10000 | Stats wrong | ANALYZE, hoặc extended stats |
Seq Scan trên huge table | Missing index | Add index |
Rows Removed by Filter: 10000+ | Filter sau index | Composite/partial index |
Heap Fetches: large in Index Only Scan | VM stale | VACUUM |
Sort Method: external sort | Spill to disk | Tăng work_mem hoặc add index |
Batches: 4+ trong Hash | Hash spill | Tăng work_mem |
loops >1000 với Index Scan | Repeated nested loop | Restructure / better join order |
shared read >> hit | Cache cold | Warmup hoặc tune shared_buffers |
Planning Time >> Execution Time | Plan cache miss / complex query | Prepared statement |
6. Statistics — Bạn bí mật của planner
6.1 pg_statistic
Planner ước cardinality dựa trên pg_statistic. Mỗi column lưu:
null_frac— % NULLavg_width— bytes per valuen_distinct— distinct values estimate (-1 = unique)most_common_vals(MCV) +most_common_freqs— top frequent valueshistogram_bounds— distribution bucketscorrelation— physical vs logical order
SELECT attname, null_frac, n_distinct, most_common_vals, most_common_freqs, histogram_bounds
FROM pg_stats
WHERE tablename = 'orders' AND attname = 'status';6.2 default_statistics_target
Số buckets trong histogram + MCV count. Default 100. Tăng cho column quan trọng có distribution phức tạp:
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 1000;
ANALYZE orders;Trade-off:
- Cao hơn → planner accurate hơn → faster plans
- Cao hơn → ANALYZE chậm hơn, planning time tăng
Pattern 2024: 1000 cho column hot (filter, join, group), giữ default cho rest.
6.3 Correlated columns — extended statistics (PG10+)
Planner default assume column independent:
SELECT * FROM addresses WHERE city = 'San Francisco' AND state = 'CA';
-- Planner: P(city=SF) × P(state=CA) ≈ 0.001 × 0.05 = 0.00005
-- Reality: 100% SF in CA → P = 0.001 (not 0.00005)Extended statistics fix:
CREATE STATISTICS stats_addr (dependencies, ndistinct, mcv)
ON city, state FROM addresses;
ANALYZE addresses;Type:
ndistinct— distinct combosdependencies— functional dependency (state → country)mcv(PG12+) — multi-column MCV list
6.4 ANALYZE timing
Autovacuum cũng analyze:
autovacuum_analyze_scale_factor = 0.1(10% rows changed → analyze)autovacuum_analyze_threshold = 50
High-write table → tune sớm hơn:
ALTER TABLE events SET (autovacuum_analyze_scale_factor = 0.02);Manual ANALYZE sau bulk operation:
ANALYZE orders;
ANALYZE VERBOSE orders; -- in progress
ANALYZE orders (column1, column2); -- specific columns PG14+7. Join Algorithms Deep Dive
7.1 Nested Loop
For each row in outer:
For each row in inner matching join cond:
Output combination
Cost: outer_rows × inner_cost_per_lookup
Tốt khi: outer rows small (1-100) AND inner has index on join key.
Tệ khi: outer >10000 → quadratic. Đa số “query đang nhanh, đột nhiên chậm” do data growth làm nested loop unfit.
7.2 Hash Join
Build hash table from smaller side (in memory)
For each row in other side:
Lookup hash table
Output matches
Cost: O(N + M) linear.
Tốt khi: both sides large, không có sortable index.
Tệ khi: build side > work_mem → spill batches → slow.
work_mem = 64MB -- session override khi cần big hash7.3 Merge Join
Sort both sides by join key (or use sorted index)
Walk both in parallel
Cost: O(N log N + M log M) if need sort, O(N + M) if already sorted.
Tốt khi: cả 2 sides đã sorted (vd: index scan trên join key).
Tệ khi: cần sort cả 2 sides huge → expensive.
7.4 Khi nào planner pick sai
Trường hợp kinh điển:
SELECT * FROM orders o JOIN users u ON o.user_id = u.id WHERE u.country = 'VN';Stats nói “100 users trong VN” (correct). Planner pick Nested Loop:
Nested Loop
Seq Scan on users (filter country='VN') → 100 rows
Index Scan on orders by user_id (per outer row) → 100 lookups, mỗi ~100 rows = 10000 total
Sau 1 năm, “VN users” tăng lên 100K. Planner stats stale, vẫn ước 100. Vẫn pick Nested Loop:
Nested Loop
100K outer rows × 100 inner lookups each = 10M operations
← Slow!
Fix:
- ANALYZE → planner update → pick Hash Join
- Hoặc extended stats nếu correlation issue
7.5 Force / hint joins
Postgres không có hint syntax. Workaround for debugging:
SET enable_nestloop = off;
EXPLAIN ANALYZE <query>;
SET enable_nestloop = on;Production: fix root cause (stats), không tắt feature.
8. CTE Optimization — PG12 behavior change
8.1 Pre-PG12: CTE là optimization fence
-- Pre-PG12
WITH active_users AS (
SELECT * FROM users WHERE active = true -- always materialized
)
SELECT * FROM active_users WHERE id = 1;
-- Planner: scan all active users, filter id=1 (slow!)8.2 PG12+: CTE inline by default
-- PG12+
WITH active_users AS (
SELECT * FROM users WHERE active = true -- inlined!
)
SELECT * FROM active_users WHERE id = 1;
-- Planner: Index Scan on users WHERE active=true AND id=1 (fast)Inline khi:
- Non-recursive
- Referenced 1 lần
- Không có side effect
8.3 Khi cần force materialize
WITH expensive_calc AS MATERIALIZED (
SELECT ... FROM ... -- heavy aggregation
)
SELECT ... FROM expensive_calc WHERE ... -- nhiều reference
UNION
SELECT ... FROM expensive_calc WHERE ...;MATERIALIZED keyword: tính 1 lần, reuse.
8.4 Recursive CTE
WITH RECURSIVE category_tree AS (
SELECT id, name, parent_id, 0 AS depth FROM categories WHERE parent_id IS NULL
UNION ALL
SELECT c.id, c.name, c.parent_id, ct.depth + 1
FROM categories c JOIN category_tree ct ON c.parent_id = ct.id
)
SELECT * FROM category_tree WHERE depth <= 5;Plan:
CTE category_tree
-> Recursive Union
-> Seq Scan on categories -- anchor
-> Hash Join -- iteration
-> WorkTable Scan on category_tree
-> Seq Scan on categories
Pitfall: vô tận khi data có cycle. Add CYCLE clause (PG14+) hoặc depth limit.
9. Window Functions — Phân tích trong SQL
9.1 Anatomy
function_name([args]) OVER (
[PARTITION BY expr_list]
[ORDER BY expr_list]
[frame_clause]
)3 parts:
- PARTITION BY: divide rows into groups (like GROUP BY but kept separate)
- ORDER BY: within partition, ordering matters for ranking + frames
- Frame: subset of partition relative to current row (sum-so-far, moving avg, etc.)
9.2 Function categories
Ranking
row_number() -- 1, 2, 3 sequential
rank() -- 1, 2, 2, 4 (gaps after tie)
dense_rank() -- 1, 2, 2, 3 (no gaps)
percent_rank() -- (rank - 1) / (count - 1)
cume_dist() -- cumulative distribution
ntile(N) -- divide into N bucketsOffset / position
lag(col, offset, default) -- previous row value
lead(col, offset, default) -- next row value
first_value(col) -- first in frame
last_value(col) -- last in frame
nth_value(col, n) -- nth in frameAggregates as window
sum(col), avg(col), min(col), max(col), count(*), etc.
-- Plus string_agg, array_agg, etc. with window9.3 Frames — the underused gem
Default frame when ORDER BY present: RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW (running total).
Three frame modes:
ROWS — physical rows
-- Moving average over last 7 rows
avg(value) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)RANGE — by value
-- Sum within $100 of current
sum(amount) OVER (ORDER BY amount RANGE BETWEEN 100 PRECEDING AND 100 FOLLOWING)
-- 7-day window by date
sum(value) OVER (ORDER BY date RANGE BETWEEN INTERVAL '6 days' PRECEDING AND CURRENT ROW)GROUPS (PG11+) — peer groups
-- 3 distinct date groups before current
avg(value) OVER (ORDER BY date GROUPS BETWEEN 3 PRECEDING AND CURRENT ROW)EXCLUDE (PG11+)
sum(value) OVER (
ORDER BY id
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
EXCLUDE CURRENT ROW -- or TIES, GROUP, NO OTHERS
)Frame bounds
UNBOUNDED PRECEDING / FOLLOWINGN PRECEDING / FOLLOWINGCURRENT ROW
9.4 Trong plan
WindowAgg (cost=...) (actual rows=1M loops=1)
-> Sort
Sort Key: user_id, placed_at
-> Seq Scan on orders
Window function cần sort theo PARTITION BY + ORDER BY. Index khớp có thể skip sort:
CREATE INDEX idx_orders_user_placed ON orders(user_id, placed_at);Multiple windows with same PARTITION + ORDER share Sort:
-- 1 Sort, multiple WindowAgg
SELECT
sum(total) OVER w,
avg(total) OVER w,
rank() OVER w
FROM orders
WINDOW w AS (PARTITION BY user_id ORDER BY placed_at);9.5 Patterns
Top N per group (3 ways)
-- A. Window function
SELECT * FROM (
SELECT *, row_number() OVER (PARTITION BY user_id ORDER BY placed_at DESC) AS rn
FROM orders
) t WHERE rn <= 3;
-- B. LATERAL join (often faster for small N)
SELECT u.id, o.* FROM users u, LATERAL (
SELECT * FROM orders WHERE user_id = u.id
ORDER BY placed_at DESC LIMIT 3
) o;
-- C. DISTINCT ON (Postgres-specific)
SELECT DISTINCT ON (user_id) * FROM orders ORDER BY user_id, placed_at DESC;
-- Only 1 per group; for N > 1 use A or BBenchmark which is fastest for your data.
Lag / Lead (compare with previous)
-- Time between orders
SELECT
id, placed_at,
placed_at - lag(placed_at) OVER (PARTITION BY user_id ORDER BY placed_at) AS gap_since_last
FROM orders;
-- Did value change?
SELECT id, value,
CASE WHEN value <> lag(value) OVER (ORDER BY ts) THEN 'changed' ELSE 'same' END AS status
FROM measurements;Running total / cumulative
SELECT id, sum(total_cents) OVER (PARTITION BY user_id ORDER BY placed_at) AS lifetime_so_far
FROM orders;Moving average
SELECT date, value,
avg(value) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS ma_7day
FROM metrics;Percentile / quartile
SELECT name, score,
ntile(4) OVER (ORDER BY score) AS quartile,
percent_rank() OVER (ORDER BY score) AS pct_rank
FROM students;Gaps & islands (consecutive runs)
-- Find runs of consecutive days
WITH numbered AS (
SELECT date,
date - (row_number() OVER (ORDER BY date))::int * interval '1 day' AS group_key
FROM events
)
SELECT min(date) AS run_start, max(date) AS run_end, count(*) AS days
FROM numbered GROUP BY group_key;Sessionization
-- Group events into sessions (>30min gap = new session)
WITH gaps AS (
SELECT *,
CASE WHEN ts - lag(ts) OVER (PARTITION BY user_id ORDER BY ts) > interval '30 min'
OR lag(ts) OVER (PARTITION BY user_id ORDER BY ts) IS NULL
THEN 1 ELSE 0 END AS is_new_session
FROM events
),
sessions AS (
SELECT *, sum(is_new_session) OVER (PARTITION BY user_id ORDER BY ts) AS session_id
FROM gaps
)
SELECT user_id, session_id, count(*) FROM sessions GROUP BY user_id, session_id;Last value with NULL handling
-- last non-null status
SELECT id, ts, status,
last_value(status) OVER (
PARTITION BY user_id ORDER BY ts
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS effective_status
FROM history WHERE status IS NOT NULL;9.6 Window vs Group By
| Group By | Window |
|---|---|
| Reduce rows | Keep all rows |
| Aggregate per group | Per-row computation with context |
SELECT count(*) GROUP BY user_id | SELECT *, count(*) OVER (PARTITION BY user_id) |
| Use for summary | Use for ranking, comparing, running totals |
9.7 Performance tips
- Match index to
PARTITION BY+ORDER BY→ avoid Sort - Reuse named windows (
WINDOW w AS ...) - Filter before window when possible (use CTE/subquery)
ROWSframe faster thanRANGEin some casesLATERALoften beats window for top-N per group with small N
10. LATERAL Join — superpower
SELECT u.id, latest.placed_at, latest.total_cents
FROM users u
LATERAL (
SELECT placed_at, total_cents FROM orders WHERE user_id = u.id
ORDER BY placed_at DESC LIMIT 1
) latest;Cho mỗi user, top-1 latest order. Cú pháp này LATERAL cho phép subquery reference outer columns.
Useful cho:
- Top-N per group (alternative to window function)
- Correlated subquery rewrite
- JSON expand per row
11. Anti-patterns
11.1 SELECT * thay vì column cụ thể
→ Block index-only scan (cần index INCLUDE mọi column) → Network bandwidth waste → Refactoring nightmare (drop column nhưng app vẫn select *)
11.2 OFFSET pagination
-- Page 100, 20 per page
SELECT * FROM orders ORDER BY placed_at DESC OFFSET 2000 LIMIT 20;
-- Plan: scan 2020 rows, throw 2000, return 20→ Use keyset pagination (Tuần 03 section 10.9).
11.3 function(column) = value
-- BAD
SELECT * FROM users WHERE lower(email) = 'a@x.com';
-- Index trên (email) không dùng được
-- FIX 1: expression index
CREATE INDEX idx_users_email_lower ON users(lower(email));
-- FIX 2: store lowercase, hoặc citext type11.4 Implicit type cast
-- column id is bigint
SELECT * FROM users WHERE id = '42'; -- text cast to bigint
SELECT * FROM users WHERE id = 42.0; -- numeric castCó thể prevent index use trong edge case. Always match types.
11.5 NOT IN với nullable column
-- BAD - returns no rows if subquery has NULL
SELECT * FROM users WHERE id NOT IN (SELECT user_id FROM blocked);
-- GOOD
SELECT * FROM users WHERE NOT EXISTS (SELECT 1 FROM blocked WHERE blocked.user_id = users.id);NULL semantics trong NOT IN là gốc của bug kinh điển.
11.6 ORM N+1
Mỗi parent row → 1 query con. 100 users → 101 queries.
→ Use JOIN, batch loading, DataLoader pattern. Sẽ học Tuan-16-ORM-CQRS-Multi-Tenancy.
11.7 count(*) trên huge table
SELECT count(*) FROM orders; -- scan 100M rows→ Approximate từ pg_class.reltuples:
SELECT reltuples::bigint FROM pg_class WHERE relname = 'orders';→ Denormalized counter (Tuần 02).
11.8 LIKE '%abc%' mà không có trigram index
→ Seq scan. Add pg_trgm GIN index.
11.9 Subquery uncorrelated trong WHERE
-- BAD: subquery runs once for every parent row potentially
SELECT * FROM orders WHERE total > (SELECT avg(total) FROM orders);
-- BETTER: precompute
WITH stats AS MATERIALIZED (SELECT avg(total) AS avg_t FROM orders)
SELECT o.* FROM orders o, stats WHERE o.total > stats.avg_t;11.10 OR conditions across columns
-- Often defeats index
SELECT * FROM orders WHERE user_id = 1 OR seller_id = 1;
-- BETTER: UNION
SELECT * FROM orders WHERE user_id = 1
UNION
SELECT * FROM orders WHERE seller_id = 1;Mỗi side dùng index riêng, planner combine bằng Bitmap.
12. Tools
12.1 explain.depesz.com
Paste EXPLAIN ANALYZE output → colored highlight slow nodes.
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) <query>;
-- Copy output → paste vào website12.2 pev2 (Postgres EXPLAIN Visualizer 2)
https://github.com/dalibo/pev2
Visual tree, percentage time per node, statistics inline. Self-host hoặc dùng demo.
Paste JSON format:
EXPLAIN (ANALYZE, BUFFERS, VERBOSE, FORMAT JSON) <query>;12.3 pgMustard
Commercial. AI-assisted analysis với recommendation cụ thể.
12.4 auto_explain
Extension auto log slow queries’ plan.
LOAD 'auto_explain';
SET auto_explain.log_min_duration = 1000; -- ms
SET auto_explain.log_analyze = true;
SET auto_explain.log_buffers = true;
SET auto_explain.log_format = 'json';Trong postgresql.conf:
shared_preload_libraries = 'auto_explain, pg_stat_statements'
auto_explain.log_min_duration = 1000
auto_explain.log_analyze = onMọi query >1s sẽ log plan → tìm slow query mà không chạy lại.
12.5 pg_stat_statements
Aggregate query stats. Sẽ đào sâu Tuần 09. Preview:
SELECT query, calls, mean_exec_time, total_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC LIMIT 10;13. Lab — 7-day deep dive
13.1 Setup data
Reuse e-commerce schema từ Tuần 02. Bulk insert larger:
-- 10M orders, 500K users, 200K products
-- See Tuần 03 lab section13.2 Day 1-2: Plan reading basics
Run 20 different queries, EXPLAIN ANALYZE each, identify:
- Total time
- Bottleneck node
- Estimate vs actual mismatch
13.3 Day 3: Statistics deep dive
-- Check current stats
SELECT * FROM pg_stats WHERE tablename = 'orders';
-- Tune
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 1000;
ANALYZE orders;
-- Verify change in planReproduce planner getting it wrong:
INSERT INTO orders ... 100K rows ...
-- Don't ANALYZE
EXPLAIN <query>;
-- Bad estimate
ANALYZE orders;
EXPLAIN <query>;
-- Better13.4 Day 4: Joins
Compare 3 join algorithms forced:
SET enable_hashjoin = off;
SET enable_mergejoin = off;
EXPLAIN ANALYZE <query>;
-- Nested loop only
-- Repeat for hash, merge
RESET enable_hashjoin;
RESET enable_mergejoin;13.5 Day 5: Window functions
Build dashboard query: top spender per country, running total per user, rank changes month over month.
13.6 Day 6: Plan visualizers
Paste plans into depesz, pev2. Practice reading visual representation.
13.7 Day 7: Capstone
Take a real query slow in production (or create one):
- Get baseline EXPLAIN ANALYZE BUFFERS
- Identify bottleneck
- Hypothesis fix
- Implement (index, rewrite, stats tune)
- Compare plans
- Document decision
14. Self-check
- EXPLAIN vs EXPLAIN ANALYZE — khác nhau gì? Khi nào dùng cái nào?
- Cost trong EXPLAIN — đơn vị gì? Default
random_page_cost = 4.0sai khi nào? - 3 loại scan: Seq, Index, Index Only. Khi nào planner pick mỗi loại?
- Heap Fetches > 0 trong Index Only Scan — nguyên nhân? Fix?
- 3 join algorithm — Nested Loop, Hash, Merge. Cho mỗi cái, khi nào tốt nhất?
Sort Method: external sort— vấn đề gì? Fix?- Estimate rows = 100, Actual rows = 100K. Lý do và cách fix?
- CTE pre-PG12 vs PG12+ — khác biệt? Khi nào dùng
MATERIALIZED? WHERE lower(email) = 'a@x.com'không dùng index trênemail. Tại sao? Fix?- Khi nào pick window function vs GROUP BY?
15. Tiếp theo
Bài tiếp: Tuan-05-Transactions-Locking — sau khi master read path, master write path + concurrency.
Đọc:
- Use The Index, Luke! full
- Postgres docs Ch.14 Performance Tips
- Database Internals (Petrov) Ch. Query Execution
Tuần 04 hoàn thành. Read the plan, not the query. Cập nhật: 2026-05-16