Case Study: Design Web Crawler — Full System Design Interview Walkthrough

“Web crawler giống như con nhện giăng tơ trên Internet — nó bò từ trang này sang trang khác, dệt nên một tấm bản đồ khổng lồ của toàn bộ World Wide Web. Google bot crawl hàng tỷ trang mỗi tháng để xây dựng search index mà chúng ta dùng hàng ngày.”

Tags: system-design case-study web-crawler alex-xu interview chapter-9 Student: Hieu Prerequisite: Tuan-01-Scale-From-Zero-To-Millions · Tuan-02-Back-of-the-envelope Liên quan: Tuan-08-Message-Queue · Tuan-09-Rate-Limiter · Tuan-10-Consistent-Hashing · Tuan-03-Networking-DNS-CDN · Tuan-06-Cache-Strategy · Tuan-07-Database-Sharding-Replication

Context & Why — Tại sao Web Crawler quan trọng?

Con nhện giăng tơ trên Internet

Hieu, hãy tưởng tượng Internet là một khu rừng khổng lồ với hàng tỷ trang web. Mỗi trang web là một chiếc lá, và các hyperlink là những sợi tơ nối chúng lại với nhau. Web crawler (hay còn gọi là spider, bot) chính là con nhện — nó bắt đầu từ một điểm, bò theo sợi tơ (link) từ lá này sang lá khác, và ghi nhớ mọi thứ nó đã thấy.

Google bot — con nhện nổi tiếng nhất — crawl khoảng 130 nghìn tỷ trang (tính đến 2024) để xây dựng Google Search Index. Mỗi khi em Google một thứ gì đó, em đang tìm kiếm trong cái “bản đồ” mà con nhện này đã vẽ ra.

Ai dùng Web Crawler?

Use Case	Ví dụ thực tế	Mô tả
Search Engine Indexing	Googlebot, Bingbot	Thu thập nội dung web để xây search index
Web Archiving	Wayback Machine (Internet Archive)	Lưu trữ lịch sử Internet
Web Mining	Fintech, Market Research	Thu thập dữ liệu giá cả, tin tức, reviews
Web Monitoring	Brand monitoring, SEO tools	Theo dõi thay đổi trên các website
Copyright Detection	YouTube Content ID (web phiên bản)	Phát hiện nội dung vi phạm bản quyền

Tại sao câu hỏi này khó?

Web crawler nghe đơn giản — “cứ download HTML rồi follow link” — nhưng thực tế đụng vào gần như mọi vấn đề kinh điển của distributed systems:

Thách thức	Liên quan đến
Phải crawl hàng tỷ trang	Scalability, distributed computing
Không được DDoS website	Rate limiting, politeness
Nội dung trùng lặp	Deduplication, hashing
URL vô tận (spider trap)	Graph traversal, trap detection
DNS lookup chậm	Caching, DNS resolver optimization
Lưu trữ petabytes dữ liệu	Distributed storage, compression
Fault tolerance	Checkpointing, graceful recovery

Aha Moment: Web crawler là bài toán “đơn giản mà không đơn giản”. Nó test khả năng em nhìn thấy edge cases và trade-offs trong một hệ thống tưởng chừng straightforward.

Overview — Bốn bước của Alex Xu

Alex Xu chia mọi System Design Interview thành 4 bước:

Bước	Tên gọi	Thời gian (45 phút)
1	Understand the Problem & Establish Design Scope	~5 phút
2	Propose High-Level Design	~15 phút
3	Design Deep Dive	~20 phút
4	Wrap Up	~5 phút

Aha Moment: Đừng nhảy thẳng vào vẽ diagram. Interviewer muốn thấy em biết hỏi đúng câu hỏi trước khi thiết kế. Với web crawler, scope rất rộng — phải narrow down ngay từ đầu.

Step 1 — Understand the Problem & Establish Design Scope

1.1 Functional Requirements (Yêu cầu chức năng)

Hieu, câu đầu tiên em hỏi interviewer phải là: “Mục đích chính của crawler này là gì?” Bởi vì web crawler cho search engine khác hoàn toàn web crawler cho price monitoring.

Câu hỏi clarify với interviewer:

Câu hỏi	Trả lời / Giả định
Mục đích chính của crawler?	Search engine indexing — thu thập HTML content
Cần crawl bao nhiêu trang/tháng?	1 Billion (1B) pages/month
Chỉ crawl HTML hay cả media (images, videos, PDF)?	Chỉ HTML
Có cần xử lý nội dung JavaScript-rendered không?	Không, chỉ static HTML (simplify scope)
Nội dung mới crawl hay cần re-crawl?	Cả hai — crawl trang mới + re-crawl trang đã thay đổi
Lưu trữ bao lâu?	5 năm
Xử lý trang trùng nội dung?	Có, phải dedup
Crawl chỉ một domain hay toàn bộ Internet?	Toàn bộ Internet (open web)

Tóm tắt Functional Requirements:

Crawl — Bắt đầu từ seed URLs, follow links, download HTML content
Dedup — Không lưu trữ nội dung trùng lặp (cả URL và content)
Store — Lưu HTML content vào distributed storage
Re-crawl — Định kỳ quay lại crawl trang đã thay đổi
Politeness — Tuân thủ robots.txt, không gửi quá nhiều request đến cùng một host

1.2 Non-Functional Requirements (Yêu cầu phi chức năng)

Requirement	Target	Giải thích
Scalability (Khả năng mở rộng)	1B pages/month, scale horizontally	Phải handle toàn bộ open web
Robustness (Tính bền vững)	Handle failures gracefully	Server crash, network timeout, malformed HTML, spider traps
Politeness (Lịch sự)	Respect robots.txt, rate limit per host	Không được DDoS bất kỳ website nào
Extensibility (Khả năng mở rộng chức năng)	Dễ thêm module mới	Thêm image crawler, PDF crawler sau này
Freshness (Tính cập nhật)	Ưu tiên re-crawl trang hay thay đổi	News sites crawl thường xuyên hơn static pages
Efficiency (Hiệu quả)	Tối ưu bandwidth, storage, compute	Mỗi byte tải về đều tốn tiền

1.3 Capacity Estimation (Ước lượng năng lực)

Tham chiếu chi tiết: Tuan-02-Back-of-the-envelope

Assumptions (Giả thiết)

Thông số	Giá trị	Giải thích
Pages to crawl/month	1 Billion (1B)	Yêu cầu từ interviewer
Average HTML page size	500 KB	Bao gồm HTML + metadata
Average URL length	100 bytes	Trung bình
Metadata per page	500 bytes	Timestamp, headers, status
Storage retention	5 năm	Lưu trữ lâu dài
Total unique URLs to track	10 Billion (10B)	Nhiều URL hơn pages vì có URL chưa crawl

QPS — Pages per Second

P a g es / seco n d = \frac{1 , 000 , 000 , 000}{30 \times 24 \times 3 , 600} = \frac{1 B}{2 , 592 , 000} \approx 400 p a g es / s

P e ak p a g es / seco n d = 400 \times 3 \approx 1, 200 p a g es / s

Nhận xét: 400 pages/second trung bình — nghe ít nhưng mỗi page cần DNS lookup + TCP connection + download + parse + extract links. Pipeline phải rất hiệu quả.

Bandwidth

B an d w i d t h_{a vg} = 400 p a g es / s \times 500 K B = 200 MB / s = 1.6 G b p s

B an d w i d t h_{p e ak} = 1, 200 p a g es / s \times 500 K B = 600 MB / s = 4.8 G b p s

Nhận xét: 1.6 Gbps trung bình — cần nhiều máy với đường truyền tốt. Một máy đơn lẻ không đủ bandwidth.

Storage

St or a g e_{m o n t h l y} = 1 B \times 500 K B = 500 TB / m o n t h

St or a g e_{5 ye a rs} = 500 TB \times 12 \times 5 = 30 PB

Nhận xét: 30 PB trong 5 năm — đây là lý do phải dùng distributed file system (HDFS, S3) và nén dữ liệu. Với compression ratio ~5x, thực tế cần khoảng 6 PB raw storage.

St or a g e_{co m p resse d} = \frac{30 PB}{5} = 6 PB

URL Frontier Size

Giả sử tại bất kỳ thời điểm nào, URL Frontier chứa khoảng 100M URLs đang chờ crawl:

U R L F ro n t i er m e m ory = 100, 000, 000 \times 100 b y t es = 10 GB

Nhận xét: 10 GB — vừa đủ để giữ trong memory, nhưng thực tế URL Frontier sẽ dùng kết hợp memory + disk.

Bloom Filter Size cho 10B URLs

Bloom filter dùng để kiểm tra “URL này đã thấy chưa?” với false positive rate $p$ :

m = - \frac{n \times ln ( p )}{( ln 2 ) ^{2}}

Với $n = 10 B$ URLs và $p = 1%$ (false positive rate):

m = - \frac{10 , 000 , 000 , 000 \times ln ( 0.01 )}{( ln 2 ) ^{2}} = - \frac{10 B \times ( - 4.605 )}{0.4805}

m \approx 95.8 \times 1 0^{9} bi t s \approx 12 GB

Số hash functions tối ưu:

k = \frac{m}{n} \times ln 2 = \frac{95.8 B}{10 B} \times 0.693 \approx 7 ha s h f u n c t i o n s

Nhận xét: 12 GB cho Bloom filter — hoàn toàn fit trong RAM. Đây là lý do Bloom filter được ưa chuộng cho URL dedup: kiểm tra O(1) với memory rất nhỏ so với lưu toàn bộ 10B URLs.

Tổng kết Estimation

Metric	Giá trị
Avg pages/second	~400
Peak pages/second	~1,200
Avg bandwidth	1.6 Gbps
Peak bandwidth	4.8 Gbps
Monthly storage (raw)	500 TB
5-year storage (compressed)	~6 PB
URL Frontier memory	~10 GB
Bloom filter for 10B URLs	~12 GB

Step 2 — Propose High-Level Design

2.1 Luồng chính của Web Crawler

Hieu, web crawler hoạt động theo một vòng lặp (loop) liên tục. Alex Xu mô tả flow chính như sau:

Seed URLs → URL Frontier → HTML Downloader → Content Parser
→ Content Seen? → Link Extractor → URL Filter → URL Seen?
→ URL Frontier (quay lại vòng lặp)

Mỗi component có một vai trò rõ ràng:

Component	Vai trò	Input	Output
Seed URLs	Điểm bắt đầu crawl	Danh sách URLs ban đầu	URLs đưa vào Frontier
URL Frontier	Hàng đợi URLs cần crawl	URLs mới + URLs re-crawl	URL tiếp theo để download
HTML Downloader	Tải HTML từ Internet	URL	Raw HTML content
Content Parser	Validate + parse HTML	Raw HTML	Parsed HTML hoặc discard
Content Seen?	Kiểm tra nội dung trùng	Parsed HTML	Pass (mới) hoặc skip (trùng)
Content Storage	Lưu trữ HTML	Parsed HTML đã qua dedup	Persisted content
Link Extractor	Trích xuất URLs từ HTML	Parsed HTML	Danh sách URLs mới
URL Filter	Lọc URLs không hợp lệ	Raw URLs	Filtered URLs
URL Seen?	Kiểm tra URL đã crawl chưa	Filtered URL	Pass (chưa thấy) hoặc skip

2.2 Web Crawler Architecture — Mermaid Diagram

flowchart TD
    SEED["Seed URLs<br/>(Starting points)"]
    FRONTIER["URL Frontier<br/>(Priority + Politeness Queues)"]
    DNS["DNS Resolver<br/>(with Cache)"]
    DOWNLOADER["HTML Downloader<br/>(Distributed Workers)"]
    ROBOTS["robots.txt<br/>Cache"]
    PARSER["Content Parser<br/>(Validate HTML)"]
    DEDUP_CONTENT["Content Seen?<br/>(SimHash / MinHash)"]
    STORAGE["Content Storage<br/>(Distributed FS)"]
    EXTRACTOR["Link Extractor<br/>(Extract URLs from HTML)"]
    FILTER["URL Filter<br/>(Blacklist, Extension filter)"]
    DEDUP_URL["URL Seen?<br/>(Bloom Filter)"]

    SEED --> FRONTIER
    FRONTIER --> DNS
    DNS --> DOWNLOADER
    ROBOTS -.->|"Check rules"| DOWNLOADER
    DOWNLOADER --> PARSER
    PARSER -->|"Invalid HTML"| DISCARD1["Discard"]
    PARSER -->|"Valid HTML"| DEDUP_CONTENT
    DEDUP_CONTENT -->|"Already seen<br/>(duplicate)"| DISCARD2["Discard"]
    DEDUP_CONTENT -->|"New content"| STORAGE
    DEDUP_CONTENT -->|"New content"| EXTRACTOR
    EXTRACTOR --> FILTER
    FILTER --> DEDUP_URL
    DEDUP_URL -->|"New URL"| FRONTIER
    DEDUP_URL -->|"Already seen"| DISCARD3["Discard"]

    style FRONTIER fill:#0d6efd,stroke:#333,color:#fff
    style DOWNLOADER fill:#198754,stroke:#333,color:#fff
    style DEDUP_CONTENT fill:#dc3545,stroke:#333,color:#fff
    style DEDUP_URL fill:#dc3545,stroke:#333,color:#fff
    style STORAGE fill:#6f42c1,stroke:#333,color:#fff
    style DNS fill:#fd7e14,stroke:#333,color:#fff

2.3 Seed URLs — Điểm bắt đầu

Seed URLs là tập URLs ban đầu mà crawler bắt đầu bò. Chiến lược chọn seed URLs ảnh hưởng lớn đến hiệu quả:

Chiến lược	Mô tả	Khi nào dùng
By popularity	Top sites theo Alexa/Similarweb ranking	General-purpose crawler
By topic/category	URLs theo danh mục (news, shopping, education)	Topic-focused crawler
By geography	URLs theo quốc gia (VN: vnexpress.net, dantri.com.vn)	Regional crawler
By domain diversity	Phân bố đều across domains	Đảm bảo coverage rộng

Tip cho interview: Em có thể nói “For a general-purpose crawler, I’d start with top 10,000 most popular domains from Alexa ranking, distributed across different categories and geographies.”

2.4 BFS vs DFS Traversal

Web crawler về bản chất là bài toán graph traversal — Internet là một đồ thị khổng lồ, mỗi trang là một node, mỗi link là một edge.

Tiêu chí	BFS (Breadth-First Search)	DFS (Depth-First Search)
Cách hoạt động	Crawl tất cả links ở level hiện tại trước khi đi sâu	Đi sâu theo một path cho đến khi không còn link
Queue type	FIFO Queue	Stack (LIFO)
Ưu điểm	Khám phá rộng, tìm nhanh trang quan trọng	Đơn giản implement
Nhược điểm	Queue lớn, có thể chậm tìm trang sâu	Dễ rơi vào spider trap, không đảm bảo politeness
Dùng cho web crawler?	Thường dùng BFS (có biến thể)	Hầu như không dùng DFS thuần

Vấn đề của BFS thuần:

Không có priority — Trang quan trọng (CNN.com) được đối xử ngang trang vô danh
Impolite — BFS có thể gửi nhiều request đồng thời đến cùng một host
Không tính recrawl — BFS chỉ crawl một lần, không xử lý trang thay đổi

Aha Moment: Vì vậy crawler thực tế dùng modified BFS — BFS kết hợp priority queue và politeness constraints. Đây chính là URL Frontier mà ta sẽ deep dive ở Step 3.

Step 3 — Design Deep Dive

3.1 URL Frontier — Bộ não của Crawler

URL Frontier không phải chỉ là một queue đơn giản. Nó là component phức tạp nhất trong crawler, kết hợp two concerns:

Priority — URL nào nên crawl trước?
Politeness — Không được gửi quá nhiều request đến cùng host

3.1.1 Priority Queue — Crawl cái gì trước?

Không phải tất cả URLs đều quan trọng như nhau. CNN.com homepage quan trọng hơn một blog post random. Cách tính priority:

Factor	Weight	Giải thích
PageRank	Cao	Trang được nhiều trang khác link đến → quan trọng
Update frequency	Cao	News sites thay đổi liên tục → cần crawl thường xuyên
Domain authority	Trung bình	.edu, .gov thường có authority cao hơn
Freshness	Trung bình	Trang lâu chưa crawl lại → ưu tiên re-crawl
Depth from seed	Thấp	Trang gần seed URL thường quan trọng hơn trang sâu

Kiến trúc Priority Queue:

Prioritizer nhận URL → tính priority score → đưa vào queue tương ứng
Có nhiều queues (Q1, Q2, …, Qn) với priority khác nhau
Queue Selector chọn URL từ queue nào tiếp theo (queues priority cao được chọn thường xuyên hơn)

3.1.2 Politeness Queue — Đừng DDoS ai cả

Đây là constraint quan trọng nhất mà nhiều ứng viên quên. Nếu crawler gửi 100 requests/second đến cùng một host → đó là DDoS attack, không phải crawling.

Nguyên tắc: Mỗi host chỉ được crawl bởi một worker tại một thời điểm, với delay giữa các request (thường 1-10 giây).

Kiến trúc Politeness Queue:

Mỗi host có riêng một queue (per-host queue)
Queue Router nhận URL → hash hostname → đưa vào đúng queue
Mỗi queue gắn với một worker thread duy nhất
Worker tuân thủ crawl delay (thường lấy từ robots.txt hoặc default 1s)

3.1.3 URL Frontier Detail — Mermaid Diagram

flowchart TD
    subgraph "Prioritization Layer"
        IN["Incoming URLs"]
        PRIOR["Prioritizer<br/>(PageRank, Freshness,<br/>Domain Authority)"]
        PQ1["Priority Q1<br/>(Highest)"]
        PQ2["Priority Q2<br/>(High)"]
        PQ3["Priority Q3<br/>(Medium)"]
        PQN["Priority QN<br/>(Low)"]
        SEL["Queue Selector<br/>(Weighted Random)"]

        IN --> PRIOR
        PRIOR --> PQ1
        PRIOR --> PQ2
        PRIOR --> PQ3
        PRIOR --> PQN
        PQ1 --> SEL
        PQ2 --> SEL
        PQ3 --> SEL
        PQN --> SEL
    end

    subgraph "Politeness Layer"
        ROUTER["Queue Router<br/>(hash hostname → queue)"]
        HQ1["Host Queue 1<br/>(cnn.com)"]
        HQ2["Host Queue 2<br/>(bbc.co.uk)"]
        HQ3["Host Queue 3<br/>(vnexpress.net)"]
        HQN["Host Queue N<br/>(...)"]
        W1["Worker 1<br/>(delay: 1s)"]
        W2["Worker 2<br/>(delay: 2s)"]
        W3["Worker 3<br/>(delay: 1s)"]
        WN["Worker N<br/>(delay: 3s)"]

        SEL --> ROUTER
        ROUTER --> HQ1
        ROUTER --> HQ2
        ROUTER --> HQ3
        ROUTER --> HQN
        HQ1 --> W1
        HQ2 --> W2
        HQ3 --> W3
        HQN --> WN
    end

    W1 --> OUT["To HTML Downloader"]
    W2 --> OUT
    W3 --> OUT
    WN --> OUT

    style PRIOR fill:#0d6efd,stroke:#333,color:#fff
    style SEL fill:#0d6efd,stroke:#333,color:#fff
    style ROUTER fill:#198754,stroke:#333,color:#fff
    style OUT fill:#fd7e14,stroke:#333,color:#fff

3.1.4 Freshness — Re-crawl Strategy

Web thay đổi liên tục. Một crawler tốt phải biết khi nào cần re-crawl một trang:

Chiến lược	Cách hoạt động	Ưu/Nhược
Recrawl all periodically	Crawl lại mọi trang sau X ngày	Đơn giản nhưng lãng phí bandwidth
Prioritize by change frequency	Trang thay đổi thường → crawl thường	Hiệu quả nhưng cần tracking history
Use HTTP headers	Dùng `Last-Modified`, `ETag`, `If-Modified-Since`	Tiết kiệm bandwidth, server phải support
Adaptive recrawl	Machine learning predict change frequency	Phức tạp nhưng tối ưu nhất

Adaptive Recrawl trong thực tế:

Lần crawl trước: content hash = abc123
Lần crawl này: content hash = abc123 (không đổi) → tăng recrawl interval (ví dụ: 7 ngày → 14 ngày)
Lần crawl tiếp: content hash = xyz789 (đã đổi) → giảm recrawl interval (ví dụ: 14 ngày → 3 ngày)

3.2 HTML Downloader — Tải nội dung

HTML Downloader là component chịu trách nhiệm tải HTML từ Internet. Nó đụng vào nhiều vấn đề thực tế.

3.2.1 robots.txt Compliance

robots.txt là file đặt ở root domain (ví dụ: https://example.com/robots.txt) mà website owner dùng để “nói” với crawler: “Đừng crawl những URL này.”

Ví dụ robots.txt:

User-agent: *
Disallow: /admin/
Disallow: /private/
Crawl-delay: 10

User-agent: Googlebot
Allow: /
Crawl-delay: 1

Directive	Ý nghĩa
`User-agent: *`	Áp dụng cho tất cả crawlers
`Disallow: /admin/`	Không được crawl /admin/
`Crawl-delay: 10`	Đợi 10 giây giữa mỗi request
`Allow: /`	Googlebot được crawl toàn bộ

Cách xử lý trong crawler:

Trước khi crawl bất kỳ URL nào → kiểm tra robots.txt của domain đó
Cache robots.txt (TTL ~24h) để không phải tải lại liên tục
Tuân thủ Crawl-delay nếu có
Nếu robots.txt trả 5xx → coi như disallow all (conservative approach)

3.2.2 Distributed Downloading

Một máy đơn lẻ không thể tải 400 pages/second liên tục. Crawler cần phân tán across multiple machines và regions:

Chiến lược	Giải thích
Geographic distribution	Đặt crawler workers ở nhiều regions (US, EU, Asia) → giảm latency khi crawl websites ở cùng region
Worker pool	Mỗi region có pool of workers, mỗi worker chạy trên separate machine
URL assignment	Dùng consistent hashing để assign URLs cho workers → đảm bảo cùng host luôn được crawl bởi cùng worker (politeness)
Connection pooling	Reuse TCP connections đến cùng host → giảm overhead
Async I/O	Non-blocking I/O để một worker có thể handle nhiều downloads đồng thời

3.2.3 DNS Resolver — Bottleneck ẩn

DNS resolution là bottleneck lớn nhất mà ít người nhận ra. Mỗi URL cần DNS lookup để chuyển hostname thành IP address, và DNS lookup mất 10-200ms.

Với 400 pages/second, nếu mỗi DNS lookup mất 100ms:

D NS t h ro ug h p u t = \frac{1 , 000 m s}{100 m s} = 10 l oo k u p s / s (p er t h re a d)

→ Cần 40 threads chỉ riêng cho DNS! Và đây là trường hợp trung bình.

Giải pháp DNS Bottleneck:

Giải pháp	Mô tả	Hiệu quả
Local DNS cache	Cache DNS results trong crawler (TTL-aware)	Giảm 80-90% DNS lookups
Pre-fetching	DNS lookup trước khi thực sự cần download	Giấu latency
Custom DNS resolver	Chạy local DNS resolver (unbound, CoreDNS) thay vì dùng ISP DNS	Kiểm soát cache, higher throughput
Batch DNS	Gom nhiều DNS queries và gửi cùng lúc	Tăng throughput

Aha Moment: DNS là bottleneck mà nhiều ứng viên quên hoàn toàn. Nếu em mention DNS caching trong interview, interviewer sẽ ấn tượng vì nó cho thấy em hiểu production challenges.

3.2.4 Handling Timeouts & Retries

Không phải mọi download đều thành công. Crawler phải handle gracefully:

Scenario	Response	Action
Connection timeout (10s)	Không kết nối được	Retry với exponential backoff (1s → 2s → 4s), max 3 retries
Read timeout (30s)	Kết nối thành công nhưng tải quá lâu	Cancel, đưa URL lại queue với priority thấp
HTTP 4xx	Client error (404, 403)	Không retry — URL invalid hoặc bị chặn
HTTP 5xx	Server error	Retry sau delay, mark host as potentially down
HTTP 301/302	Redirect	Follow redirect (max 5 hops), lưu final URL
HTTP 200	Success	Tiếp tục pipeline
Quá lớn (>10MB)	HTML page bất thường	Truncate hoặc skip

3.3 Content Parser & Validation

Sau khi download HTML, cần validate và parse:

Check	Mục đích
Malformed HTML	Loại bỏ HTML không hợp lệ hoặc corrupt
Encoding detection	Xác định charset (UTF-8, ISO-8859-1, etc.)
Language detection	Nếu chỉ muốn crawl tiếng Việt/Anh
Content extraction	Tách nội dung chính khỏi boilerplate (nav, footer, ads)
Size check	Skip pages quá nhỏ (<1KB, có thể empty) hoặc quá lớn

3.4 Content Deduplication — Content Seen?

Trên Internet, ~30% trang web có nội dung trùng lặp hoặc gần trùng (mirror sites, syndicated content, copy-paste). Lưu trữ duplicate content lãng phí storage và bandwidth.

3.4.1 Tại sao không dùng hash thông thường?

MD5/SHA-256 hash: Chỉ phát hiện exact duplicate. Nếu hai trang khác nhau 1 ký tự → hash hoàn toàn khác.
Nhưng web content thường near-duplicate — cùng bài viết nhưng khác ads, timestamps, sidebar.

3.4.2 SimHash — Fingerprinting cho near-duplicate detection

SimHash tạo ra một “fingerprint” sao cho hai documents tương tự nhau sẽ có fingerprint tương tự.

Cách SimHash hoạt động (high-level):

Tokenize document thành features (words, n-grams)
Hash mỗi feature thành bit vector
Weighted sum tất cả bit vectors
Binarize: positive → 1, negative → 0 → SimHash fingerprint

So sánh hai documents:

Tính Hamming distance (số bit khác nhau) giữa hai SimHash fingerprints
Nếu Hamming distance $\leq k$ (thường $k = 3$ cho 64-bit SimHash) → near-duplicate

H ammin g d i s t an ce (A, B) = p o p co u n t (A \oplus B)

Phương pháp	Phát hiện exact dup	Phát hiện near-dup	Memory per doc	So sánh
MD5	Yes	No	16 bytes	O(1)
SHA-256	Yes	No	32 bytes	O(1)
SimHash	Yes	Yes	8 bytes (64-bit)	O(1)
MinHash (Jaccard)	Yes	Yes	~100 bytes (100 hashes)	O(k)

3.4.3 MinHash — Jaccard Similarity Estimation

MinHash là phương pháp khác để ước lượng Jaccard similarity giữa hai documents:

J (A, B) = \frac{∣ A \cap B ∣}{∣ A \cup B ∣}

Với $A$ và $B$ là tập hợp shingles (n-grams) của hai documents.

Cách MinHash hoạt động:

Tạo $k$ hash functions ngẫu nhiên
Với mỗi hash function, tính min hash value across all shingles → MinHash signature
So sánh signatures: tỉ lệ positions giống nhau $\approx$ Jaccard similarity

Threshold	Ý nghĩa	Action
$J > 0.9$	Near-duplicate (rất giống)	Skip, không lưu
$0.5 < J \leq 0.9$	Partially similar	Có thể lưu nhưng đánh dấu
$J \leq 0.5$	Khác nhau đủ	Lưu cả hai

3.4.4 Content Dedup Flow — Mermaid Diagram

flowchart TD
    HTML["Parsed HTML Content"]
    FP["Compute Fingerprint<br/>(SimHash 64-bit)"]
    LOOKUP["Lookup Fingerprint<br/>in Fingerprint Store"]
    CHECK{"Hamming Distance<br/> ≤ 3?"}
    NEW["New Content<br/>→ Store fingerprint<br/>→ Save to Content Storage<br/>→ Pass to Link Extractor"]
    DUP["Near-Duplicate<br/>→ Discard content<br/>→ Still extract links<br/>(optional)"]

    HTML --> FP
    FP --> LOOKUP
    LOOKUP --> CHECK
    CHECK -->|"No match found<br/>or distance > 3"| NEW
    CHECK -->|"Match found<br/>distance ≤ 3"| DUP

    style FP fill:#0d6efd,stroke:#333,color:#fff
    style CHECK fill:#ffc107,stroke:#333
    style NEW fill:#198754,stroke:#333,color:#fff
    style DUP fill:#dc3545,stroke:#333,color:#fff

3.5 URL Deduplication — URL Seen?

Trước khi thêm URL mới vào Frontier, phải kiểm tra xem URL này đã crawl rồi chưa hoặc đã nằm trong queue rồi chưa.

3.5.1 Tại sao Bloom Filter?

Với 10B URLs, lưu toàn bộ vào HashSet cần:

M e m or y_{H a s h S e t} = 10 B \times 100 b y t es = 1 TB

→ Không fit trong RAM! Bloom filter chỉ cần 12 GB (tính ở phần Estimation) với 1% false positive.

Trade-off của Bloom Filter:

Aspect	Bloom Filter	HashSet
Memory	O(m) — rất nhỏ	O(n * size_per_entry) — rất lớn
False Positive	Có (tunable, ví dụ 1%)	Không
False Negative	Không bao giờ	Không
Delete support	Không (trừ Counting Bloom)	Có

Quan trọng: Bloom filter không bao giờ cho false negative. Nghĩa là nếu Bloom filter nói “chưa thấy URL này” → chắc chắn chưa thấy. Nếu nó nói “đã thấy” → 99% đúng, 1% false positive. Với web crawler, false positive chỉ có nghĩa là skip một URL — chấp nhận được.

3.5.2 URL Normalization

Trước khi check Bloom filter, URL phải được normalize để tránh crawl cùng một trang nhiều lần:

Vấn đề	Ví dụ	Normalized
Trailing slash	`example.com/page` vs `example.com/page/`	`example.com/page`
Protocol	`http://` vs `https://`	Chọn `https://` nếu có
www prefix	`www.example.com` vs `example.com`	Chọn non-www (hoặc canonical)
Fragment	`example.com/page#section1`	Remove fragment → `example.com/page`
Query params order	`?a=1&b=2` vs `?b=2&a=1`	Sort alphabetically → `?a=1&b=2`
Percent encoding	`%7E` vs `~`	Decode unreserved chars → `~`
Default port	`example.com:80`	Remove default port → `example.com`
Case	`Example.COM/Page`	Lowercase hostname → `example.com/Page` (path case-sensitive!)
Tracking params	`?utm_source=google&page=1`	Remove tracking params (utm_*)

Aha Moment: URL normalization là chi tiết nhỏ nhưng cực kỳ quan trọng. Không normalize → crawler sẽ crawl cùng trang hàng chục lần với các URL “trông khác nhưng cùng nội dung”. Lãng phí bandwidth và storage.

3.6 Content Storage — Lưu ở đâu?

3.6.1 Storage Architecture

Với 500 TB/month, cần distributed file system:

Storage Layer	Lưu gì	Technology
Metadata Store	URL, timestamp, HTTP headers, crawl status	Distributed DB (Cassandra, HBase)
Content Store	Raw HTML content (compressed)	Distributed FS (HDFS, S3, GCS)
URL Store	Bloom filter, URL → fingerprint mapping	In-memory (Redis) + persistent backup
Fingerprint Store	SimHash fingerprints	Distributed DB hoặc in-memory store

3.6.2 Storage Optimization

Kỹ thuật	Mô tả	Tiết kiệm
Compression	gzip/zstd compress HTML	~5x reduction
Dedup at storage	Chỉ lưu content mới (SimHash filter)	~30% reduction
Tiered storage	Hot data → SSD, Cold data → HDD/S3	Giảm chi phí
TTL-based cleanup	Xóa content quá cũ (>5 năm)	Free up space

3.7 Scalability — Mở rộng Crawler

3.7.1 Multiple Crawl Workers

Mỗi crawl worker là một process/container chạy trên separate machine:

W or k er s_{n ee d e d} = \frac{P e ak p a g es / s}{P a g es / s p er w or k er}

Giả sử mỗi worker handle 10 pages/s (including DNS, download, parse):

W or k er s_{n ee d e d} = \frac{1 , 200}{10} = 120 w or k ers

3.7.2 Consistent Hashing cho URL Assignment

Dùng consistent hashing (Tuan-10-Consistent-Hashing) để assign URLs cho workers dựa trên hostname:

Lợi ích	Giải thích
Politeness tự nhiên	Cùng hostname → cùng worker → dễ enforce rate limit
DNS cache hiệu quả	Cùng worker crawl cùng domain → DNS cache hit rate cao
robots.txt cache	Mỗi worker chỉ cần cache robots.txt cho domains nó phụ trách
Rebalance khi scale	Thêm/bớt worker chỉ ảnh hưởng tối thiểu URL assignment

3.7.3 Multi-region Deployment

flowchart TB
    subgraph "Region: US-East"
        F_US["URL Frontier<br/>(US URLs)"]
        W_US1["Worker 1"]
        W_US2["Worker 2"]
        W_USN["Worker N"]
        F_US --> W_US1
        F_US --> W_US2
        F_US --> W_USN
    end

    subgraph "Region: EU-West"
        F_EU["URL Frontier<br/>(EU URLs)"]
        W_EU1["Worker 1"]
        W_EU2["Worker 2"]
        W_EUN["Worker N"]
        F_EU --> W_EU1
        F_EU --> W_EU2
        F_EU --> W_EUN
    end

    subgraph "Region: Asia-Pacific"
        F_AP["URL Frontier<br/>(Asia URLs)"]
        W_AP1["Worker 1"]
        W_AP2["Worker 2"]
        W_APN["Worker N"]
        F_AP --> W_AP1
        F_AP --> W_AP2
        F_AP --> W_APN
    end

    subgraph "Central"
        CS["Content Storage<br/>(S3 / HDFS)"]
        DEDUP["Global Dedup Service"]
    end

    W_US1 --> CS
    W_EU1 --> CS
    W_AP1 --> CS
    W_US1 --> DEDUP
    W_EU1 --> DEDUP
    W_AP1 --> DEDUP

    style CS fill:#6f42c1,stroke:#333,color:#fff
    style DEDUP fill:#dc3545,stroke:#333,color:#fff

Lợi ích multi-region:

Crawl websites ở Asia từ Asia region → giảm latency 200-300ms per request
Tránh cross-continent bandwidth → giảm chi phí network
Nếu một region down → các regions khác vẫn tiếp tục crawl

3.8 Spider Traps & Problematic Content

3.8.1 Spider Trap là gì?

Spider trap là website (cố ý hoặc vô ý) tạo ra vô hạn URLs, khiến crawler bò mãi không thoát ra được.

Loại trap	Ví dụ	Giải thích
Infinite loop	`/a/b/a/b/a/b/...`	Trang A link đến B, B link đến A
Dynamic URL generation	`/page?id=1`, `/page?id=2`, … `/page?id=99999999`	Server tạo vô hạn pages
Calendar trap	`/calendar/2025/01/01`, `/calendar/2025/01/02`, …	Calendar component tạo URL cho mỗi ngày đến vô tận
Session ID in URL	`/page?session=abc123`, `/page?session=def456`	Mỗi visit tạo URL mới
Soft 404	Mọi URL đều trả HTTP 200 với nội dung “Page not found”	Server không trả 404 đúng cách

3.8.2 Anti-Trap Strategies

Chiến lược	Mô tả	Hiệu quả
Max URL depth	Giới hạn crawl depth (ví dụ: tối đa 15 levels từ seed)	Tránh bò quá sâu
Max pages per domain	Giới hạn số trang crawl per domain (ví dụ: 100K pages)	Ngăn một domain chiếm hết resources
URL length limit	Skip URLs dài hơn 2048 ký tự	URLs dài thường là trap
URL pattern detection	Phát hiện repeating patterns (`/a/b/a/b/`)	Heuristic-based
Manual blacklist	Đội ops maintain danh sách domains/patterns cần block	Catch-all cuối cùng
Content similarity	Nếu N trang liên tiếp từ cùng domain có content gần giống → nghi trap	Dùng SimHash
Crawl budget per domain	Mỗi domain được “ngân sách” crawl cố định per cycle	Fair resource allocation

Aha Moment: Spider traps là lý do chính DFS không được dùng cho web crawling. DFS sẽ bò sâu vào trap và không bao giờ quay lại. BFS với depth limit an toàn hơn nhiều.

3.8.3 Problematic Content khác

Vấn đề	Mô tả	Giải pháp
JavaScript-rendered pages	SPA (React, Angular) — HTML trống, content render bằng JS	Headless browser (Puppeteer, Playwright) — tốn resources hơn nhiều
Login walls	Content sau login form	Skip hoặc dùng authenticated crawling (phức tạp)
CAPTCHAs	Anti-bot protection	Skip — đây là tín hiệu website không muốn bị crawl
Rate limiting responses	HTTP 429 Too Many Requests	Back off, giảm crawl rate cho domain đó
Huge files	PDF 500MB, video embedded	Set max download size, filter by Content-Type header
Encoding issues	Trang tiếng Trung, Nhật với encoding lạ	Detect charset từ HTTP header + meta tag + content sniffing

Step 4 — Wrap Up

4.1 Security Considerations (Bảo mật)

4.1.1 robots.txt Compliance — Tôn trọng chủ nhà

Đây không chỉ là kỹ thuật mà còn là vấn đề pháp lý và đạo đức:

Aspect	Chi tiết
Bắt buộc tuân thủ	robots.txt là “gentleman’s agreement” — không bắt buộc kỹ thuật nhưng vi phạm có thể bị kiện
Cache robots.txt	Tải một lần, cache 24h. Đừng tải robots.txt trước MỖI request
Crawl-delay	Tuân thủ `Crawl-delay` directive. Nếu không có, default 1-5 giây
User-agent identification	Crawler phải tự xưng danh qua User-Agent header (ví dụ: `HieuBot/1.0`)

4.1.2 Crawl Rate Limiting — Đừng DDoS

Không có crawl rate limiting = DDoS attack không chủ đích. Xem chi tiết Tuan-09-Rate-Limiter.

Loại limit	Giá trị khuyến nghị	Mục đích
Per-host request rate	1 request/1-10 giây	Không overwhelm bất kỳ server nào
Per-host concurrent connections	1 connection	Politeness
Global request rate	Tuỳ bandwidth budget	Kiểm soát tổng tải
Per-host daily limit	10K-100K pages	Ngăn crawl quá nhiều từ một domain

4.1.3 Legal Considerations — Pháp lý

Vấn đề pháp lý	Chi tiết	Giải pháp
GDPR (EU)	Crawl có thể thu thập personal data (email, tên, địa chỉ trên web pages)	Không index/store personal data. Nếu lưu → phải cho phép “right to be forgotten”
Copyright	HTML content có bản quyền	Crawler thu thập để index, KHÔNG republish. Fair use doctrine
Computer Fraud and Abuse Act (CFAA)	Truy cập “unauthorized” vào hệ thống	Tuân thủ robots.txt = implicit authorization. Bypass = rủi ro pháp lý
Terms of Service	Nhiều website cấm crawling trong ToS	Rủi ro bị kiện nếu vi phạm ToS (case: LinkedIn vs hiQ Labs)
DMCA	Nội dung được bảo vệ bản quyền	Cơ chế DMCA takedown request

Aha Moment: Nhiều ứng viên focus vào kỹ thuật mà quên hoàn toàn khía cạnh pháp lý. Mention GDPR và robots.txt compliance trong interview cho thấy em hiểu bài toán end-to-end, không chỉ engineering.

4.1.4 Malicious Content Handling

Mối đe dọa	Mô tả	Giải pháp
Malware in HTML	Trang web chứa mã độc	Sandbox environment cho parser, không execute JavaScript trong crawler
Phishing pages	Trang giả mạo ngân hàng, Google, etc.	Integrate phishing detection (Google Safe Browsing API) trước khi index
SEO spam	Trang chứa toàn keyword spam, hidden text	Content quality scoring — nếu score thấp → giảm priority hoặc skip
Link farms	Network of sites link đến nhau để tăng PageRank	Phát hiện bằng graph analysis — cluster of sites với unnatural linking patterns
Cloaking	Server trả content khác cho crawler vs user	Random User-Agent rotation, so sánh crawl result vs actual page

4.1.5 Phishing Page Detection

Phishing pages giả mạo websites hợp pháp để lừa người dùng. Crawler có thể giúp phát hiện:

Technique	Mô tả
URL analysis	Domain tương tự domain hợp pháp (googIe.com vs google.com, paypa1.com vs paypal.com)
Content similarity	So sánh content với trang gốc bằng SimHash — nếu rất giống nhưng domain khác → nghi phishing
SSL certificate check	Phishing sites thường dùng free SSL hoặc self-signed cert
Domain age	Domain mới tạo (<30 ngày) + nội dung giống trang nổi tiếng → red flag
Blacklist check	Check domain against known phishing databases (PhishTank, OpenPhish)

4.2 DevOps — Deployment, Monitoring, Operations

4.2.1 Monitoring — Theo dõi sức khỏe Crawler

Tham chiếu: Tuan-13-Monitoring-Observability

Metric	Mô tả	Alert Threshold
Crawl rate (pages/second)	Tốc độ crawl hiện tại	< 200 pages/s (dưới 50% target)
Success rate	% requests trả về HTTP 200	< 80%
Error rate by type	% HTTP 4xx, 5xx, timeout	4xx > 20%, 5xx > 10%, timeout > 5%
Queue depth	Số URLs trong Frontier	> 500M (queue đang phình) hoặc < 1M (sắp hết việc)
Download latency (P50, P99)	Thời gian tải mỗi trang	P99 > 10 giây
DNS resolution time	Thời gian DNS lookup	P99 > 500ms
Duplicate rate	% content bị detect là duplicate	> 50% (có thể crawler đang bị trap)
Bloom filter false positive rate	Estimated FP rate	> 5% (cần resize)
Storage growth rate	TB/day mới thêm	Deviation > 50% so với expected
Worker health	CPU, memory, network I/O per worker	CPU > 90%, Memory > 85%

Dashboard layout gợi ý:

Panel	Metric	Visualization
Top-left	Crawl rate (pages/s) real-time	Time-series line chart
Top-right	Success/Error rate	Stacked area chart
Middle-left	Queue depth over time	Line chart with threshold lines
Middle-right	Top 10 domains by crawl volume	Bar chart
Bottom-left	Download latency P50/P95/P99	Histogram
Bottom-right	Worker status grid	Heatmap (green/yellow/red)

4.2.2 Distributed Deployment

Component	Deployment	Scaling Strategy
URL Frontier	Stateful service, sharded by hostname hash	Scale shards khi queue depth tăng
Crawl Workers	Stateless containers (K8s pods)	Horizontal auto-scaling based on queue depth
DNS Resolver	Per-region caching resolver	Scale with worker count
Content Parser	Stateless, CPU-intensive	Scale with crawl rate
Dedup Service	Stateful (Bloom filter, fingerprint store)	Scale memory, partition Bloom filter
Content Storage	S3/HDFS cluster	Add nodes khi storage đầy
Metadata DB	Cassandra/HBase cluster	Add nodes, consistent hashing tự rebalance

4.2.3 Graceful Shutdown

Khi cần restart/update crawler, không được kill process ngay lập tức:

Bước	Action	Lý do
1	Stop accepting new URLs từ Frontier	Không nhận thêm việc mới
2	Finish in-flight downloads (timeout 30s)	Không lãng phí downloads đã bắt đầu
3	Save Frontier state to disk	URLs trong queue không bị mất
4	Save Bloom filter snapshot	Không phải rebuild từ đầu
5	Flush parsed content đến storage	Không mất data đã parse
6	Shutdown workers	Clean exit

4.2.4 Checkpoint & Resume

Crawler chạy liên tục 24/7. Khi crash (và SẼ crash), phải resume được:

State cần checkpoint	Frequency	Storage
URL Frontier	Mỗi 5 phút	Disk (RocksDB / local SSD)
Bloom filter	Mỗi 15 phút	Disk snapshot
Crawl progress per domain	Real-time (mỗi page)	Metadata DB
Worker assignment	Mỗi khi thay đổi	Coordination service (ZooKeeper/etcd)

Recovery flow:

Worker crash → Coordination service detect (heartbeat timeout)
Reassign URLs of dead worker → healthy workers (consistent hashing makes this efficient)
New worker starts → load latest Bloom filter snapshot + Frontier state
Resume crawling from last checkpoint

Aha Moment: Checkpoint/resume là thứ phân biệt production crawler với toy crawler. Không ai muốn restart crawl 10B URLs từ đầu vì một process crash.

4.2.5 Alerting Strategy

Severity	Condition	Action
P0 — Critical	Crawl rate = 0 (toàn bộ crawler dừng)	Page on-call engineer ngay lập tức
P1 — High	Crawl rate < 50% target liên tục > 15 phút	Notify team, investigate trong 30 phút
P2 — Medium	Error rate > 20% cho specific domain cluster	Auto-reduce crawl rate, notify team
P3 — Low	Bloom filter FP rate > 3%	Schedule resize trong sprint tiếp

Aha Moments & Pitfalls — Những điều cần nhớ

Aha Moment 1: Politeness vs Speed — Trade-off cốt lõi

Hieu, đây là trade-off trung tâm của web crawler. Em muốn crawl nhanh (1B pages/month) nhưng đồng thời phải lịch sự (không DDoS bất kỳ ai).

Nếu quá nhanh	Nếu quá chậm
Website block IP crawler	Không đạt target 1B pages/month
Bị kiện (CFAA violation)	Content index không fresh
Reputation damage	Search results quality giảm
Website sập → crawler cũng không crawl được	Tốn tiền server chạy lâu

Giải pháp: Nhanh tổng thể nhưng chậm với từng host. Crawl song song hàng nghìn hosts, mỗi host 1 request/vài giây.

Aha Moment 2: Spider Traps — Hố đen của Crawler

Spider traps có thể “nuốt” toàn bộ resources của crawler nếu không phòng bị. Ví dụ thực tế: một calendar widget có thể tạo URLs cho mọi ngày từ năm 1970 đến 9999 — hàng triệu URLs, tất cả đều hợp lệ nhưng vô nghĩa.

Cách phòng: Defense in depth — kết hợp max depth + max pages per domain + URL pattern detection + manual blacklist. Không có silver bullet.

Aha Moment 3: URL Normalization — Chi tiết nhỏ, impact lớn

Không normalize URLs → crawler có thể crawl cùng trang hàng chục lần:

http://Example.COM/page
https://www.example.com/page/
https://example.com/page?utm_source=google
https://example.com/page#section1

Tất cả là cùng một trang! Normalize trước khi đưa vào Bloom filter.

Aha Moment 4: DNS Bottleneck — Kẻ thù giấu mặt

DNS lookup tưởng “có gì đâu, mỗi cái 100ms” — nhưng nhân với 400 pages/second = 40,000ms DNS time mỗi giây. Không cache DNS = crawler chạy chậm hơn 10x.

Giải pháp đơn giản: local DNS cache + custom DNS resolver + pre-fetching. Đa số DNS entries có TTL dài (hours-days), nên cache hit rate rất cao.

Aha Moment 5: Content Near-Duplicate — 30% Internet là copy-paste

Trên Internet, cùng một bài báo xuất hiện trên hàng chục websites (syndicated content). Lưu tất cả = lãng phí 30% storage. SimHash cho phép detect near-duplicate với chỉ 8 bytes per document — cực kỳ hiệu quả.

Pitfall 1: Dùng DFS thay BFS

DFS sẽ bò sâu vào một domain/path và không quay lại. Kết hợp với spider trap → crawler bị stuck mãi mãi. Luôn dùng BFS (hoặc modified BFS với priority).

Pitfall 2: Không handle HTTP redirects đúng cách

Redirect chains có thể tạo loop: A → B → C → A. Phải set max redirect hops (thường 5-10) và track visited URLs trong chain.

Pitfall 3: Không set timeout cho downloads

Một server chậm (hoặc cố tình) có thể giữ connection mở mãi mãi → worker bị block. Luôn set connection timeout (10s) và read timeout (30s).

Pitfall 4: Quên xử lý character encoding

Trang web có thể dùng UTF-8, ISO-8859-1, EUC-KR, Shift_JIS, etc. Parse sai encoding → content bị garbled → dedup không hoạt động đúng → lưu duplicates.

Pitfall 5: Single point of failure ở URL Frontier

Nếu URL Frontier chạy trên một máy duy nhất → máy đó chết = toàn bộ crawler dừng. Frontier phải được sharded và replicated.

Tổng kết — Checklist cho Interview

Khi trả lời “Design a Web Crawler” trong interview, Hieu cần cover:

#	Topic	Must mention
1	Clarify requirements	Scale (1B pages), purpose, HTML only vs media
2	High-level flow	Seed → Frontier → Download → Parse → Dedup → Extract → Loop
3	URL Frontier	Priority queue + Politeness (per-host queue)
4	Politeness	robots.txt, crawl delay, per-host rate limit
5	Dedup	Content dedup (SimHash) + URL dedup (Bloom filter)
6	DNS optimization	Local cache, custom resolver
7	Robustness	Spider traps, timeouts, retries, checkpoint/resume
8	Scalability	Distributed workers, consistent hashing, multi-region
9	Storage	Distributed FS, compression, tiered storage
10	Estimation	Pages/s, bandwidth, storage, Bloom filter size

Liên kết kiến thức

Concept trong bài	Tuần đã học	Áp dụng cụ thể
Message Queue	Tuan-08-Message-Queue	URL Frontier chính là một message queue phức tạp
Rate Limiter	Tuan-09-Rate-Limiter	Politeness = rate limiting per host
Consistent Hashing	Tuan-10-Consistent-Hashing	URL assignment cho workers, sharding Frontier
DNS & CDN	Tuan-03-Networking-DNS-CDN	DNS resolution optimization
Cache Strategy	Tuan-06-Cache-Strategy	DNS cache, robots.txt cache, content cache
Database Sharding	Tuan-07-Database-Sharding-Replication	Metadata DB sharding, content storage

Appendix — So sánh Web Crawler vs các Case Study khác

Aspect	URL Shortener	Chat System	News Feed	Web Crawler
Read/Write pattern	Read-heavy	Write-heavy (messages)	Read-heavy (feed)	Write-heavy (crawl + store)
Core challenge	Hashing, uniqueness	Real-time delivery	Fan-out, ranking	Politeness, dedup, scale
Storage	Small (URLs)	Medium (messages)	Medium (posts)	Massive (petabytes of HTML)
Latency requirement	Very low (<50ms)	Very low (<100ms)	Low (<500ms)	Not critical (batch)
Distributed?	Optional at small scale	Yes (WebSocket servers)	Yes (fan-out workers)	Always (mandatory)
Unique patterns	Base62 encoding	WebSocket, presence	Push vs Pull model	BFS traversal, Bloom filter

“Web crawler là bài toán mà simplicity deceives — nhìn đơn giản nhưng mỗi component đều ẩn chứa complexity. Em sẽ không bao giờ nhìn Google Search giống cách cũ nữa sau khi hiểu được bao nhiêu engineering đằng sau việc ‘crawl mấy trang web’.”

lthieu's notes

Explorer

Case-Design-Web-Crawler