Tuần 19: Design Notification System
“Notification hay thì người dùng cảm ơn. Notification dở thì người dùng uninstall app. Ranh giới mỏng hơn em tưởng.”
Tags: system-design notification case-study alex-xu Student: Hieu Prerequisite: Tuan-08-Message-Queue · Tuan-09-Rate-Limiter · Tuan-06-Cache-Strategy Liên quan: Tuan-17-Design-Chat-System · Tuan-18-Design-News-Feed · Tuan-13-Monitoring-Observability · Tuan-14-CI-CD-Pipeline
Step 1 — Understand the Problem & Establish Design Scope
1.1 Functional Requirements (Yêu cầu chức năng)
Hieu, trước khi vẽ bất kỳ diagram nào, em phải hỏi interviewer cho rõ scope. Đây là những câu hỏi quan trọng:
| Câu hỏi | Câu trả lời giả định | Tại sao hỏi |
|---|---|---|
| Hệ thống hỗ trợ những loại notification nào? | Push notification (iOS/Android), SMS, Email | Mỗi channel có architecture riêng |
| Notification được trigger bởi ai/cái gì? | Event-driven: new message, order update, marketing campaign, system alert | Phân biệt transactional vs marketing |
| Có real-time requirement không? | Push & SMS: near real-time (< 5s). Email: best-effort (< 5 min) | Ảnh hưởng queue priority |
| User có thể opt-out từng channel không? | Có — user preference per channel, per notification type | Cần User Preference Service |
| Có priority levels không? | Urgent (OTP, security alert), Normal (order update), Low (marketing) | Priority queue design |
| Có cần analytics không? | Có — delivery rate, open rate, click-through rate | Analytics pipeline |
| Hỗ trợ đa ngôn ngữ không? | Có — i18n với template engine | Template design |
1.2 Non-functional Requirements (Yêu cầu phi chức năng)
| Requirement | Target | Ghi chú |
|---|---|---|
| Scalability | 100M DAU, 500M notifications/day | Hệ thống quy mô lớn |
| Availability | 99.9% (three 9s) | 8.77 giờ downtime/năm |
| Latency | Urgent: < 3s, Normal: < 30s, Low: best-effort | SLA per priority |
| Reliability | At-least-once delivery, ideally exactly-once | Deduplication mechanism |
| Compliance | CAN-SPAM (email), GDPR (EU), TCPA (SMS) | Legal requirements |
1.3 Capacity Estimation (Ước lượng dung lượng)
Assumptions (Giả thiết)
| Thông số | Giá trị | Giải thích |
|---|---|---|
| DAU | 100M | Hệ thống quy mô Facebook/Uber-like |
| Total notifications/day | 500M | ~5 notifications/user/day trung bình |
| Channel breakdown | Push: 60%, Email: 30%, SMS: 10% | Push rẻ nhất, SMS đắt nhất |
| Avg notification payload | Push: 1 KB, Email: 50 KB, SMS: 0.2 KB | Bao gồm metadata |
| Analytics event size | 0.5 KB per notification | Delivery + open + click tracking |
| Retention | Notification log: 90 ngày, Analytics: 1 năm | Compliance + business intelligence |
Notification Volume per Channel
QPS Calculation
Tại sao peak multiplier = 5x? Marketing campaign gửi đồng loạt, flash sale, breaking news — tất cả tạo burst traffic.
Worker Count Estimation
Giả sử mỗi worker xử lý 100 notifications/s (bao gồm template rendering, API call to provider, retry logic):
Rule of thumb: Provision 1.5x peak workers cho headroom → ~450 workers.
Queue Depth Estimation
Nếu downstream provider (APNs/FCM/SES) bị chậm 30 giây:
Kafka partition cho throughput: mỗi partition handle ~10K msg/s → cần 3 partitions cho push channel ở peak.
Storage Estimation
Notification log (90 ngày):
Analytics data (1 năm):
Email content storage (90 ngày):
Alert: Email content chiếm phần lớn storage! Cần compression + object storage (S3) thay vì DB.
Bandwidth Estimation
Tóm tắt Estimation
| Metric | Value |
|---|---|
| Total QPS (avg) | ~5,800/s |
| Total QPS (peak) | ~30,000/s |
| Push notifications/day | 300M |
| Email notifications/day | 150M |
| SMS notifications/day | 50M |
| Workers needed (peak + headroom) | ~450 |
| Queue depth (30s buffer) | ~900K messages |
| Notification log storage (90d) | ~45 TB |
| Analytics storage (1yr) | ~91 TB |
| Email content storage (90d) | ~675 TB |
Step 2 — Propose High-Level Design
2.1 Analogy (Ví dụ đời thường)
Hieu, tưởng tượng hệ thống notification như bưu điện quốc gia:
- Event producers = Người viết thư (các service trong hệ thống)
- Notification service = Bưu điện trung tâm — phân loại, kiểm tra địa chỉ, đóng dấu
- Template engine = Máy in — điền thông tin vào mẫu thư sẵn
- User preference service = Sổ đăng ký — ai muốn nhận thư, ai không
- Rate limiter = Kiểm soát viên — không cho gửi quá nhiều thư cho một người
- Message queues = Xe tải phân loại theo vùng (push/SMS/email)
- Delivery workers = Người đưa thư — mỗi loại biết cách giao khác nhau
- Dead letter queue = Kho thư không giao được — thử lại sau
2.2 Architecture Overview
flowchart TB subgraph "Event Producers" EP1["Order Service"] EP2["Chat Service"] EP3["Marketing Service"] EP4["Auth Service<br/>(OTP, Security)"] EP5["Scheduler<br/>(Cron Jobs)"] end subgraph "Notification Platform" NS["Notification Service<br/>(API Gateway)"] VAL["Validation &<br/>Deduplication"] UPS["User Preference<br/>Service"] TE["Template Engine<br/>(i18n)"] RL["Rate Limiter"] PQ["Priority Router"] end subgraph "Message Queues (Kafka)" direction LR Q_PUSH_U["Push Queue<br/>(Urgent)"] Q_PUSH_N["Push Queue<br/>(Normal)"] Q_EMAIL_U["Email Queue<br/>(Urgent)"] Q_EMAIL_N["Email Queue<br/>(Normal)"] Q_SMS_U["SMS Queue<br/>(Urgent)"] Q_SMS_N["SMS Queue<br/>(Normal)"] end subgraph "Delivery Workers" PW["Push Workers<br/>(FCM / APNs)"] EW["Email Workers<br/>(SES / SendGrid)"] SW["SMS Workers<br/>(Twilio)"] end subgraph "External Providers" FCM["Firebase Cloud<br/>Messaging"] APNS["Apple Push<br/>Notification Service"] SES["Amazon SES /<br/>SendGrid"] TWILIO["Twilio /<br/>MessageBird"] end subgraph "Data Stores" DB[("Notification Log<br/>(Cassandra)")] REDIS[("User Preferences<br/>+ Device Tokens<br/>(Redis)")] S3[("Email Templates<br/>+ Content<br/>(S3)")] ANALYTICS[("Analytics<br/>(ClickHouse)")] end subgraph "Reliability" DLQ["Dead Letter Queue"] RETRY["Retry Scheduler"] end EP1 & EP2 & EP3 & EP4 & EP5 --> NS NS --> VAL VAL --> UPS UPS --> TE TE --> RL RL --> PQ PQ --> Q_PUSH_U & Q_PUSH_N PQ --> Q_EMAIL_U & Q_EMAIL_N PQ --> Q_SMS_U & Q_SMS_N Q_PUSH_U & Q_PUSH_N --> PW Q_EMAIL_U & Q_EMAIL_N --> EW Q_SMS_U & Q_SMS_N --> SW PW --> FCM & APNS EW --> SES SW --> TWILIO PW & EW & SW --> DB PW & EW & SW --> ANALYTICS PW & EW & SW -->|"Failed"| DLQ DLQ --> RETRY RETRY --> Q_PUSH_N & Q_EMAIL_N & Q_SMS_N NS --> REDIS TE --> S3 style NS fill:#4CAF50,stroke:#333,stroke-width:2px,color:#fff style PQ fill:#FF9800,stroke:#333,stroke-width:2px,color:#fff style DLQ fill:#f44336,stroke:#333,stroke-width:2px,color:#fff
2.3 Component Responsibilities
| Component | Trách nhiệm | Technology |
|---|---|---|
| Notification Service | API gateway, nhận request từ producers, validation | Go / Java microservice |
| Validation & Dedup | Schema validation, idempotency check (idempotency key) | In-process + Redis |
| User Preference Service | Kiểm tra opt-in/opt-out per user per channel per type | Redis Hash |
| Template Engine | Render notification content từ template + parameters, i18n | Handlebars / Jinja2 |
| Rate Limiter | Per-user + per-channel rate limit, chống spam | Redis + Token Bucket |
| Priority Router | Route notification vào đúng queue theo priority + channel | In-process logic |
| Message Queues | Decouple producers & consumers, buffer burst traffic | Kafka (6 topics: 3 channels x 2 priorities) |
| Delivery Workers | Consume từ queue, gọi external provider, handle response | Stateless workers, autoscaled |
| Dead Letter Queue | Chứa notifications fail sau max retries | Kafka DLQ topic |
| Notification Log | Lưu trạng thái mỗi notification (created, sent, delivered, opened) | Cassandra (write-heavy, time-series) |
| Analytics Service | Aggregate delivery/open/click metrics, A/B testing | ClickHouse / Druid |
2.4 Notification Delivery Flow
sequenceDiagram participant Producer as Event Producer participant NS as Notification Service participant Redis as Redis (Prefs + Dedup) participant TE as Template Engine participant RL as Rate Limiter participant Kafka as Kafka Queue participant Worker as Delivery Worker participant Provider as External Provider<br/>(FCM/SES/Twilio) participant DB as Notification Log Producer->>NS: POST /api/v1/notifications<br/>{user_id, type, channel, data, priority} NS->>Redis: Check idempotency key Redis-->>NS: Not duplicate NS->>Redis: Get user preferences Redis-->>NS: {push: true, email: true, sms: false} Note over NS: User opted out of SMS → skip SMS channel NS->>TE: Render template(type, locale, data) TE-->>NS: Rendered content NS->>RL: Check rate limit(user_id, channel) RL-->>NS: Allowed (under limit) NS->>Kafka: Publish to push_normal topic NS->>Kafka: Publish to email_normal topic NS->>DB: Log status = QUEUED Kafka->>Worker: Consume message Worker->>Provider: Send notification Provider-->>Worker: Success (message_id) Worker->>DB: Update status = SENT Worker->>DB: Log provider_message_id Note over Provider: Provider delivers to device/inbox Provider-->>Worker: Delivery callback (webhook) Worker->>DB: Update status = DELIVERED
Step 3 — Design Deep Dive
3.1 Notification Template Engine
Tại sao cần template?
Thay vì mỗi service tự compose notification content (dễ inconsistent, khó maintain), centralize templates:
// Template: order_shipped
// Locale: vi
Chào {{user_name}}, đơn hàng #{{order_id}} đã được giao cho {{carrier}}.
Theo dõi tại: {{tracking_url}}
// Locale: en
Hi {{user_name}}, your order #{{order_id}} has been shipped via {{carrier}}.
Track it here: {{tracking_url}}
Template Storage Design
{
"template_id": "order_shipped",
"version": 3,
"channels": {
"push": {
"vi": {
"title": "Đơn hàng đã gửi!",
"body": "Đơn #{{order_id}} đang trên đường đến bạn."
},
"en": {
"title": "Order Shipped!",
"body": "Order #{{order_id}} is on its way."
}
},
"email": {
"vi": {
"subject": "Đơn hàng #{{order_id}} đã được gửi",
"html_template_s3_key": "templates/order_shipped/vi/v3.html"
},
"en": {
"subject": "Your order #{{order_id}} has shipped",
"html_template_s3_key": "templates/order_shipped/en/v3.html"
}
},
"sms": {
"vi": {
"body": "Don hang #{{order_id}} da gui. Theo doi: {{tracking_url}}"
},
"en": {
"body": "Order #{{order_id}} shipped. Track: {{tracking_url}}"
}
}
},
"required_params": ["user_name", "order_id", "carrier", "tracking_url"],
"created_at": "2026-01-15T10:00:00Z"
}Aha Moment #1: Template versioning rất quan trọng. Khi update template, notification đang trong queue vẫn dùng version cũ. Phải lưu
template_versiontrong message payload.
i18n Strategy
- User profile lưu
preferred_locale(mặc định từ device locale) - Fallback chain:
user_locale → country_default → en - SMS không dùng Unicode để tránh tốn segment (Unicode SMS = 70 chars/segment vs GSM-7 = 160 chars/segment)
3.2 Priority Queue Design
Priority Levels
| Priority | Use Cases | SLA | Queue |
|---|---|---|---|
| URGENT | OTP, security alert, payment confirmation | < 3 seconds | Dedicated urgent queue, pre-allocated workers |
| NORMAL | Order update, message notification, social interaction | < 30 seconds | Standard queue |
| LOW | Marketing campaign, weekly digest, recommendations | Best-effort (< 5 min) | Low-priority queue, off-peak delivery |
Implementation: Separate Queues (Not Priority Field)
Tại sao dùng separate Kafka topics thay vì single queue + priority field?
- Resource isolation: Urgent queue có dedicated worker pool — marketing burst không ảnh hưởng OTP
- Independent scaling: Urgent workers always-on (không autoscale xuống 0), marketing workers scale to zero off-peak
- Different retry policies: Urgent → aggressive retry (3 lần trong 10s). Low → gentle retry (3 lần trong 1 giờ)
Kafka topics:
├── notification.push.urgent (3 partitions, replication=3)
├── notification.push.normal (6 partitions, replication=3)
├── notification.push.low (3 partitions, replication=2)
├── notification.email.urgent (3 partitions, replication=3)
├── notification.email.normal (6 partitions, replication=3)
├── notification.email.low (3 partitions, replication=2)
├── notification.sms.urgent (3 partitions, replication=3)
├── notification.sms.normal (3 partitions, replication=3)
├── notification.sms.low (2 partitions, replication=2)
└── notification.dlq (3 partitions, replication=3)
3.3 Rate Limiting — Chống Notification Fatigue
Tại sao cần rate limit notification?
- User nhận 50 push/ngày → uninstall app
- Email quá nhiều → ISP đánh spam → deliverability giảm cho toàn domain
- SMS quá nhiều → chi phí bùng nổ + vi phạm TCPA
Rate Limit Tiers
| Tier | Limit | Scope | Ví dụ |
|---|---|---|---|
| Per-user per-channel | 10 push/hour, 5 email/day, 2 SMS/day | Một user cụ thể | User A nhận max 10 push notification/giờ |
| Per-user total | 30 notifications/day | Tổng tất cả channels | User A nhận max 30 notification/ngày tổng cộng |
| Per-type | 1/event_type/hour | Cùng loại notification | Không gửi “someone liked your post” quá 1 lần/giờ |
| Global per-channel | 100K email/hour | Toàn hệ thống | Tránh bị ISP throttle |
Ngoại lệ: URGENT priority (OTP, security alert) bypass rate limiter. Người dùng request OTP thì phải gửi ngay, không được chặn.
Redis Implementation
# Per-user per-channel rate limit (sliding window)
Key: rate:{user_id}:{channel}:{window}
Value: count
TTL: window duration
# Example: User 12345, push channel, hourly window
Key: rate:12345:push:2026031810
Value: 7
TTL: 3600
# Check: if value < 10 → allowed, INCR
# if value >= 10 → rejected, queue for later or drop
3.4 Retry Mechanism & Dead Letter Queue
Retry Strategy: Exponential Backoff
flowchart TD A["Worker picks message<br/>from Kafka"] --> B{"Send to Provider"} B -->|"Success"| C["Update status = SENT<br/>ACK message"] B -->|"Fail (timeout,<br/>5xx, rate limit)"| D{"Retry count<br/>< max_retries?"} D -->|"Yes"| E["Calculate backoff:<br/>delay = min(base × 2^retry, max_delay)<br/>+ random jitter"] E --> F["Re-publish to same topic<br/>with incremented retry_count<br/>and scheduled_at = now + delay"] F --> A D -->|"No (exhausted)"| G["Publish to DLQ"] G --> H["Alert + Manual Review"] B -->|"Fail (4xx client error,<br/>invalid token)"| I["Permanent failure<br/>Update status = FAILED"] I --> J["Clean up: remove<br/>invalid device token"] style G fill:#f44336,stroke:#333,stroke-width:2px,color:#fff style C fill:#4CAF50,stroke:#333,stroke-width:2px,color:#fff style I fill:#FF9800,stroke:#333,stroke-width:2px,color:#fff
Retry Configuration per Priority
| Priority | Max Retries | Base Delay | Max Delay | Total Window |
|---|---|---|---|---|
| URGENT | 5 | 1s | 30s | ~1 min |
| NORMAL | 3 | 30s | 5 min | ~15 min |
| LOW | 3 | 5 min | 1 hour | ~3 hours |
Backoff Formula
Jitter rất quan trọng: Không có jitter → tất cả retries đến cùng lúc → thundering herd trên provider API.
Dead Letter Queue (DLQ) Processing
Messages trong DLQ được xử lý bởi manual review hoặc automated recovery:
- Invalid device token → Remove token từ Redis, mark user’s device as inactive
- Provider outage → Bulk re-process khi provider recovery
- Permanent failure → Log, alert, move to archive
3.5 Push Notification — FCM/APNs Integration
Device Token Management
Mỗi user có thể có nhiều devices. Mỗi device có một device token (registration token) từ FCM hoặc APNs.
Redis Hash — device tokens per user:
Key: devices:{user_id}
Fields:
{device_id_1}: {"token": "fcm_token_abc", "platform": "android", "app_version": "3.2.1", "last_active": "2026-03-17T10:00:00Z"}
{device_id_2}: {"token": "apns_token_xyz", "platform": "ios", "app_version": "3.1.0", "last_active": "2026-03-18T08:30:00Z"}
Token Lifecycle
| Event | Action |
|---|---|
| User installs app | Register device token → Redis |
| User opens app | Refresh token (tokens có thể thay đổi) |
Push fails with InvalidRegistration (FCM) hoặc BadDeviceToken (APNs) | Remove token từ Redis |
| User uninstalls app | Token becomes invalid → detected on next push fail |
| User logs out | Dissociate token from user (nhưng không xóa — re-associate khi login lại) |
Silent Push (Data-only Push)
Không phải push nào cũng hiện notification banner. Silent push dùng để:
- Trigger background data sync
- Update badge count
- Invalidate local cache
// FCM silent push payload
{
"to": "device_token",
"data": {
"type": "badge_update",
"unread_count": 5
}
// Không có "notification" key → silent push
}3.6 Email — Deliverability & Integration
Email Architecture
Notification Service → Email Worker → Email Sending Service (SES/SendGrid)
↓
SMTP Relay / API
↓
Recipient ISP (Gmail, Outlook, Yahoo)
↓
User's Inbox (or Spam folder!)
Deliverability Essentials — SPF, DKIM, DMARC
| Protocol | Mục đích | DNS Record |
|---|---|---|
| SPF (Sender Policy Framework) | Xác định IP nào được phép gửi email cho domain | TXT "v=spf1 include:amazonses.com -all" |
| DKIM (DomainKeys Identified Mail) | Ký email bằng private key, receiver verify bằng public key trong DNS | TXT record chứa DKIM public key |
| DMARC (Domain-based Message Auth) | Policy khi SPF/DKIM fail: none, quarantine, reject | TXT "v=DMARC1; p=reject; rua=mailto:dmarc@example.com" |
Aha Moment #2: Nếu không setup SPF + DKIM + DMARC đúng, email notification sẽ rơi vào spam folder. Và một khi domain bị ISP blacklist, recovery mất hàng tuần. Đây là lý do email deliverability là kỹ năng riêng biệt.
Bounce Handling
| Bounce Type | Ví dụ | Action |
|---|---|---|
| Hard bounce | Email address không tồn tại (550) | Remove email khỏi user profile, KHÔNG bao giờ gửi lại |
| Soft bounce | Mailbox full (452), server tạm thời lỗi | Retry 3 lần, sau đó mark as soft-bounced |
| Complaint | User click “Report Spam” | Unsubscribe ngay lập tức, log complaint |
ISP theo dõi bounce rate và complaint rate. Nếu bounce rate > 5% hoặc complaint rate > 0.1% → domain bị throttle hoặc blacklist.
SES/SendGrid Integration
- Dùng dedicated IP thay vì shared IP (kiểm soát reputation)
- IP warming: IP mới gửi từ từ (100 email/ngày → tăng dần → full volume trong 4-6 tuần)
- Feedback loop: SES SNS notifications cho bounce/complaint → auto-update user preference
3.7 SMS — Cost Optimization
SMS Pricing Reality
| Route | Giá/SMS (USD) | 50M SMS/day Cost |
|---|---|---|
| US (short code) | $0.0075 | $375,000/day |
| US (long code) | $0.0050 | $250,000/day |
| Vietnam | $0.04 | $2,000,000/day |
| India | $0.003 | $150,000/day |
Aha Moment #3: SMS là channel đắt nhất — gấp hàng nghìn lần push notification (gần như miễn phí). Đây là lý do SMS chỉ nên dùng cho high-value notifications (OTP, security, payment).
Cost Optimization Strategies
- Channel preference cascade: Push first → Email → SMS (only if urgent + user has no push token)
- Smart routing: Dùng short code cho US (higher throughput, 0.005 but slower)
- SMS content optimization: GSM-7 encoding (160 chars/segment) thay vì Unicode (70 chars/segment)
- Batching: Aggregate similar SMS (thay vì 3 SMS riêng → 1 SMS tổng hợp)
- Geographic routing: Dùng local provider cho từng region (Twilio US, MessageBird EU, local provider VN)
3.8 Deduplication — Exactly-Once Delivery
Tại sao cần dedup?
- Network failure sau khi gửi thành công → producer retry → user nhận 2 lần
- Kafka consumer crash sau khi gửi nhưng trước khi commit offset → re-process → duplicate
- Marketing campaign accidentally triggered twice → user nhận double email
Idempotency Key Design
Idempotency key = hash(event_source + event_id + user_id + channel + notification_type)
Example:
event_source: "order_service"
event_id: "order_12345_shipped"
user_id: "user_67890"
channel: "push"
notification_type: "order_shipped"
Key: SHA256("order_service:order_12345_shipped:user_67890:push:order_shipped")
= "a1b2c3d4..."
Redis SET with TTL:
SETNX dedup:{idempotency_key} 1 EX 86400 (24h TTL)
→ If OK (key didn't exist) → process notification
→ If nil (key already exists) → skip (duplicate)
Trade-off: TTL quá ngắn → miss duplicates. TTL quá dài → Redis memory bloat. 24h là sweet spot cho hầu hết use cases.
3.9 Analytics — Measuring Notification Effectiveness
Metrics Pipeline
Notification sent → Delivery callback → Open/Read event → Click event → Conversion event
↓ ↓ ↓ ↓ ↓
SENT DELIVERED OPENED CLICKED CONVERTED
Key Metrics
| Metric | Formula | Target | Cách đo |
|---|---|---|---|
| Delivery Rate | delivered / sent | > 95% (push), > 98% (email) | Provider callback |
| Open Rate | opened / delivered | > 15% (push), > 20% (email) | Tracking pixel (email), app open event (push) |
| Click-through Rate (CTR) | clicked / opened | > 5% | Redirect URL tracking |
| Opt-out Rate | unsubscribed / delivered | < 0.5% | Unsubscribe link click |
| Conversion Rate | converted / clicked | Varies by campaign | Deep link attribution |
A/B Testing
- Split users into buckets (control vs variant)
- Test: notification title, body, send time, channel
- Track metrics per variant → pick winner
- Example: “Đơn hàng đã gửi!” vs “Đơn hàng đang trên đường đến bạn!” → which has higher open rate?
3.10 User Preference Store
Redis Hash Design
Key: prefs:{user_id}
Fields:
push_enabled: "true"
email_enabled: "true"
sms_enabled: "false"
push_marketing: "false" # opt-out marketing push
push_social: "true" # opt-in social push
push_transactional: "true" # opt-in transactional push
email_marketing: "false"
email_transactional: "true"
quiet_hours_start: "22:00" # local time
quiet_hours_end: "08:00"
timezone: "Asia/Ho_Chi_Minh"
locale: "vi"
Opt-out Handling — Legal Compliance
| Regulation | Yêu cầu | Implementation |
|---|---|---|
| CAN-SPAM (US, Email) | Unsubscribe link trong mỗi email, process trong 10 ngày | List-Unsubscribe header, one-click unsubscribe |
| GDPR (EU) | Explicit consent, right to erasure | Double opt-in, preference deletion API |
| TCPA (US, SMS) | Explicit written consent trước khi gửi SMS marketing | Consent record + timestamp trong DB |
3.11 Batch Notification — Aggregation
Problem
User có 10 người like post trong 5 phút → gửi 10 notification? No!
Solution: Event Aggregation
Window: 5 minutes
Events:
10:00 - Alice liked your post
10:01 - Bob liked your post
10:03 - Charlie liked your post
Aggregated notification (at 10:05):
"Alice, Bob, Charlie liked your post"
If > 3 people:
"Alice, Bob, and 8 others liked your post"
Implementation
- Tumbling window (fixed 5-min intervals): Simple, predictable
- Session window (gap-based): Send when no new events for 2 minutes
- Storage: Redis Sorted Set per user per event type, score = timestamp
- Aggregation worker runs every window interval, collects events, sends single notification
3.12 Timezone-Aware Delivery
Problem
Marketing notification scheduled for 9:00 AM — but user is in which timezone?
Solution
User timezone: "Asia/Ho_Chi_Minh" (UTC+7)
Scheduled send: 9:00 AM local time
Server calculation:
UTC time = 9:00 AM - 7h = 2:00 AM UTC
For user in "America/New_York" (UTC-4):
UTC time = 9:00 AM + 4h = 1:00 PM UTC
- Quiet hours: Do not send between 22:00 - 08:00 local time (buffer in queue, send at 08:00)
- Exception: URGENT priority ignores quiet hours (OTP/security alert at 3 AM is fine)
4. Security — Notification-Specific Threats
4.1 Notification Spoofing Prevention
Threat: Attacker gửi fake notification request giả danh internal service.
Mitigations:
- Service-to-service authentication: mTLS hoặc JWT with service identity
- API key per producer: Mỗi service có API key riêng, rate-limited
- Request signing: HMAC signature trên payload → Notification Service verify trước khi process
- Allow-list: Chỉ registered event types + registered producer services mới được gửi
4.2 Phishing Detection in Email
Threat: Internal service bị compromise → gửi phishing email qua notification system.
Mitigations:
- Template-only policy: Email content PHẢI dùng pre-approved templates. Không cho phép arbitrary HTML.
- URL allowlist: Links trong email chỉ được point to approved domains
- Content scanning: Scan email content cho known phishing patterns trước khi gửi
- DMARC enforcement:
p=reject→ email giả mạo domain bị reject bởi ISP
4.3 PII in Notifications
Threat: Push notification preview hiện trên lock screen → lộ thông tin nhạy cảm.
Mitigations:
- Content masking: “Bạn nhận được chuyển khoản ****5,000,000 VND” thay vì hiện full
- Hidden preview: iOS/Android cho phép ẩn notification content trên lock screen
- Encrypted push payload: Encrypt payload, app decrypt khi user unlock device
4.4 Encrypted Push Payload
// Standard push (visible on lock screen):
{
"title": "Chuyển khoản thành công",
"body": "Bạn nhận 5,000,000 VND từ Nguyễn Văn A" ← PII leak!
}
// Encrypted push:
{
"data": {
"encrypted_payload": "aes256gcm:iv:ciphertext:tag",
"key_id": "user_key_v3"
}
// App decrypts after biometric auth
}
5. DevOps — Production Operations
5.1 Kafka Configuration for Notification Ingestion
# kafka-notification-topics.yml
topics:
- name: notification.push.urgent
partitions: 3
replication_factor: 3
config:
retention.ms: 86400000 # 24h
min.insync.replicas: 2 # Strong durability for urgent
max.message.bytes: 1048576 # 1MB max
- name: notification.push.normal
partitions: 6
replication_factor: 3
config:
retention.ms: 259200000 # 72h
min.insync.replicas: 2
- name: notification.email.normal
partitions: 6
replication_factor: 3
config:
retention.ms: 259200000
min.insync.replicas: 2
- name: notification.sms.urgent
partitions: 3
replication_factor: 3
config:
retention.ms: 86400000
min.insync.replicas: 2
- name: notification.dlq
partitions: 3
replication_factor: 3
config:
retention.ms: 604800000 # 7 days — manual review window
min.insync.replicas: 25.2 Worker Autoscaling
# k8s-hpa-push-workers.yml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: push-worker-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: push-worker
minReplicas: 20 # Always-on baseline
maxReplicas: 200 # Peak capacity
metrics:
- type: External
external:
metric:
name: kafka_consumergroup_lag
selector:
matchLabels:
topic: notification.push.normal
target:
type: AverageValue
averageValue: "1000" # Scale up when lag > 1000 per pod
- type: Resource
resource:
name: cpu
target:
type: Utilization
targetAverageUtilization: 70
behavior:
scaleUp:
stabilizationWindowSeconds: 60 # React fast to burst
policies:
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300 # Slow scale-down to avoid flapping
policies:
- type: Percent
value: 25
periodSeconds: 1205.3 Monitoring & Alerting
# prometheus-notification-alerts.yml
groups:
- name: notification_delivery
rules:
- alert: DeliveryRateDropped
expr: |
(
sum(rate(notification_delivered_total[5m]))
/
sum(rate(notification_sent_total[5m]))
) < 0.90
for: 5m
labels:
severity: critical
annotations:
summary: "Notification delivery rate dropped below 90%"
description: "Current delivery rate: {{ $value | humanizePercentage }}"
- alert: UrgentQueueLagHigh
expr: kafka_consumergroup_lag{topic=~"notification.*.urgent"} > 5000
for: 1m
labels:
severity: critical
annotations:
summary: "Urgent notification queue lag is high ({{ $value }})"
- alert: DLQGrowing
expr: rate(kafka_topic_messages_in_total{topic="notification.dlq"}[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "Dead letter queue receiving {{ $value }} msg/s"
- alert: SMSCostSpike
expr: |
sum(rate(sms_sent_total[1h])) * 0.0075 * 24 > 100000
for: 15m
labels:
severity: critical
annotations:
summary: "Projected daily SMS cost exceeds $100K"
- alert: EmailBounceRateHigh
expr: |
sum(rate(email_bounce_total{type="hard"}[1h]))
/
sum(rate(email_sent_total[1h]))
> 0.05
for: 10m
labels:
severity: critical
annotations:
summary: "Email hard bounce rate exceeds 5% — risk of domain blacklist"
- alert: PushTokenInvalidRate
expr: |
sum(rate(push_token_invalid_total[1h]))
/
sum(rate(push_sent_total[1h]))
> 0.10
for: 10m
labels:
severity: warning
annotations:
summary: "10%+ push tokens are invalid — device token cleanup needed"5.4 Grafana Dashboard Panels
| Panel | PromQL | Threshold |
|---|---|---|
| Notifications/sec by channel | sum(rate(notification_sent_total[1m])) by (channel) | Compare vs estimation |
| Delivery rate by channel | delivered / sent by channel | Push > 95%, Email > 98% |
| P99 notification latency | histogram_quantile(0.99, notification_e2e_duration_seconds_bucket) | Urgent < 3s |
| Kafka consumer lag | kafka_consumergroup_lag by topic | Urgent < 1K, Normal < 100K |
| DLQ message count | kafka_topic_partition_current_offset{topic="notification.dlq"} | < 1000 |
| SMS daily cost projection | sum(sms_sent_total) * cost_per_sms * (86400/elapsed) | < $50K/day |
| Email bounce rate | hard_bounce / sent | < 2% |
| Opt-out rate (rolling 7d) | unsubscribe_total / delivered_total | < 0.5% |
5.5 SES/SendGrid Operational Notes
- Dedicated IP pool: Separate IPs for transactional vs marketing email (khác reputation)
- IP warming schedule: Day 1: 200 emails → Day 7: 10K → Day 14: 50K → Day 30: full volume
- Suppression list: SES auto-maintains bounce/complaint list — sync to internal preference store
- Sending quota: SES default = 200 emails/sec. Request increase gradually.
- Dashboard: Monitor SES console cho sending statistics, bounce/complaint rates, reputation dashboard
6. Code Examples
6.1 Python: Notification Service with Priority Queue
"""
Notification Service — Core processing pipeline
Handles validation, deduplication, preference check, template rendering,
rate limiting, and queue routing.
"""
import hashlib
import json
import time
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
import redis
from confluent_kafka import Producer
class Channel(Enum):
PUSH = "push"
EMAIL = "email"
SMS = "sms"
class Priority(Enum):
URGENT = "urgent"
NORMAL = "normal"
LOW = "low"
@dataclass
class NotificationRequest:
event_source: str
event_id: str
user_id: str
channels: list[Channel]
notification_type: str
priority: Priority
template_params: dict
idempotency_key: Optional[str] = None
scheduled_at: Optional[float] = None
def __post_init__(self):
if not self.idempotency_key:
# Auto-generate idempotency key
raw = f"{self.event_source}:{self.event_id}:{self.user_id}"
self.idempotency_key = hashlib.sha256(raw.encode()).hexdigest()
@dataclass
class NotificationMessage:
"""Message published to Kafka after processing."""
notification_id: str
user_id: str
channel: Channel
priority: Priority
rendered_content: dict
device_tokens: list[str] = field(default_factory=list)
email_address: str = ""
phone_number: str = ""
retry_count: int = 0
created_at: float = field(default_factory=time.time)
class NotificationService:
"""
Core notification processing pipeline.
Flow: Validate → Dedup → Check Preferences → Render Template
→ Rate Limit → Route to Queue
"""
DEDUP_TTL = 86400 # 24 hours
RATE_LIMITS = {
Channel.PUSH: {"window": 3600, "max_count": 10}, # 10/hour
Channel.EMAIL: {"window": 86400, "max_count": 5}, # 5/day
Channel.SMS: {"window": 86400, "max_count": 2}, # 2/day
}
def __init__(self, redis_client: redis.Redis, kafka_producer: Producer):
self.redis = redis_client
self.producer = kafka_producer
def process(self, request: NotificationRequest) -> dict:
"""Main entry point — process a notification request."""
results = {}
# Step 1: Deduplication check
if self._is_duplicate(request.idempotency_key):
return {"status": "skipped", "reason": "duplicate"}
for channel in request.channels:
result = self._process_channel(request, channel)
results[channel.value] = result
return {"status": "processed", "channels": results}
def _is_duplicate(self, idempotency_key: str) -> bool:
"""Check Redis for duplicate using SETNX."""
key = f"dedup:{idempotency_key}"
was_set = self.redis.set(key, 1, nx=True, ex=self.DEDUP_TTL)
return not was_set # None = key existed = duplicate
def _process_channel(self, request: NotificationRequest, channel: Channel) -> dict:
# Step 2: Check user preferences
if not self._check_preference(request.user_id, channel, request.notification_type):
return {"status": "skipped", "reason": "user_opted_out"}
# Step 3: Check quiet hours (skip for URGENT)
if request.priority != Priority.URGENT:
if self._in_quiet_hours(request.user_id):
return {"status": "deferred", "reason": "quiet_hours"}
# Step 4: Render template
locale = self._get_user_locale(request.user_id)
content = self._render_template(
request.notification_type, channel, locale, request.template_params
)
# Step 5: Rate limit check (URGENT bypasses)
if request.priority != Priority.URGENT:
if not self._check_rate_limit(request.user_id, channel):
return {"status": "rate_limited", "reason": "too_many_notifications"}
# Step 6: Build message and publish to Kafka
message = NotificationMessage(
notification_id=f"{request.idempotency_key}:{channel.value}",
user_id=request.user_id,
channel=channel,
priority=request.priority,
rendered_content=content,
)
# Enrich with delivery target
if channel == Channel.PUSH:
message.device_tokens = self._get_device_tokens(request.user_id)
if not message.device_tokens:
return {"status": "skipped", "reason": "no_device_tokens"}
elif channel == Channel.EMAIL:
message.email_address = self._get_email(request.user_id)
elif channel == Channel.SMS:
message.phone_number = self._get_phone(request.user_id)
topic = f"notification.{channel.value}.{request.priority.value}"
self._publish(topic, message)
return {"status": "queued", "topic": topic}
def _check_preference(self, user_id: str, channel: Channel, notif_type: str) -> bool:
"""Check user preference from Redis hash."""
prefs = self.redis.hgetall(f"prefs:{user_id}")
if not prefs:
return True # Default: all enabled
# Check channel-level opt-out
channel_key = f"{channel.value}_enabled"
if prefs.get(channel_key, b"true") == b"false":
return False
# Check type-level opt-out (e.g., push_marketing)
type_key = f"{channel.value}_{notif_type}"
if prefs.get(type_key, b"true") == b"false":
return False
return True
def _in_quiet_hours(self, user_id: str) -> bool:
"""Check if current time is within user's quiet hours."""
prefs = self.redis.hgetall(f"prefs:{user_id}")
quiet_start = prefs.get(b"quiet_hours_start", b"").decode()
quiet_end = prefs.get(b"quiet_hours_end", b"").decode()
if not quiet_start or not quiet_end:
return False
tz = prefs.get(b"timezone", b"UTC").decode()
# In production: convert current UTC time to user's local time
# and check if within [quiet_start, quiet_end]
# Simplified here for brevity
return False
def _check_rate_limit(self, user_id: str, channel: Channel) -> bool:
"""Sliding window rate limit using Redis INCR + TTL."""
config = self.RATE_LIMITS[channel]
window = int(time.time()) // config["window"]
key = f"rate:{user_id}:{channel.value}:{window}"
count = self.redis.incr(key)
if count == 1:
self.redis.expire(key, config["window"])
return count <= config["max_count"]
def _get_user_locale(self, user_id: str) -> str:
locale = self.redis.hget(f"prefs:{user_id}", "locale")
return locale.decode() if locale else "en"
def _render_template(
self, notif_type: str, channel: Channel, locale: str, params: dict
) -> dict:
"""
Render notification template with parameters.
In production: load template from S3/cache, use Jinja2/Handlebars.
"""
# Simplified — in production this loads from template store
templates = {
("order_shipped", Channel.PUSH, "vi"): {
"title": "Don hang da gui!",
"body": "Don #{order_id} dang tren duong den ban.",
},
("order_shipped", Channel.PUSH, "en"): {
"title": "Order Shipped!",
"body": "Order #{order_id} is on its way.",
},
}
template = templates.get((notif_type, channel, locale), {})
rendered = {}
for key, value in template.items():
for param_name, param_value in params.items():
value = value.replace(f"{{{param_name}}}", str(param_value))
rendered[key] = value
return rendered
def _get_device_tokens(self, user_id: str) -> list[str]:
devices = self.redis.hgetall(f"devices:{user_id}")
tokens = []
for device_id, device_data in devices.items():
data = json.loads(device_data)
tokens.append(data["token"])
return tokens
def _get_email(self, user_id: str) -> str:
return self.redis.hget(f"user:{user_id}", "email").decode()
def _get_phone(self, user_id: str) -> str:
return self.redis.hget(f"user:{user_id}", "phone").decode()
def _publish(self, topic: str, message: NotificationMessage):
payload = json.dumps({
"notification_id": message.notification_id,
"user_id": message.user_id,
"channel": message.channel.value,
"priority": message.priority.value,
"content": message.rendered_content,
"device_tokens": message.device_tokens,
"email_address": message.email_address,
"phone_number": message.phone_number,
"retry_count": message.retry_count,
"created_at": message.created_at,
})
self.producer.produce(topic, value=payload.encode("utf-8"))
self.producer.flush()6.2 Python: FCM Push Notification Sender
"""
Push Notification Worker — Consumes from Kafka, sends via FCM/APNs.
Handles retries with exponential backoff.
"""
import json
import time
import random
import logging
from dataclasses import dataclass
import requests
from confluent_kafka import Consumer, Producer
logger = logging.getLogger(__name__)
@dataclass
class RetryConfig:
max_retries: int
base_delay_seconds: float
max_delay_seconds: float
RETRY_CONFIGS = {
"urgent": RetryConfig(max_retries=5, base_delay_seconds=1, max_delay_seconds=30),
"normal": RetryConfig(max_retries=3, base_delay_seconds=30, max_delay_seconds=300),
"low": RetryConfig(max_retries=3, base_delay_seconds=300, max_delay_seconds=3600),
}
class FCMPushSender:
"""Send push notifications via Firebase Cloud Messaging HTTP v1 API."""
FCM_URL = "https://fcm.googleapis.com/v1/projects/{project_id}/messages:send"
def __init__(self, project_id: str, service_account_token: str):
self.url = self.FCM_URL.format(project_id=project_id)
self.headers = {
"Authorization": f"Bearer {service_account_token}",
"Content-Type": "application/json",
}
def send(self, device_token: str, title: str, body: str, data: dict = None) -> dict:
"""
Send push notification to a single device via FCM.
Returns: {"success": True, "message_id": "..."} or {"success": False, "error": "..."}
"""
payload = {
"message": {
"token": device_token,
"notification": {
"title": title,
"body": body,
},
}
}
if data:
payload["message"]["data"] = {k: str(v) for k, v in data.items()}
try:
response = requests.post(
self.url, headers=self.headers, json=payload, timeout=5
)
if response.status_code == 200:
return {"success": True, "message_id": response.json().get("name")}
elif response.status_code == 404:
# Invalid device token — permanent failure
return {
"success": False,
"error": "INVALID_TOKEN",
"retriable": False,
}
elif response.status_code == 429:
# Rate limited by FCM
return {
"success": False,
"error": "RATE_LIMITED",
"retriable": True,
}
else:
return {
"success": False,
"error": f"HTTP_{response.status_code}",
"retriable": response.status_code >= 500,
}
except requests.exceptions.Timeout:
return {"success": False, "error": "TIMEOUT", "retriable": True}
except requests.exceptions.ConnectionError:
return {"success": False, "error": "CONNECTION_ERROR", "retriable": True}
class PushWorker:
"""
Kafka consumer worker that processes push notification messages.
Handles retry with exponential backoff and DLQ routing.
"""
def __init__(
self,
consumer: Consumer,
producer: Producer,
fcm_sender: FCMPushSender,
redis_client,
):
self.consumer = consumer
self.producer = producer
self.fcm = fcm_sender
self.redis = redis_client
def run(self):
"""Main consumer loop."""
topics = [
"notification.push.urgent",
"notification.push.normal",
"notification.push.low",
]
self.consumer.subscribe(topics)
logger.info(f"Push worker subscribed to {topics}")
while True:
msg = self.consumer.poll(timeout=1.0)
if msg is None:
continue
if msg.error():
logger.error(f"Consumer error: {msg.error()}")
continue
try:
self._process_message(json.loads(msg.value().decode("utf-8")))
self.consumer.commit(msg)
except Exception as e:
logger.exception(f"Failed to process message: {e}")
def _process_message(self, message: dict):
"""Process a single notification message."""
priority = message["priority"]
retry_count = message.get("retry_count", 0)
retry_config = RETRY_CONFIGS[priority]
for token in message["device_tokens"]:
result = self.fcm.send(
device_token=token,
title=message["content"].get("title", ""),
body=message["content"].get("body", ""),
data={"notification_id": message["notification_id"]},
)
if result["success"]:
logger.info(
f"Push sent: {message['notification_id']} → {token[:20]}..."
)
self._update_status(message["notification_id"], "SENT")
elif not result.get("retriable", False):
# Permanent failure — e.g., invalid token
logger.warning(
f"Permanent failure for token {token[:20]}: {result['error']}"
)
self._remove_invalid_token(message["user_id"], token)
self._update_status(message["notification_id"], "FAILED")
elif retry_count < retry_config.max_retries:
# Retriable failure — schedule retry
delay = self._calculate_backoff(
retry_count, retry_config.base_delay_seconds,
retry_config.max_delay_seconds,
)
logger.info(
f"Scheduling retry #{retry_count + 1} in {delay:.1f}s "
f"for {message['notification_id']}"
)
message["retry_count"] = retry_count + 1
message["scheduled_at"] = time.time() + delay
topic = f"notification.push.{priority}"
self.producer.produce(
topic, value=json.dumps(message).encode("utf-8")
)
self.producer.flush()
else:
# Exhausted retries — send to DLQ
logger.error(
f"Max retries exhausted for {message['notification_id']}"
)
self.producer.produce(
"notification.dlq",
value=json.dumps(message).encode("utf-8"),
)
self.producer.flush()
self._update_status(message["notification_id"], "DLQ")
@staticmethod
def _calculate_backoff(
retry_count: int, base_delay: float, max_delay: float
) -> float:
"""Exponential backoff with jitter."""
delay = min(base_delay * (2 ** retry_count), max_delay)
jitter = random.uniform(0, 1.0) # Up to 1 second jitter
return delay + jitter
def _remove_invalid_token(self, user_id: str, token: str):
"""Remove invalid device token from Redis."""
devices = self.redis.hgetall(f"devices:{user_id}")
for device_id, device_data in devices.items():
data = json.loads(device_data)
if data.get("token") == token:
self.redis.hdel(f"devices:{user_id}", device_id)
logger.info(f"Removed invalid token for user {user_id}")
break
def _update_status(self, notification_id: str, status: str):
"""Update notification status in log (Cassandra in production)."""
# In production: write to Cassandra notification_log table
logger.info(f"Status update: {notification_id} → {status}")6.3 Python: Email Sender with Retry and Bounce Handling
"""
Email Worker — Sends emails via Amazon SES with bounce handling.
"""
import json
import logging
from dataclasses import dataclass
import boto3
from botocore.exceptions import ClientError
logger = logging.getLogger(__name__)
@dataclass
class EmailResult:
success: bool
message_id: str = ""
error: str = ""
error_type: str = "" # "hard_bounce", "soft_bounce", "throttle", "other"
retriable: bool = False
class SESEmailSender:
"""Send emails via Amazon SES with proper error categorization."""
def __init__(self, region: str = "us-east-1", config_set: str = "notification"):
self.client = boto3.client("ses", region_name=region)
self.config_set = config_set
def send(
self,
to_address: str,
subject: str,
html_body: str,
from_address: str = "noreply@notifications.example.com",
reply_to: str = None,
headers: dict = None,
) -> EmailResult:
"""Send a single email via SES."""
try:
message = {
"Subject": {"Data": subject, "Charset": "UTF-8"},
"Body": {
"Html": {"Data": html_body, "Charset": "UTF-8"},
},
}
kwargs = {
"Source": from_address,
"Destination": {"ToAddresses": [to_address]},
"Message": message,
"ConfigurationSetName": self.config_set,
}
if reply_to:
kwargs["ReplyToAddresses"] = [reply_to]
# Add List-Unsubscribe header for CAN-SPAM compliance
if headers and "List-Unsubscribe" in headers:
kwargs["Tags"] = [
{"Name": "unsubscribe_url", "Value": headers["List-Unsubscribe"]}
]
response = self.client.send_email(**kwargs)
return EmailResult(
success=True,
message_id=response["MessageId"],
)
except ClientError as e:
error_code = e.response["Error"]["Code"]
if error_code == "MessageRejected":
# Typically: email address on suppression list
return EmailResult(
success=False,
error=str(e),
error_type="hard_bounce",
retriable=False,
)
elif error_code == "Throttling":
return EmailResult(
success=False,
error="SES rate limit exceeded",
error_type="throttle",
retriable=True,
)
elif error_code == "AccountSendingPausedException":
# SES suspended sending — critical alert needed
logger.critical("SES account sending is paused!")
return EmailResult(
success=False,
error="SES sending paused",
error_type="other",
retriable=False,
)
else:
return EmailResult(
success=False,
error=str(e),
error_type="other",
retriable=True,
)
except Exception as e:
return EmailResult(
success=False,
error=str(e),
error_type="other",
retriable=True,
)
class BounceHandler:
"""
Process SES bounce/complaint notifications via SNS webhook.
Updates user preferences to prevent future sends to invalid addresses.
"""
def __init__(self, redis_client):
self.redis = redis_client
def handle_sns_notification(self, sns_message: dict):
"""Process SNS notification from SES feedback."""
notif_type = sns_message.get("notificationType")
if notif_type == "Bounce":
self._handle_bounce(sns_message["bounce"])
elif notif_type == "Complaint":
self._handle_complaint(sns_message["complaint"])
elif notif_type == "Delivery":
self._handle_delivery(sns_message["delivery"])
def _handle_bounce(self, bounce: dict):
bounce_type = bounce["bounceType"] # "Permanent" or "Transient"
for recipient in bounce["bouncedRecipients"]:
email = recipient["emailAddress"]
if bounce_type == "Permanent":
# Hard bounce — disable email for this user
logger.warning(f"Hard bounce for {email} — disabling email")
self._disable_email_for_user(email)
else:
# Soft bounce — log, but don't disable yet
logger.info(f"Soft bounce for {email} — monitoring")
def _handle_complaint(self, complaint: dict):
"""User clicked 'Report Spam' — immediately unsubscribe."""
for recipient in complaint["complainedRecipients"]:
email = recipient["emailAddress"]
logger.warning(f"Spam complaint from {email} — unsubscribing immediately")
self._disable_email_for_user(email)
def _handle_delivery(self, delivery: dict):
"""Successful delivery confirmation."""
for recipient in delivery["recipients"]:
logger.info(f"Email delivered to {recipient}")
def _disable_email_for_user(self, email: str):
"""
Find user by email and disable email notifications.
In production: lookup user by email in DB, then update Redis prefs.
"""
# Simplified — in production: DB lookup
# user_id = db.query("SELECT id FROM users WHERE email = ?", email)
# self.redis.hset(f"prefs:{user_id}", "email_enabled", "false")
pass7. Aha Moments — Tổng hợp
#1 — Template versioning: Khi update template, notification đang trong queue vẫn dùng version cũ. Luôn embed
template_versiontrong message. Nếu không, user có thể nhận notification với format sai khi template thay đổi giữa chừng.
#2 — Email deliverability là hệ sinh thái riêng: SPF + DKIM + DMARC + dedicated IP + IP warming + bounce handling + complaint handling. Skip bất kỳ bước nào → email vào spam. Recovery từ blacklist mất hàng tuần.
#3 — SMS là channel đắt nhất: Push notification gần như miễn phí (FCM/APNs không charge). SMS = 0.04/message. Với 50M SMS/ngày = hàng trăm nghìn USD/ngày. Đây là lý do preference cascade (push > email > SMS) là bắt buộc.
#4 — Deduplication window trade-off: TTL quá ngắn → miss duplicates, user bực. TTL quá dài → Redis memory bloat. 24h với SHA256 key (32 bytes) cho 500M notifications = ~16 GB Redis → chấp nhận được.
#5 — Separate queues per priority: Dùng single queue + priority field thì marketing blast sẽ delay OTP. Separate queues + dedicated worker pools = resource isolation. Urgent notifications không bao giờ bị ảnh hưởng bởi bulk sends.
#6 — Quiet hours + timezone: Marketing notification lúc 3 AM = user rage uninstall. Nhưng OTP lúc 3 AM thì PHẢI gửi. Priority-based quiet hours bypass là thiết kế tinh tế mà interviewer đánh giá cao.
#7 — Event aggregation giảm notification fatigue: “10 people liked your post” (1 notification) >> 10 x “X liked your post” (10 notifications). Aggregation window 5 phút, dùng Redis Sorted Set, là pattern production-ready.
8. Common Pitfalls — Sai lầm thường gặp
Pitfall 1: Notification Fatigue
Sai: Gửi notification cho mọi event — “A liked your post”, “B liked your post”, “C liked your post”… Đúng: Aggregate events. Rate limit per user. Respect quiet hours. Cho user granular opt-out (per type, per channel). Mỗi notification phải earn quyền được gửi.
Pitfall 2: Thundering Herd on Mass Notification
Sai: Marketing campaign gửi 100M push notification cùng lúc → spike 100x normal QPS → Kafka lag, FCM rate limit, worker crash. Đúng: Staggered delivery — spread 100M notifications over 30-60 phút. Rate limit marketing sends at global level. Use separate Kafka partitions cho marketing vs transactional.
Pitfall 3: SMS Cost Explosion
Sai: Mặc định gửi SMS cho mọi notification type. Đúng: SMS chỉ cho URGENT (OTP, security). Preference cascade: push first → email → SMS as last resort. Budget alert khi projected daily SMS cost vượt threshold. Geographic routing (local SMS providers rẻ hơn Twilio cho nhiều quốc gia).
Pitfall 4: Timezone-Unaware Delivery
Sai: Schedule “morning notification” at 9 AM UTC → user ở Việt Nam nhận lúc 4 PM, user ở US West nhận lúc 1 AM. Đúng: Store user timezone. Convert scheduled time to user’s local time. Buffer in queue nếu trong quiet hours, gửi khi quiet hours kết thúc.
Pitfall 5: Ignoring Device Token Lifecycle
Sai: Lưu device token một lần, dùng mãi. Đúng: Token có thể thay đổi khi app reinstall, OS update, hoặc token refresh. FCM trả
InvalidRegistration→ xóa token. Cần token refresh on every app open. Nếu không, push failure rate tăng dần theo thời gian (token decay).
Pitfall 6: Single Point of Failure in Template Engine
Sai: Template engine là single service — nó down thì tất cả notification fail. Đúng: Cache rendered templates. Fallback to plain text nếu template service unavailable. Template engine nên stateless, horizontally scalable. Pre-render marketing templates trước campaign.
Pitfall 7: No Idempotency → Duplicate Notifications
Sai: Retry khi network timeout mà không check duplicate → user nhận “Đơn hàng đã gửi” 3 lần. Đúng: Idempotency key = hash(event_source + event_id + user_id + channel). SETNX trong Redis với TTL. Producer retry an toàn vì duplicate bị filter.
Step 4 — Wrap Up
Tóm tắt kiến trúc
| Layer | Components | Key Design Decision |
|---|---|---|
| Ingestion | Notification Service API | Validation + dedup + preference check trước khi vào queue |
| Processing | Template Engine, Rate Limiter, Priority Router | Separate concerns, mỗi component có thể scale độc lập |
| Queueing | Kafka (9 topics: 3 channels x 3 priorities) | Separate queues cho resource isolation |
| Delivery | Push/Email/SMS Workers | Stateless, autoscaled dựa trên Kafka lag |
| Reliability | Retry + DLQ | Exponential backoff + jitter, permanent vs transient failure |
| Data | Cassandra (log), Redis (prefs + dedup), ClickHouse (analytics) | Right tool for right access pattern |
Scalability Dimensions
| Nếu cần scale… | Solution |
|---|---|
| More notifications/sec | Add Kafka partitions + more workers |
| More channels (WhatsApp, Telegram) | Add new Kafka topic + new worker type — plug-in architecture |
| More notification types | Add new template — zero code change |
| Global deployment | Multi-region Kafka + region-local workers + local SMS providers |
Những điểm interviewer đánh giá cao
- Priority-based queueing với resource isolation (không phải single queue + priority field)
- Rate limiting ở nhiều level (per-user, per-channel, per-type, global)
- Exactly-once delivery qua idempotency key
- Email deliverability awareness (SPF/DKIM/DMARC, bounce handling, IP warming)
- Cost optimization cho SMS channel
- Timezone-aware delivery với quiet hours
- Event aggregation cho batch notifications
- Analytics pipeline tách biệt khỏi delivery path
Tham khảo
- Alex Xu, System Design Interview — Chapter 10: Design a Notification System
- Firebase Cloud Messaging Documentation
- Amazon SES Developer Guide
- Twilio SMS API
- Tuan-02-Back-of-the-envelope — Estimation fundamentals
- Tuan-06-Cache-Strategy — Redis caching patterns
- Tuan-08-Message-Queue — Kafka queue architecture
- Tuan-09-Rate-Limiter — Rate limiting algorithms
- Tuan-13-Monitoring-Observability — Monitoring delivery metrics
- Tuan-14-CI-CD-Pipeline — Deploying notification workers
- Tuan-15-Data-Security-Encryption — PII protection in notifications
- Tuan-17-Design-Chat-System — Related real-time system design
- Tuan-18-Design-News-Feed — Related event-driven system design
Tuần tới: Tuan-20-Design-Key-Value-Store — Distributed KV Store design deep dive