Bonus — Audit Competition Playbook (Code4rena, Sherlock, Cantina, Hats)
“A private audit pays for a calendar and a reputation. A contest pays for one finding. Both are useful, neither is the other. The auditors who compound the fastest in 2024–2026 are the ones who treat contests as a calibration loop — short feedback cycles, public judges, peer benchmarks — and feed what they learn back into private engagements. Don’t compete to earn, compete to learn what 60-odd other people see in the same code that you didn’t.”
Tags: web3-security methodology audit-competition code4rena sherlock cantina hats solodit leaderboard Learner: Past Tuan-15-Audit-Methodology-Tooling and Tuan-16-Report-Writing-Capstone → ready to enter the public arena Time: 4–5 days lesson + an ongoing 12-month practice loop Related: Tuan-15-Audit-Methodology-Tooling · Tuan-16-Report-Writing-Capstone · Tuan-Bonus-Bug-Bounty-Immunefi · severity-rubric-immunefi-c4 · audit-checklist-master · Tuan-05-Vulnerability-Classes-Part-1
1. Context & Why
1.1 Why a competition tier exists at all
Until ~2021, smart-contract audits were exclusively a private-engagement business: a protocol paid one firm a fixed fee for a fixed window of attention. The model has three structural weaknesses for the protocol:
- Diversity of attack imagination is bounded by the team size. A four-auditor team has four mental models. The bug they all miss is the bug that ships.
- Auditors face no payoff distribution that selects for the best-on-the-day. Whether you find zero High findings or three, you get the same fee.
- No public signal on auditor quality — clients hire by brand reputation and word-of-mouth, both lagging indicators.
Audit competitions (also called “contests” or “crowdsourced security reviews”) flip the structure: a protocol posts code and a prize pool; anywhere from 30 to 400+ independent researchers (“wardens” / “watsons” / “researchers” depending on platform) review in parallel for 5–30 days; judges classify and de-duplicate findings; payout is proportional to severity, uniqueness, and warden contribution.
For the researcher the model has three structural advantages:
- Faster feedback loop than private audits. Submit a finding → judge decides in 2–6 weeks → you see your severity vs. peers’ severities vs. judge ruling. Calibration data accumulates per finding.
- Real money tied to peer-relative performance. A unique High in a 10–25k; the same warden doing the same work for the same protocol on a private retainer might bill 1–2 days of work.
- Reputation builds publicly. Leaderboard standings, Solodit author profiles, Sherlock watson rank, Cantina researcher rank are searchable. A 6-month run of consistent placements is enough to start landing private leads.
These three things compound. Most senior independent auditors in 2024–2026 ([trust1995], hansfriese, GalloDaSballo, cmichel, pashov, 0xRajeev, dirk_y, kalou, etc. — names current at the lesson’s writing date) used competitions as the first 12–24 months of their career, then mixed contests with private engagements once private rates became competitive.
The honest framing: most wardens lose money. Top decile makes a living; top centile makes a top-firm partner-track income. The median Code4rena warden in a given contest earns less than the cost of their time at minimum wage. The lesson is not “compete and you’ll get rich” — it’s “compete to learn, and the income will follow whoever calibrates fastest”.
1.2 What this chapter covers
By the end you can:
- Pick which platform fits a given week’s available time and your current skill bracket.
- Estimate expected payout from a contest before committing time (the ROI calculation).
- Scout a protocol pre-contest (docs, prior audits, novelty estimate) in 60–90 minutes.
- Allocate time within a contest using a phased pass model (recon → module → cross-cutting → write-up).
- Write a finding in the style judges accept — title, severity rationale, impact, PoC, recommendation — and defend it through escalations.
- Read 10 invalidated findings from a recent contest and explain why each was downgraded or rejected.
- Calibrate your severity calls against Solodit’s aggregated rulings — closing the gap between “I thought it was High” and “the judge ruled Medium”.
- Recognize the anti-patterns (spam, vague impact, missing PoC, wrong rubric) that classify a finding as low-effort or invalid.
1.3 Primary references
| Source | URL | Notes |
|---|---|---|
| Code4rena Docs | https://docs.code4rena.com/ | Submission, judging, severity rubric. Read end-to-end before first contest. |
| Code4rena Submission Guidelines | https://docs.code4rena.com/competitions/submission-guidelines | The single most-referenced page. Bookmark it. |
| Code4rena Severity Categorization | https://docs.code4rena.com/competitions/judging/severity-categorization | The rubric you’ll be judged against. |
| Sherlock Docs | https://docs.sherlock.xyz/ | Lead auditor + watson model; stricter rubric. |
| Sherlock Judging Criteria | https://docs.sherlock.xyz/audits/judging/judging | Defines what counts as High vs Medium (Low isn’t paid). |
| Sherlock Audits Calendar | https://audits.sherlock.xyz/contests | Active + upcoming contests. |
| Cantina Docs | https://docs.cantina.xyz/ | Marketplace + competitive reviews. |
| Cantina Competitions | https://cantina.xyz/competitions | Active contests; also lists private review marketplace. |
| Solodit | https://solodit.cyfrin.io/ | Cross-platform finding aggregator. Single best calibration tool in the industry. |
| Hats Finance | https://hats.finance/ | Continuous audits + bug bounty hybrid; less standardized rubric. |
| Hats Audit Competitions | https://app.hats.finance/audit-competitions | Active Hats contests. |
| Code4rena Zenith | https://code4rena.com/zenith | Curated, invite-only / vetted-researcher contests by C4. |
| Cantina × Spearbit | https://cantina.xyz/welcome | Cantina now runs Spearbit-style competitive + curated reviews. |
| C4 escalations / appeals process | https://docs.code4rena.com/competitions/judging/escalations | The “appeal a judge ruling” flow. Most expensive page if you skip it. |
Many platforms iterate their rubrics every few quarters. Treat anything quoted in §3–§4 as a snapshot — re-read the source links before each contest. [verify] any specific dollar figure, percentage threshold, or rule clause against the live docs.
2. The Competition Tier — Platforms and How They Differ
2.1 At-a-glance
| Platform | Started | Format | Severity tiers paid | Pool size (typical 2025–26 [verify]) | Pool model | Judge model |
|---|---|---|---|---|---|---|
| Code4rena (C4) | 2021 | Open contest, 5–14 days | High / Medium (Low + QA bundled in QA report) | 1M+ | Sponsor-funded prize pool | C4 judge pool (paid C4 judges) |
| Sherlock | 2022 | Open contest, 3–10 days, lead-auditor framing | High / Medium only | 500k | Sponsor pool + Sherlock pays watsons | Sherlock-internal lead judges |
| Cantina | 2023 | Open competition + curated/private reviews | High / Medium / Low | 1M+ | Sponsor pool | Cantina senior researchers + Spearbit-affiliated leads |
| Hats Finance | 2022 | Continuous + competition mode | Critical / High / Med / Low (per project) | 300k+ | Mix sponsor + project token | Hats triage + project committee |
| Immunefi (bounty) | 2020 | Continuous bug bounty (not contest) | Critical / High / Med / Low | n/a (per-bounty programs up to $50M) | Per-program continuous pool | Project + Immunefi triage |
Immunefi is not a contest platform; it’s listed for context. See Tuan-Bonus-Bug-Bounty-Immunefi.
2.2 Code4rena (“C4”) — the volume leader
Format: a public, open-entry contest typically running 5 to 14 days, occasionally longer for larger codebases. Anyone can submit findings as a “warden”. Sponsors pay an upfront prize pool plus a per-finding judge fee. Code4rena runs many contests per month, so the absolute warden volume is the highest in the industry.
Severity tiers:
| Tier | C4 description (paraphrased; check live rubric) |
|---|---|
| High (H / 3-H-XX) | Loss of funds, broken core protocol functionality, or any state corruption that compromises invariants in a way attackers can realistically exploit |
| Medium (M / 3-M-XX) | Risk arises only under specific conditions (external state, market, governance) or breaks a non-critical function; assets not directly stealable but value can leak |
| Low (L) | Issues worth noting but not directly exploitable; design quality concerns; bundled into a per-warden QA report rather than paid per-finding |
| Gas / Q&A | Optimizations or non-security observations; bundled into a per-warden Gas report |
Submission format (recent C4 site uses a structured form):
- Title, Severity (warden-proposed), Lines of code linked to the file/commit in scope, Vulnerability detail (free text Markdown), Impact, Tools used, Recommended mitigation.
Categories of warden output:
- Per-finding submissions (H/M) — paid based on slot share (see §6).
- One QA report per warden — paid in tiered bracketed grades (typically Grade A / B / C and “no award”).
- One Gas report per warden — same tiering.
Warden tiers (C4-specific terminology; structure shifts every ~12 months, [verify] at submission time):
- New / unranked → Certified (“Cwarden”) tier earned through consistent placements.
- Zenith is a separate vetted-researcher track (9-3-zenith-track) where C4 invites top performers into curated contests, often closer in shape to private engagements.
Idiosyncrasies:
- The “C4 method” of judging is famously contentious for new wardens: severity decisions reflect judge interpretation of the published rubric, sometimes accompanied by a one-line rationale. Judges can downgrade aggressively. The escalations process (§7.4) is where you challenge a call — and where most new wardens lose money by not using it correctly.
- Duplicates (“dupes”) — same finding by multiple wardens are merged into one issue, and the prize for that finding is split across the dupe group with a slot-share formula (§6). Your unique find pays vastly more than your fifth-shared find. Originality is rewarded structurally.
- Primary vs supporting — for a duplicate group, the judge picks a single best-written submission as the “primary”; that warden gets a slot bonus. Worth optimizing for.
- Selective audits / Pro audits — C4 also runs a Pro / private tier alongside open contests, but the bread-and-butter is open competition.
2.3 Sherlock — the lead-auditor model
Format: a contest typically 3 to 10 days. Sherlock introduced two innovations:
- Lead auditor: each contest has a designated “lead senior watson” who plays a quasi-judge role and is paid extra for the responsibility.
- Watson pool: many independent researchers (“watsons”) compete in parallel as in C4.
Severity tiers (only High and Medium are paid — Sherlock famously does not pay Low or informational):
| Tier | Sherlock criteria (verbatim-ish; check live docs [verify]) |
|---|---|
| High | A bug that causes loss of funds without extensive prior external conditions, and the loss meets impact bars (typically >1% AND >$10 of principal/yield) |
| Medium | Loss requiring specific conditions, or breaks core functionality, with thresholds (typically >0.01% AND >$10) |
Key distinguishing rule: likelihood is not considered for validity — if an attack is theoretically possible and meets the impact bar, it’s valid even if difficult to execute. This makes Sherlock the strictest in favor of researchers on hard-to-execute bugs, but harshest in rejecting findings that don’t meet the precise impact thresholds.
Additional Sherlock-specific judging rules (snapshot — read live docs for canonical text):
- Admin functions assumed used correctly unless the contest README explicitly says otherwise.
- Front-running on public mempool chains in-scope; on Arbitrum / Optimism / private-mempool chains, front-running is out-of-scope because there is no public mempool to front-run on.
- Stale Chainlink price findings typically invalid unless paired with a concrete consumer impact (i.e., not “this could happen” but “and here is how the protocol gets drained”).
- Storage-gap omissions in upgradeable contracts typically invalid unless complex inheritance is present.
- DoS findings: must lock funds >7 days OR impact a time-sensitive function — both → High, either → Medium.
Watson rank and “Senior Watson” status — accrue through consistent valid findings; senior status unlocks lead-auditor opportunities.
Authorship and Solodit — Sherlock is the primary upstream into Solodit’s aggregated finding feed. Most Sherlock findings end up indexed there, so calibration data is plentiful. (Code4rena and Cantina also feed Solodit but with different latency / coverage.)
2.4 Cantina — competitive + curated, spans bigger protocols
Format: Cantina (Spearbit’s competitive-audit platform) runs a mix of:
- Open competitions — similar shape to C4/Sherlock; anyone can submit.
- Curated reviews — top researchers are invited; smaller researcher pool, higher per-researcher payout, closer to a hybrid between contest and private engagement.
- Marketplace for private engagements — Cantina acts as broker between protocols and vetted reviewers.
Severity tiers: High / Medium / Low (Low often paid, unlike Sherlock; [verify] per contest).
Idiosyncrasies:
- Cantina’s curated competitions tend to bring bigger protocol names to competition (post-merge LST protocols, major restaking infrastructure, large lending markets) — the pools are often the largest in the industry, but the researcher pool is also higher-skill.
- Severity rubric is closer to Code4rena than Sherlock; expect “likelihood matters” judging.
- Researcher rank on Cantina builds towards invite eligibility for curated reviews — the path is similar to C4 Zenith.
If you’ve placed Top-10 across 6+ C4/Sherlock contests, expect Cantina invitations to follow.
2.5 Hats / Hats Pro — continuous + competition hybrid
Format: two modes.
- Continuous audit competitions: an evolving codebase has a long-running open bounty (weeks to months), with the protocol’s own deployment / TVL behind it.
- Discrete audit competitions: fixed-window contests like C4.
Severity tiers: Critical / High / Medium / Low (per project; rubric varies more than other platforms — [verify] per program).
Idiosyncrasies:
- Reward sometimes paid in project token rather than stablecoin — adds price-volatility exposure that doesn’t exist on C4/Sherlock USDC-denominated pools.
- Triage and dispute process is less standardized than the C4/Sherlock pipeline; reading prior Hats finding reports for the specific program is essential before submitting.
- Audit + bug-bounty boundary blurrier — sometimes the same finding can be reported during a contest or as a continuous bounty, with different reward sizes.
- Project committee involvement in judging means downgrading-via-political-disagreement is more common; document everything.
2.6 Choosing where to enter as a new warden
| Goal | Best platform |
|---|---|
| Earliest possible calibration on a single finding (lowest barrier) | Code4rena open contests |
| Strictest rubric, fewest opinion-based downgrades | Sherlock |
| Most signal on report-writing quality | Cantina (longer prose expected) |
| Largest pools at higher difficulty | Cantina curated / C4 Zenith |
| Stomach for token-denominated reward + less-standardized triage | Hats |
| Continuous engagement with one codebase (not contest) | Hats continuous or Immunefi |
Pragmatic order for the first 12 months (one recommendation; many viable paths):
- Months 1–3: 2–3 Code4rena contests (low pressure, lots of volume, fastest calibration). Aim for any valid finding — even a Medium with 30-way duplicate teaches a lot.
- Months 4–6: 1–2 Sherlock contests (stricter rubric trains precision). Mix C4 in between.
- Months 7–9: One Cantina open competition. Start submitting QA + Gas reports to learn the “polish” side.
- Months 10–12: Begin Zenith / curated competitions if invited; otherwise continue C4/Sherlock. Start an Immunefi continuous bounty in parallel.
The order matters less than the cadence — one contest per month with a written calibration retrospective after each is worth ten contests done without reflection.
3. The ROI Question — How to Decide Whether to Compete at All
3.1 The math nobody publishes
Expected payout from a single contest is not “I will find a High and earn 5 figures”. It is a probability-weighted distribution. A back-of-the-envelope:
E[payout] = P(find ≥1 valid M+) × E[$ per M+ found | finding]
+ P(find ≥1 H) × E[$ per H found | finding]
− OpportunityCost(hours_committed)
For a representative mid-pool open contest:
| Variable | Realistic 2025–26 [verify] |
|---|---|
| Pool size | $100k |
| Wardens (active submitters) | 60–120 |
| Findings issued | 80–200 (across H/M/L/QA) |
| Wardens with ≥1 valid M+ | 20–40 (i.e., 30–50% find something) |
| Wardens with ≥1 valid H | 5–15 |
| Wardens with a unique solo H | 2–6 |
| Top warden’s share of pool | 20–35% (often one warden hits multiple H+M) |
| Median submitter’s share | 0–1% |
| Median warden net earnings (hours @ $100 opportunity cost) | Negative |
If you spend ~80 hours on the contest and have 4,000 of time. Median outcome: 500 in finding payouts. Top-decile outcome: 25k. Top-1% (one or two unique H + best-written): 80k.
The distribution is fat-tailed. Expected value calculations only become favorable once your find rate and unique-find rate cross some platform-dependent threshold. For most researchers, this threshold is reached after roughly 6–12 contests of practice (per anecdotal reports across the industry — [verify] with your own tracking).
3.2 Find rate vs unique-find rate
Two distinct metrics matter:
- Find rate = (your valid M+ findings) / (total M+ findings in the contest)
- Unique-find rate = (your unique solo findings) / (your valid M+ findings)
Top wardens land 5–15% find rate consistently, with 30–50% of their finds being unique in moderately-attended contests. That’s the income-producing combination: enough volume to participate in many dupe groups, plus enough novelty to occasionally own a finding.
A new warden landing 1–2% find rate with mostly heavy-dupe findings will net almost nothing. Don’t be discouraged — the learning per finding is far higher in the first 10 contests; income arrives later.
3.3 Hours-per-find calibration
Track this across contests. A simple ledger:
contest: <name> hours worked: 65
H found: 1 (dupe of 4) M found: 2 (1 solo, 1 dupe of 7)
QA report: Grade B
gross payout: $1,820
$/hour gross: $28
net (after 30% opportunity cost adjustment): $19/hour
After 6 contests, you’ll see your 50/hour with rising hour counts, the calibration target is severity-precision (under-rating Lows as Mediums) or recon discipline (spending hours in the wrong module).
3.4 The opportunity-cost framing for working auditors
If you’re already earning 15k of foregone billable time. The contest has to gross >$15k to break even on cash, plus deliver some learning-value to break even on career. This is a high bar — for most established private auditors, contests are a complement (one per quarter for calibration) rather than a primary income.
For a 1st-year independent auditor without private clients yet, opportunity cost is closer to $0 — contests are the highest-ROI use of time available because they also build the public track record needed to land private work.
4. Pre-Contest Scouting (60–90 minutes, before committing time)
4.1 The scouting checklist
Before spending any review time, spend an hour answering:
- Pool size and contest length — does the time budget plausibly justify the pool? (See §3.)
- Code line count and complexity — read the contest README. Compute SLOC. Apply complexity multipliers from Tuan-15-Audit-Methodology-Tooling §3.2.
- Number of wardens already signed up — if listed; some platforms show count.
- Prior audits — has the protocol been audited before? Read those reports first.
- Protocol category — AMM? Lending? Vault? Bridge? Restaking? Match against your strongest area.
- Novelty estimate — is this a Uniswap V2 fork (well-known surface, low edge for you) or a novel curve mechanism (high edge if math is your strength)?
- Identifiable senior wardens / lead auditors competing — public commitment via Twitter / Discord. The denser the senior pack, the harder unique finds become.
- Sponsor responsiveness — is the team active on Discord answering questions during the contest? Active sponsors → fewer rejected findings via “we assumed this away”.
If even 3 of these flags are unfavorable, consider skipping in favor of the next contest. Contest selection is a major skill — top wardens skip 60–80% of available contests.
4.2 Reading prior audits
Most C4/Sherlock/Cantina contests list prior audits in the README. Spend 30 minutes per prior report:
| What to look for | Why |
|---|---|
| Severity distribution of prior findings | High-severity-heavy = protocol has structural complexity, fertile ground; or already-cleaned = low edge |
| Categories of bugs found | Repeated reentrancy / oracle / access control — what’s their developmental weakness? |
| Specific functions / modules flagged | A re-audit of “previously-found” code rarely repays time; the previously-clean modules are the new attack surface |
| Acknowledged / wontfix issues | Often a reservation against the exact issue category others will re-submit and get invalid’d. Read these especially carefully. |
| Time elapsed since last audit | Lots of code added since? That’s the high-yield diff. |
Anti-pattern: submitting a finding that was acknowledged in a prior audit report. Judges WILL flag this as out-of-scope or invalid, and you’ve wasted submission slot. Always scan prior reports first.
4.3 Identifying novel vs reused-library components
Most protocols are 60–90% standard library code (OpenZeppelin, Solmate, Uniswap-V2-style math) with a 10–40% novel slice. Bugs heavily concentrate in:
- The novel slice itself — fresh code with no prior audit history.
- The integration glue between novel code and standard libraries — the call sites where assumptions cross.
- Upgrade hooks that modify standard library behavior.
Spend a triage hour: grep for import "@openzeppelin/, import "@uniswap/, import "@solmate/. Subtract from total SLOC. The residue is the high-edge surface — focus there.
4.4 Estimating your edge
A practical question to ask before signing up:
“What does my background give me that 80% of other wardens don’t have?”
| Background | Edge |
|---|---|
| Strong math (e.g., engineering / quant / cryptography) | Curve mechanisms, AMM math, interest-rate models, slippage proofs |
| Strong systems / OS background | Gas economics, ordering subtleties, MEV |
| Heavy DeFi-using personal experience | Integration risk, real-world failure mode intuition |
| Solana / Move / Cairo / Cosmos experience (and the contest covers any of these) | Massive — non-EVM competitive pool is much smaller |
| L2 / bridge / cross-chain background | High in bridge contests, which have few experts |
| Frontend / supply-chain / OPSEC | Niche but useful in dApp-scope contests (rare) |
If your edge intersects the contest’s category, your unique-find rate goes up. If it doesn’t, you’re competing on raw thoroughness against more specialized wardens.
4.5 The decision form
Fill it out before committing:
Contest: <name>
Platform: <C4 / Sherlock / Cantina / Hats>
Pool: $<amount>
Length: <days>
SLOC (in scope): <SLOC, novel-only>
Prior audits: <count, latest date>
Category: <AMM / lending / vault / bridge / restaking / ...>
Novel slice: <approx % of code>
My edge: <bullet points>
Time I'll commit: <hours; cap at 1.5× planned>
Expected find rate: <%>
Decision: <enter / skip / wait for next>
Notes: <flagged risks, e.g., "team unresponsive on Discord">
Keep these in a logbook. After 12 contests, your prediction accuracy on find rate will be calibrated. That’s an enormously valuable artifact.
5. During-Contest Workflow — Phased Pass Model
Adapt the methodology from Tuan-15-Audit-Methodology-Tooling §2.2 to a contest’s time pressure. The phase model below is for an 8-day open contest; adjust ratios proportionally for 5-day or 21-day windows.
5.1 The 8-day phase budget
Day 1 (4h): Recon — README, docs, prior audits, system overview, threat model v0
Day 1–2 (6h): Tool sweep — Slither / Aderyn / build coverage report
Day 2–4 (24h): Module-by-module manual review (3–6 modules at 4–6h each)
Day 5–6 (12h): Cross-cutting passes — access control, invariants, oracle/MEV, economic
Day 6 (6h): Fuzz harness + targeted PoC bursts
Day 7 (8h): Write-up burst — convert finding notes into submission-quality drafts
Day 8 (6h): Final polish, severity calibration, submit
Total: ~66h (realistic for a 60–80h committed window)
Note: judges have observed for years that the quality of submissions plateaus after about 80 hours per warden on a typical 8-day open contest. Going beyond costs more than it earns. Cap your hours; protect calibration capacity for the next contest.
5.2 Phase 1 — Recon (4 hours, Day 1 morning)
Goal: build the same scoping artifact a private audit would produce, but in a quarter the time.
Tasks:
- Read the README end-to-end — twice. The first time for “what is this”, the second for any hidden gotchas (“we acknowledge X is out of scope”, “the keeper is assumed honest”, etc.).
- Read the protocol docs / whitepaper — 30 minutes max. The point is to know the intended invariants, not to memorize architecture.
- Read all prior audit reports in full (see §4.2).
- Build the file/contract dependency tree —
slither . --print human-summary,contract-summary,inheritance-graphproduces it in seconds. - Identify entry points —
slither . --print entry-pointslists every external/public function. Print this list and physically check off as you review. - Write a 1-page threat model in your own notes file: actors, trust boundaries, top-5 invariants to verify. (Even if rough — the act of writing it forces structure.)
- Identify novel slice (see §4.3). Mark modules as “deep dive” or “skim”.
Output: scoping notes file with entry-point list, threat model v0, ranked module list (most novel / most likely-buggy first).
5.3 Phase 2 — Tool sweep (1–2 hours, Day 1–2)
Reference Tuan-15-Audit-Methodology-Tooling §8–§9. For a contest:
- Slither + Aderyn: run both, dump output to files. Don’t immediately submit Slither findings — most are noise. But scan them: any uncontroversial high-confidence finding (e.g.,
controlled-delegatecall) deserves immediate verification. - Build a Foundry harness early — even a stub. You’ll want to write PoCs throughout the contest, not at the end.
- Foundry coverage —
forge coveragetells you which functions tests don’t cover. Untested code is high-value review territory. - Echidna / Medusa: skip in 5-day contests (too much config overhead). Use in 14-day contests for late-stage invariant verification.
- Halmos: only if a math-heavy module justifies it. Math libraries (TickMath-likes, mulDiv variants) are great targets.
The output of this phase: a triaged tool report + a working Foundry harness pre-loaded with the protocol.
5.4 Phase 3 — Module pass (24 hours, Day 2–4)
For each module (3–6 modules of 4–6 hours each):
Apply the three-pass structure from Tuan-15-Audit-Methodology-Tooling §7.1:
- Top-down (the user’s path) — for each external entry, trace who can call it, validate inputs, read+write state, external calls, events, gas behavior. Drop questions into your finding journal as you go.
- Bottom-up (the state’s path) — list every state variable; identify all writers; check consistency.
- Heuristics-on-sight — every smell from Tuan-15-Audit-Methodology-Tooling §7.3 triggers a 5× pace slowdown.
The 60-second finding-journal habit: every time something feels off, spend 60 seconds writing the question into a Markdown file. Don’t try to resolve immediately. By end of contest you’ll have 50–200 journal entries; ~10–20% become findings.
Pace target on first contest: one external function fully reviewed per 30 minutes on novel code. Speeds up as familiarity grows.
5.5 Phase 4 — Cross-cutting passes (12 hours, Day 5–6)
Once per-module work is done, run patterns across the codebase:
| Pass | What to check |
|---|---|
| Access control matrix | Every privileged function × every role. Spreadsheet. Any gaps? Any over-grants? |
| Invariant sweep | Take your top-10 invariants (from threat model); test each via Foundry assertion. Anything that fails or you can’t quickly verify is a candidate finding. |
| Oracle / price source | Every consumption of an external price (Chainlink, Uniswap V2/V3, Curve, custom). Stale, manipulable, mis-decimaled, mis-denominated? |
| Math direction | Every division. Rounding favors who? Consistent with documented spec? Donation-attack vulnerable? |
| Reentrancy surface | Every external call followed by state writes. CEI or nonReentrant? Cross-function? Read-only via view exposed to consumers? |
| Token integration | SafeERC20 throughout? Fee-on-transfer aware? Rebasing aware? Approve race? |
| Time / block sensitivity | Every block.timestamp / block.number. Manipulation surface? Cross-chain inconsistency? |
| Event coverage | Every state-changing function emits an event sufficient for off-chain reconstruction? |
| Initialization / upgrade hooks | _disableInitializers present? Storage gap? Storage layout preserved? |
| MEV / front-running | Any function whose order-dependence creates value-extraction? Slippage protection? |
Cross-cutting is where senior wardens land uniques — fast pattern recognition across the code surface. New wardens often skip this phase (“I’m still reviewing modules”); the senior fix is to stop module review on Day 4 hard and switch to cross-cutting even if module pass feels incomplete.
5.6 Phase 5 — PoC bursts (within phases 3–4)
Whenever a finding crystallizes — immediately write the Foundry PoC. Don’t batch.
Reasons:
- A PoC validates the bug exists. Sometimes it doesn’t, and you save the write-up time.
- A PoC produces numbers you’ll cite in the impact section.
- A PoC is the strongest defense against severity downgrades.
- A PoC sometimes reveals a second finding nearby (“while attacking X I noticed Y”).
Speed target: 30–90 minutes per PoC for a clean bug. If it’s taking >2 hours, either the bug isn’t real or the harness needs work — step back.
5.7 Phase 6 — Write-up burst (Day 7, 8 hours)
Convert finding-journal entries into submission-shaped Markdown. Apply the template in §6.
This is also when you kill doubtful findings. Half of your journal entries don’t survive a careful re-read. Better to submit 5 confident H/M findings than 20 mixed-quality ones — judges read submission quality as a per-warden signal, and a wall of Low-disguised-as-Medium hurts your reputation on the platform.
5.8 Phase 7 — Submit, polish, defend
Final pass:
- Every Medium+ has a working PoC linked.
- Every severity claim has explicit reference to the platform rubric.
- Every impact statement includes numbers (USD or % of TVL).
- Every recommendation includes a specific code fix, not generalities.
- No typos in the title or first paragraph — those determine reading order.
- No duplicate-from-prior-audit content (re-check §4.2).
- QA report assembled with all Low findings as a single multi-section document.
- Gas report assembled if you have ≥3 gas optimizations worth submitting.
Submit before the deadline by ≥4 hours — the C4 / Sherlock / Cantina submission systems get DDoS-grade traffic in the final hour and submission failures happen.
After submission: watch the post-judging period (typically 2–6 weeks). Use escalations (§7.4) when applicable. This is where downgrade calls get reversed if you defend them well.
6. Writing a Finding That Judges Accept
6.1 The universal template
Across C4, Sherlock, Cantina, the strongest findings share a structure:
# <Title — function name + bug class>
## Summary
<One-sentence summary: who can do what to whom under what conditions.>
## Severity
<Proposed: High / Medium / Low — *with explicit rubric reference*>
## Vulnerability Details
<Code excerpt with line numbers; precise explanation of the bug;
state transitions / invariants violated; assumptions broken.>
## Impact
<Who loses what under what realistic conditions; numerical bounds
(USD, % of TVL, % of fees, time-to-execute).>
## Proof of Concept
<Foundry test or step-by-step reproduction; concrete numbers.>
## Tools Used
<Manual / Foundry / Slither / Halmos / ...>
## Recommended Mitigation
<Specific code change. Patch-style diff if possible. Anticipate
side effects of the fix.>
## References
<Prior similar findings; Solodit links; spec / docs.>Some platforms (C4) have stricter forms; some (Sherlock) have looser. The template works as a baseline you can compress or expand.
6.2 Title — “function name + bug class”
A great title is grep-able: another auditor scanning Solodit for a category should find your finding by keyword search.
| Weak title | Strong title |
|---|---|
| ”Vault can be drained" | "Vault.redeem() allows draining via first-depositor donation attack" |
| "Issue with rounding" | "Rounding in convertToShares() favors user, causing slow drain of pool assets" |
| "Oracle problem" | "Oracle.getPrice() uses spot price on UniV2; manipulable for ~$200k cost given current pool depth" |
| "Reentrancy" | "Read-only reentrancy in Pool.virtualPrice() allows consumer protocols (e.g., Bank.deposit()) to mis-price during removeLiquidity()” |
The strong titles communicate function + class + immediate consequence. Judges form a first impression from titles; titles also help in dedupe (two wardens with the same strong title cluster instantly).
6.3 Severity — justify, don’t assert
A severity claim without a rubric reference is an assertion. With one, it’s an argument.
## Severity
**High** under Code4rena's rubric §3-H ("Assets can be stolen, lost, or
compromised directly; or there's a valid attack path with realistic
assumptions").
Impact: direct loss of user deposits, attack path is single-tx,
attacker capital ≤ flash-loan accessible amount (~$50M on Aave today),
no admin / governance dependency.
Likelihood: any caller — no privileged role required. PoC executes
in a single transaction.This wording defends against the most common downgrade — “Medium because too hard to execute”. You preempt it with the flash-loan availability argument and the single-tx PoC.
For Sherlock, replace the Code4rena reference with the Sherlock rubric thresholds (>1% AND >$10 of principal). For Cantina, follow their published rubric (closer to C4 in 2025–26 [verify]).
6.4 Impact — numbers, not adjectives
| Weak impact | Strong impact |
|---|---|
| ”Significant fund loss is possible" | "At block 18,500,000 the pool holds 3.8M (90% of pool) at a flash-loan cost of 0 fee (Aave). Net profit ~$3.8M." |
| "Users could be harmed" | "Each depositor in the bottom decile (deposits <$500) loses ~3% of principal to rounding accumulation over a 90-day holding period." |
| "This is a critical issue" | "Critical: a permanent freeze of all funds — the pause setter has no inverse and no role-rotation; recovery requires redeploy.” |
Specific numbers move severity calls upward. Vague impact moves them downward.
If you don’t have concrete numbers (because the bug is conceptual rather than exploitable for cash), be explicit:
“Impact is bounded by the number of users with
>0pendingRewardsat the time of the upgrade — at writing, this is ~1,200 users with median pending reward of 1,800.”
That’s still numerical. Even bounded enumeration beats “this could affect users”.
6.5 PoC — prefer Foundry; show inputs and outputs
A Foundry PoC is the standard. Anything else (Hardhat, conceptual sequence, “exploit script not included”) loses credibility instantly.
Template:
// SPDX-License-Identifier: MIT
pragma solidity ^0.8.20;
import "forge-std/Test.sol";
import "../src/Vault.sol";
import "../src/MockERC20.sol";
contract Exploit_DonationAttack is Test {
Vault vault;
MockERC20 asset;
address victim = address(0xBEEF);
address attacker = address(this);
function setUp() public {
asset = new MockERC20("USDC", "USDC", 6);
vault = new Vault(IERC20(address(asset)));
// Seed an empty vault scenario; we are the first depositor (attacker)
}
function test_donationAttack_drains_victim_deposit() public {
// 1. Attacker deposits the minimum (1 wei)
asset.mint(attacker, 1);
asset.approve(address(vault), 1);
uint256 attackerShares = vault.deposit(1, attacker);
assertEq(attackerShares, 1, "first-deposit 1:1 ratio");
// 2. Attacker donates a large amount directly to vault (no mint)
uint256 donation = 1_000e6; // 1,000 USDC
asset.mint(attacker, donation);
asset.transfer(address(vault), donation);
// 3. Victim deposits 500 USDC, expects ~500 worth of shares
asset.mint(victim, 500e6);
vm.startPrank(victim);
asset.approve(address(vault), 500e6);
uint256 victimShares = vault.deposit(500e6, victim);
vm.stopPrank();
// 4. Due to rounding, victim shares are 0 — all value
// absorbed proportionally by attacker's single share
assertEq(victimShares, 0, "victim received zero shares — finding!");
// 5. Attacker redeems, walking away with both deposits
uint256 attackerOut = vault.redeem(attackerShares, attacker, attacker);
emit log_named_uint("attacker walks with USDC (6 dp)", attackerOut);
assertGt(attackerOut, 1_500e6, "attacker should take >1500 USDC");
}
}Run output included in the submission:
[PASS] test_donationAttack_drains_victim_deposit() (gas: 217,891)
Logs:
attacker walks with USDC (6 dp): 1500000001
The numbers — 1,500,000,001 micro-USDC (1,500 USDC, vs the 1 USDC the attacker actually deposited) — make the finding undeniable.
For higher-severity findings (especially flash-loan-amplified ones), fork mainnet:
function setUp() public {
vm.createSelectFork(vm.envString("MAINNET_RPC"), 18_500_000);
vault = Vault(0x...real address...);
}A fork-test PoC reproduces against real on-chain state and is the highest-credibility evidence.
6.6 Recommendation — specific, with anticipated side effects
## Recommended Mitigation
Replace `convertToShares()` with the virtual-offset pattern (as in OpenZeppelin's `ERC4626.sol` v5+):
```solidity
function convertToShares(uint256 assets) public view returns (uint256) {
uint256 supply = totalSupply() + 10**_decimalsOffset;
uint256 assetsBase = totalAssets() + 1;
return (assets * supply) / assetsBase;
}Where _decimalsOffset = 6 (or appropriate for the asset’s decimals).
Side effects to verify:
- Initial deposits will mint slightly fewer shares than the asset 1:1; this is intended and stays in the protocol as reserve.
- Existing depositors are unaffected if migration sets the offset correctly.
- The donation attack vector is bounded to
10**_decimalsOffsetworth of asset — defenders accept this small loss as the price of mitigation.
Tests to add:
- A test that attacker cost to drain a 1-wei depositor is bounded to ≥
10**_decimalsOffsetof asset. - A test that the virtual offset reserve is non-zero after first deposit.
A recommendation that ends with **side effects + tests to add** signals senior auditor. A recommendation that ends with "use safe math" or "consider adding a check" signals junior.
### 6.7 The "common downgrade" patterns and how to preempt them
| Common downgrade | How to preempt |
|-----------------|----------------|
| "Requires admin error" | Either argue admin compromise is a documented risk **or** show the path doesn't need admin. Make the path explicit. |
| "Requires governance compromise" | Argue with flash-loan governance: how much voting power costs how much capital; whether timelock is sufficient. Use numbers from the contest's deployed config. |
| "Requires external preconditions unlikely on mainnet" | Show the preconditions are routine (e.g., "in the last 30 days, ETH/USD on UniV3 has hit this state 17 times"). Solodit + on-chain history. |
| "Design choice, not a bug" | Cite the spec / docs / NatSpec that contradicts the behavior. If docs and code disagree, *that's* the finding (doc bug or code bug, one of them). |
| "Out of scope" | Pre-check the README scope rules. Don't waste a submission slot on out-of-scope. |
| "Dupe of <other warden>" | This is downgrade only for *unique-share*; the finding is still valid. Don't treat it as failure — it's normal. The fix is novelty, not better writing of the same bug. |
| "Insufficient PoC" | A working Foundry test is the gold standard; supplemental scripts only as an aid to it. |
| "Theoretical, no realistic exploit" | Provide the realistic-exploit scenario in the Impact section with mainnet numbers. |
### 6.8 Worked example — a finding from notes to submission
**Stage 1 — notes (from the finding journal)**:
> "Vault.sol L142 — `convertToShares` divides assets by totalAssets() — first depositor can manipulate. donation attack? mint 1 share, then transfer huge token to vault, next depositor gets 0 shares due to rounding. classic OZ4626 pattern. did the team add the virtual offset? no, they're using the naive form. check inheritance — extends ERC20 directly, not ERC4626. so they wrote it themselves. confirmed: vulnerable."
**Stage 2 — confirmed by PoC** (the test from §6.5).
**Stage 3 — submission**:
```markdown
# `Vault.deposit()` allows first-depositor donation attack — subsequent depositors receive zero shares
## Summary
A donation attack against `Vault.convertToShares()` lets a first depositor with
1 wei of asset siphon the deposits of all subsequent users until the share-pricing
becomes too coarse for them to mint any shares.
## Severity
**High** under Code4rena §3-H. Direct loss of user funds, no privileged role required,
attacker capital negligible (1 wei + gas).
## Vulnerability Details
`Vault.convertToShares()` computes shares as `assets * totalSupply / totalAssets()`.
When `totalSupply == 1` and `totalAssets()` is inflated by a direct token transfer
(donation), a user depositing `assets < totalAssets() / totalSupply` receives zero
shares and contributes their deposit pro-rata to the existing single share — all of
which is owned by the attacker.
This is the well-known ERC-4626 first-deposit / donation attack. The standard
mitigation is the virtual-share offset adopted by OpenZeppelin in `ERC4626` v5+;
the protocol's custom implementation does not include it.
## Impact
Every user whose deposit is smaller than `attacker_donation / totalSupply * 0.5`
(i.e., the rounding-floor threshold) loses 100% of their deposit. For an attacker
donation of 1,000 USDC and a fresh vault, all subsequent depositors of <500 USDC
receive 0 shares; total loss = sum of victim deposits, captured by the attacker.
PoC drains 1,500 USDC of victim deposit at an attacker cost of 1 wei USDC + gas.
## Proof of Concept
[See `test/exploit/DonationAttack.t.sol` — full test in submission.]
[forge test output as above, showing 1,500,000,001 micro-USDC walked.]
## Tools Used
Foundry; manual review of `Vault.sol:142–168`.
## Recommended Mitigation
Adopt OpenZeppelin's virtual-offset pattern in `ERC4626` v5+...
[as in §6.6]
## References
- [OpenZeppelin ERC4626 docs](https://docs.openzeppelin.com/contracts/api/token/ERC20#ERC4626)
- [The "donation" attack on ERC-4626 (Akshay Srivastav)](https://mixbytes.io/blog/overview-of-the-inflation-attack)
- Solodit: [Code4rena Y2024-AAVE-V3 finding M-04](https://solodit.cyfrin.io/?...) [verify]
- [[Tuan-15-Audit-Methodology-Tooling]] §6.3 — ERC-4626 invariants
Stage 4 — judge ruling: probably accepted as High (or downgraded to Medium if dupes are heavy and primary went to a better-written submission). If downgraded for reasons you disagree with, escalate (§7.4) with explicit rubric reference.
7. Judging Culture — What Makes a Finding Valid, Invalid, or Downgraded
7.1 What invalidates a finding
Across all platforms, common invalidation reasons:
| Reason | Description |
|---|---|
| Out of scope | The bug is in a file/contract the contest README explicitly excluded. Always re-check. |
| Requires admin error (as primary) | “Admin sets feeBps to >10000” — admin is assumed competent unless README says otherwise. Sherlock and C4 both apply this strictly. |
| Requires governance compromise (as primary) | Same logic. Unless the bug is in the governance mechanism, governance-attack-needed-to-trigger doesn’t pass. |
| Documented behavior / design choice | If the README, spec, or NatSpec explicitly describes the behavior as intended, it’s not a bug. Doc-vs-code disagreements are findings; doc-acknowledged design choices are not. |
| Negligible impact | Sherlock’s strict thresholds; even C4 will dismiss findings where the loss is sub-dust. |
| Theoretical without realistic exploit | ”Could be exploited if X” where X never happens in practice. Provide on-chain evidence X happens. |
| Already reported in prior audit & acknowledged / wontfix | Re-submitting these wastes a slot. Always read prior audits. |
| Compiler / dependency bug | Unless the contest README explicitly includes them. |
| Front-running on private-mempool chains | Sherlock explicitly OoS on Arbitrum / Optimism / Base / etc. |
7.2 What downgrades a finding
Even valid findings get downgraded:
| Original | Downgraded to | Why |
|---|---|---|
| High | Medium | Requires specific market state (e.g., particular oracle update timing) |
| High | Medium | Requires specific user behavior (e.g., a user signs an unusual permit) |
| Medium | Low | Impact is real but bounded to one user’s small deposit, no protocol-wide effect |
| Medium | Q&A | Behavior is suboptimal but not a value-loss vector |
| Any | Invalid | Misunderstanding of how Solidity / EVM / a library works |
You’ll see “downgrade to QA” a lot in C4. It’s the judge saying “this is a useful observation but not severity-paying”. Don’t treat it as personal — fold into your QA report for future contests.
7.3 “Dupe wars” — primary vs supporting
In a dupe group of 10 wardens reporting the same bug, the judge designates one as primary (best write-up) and 9 as supporting. The pool’s share for that finding is then split via a slot-share formula favoring the primary.
For C4 specifically (formula has evolved; [verify] at the time of contest):
slot share for primary = base_share × primary_bonus_multiplier
slot share per supporting = base_share / (n_supporting + 1) (rough; varies)
Numerically: in a 10-way dupe of a 12k–1.5k–$2.5k. Quality of write-up is income-multiplicative.
To win primary:
- Title clarity — judge skims titles when picking primary.
- Severity rationale — explicit rubric reference, defended in advance.
- Working PoC — judges have stopped picking PoC-less submissions as primary across most contests.
- Recommendation quality — specific code fix, anticipating side effects.
- No filler — every paragraph adds information.
The trade-off: don’t optimize only for primary — submitting more findings is also high-EV. The right strategy depends on contest length and your speed.
7.4 Appeals / escalations — the most undervalued process
Every platform has an appeals/escalations window after the preliminary judgment:
- Code4rena: “Post-Judging QA” period; wardens can file escalations on specific findings via the contest’s GitHub issues, citing rubric. The judge or a higher-tier reviewer revisits.
- Sherlock: explicit “Escalation Period”; watsons file via the contest dashboard. Senior watson + lead auditor + Sherlock team adjudicates.
- Cantina: dispute window with comment threading; senior researchers and Cantina staff adjudicate.
Escalation success rate [verify with recent data] across platforms: 15–30% of escalations succeed in changing the ruling. That’s high — much higher than most wardens assume. The reason: judges are time-limited and sometimes ship rulings with one-line rationales that don’t survive a careful re-read.
Conditions under which to escalate:
- Severity was downgraded with a one-line rationale you can rebut with specific rubric language.
- Your finding was marked dupe with another that has a different root cause (de-dupe error).
- You were marked supporting in a dupe group where your write-up is objectively better-developed (PoC + numbers vs prose-only).
- Your finding was marked invalid because of “assumption X” that the contest README does not state.
- A judge appears to have misunderstood the technical claim — produce a clarifying PoC.
Conditions under which to not escalate:
- You disagree with severity but can’t cite the rubric.
- You think your finding “should be more important” but offer no new evidence.
- You’re trying to convert dupe→primary via writing skill alone (sometimes accepted on platforms; usually not).
Tone of escalations matters. Cite the rubric verbatim, attach the PoC link, keep it short. Avoid emotional language. Judges are auditors too — meet them in the same register.
7.5 Long-tail: the “cancer” problem
A community term for: spamming low-effort findings hoping a few survive judging. Some wardens submit 30–50 findings, most invalid, on the bet that the judge can’t quickly invalidate all of them. Platforms have responded:
- C4: introduced “insufficient quality” penalties that reduce a warden’s pool share if too many findings are obviously invalid.
- Sherlock: tracks watson submission quality across contests; a bad track record reduces future visibility.
- Cantina: invitations to curated contests depend on quality history.
The takeaway for you: aim for high signal density. Five carefully-written H/M findings beats twenty mixed ones, both in EV (because of slot-share math) and in reputation (because judges remember).
8. Leaderboard Math — How Pools Pay Out
8.1 Code4rena slot-share formula (snapshot — [verify] against live docs)
C4’s slot-share has evolved through several revisions. Roughly:
Each finding has a fixed slot value (function of severity & pool):
H = 10 slots, M = 3 slots (illustrative — verify per contest)
Per finding payout = (pool_for_HM × slots_for_this_finding) / total_slots
Per warden share of finding =
if primary: slots × primary_share_multiplier / total_warden_count_in_group
if supporting: slots × supporting_share_multiplier / total_warden_count_in_group
QA + Gas reports: separate sub-pool (often 5–15% of total)
The structure favors:
- Severity (H > M > L in non-linear ratio).
- Uniqueness (n_warden_count_in_group = 1 → maximum per-warden share).
- Primary status (multiplier ~1.5–2× supporting share).
- Volume (more findings = more slot accumulation, even if no uniques).
Implication: a single solo H in a 8k–1k–4k–2k–$4k.
8.2 Sherlock payout structure
Sherlock pays based on:
- High vs Medium tier (different per-finding pools).
- Number of valid finders per finding (split).
- Lead senior watson bonus if applicable to the contest.
Sherlock historically paid out closer to:
per-finding pool for a H = ~$20k–$50k (function of total pool & severity mix)
split across n finders, with senior watson bonus 5–10% off the top
[verify] with the latest published payout examples; Sherlock has changed formulas multiple times.
8.3 Tiered rewards and brackets
QA / Gas reports use bracketed grades instead of slot share:
| C4 QA report grades (illustrative) | Pay (% of QA sub-pool) |
|---|---|
| Grade A (top ~10% of QA reports) | 25–35% |
| Grade B (next ~25%) | 10–15% |
| Grade C (next ~30%) | 3–6% |
| No award | 0 |
Even Grade B QA on a 1k–$1.5k. Writing a coherent QA report is high-ROI for ~3–4 hours of work — judges read structure, not volume.
8.4 Hyped vs unhyped contests
A contest’s attendance (number of wardens) is the strongest predictor of your per-finding share:
- High-hype contests (major protocols, large pools, public hype on Twitter): 150–400 wardens. Solo finds rare; dupe groups deep; per-warden EV often lower than mid-pool contests.
- Mid-hype contests (mid-cap protocol, $50–200k pool, moderate Twitter): 40–120 wardens. The sweet spot — depth enough to have prizes, sparse enough to win uniques.
- Low-hype contests (small pool, niche category, weak marketing): 10–40 wardens. Easy uniques but small absolute pool. High
$/finding, low total income.
Strategic implication: top wardens often skip the most hyped contests in favor of mid-hype contests, where their edge is more rewarded. New wardens benefit from low-hype contests as low-pressure calibration.
8.5 Per-platform “where’s the money in 2025–26”
Rough averages [verify per quarter]:
- C4: largest absolute pool $, broadest warden field, lowest per-warden EV in hyped contests, best for cadence.
- Sherlock: smaller pools, stricter rubric, higher per-finding EV when valid, fewer dupe-group splits.
- Cantina: largest individual prizes for top wardens in curated contests, hardest entry for new wardens, best for established researchers.
- Hats: lower volume, token-denominated reward, more variable.
A pragmatic 12-month income model (illustrative, not commitment):
| Year-month | Effort | Realistic gross |
|---|---|---|
| Months 1–3 (3 contests) | 200 hours | 3,000 (likely net negative vs opp cost) |
| Months 4–6 (3 contests) | 200 hours | 10,000 |
| Months 7–9 (3 contests) | 200 hours | 25,000 |
| Months 10–12 (4 contests) | 250 hours | 60,000 |
By month 12 a calibrated warden is netting positive after opportunity cost; by month 18–24 the trajectory inflects sharply if combined with private engagement leads.
9. Calibration Practice — Solodit, Past Findings, the Daily Habit
9.1 Solodit as the central calibration tool
Solodit (https://solodit.cyfrin.io/) aggregates findings from Code4rena, Sherlock, Cantina, Spearbit, and other sources. It’s free.
For a serious warden, Solodit is the daily reading habit:
- Filter by platform, severity, protocol type, year.
- Read raw findings as they were submitted (with judge ruling).
- Compare your gut-call severity to the actual ruling.
The 100-finding study (the Lab in §11.1): read 100 findings, write down what severity you’d assign before scrolling to the ruling, and tabulate your hit rate. After 100 findings, your match rate reveals your calibration baseline:
- <40% match: severity calls are not yet aligned with platform conventions. Re-read the rubrics. Re-do the exercise.
- 40–60%: typical for first-quarter wardens. Continue practice.
- 60–75%: competition-ready. You’ll have predictable severity submissions.
- >75%: senior-level calibration. Your escalations will succeed at higher rates.
9.2 Reading the famous wardens
Public author profiles on Solodit / C4 / Sherlock — sample (current as of late 2025 — [verify] since names rotate and rankings shift):
- trust1995 — Sherlock-heavy; concise write-ups; strong math edge.
- hansfriese — Code4rena top warden across many quarters; comprehensive QA reports.
- GalloDaSballo — extremely prolific across platforms; known for fork-test PoCs.
- cmichel — Solo-warden archetype; cross-platform.
- pashov — Solo and team; runs
Pashov Audit Groupwhich sells private engagements; ex top-C4. - 0xRajeev / Rajeev — Methodology-focused write-ups; useful for studying style.
- dirk_y — Defi-deep; Sherlock high ratings.
- kalou — Frequent unique finds; minimalist write-up style.
Pick 3 and read 5 of each warden’s findings. Note their consistencies — title format, PoC style, rubric reference style. Pattern-match the consistencies into your template.
9.3 Watching judges’ rulings as a stream
Each platform’s judge decisions are public:
- Code4rena: contest pages list final findings with severity. Compare against the warden’s submitted severity (often visible in the issue history). Patterns emerge in what gets downgraded.
- Sherlock: published findings with escalation history. Read escalation threads — these are gold for learning rubric interpretation.
- Cantina: detailed findings with judge / sponsor comment threads.
A weekly habit: 30 minutes scanning new rulings for one or two patterns. Over a quarter, your sense of “what passes vs fails” becomes nearly explicit.
9.4 Mock-judging exercise
For the 10 findings in any recently-closed contest:
- Read the title and severity claim only. Write your severity guess.
- Read the detail + PoC. Update your guess.
- Read the impact + recommendation. Lock in your final guess.
- Reveal the actual ruling. Tabulate.
When you and the judge disagree, write down why — in one sentence. Reasons usually cluster:
- “I missed the dependency on admin error.”
- “I didn’t recognize the loss threshold was below Sherlock’s 0.01%.”
- “The PoC seemed weak to me but the judge accepted the conceptual chain.”
Patterns in your own disagreement modes are the most actionable calibration insight you can produce.
9.5 The retrospective journal
After every contest, before checking your earnings:
# Contest <name> — Retrospective
## Findings submitted
| ID | Title | My severity | Judge ruling | Status | Notes |
|----|------|-------------|--------------|--------|-------|
| 1 | ... | H | H, dupe of 8 | accepted | tight PoC saved this |
| 2 | ... | M | invalid | rejected | required admin error — should have caught |
| 3 | ... | M | M, primary | accepted | unique find — virtual-offset deep dive |
## What I missed
- <bug class>: <why I missed it; e.g., didn't review module X enough>
- ...
## What I over-reported
- <finding>: <why it was weak; e.g., theoretical without realistic conditions>
## Calibration delta
- I called <n> findings High that were Medium → severity inflation
- I called <n> findings Medium that were High → severity deflation
- I missed <n> findings entirely
## Process changes for next contest
1. ...
2. ...A 30-minute retrospective after each contest, accumulated over 12 contests, is the single most valuable thing you can do for your career.
10. Anti-Patterns (avoid; add to master checklist)
A consolidated list, drawing on platform docs, judge culture, and the lessons in §6–§7.
10.1 Submission-quality anti-patterns
- Vague “could lead to” language. Specific cause → specific consequence. “Could potentially affect users in some scenarios” reads as no-finding.
- Missing PoC for Medium-or-above. Conceptual chains rarely survive judging at M+.
- Wrong rubric reference. Citing Immunefi tiers on a Sherlock contest gets you downgraded. Match the platform.
- Spam findings. Submitting 30+ Lows hoping one is upgraded — invites quality penalties.
- Long-form prose without structure. Judges skim. Use headers, bullets, code blocks.
- PoC without numerical output. Show the actual values your test prints — they tell the story.
- Recommendation that says “use a check”. Specific code, specific function name, specific patch.
- Title that names the file but not the bug class. Strong titles include both.
- Impact without TVL or % loss. Numbers move severity.
- Severity claim without rubric language. “I think this is High” is not an argument.
- Same finding split across multiple submissions. Consolidate; multiple submissions of the same root cause are merged anyway, and look like spam.
- Out-of-scope findings. Always re-check scope at submission time, not during review.
- Findings already covered in a prior audit. Always read prior audit reports.
10.2 Process-level anti-patterns
- Submitting the day-of deadline. Submission systems fail; submit ≥4 hours early.
- Skipping the retrospective. Calibration only happens when you reflect.
- Trying to “win” every contest. Selection is a skill; skipping bad contests is part of the job.
- Reading no Solodit findings. Most calibration data is free.
- Not using escalations. 15–30% of escalations succeed; not appealing is leaving money on the table.
- Treating contests as primary income from month 1. Plan for 12-month ramp; ignore monthly P&L.
- Working >80 hours/contest without breaks. Quality plateaus; mental health collapses.
- Not building a finding-journal habit. Memory fails by Day 5 of an 8-day contest.
- No fork-test capability. For oracle / lending / AMM bugs, fork-test is the difference between “could happen” and “demonstrably worth $3.8M”.
10.3 Career-level anti-patterns
- Competing in isolation indefinitely. Senior wardens engage with the community (Twitter, Discord, conferences) — leads come from visibility.
- Ignoring private-audit opportunities once leaderboard-ranked. Hybrid (contest + private) maximizes income from month 12 onwards.
- Not specializing. Generalists plateau around 60th percentile. Specializing in one of: AMM math / lending / cross-chain / restaking / LST / non-EVM produces inflection.
- Not documenting your portfolio. A public list of valid findings (with link to Solodit) is the auditor’s resume.
11. Lab — Three Exercises for Calibration
11.1 Lab 1 — One closed Code4rena contest, hunt-and-compare
Goal: Pick one closed Code4rena contest on Solodit. Read the README and scope. Spend a timeboxed 4 hours hunting. Compare your finds to the published reports. Calibrate.
Steps:
- Choose contest — pick something mid-pool ($50–150k), category you’re not yet expert in. Examples: a lending market, a vault, a small cross-chain bridge. Avoid hyped megaprotocols (signal-to-noise too low for a first run).
- Freeze a fork — clone the contest repo at the contest commit. Don’t peek at findings yet.
- 4-hour timer — apply the phased model (§5) compressed: 30 min recon, 30 min tool sweep, 2.5 hours manual review (skip module-by-module rigor; speed-pass each entry point), 30 min write-up.
- Write 1–3 findings in submission format (§6). Self-assigned severity, justified.
- Reveal: open the contest’s findings page on Solodit. Compare:
- Did you find any of the published findings? Match by root cause, not title.
- Did you find any not in the published findings? (Almost certainly false positives — but verify.)
- Did you miss findings that, with hindsight, you should have found? Note category.
- Retrospective: 30 minutes. What was your find rate vs the median warden in that contest? What pattern recurs in your misses?
The first time you do this, expect 0–1 valid finds out of 4 hours matching the published set. That’s normal. Run the lab 3 more times across the next month; track find rate over time.
11.2 Lab 2 — Re-write one of your own historical bugs as a Code4rena-style finding
Goal: convert a bug you found in earlier lessons (e.g., the reentrancy PoC in Tuan-05-Vulnerability-Classes-Part-1 §7.4, or the donation-attack PoC in Tuan-15-Audit-Methodology-Tooling §18.3) into a submission-quality finding.
Apply the template in §6.1. Specifically:
- Title that names function + bug class.
- Severity claim with explicit C4 rubric reference.
- Impact with realistic mainnet-style numbers (even synthetic — use the seeded amounts from the lab).
- Working Foundry PoC, copy-pasted with output log.
- Recommendation with specific code change + side effects + tests-to-add.
- References to Solodit-indexed similar findings (find one matching root cause).
Submission test: have a peer (or yourself, after 48 hours of detachment) read it cold. Can they understand the bug in <2 minutes? Can they reproduce the PoC in <10 minutes from the repo? Can they write the fix from the recommendation without asking clarifying questions?
If any answer is no, iterate.
11.3 Lab 3 — Read 10 invalidated findings from a recent contest
Goal: build intuition for why findings get rejected.
Steps:
- Pick a recent (within last 6 months) closed contest with a published findings page that includes invalid / dupe / out-of-scope rulings. Code4rena’s archived contest pages include these; Sherlock’s escalation logs are excellent.
- Read 10 invalidated findings in detail (not just the ruling — the full submission + ruling rationale).
- For each, write the one-sentence reason for invalidation.
- Group reasons. Common buckets:
- Out-of-scope (~20–30%).
- Requires admin / governance / external precondition (~25–35%).
- Misunderstanding of Solidity / EVM / library behavior (~10–20%).
- Already in prior audit / acknowledged (~10–15%).
- Negligible impact / sub-threshold (~10–20%).
- Dupe (not invalidated, just merged) — note these separately.
- Output: a one-page summary of your top-3 most-common invalidation reasons in this contest, with a “preempt this in my next submission by…” for each.
This lab takes ~3 hours. After running it across 3 contests, your own submission’s invalid-rate drops markedly — typically from 40–60% (new wardens) to 10–20% (calibrated wardens).
11.4 Lab 4 — (stretch) Submit to a live contest
The previous labs simulate; this one is real.
Pick a live open contest with at least 5 days remaining. Apply §4 (scout), §5 (workflow), §6 (write-ups). Submit at least one finding (even a Medium or QA).
After the judging period, run §9.5 (retrospective journal). Compare your predicted severity to the judge’s call. Note dupe count. Note your hour count and your gross.
The first contest’s earnings are nearly irrelevant. The lab outcome is: you have a baseline. After three contests, you have a trend.
12. Trade-offs & Open Debates
| Decision | Option A | Option B | Auditor’s view |
|---|---|---|---|
| Volume vs uniqueness | Submit 15 findings (mix of confidence) | Submit 5 high-confidence findings | Depends on contest length. Short contests reward volume; long contests reward uniqueness. Track $/finding by mode over 6 contests. |
| Pre-contest scouting time | 30 min “good enough” | 90 min thorough | 90 min on contests you commit to; skip the contest entirely if 30 min reveals red flags. |
| Tool reliance | Heavy (Slither + custom detectors + fuzz) | Light (manual-only) | Heavy for module-coverage, light for cross-cutting. Tools catch the easy; manual catches the hard. |
| Specialization | Generalist across all categories | Specialize in 1–2 (e.g., AMM + lending) | Specialize from month 6 onward. Generalists peak at 60th percentile; specialists land 90th+ in their niche. |
| Platform selection | All-in on one (e.g., C4) | Diversify across C4 + Sherlock + Cantina | Diversify after month 6. Different platforms reward different submission styles; both signals are useful. |
| QA report effort | Skip it (focus H/M) | Polish it (Grade A target) | Polish it. 3–4 hours for $1–3k of Grade B is excellent ROI; skipping QA is leaving money on the table. |
| PoC quality | Minimal “shows the bug” | Polished fork-test with realistic numbers | Polished for Medium+; minimal for Low. The PoC is severity-defending evidence. |
| Escalations | Skip (“waste of time”) | Escalate aggressively | Escalate carefully. ~30% success rate is high; don’t escalate without rubric language; don’t escalate without new evidence. |
| Public sharing of process | Private until established | Tweet / blog / Discord | Public from day 1, modestly. Visibility compounds; the auditors who hit 6-figure annuals all built audiences alongside their finding portfolios. |
| Income mix in year 1 | 100% contests | Mix with private leads | If unproven publicly, 100% contests until you have a 6-finding Solodit portfolio. Then mix. |
13. Quiz (≥80% to advance)
-
Q: A new warden hits an 8-day Code4rena contest with 100 hours committed. Pool is 300–500–50/hr opportunity cost ($5,000): heavily negative. Calibration value still high if they run a retrospective.
-
Q: Sherlock’s severity rubric — your finding causes a 0.005% loss of user fees under a precise market condition. Severity? A: Invalid. Sherlock’s Medium threshold is 0.01% AND >$10 of fees. 0.005% is below the threshold. The finding may still appear in QA-equivalent notes (not paid on Sherlock).
-
Q: A contest README says: “Owner is assumed to act honestly except for the specific oracle-rotation function, which is in scope for Owner-induced attacks.” You find that any function with
onlyOwnermodifier and a call totransferFromcan drain the protocol. Severity? A: Out-of-scope for non-oracle-rotation owner abuse. The contest README narrowed Owner-attack scope to oracle-rotation; other owner-induced attacks are out of scope. File as QA or note for the protocol’s information; don’t expect payment. -
Q: You write up a finding as High. The judge rules Medium, citing “requires specific market condition (UniV3 pool depth < $X)“. The README didn’t state this assumption. Do you escalate? A: Yes. The escalation cites that the depth condition isn’t in the README; demonstrate from on-chain history that the condition is met on a recurring basis (e.g., “in the last 90 days, this pool depth was met N times”). 15–30% escalation success rate makes this clearly EV-positive.
-
Q: What’s the slot-share intuition for why a unique High pays much more than a same-severity 10-dupe High? A: The pool for that finding is divided across warden contributors using a formula where supporting wardens share a fraction (often 1 / (n+1) or similar). A solo finder captures the entire slot value; 10 dupes split a slightly larger primary-bonus-augmented pool, but each supporting share is ~10× smaller than the solo. Uniqueness is income-multiplicative.
-
Q: Code4rena vs Sherlock — which is “harder” for a finding to be valid? A: Both are strict, in different ways. Code4rena considers likelihood (a high-impact, very-unlikely bug may be downgraded). Sherlock does not consider likelihood for validity but has tighter loss thresholds (>0.01% AND >$10 minimum). Practical answer: high-impact-low-likelihood bugs tend to pass Sherlock; high-likelihood-medium-impact bugs tend to pass Code4rena. Choose platform partly by your bug’s profile.
-
Q: You spend 90 minutes scouting a contest. The README is thin, the team is unresponsive on Discord, the protocol forks Compound V2 with minor changes, and 200 wardens have already signed up. Decision? A: Skip. Unresponsive team → ambiguous rulings; well-known Compound fork → low novelty edge; 200 wardens → deep dupe groups on the obvious bugs. Selection is a skill; better contests are coming.
-
Q: For a Foundry PoC of a flash-loan-amplified oracle attack, why is a fork test much stronger than a mock test? A: A mock test uses arbitrary numbers — judges discount “10x manipulation cost” as theoretical. A fork test uses real on-chain pool depths and flash-loan capacities, producing concrete USD numbers the judge can verify. Same finding, dramatically stronger evidence.
-
Q: A judge marks your finding “dupe of #42” but #42 has a different root cause (different function, different bug class) — only the user-visible symptom (fund loss) is similar. Action? A: Escalate as a de-dupe error. The fix is “the finding at #42 has root cause RC1; my finding has root cause RC2; the fixes are different code in different files.” Provide a working PoC showing your bug exists even after a hypothetical fix to #42. Most de-dupe escalations succeed when the auditor can show non-overlapping root causes.
-
Q: After 12 contests, your 30 gross, your unique-find rate is 5%, your severity match-rate against judges is 50%. What’s the next move? A: Calibration is the bottleneck. Severity match rate of 50% is below the 60% inflection point. Spend a focused month on Solodit reading (Lab §11.1 style), with explicit before/after severity prediction. Continue 1 contest/month but de-emphasize hours-per-contest, emphasize study. If after 4 more contests the match rate isn’t ≥60%, reconsider whether the auditor career fits or whether you’d be better as a Solidity dev / DeFi engineer.
14. Bonus Deliverables
- Decision-form template (§4.5) filled for at least 3 hypothetical contests, with go/no-go reasoning.
- Re-writeup of one of your own historical bugs from Weeks 5–14 in full Code4rena-style finding format.
- Solodit calibration study: 100 findings read with pre-ruling severity predictions; final tabulated match rate.
- Invalidated-findings analysis from one recent contest (Lab §11.3).
- First live contest submission (Lab §11.4) + retrospective journal.
- Updated audit-checklist-master with this chapter’s anti-patterns.
15. Where this leads
Two parallel arcs from here:
Public-arena loop:
- Pick one contest per month using §4’s selection criteria.
- Run the §5 phased workflow.
- Submit using the §6 template.
- Escalate when warranted (§7.4).
- Retrospective journal (§9.5).
- Solodit study between contests.
Over 12 months this produces a portfolio. The portfolio produces leads. The leads produce private-engagement income at rates Tuan-15-Audit-Methodology-Tooling §3.3 quotes.
Bug-bounty parallel: Tuan-Bonus-Bug-Bounty-Immunefi covers the continuous-bug-bounty side. Many top wardens run an Immunefi continuous program for a single major protocol while doing contests — the bounty’s higher per-finding payout (10% of TVL, up to $1M+) rewards the unique critical the contests sometimes don’t surface.
Eventually, hybrid:
Income shape after 18–24 months (representative):
~30% competitive contests
~50% private retainer / boutique audits
~15% Immunefi / continuous bug bounty
~5% speaking / writing / consulting
The contests stay in the mix because they’re calibration. The day a senior auditor stops calibrating against the community is the day their judgement starts to drift — and the bugs they miss get progressively more expensive when missed in private work.
The market is a feedback loop. Stay in it.
Last updated: 2026-05-16 See also: Roadmap · References · Tuan-15-Audit-Methodology-Tooling · Tuan-16-Report-Writing-Capstone · Tuan-Bonus-Bug-Bounty-Immunefi · severity-rubric-immunefi-c4 · audit-checklist-master · Tuan-05-Vulnerability-Classes-Part-1