Bonus — Audit Competition Playbook (Code4rena, Sherlock, Cantina, Hats)

“A private audit pays for a calendar and a reputation. A contest pays for one finding. Both are useful, neither is the other. The auditors who compound the fastest in 2024–2026 are the ones who treat contests as a calibration loop — short feedback cycles, public judges, peer benchmarks — and feed what they learn back into private engagements. Don’t compete to earn, compete to learn what 60-odd other people see in the same code that you didn’t.”

Tags: web3-security methodology audit-competition code4rena sherlock cantina hats solodit leaderboard Learner: Past Tuan-15-Audit-Methodology-Tooling and Tuan-16-Report-Writing-Capstone → ready to enter the public arena Time: 4–5 days lesson + an ongoing 12-month practice loop Related: Tuan-15-Audit-Methodology-Tooling · Tuan-16-Report-Writing-Capstone · Tuan-Bonus-Bug-Bounty-Immunefi · severity-rubric-immunefi-c4 · audit-checklist-master · Tuan-05-Vulnerability-Classes-Part-1

1. Context & Why

1.1 Why a competition tier exists at all

Until ~2021, smart-contract audits were exclusively a private-engagement business: a protocol paid one firm a fixed fee for a fixed window of attention. The model has three structural weaknesses for the protocol:

Diversity of attack imagination is bounded by the team size. A four-auditor team has four mental models. The bug they all miss is the bug that ships.
Auditors face no payoff distribution that selects for the best-on-the-day. Whether you find zero High findings or three, you get the same fee.
No public signal on auditor quality — clients hire by brand reputation and word-of-mouth, both lagging indicators.

Audit competitions (also called “contests” or “crowdsourced security reviews”) flip the structure: a protocol posts code and a prize pool; anywhere from 30 to 400+ independent researchers (“wardens” / “watsons” / “researchers” depending on platform) review in parallel for 5–30 days; judges classify and de-duplicate findings; payout is proportional to severity, uniqueness, and warden contribution.

For the researcher the model has three structural advantages:

Faster feedback loop than private audits. Submit a finding → judge decides in 2–6 weeks → you see your severity vs. peers’ severities vs. judge ruling. Calibration data accumulates per finding.
Real money tied to peer-relative performance. A unique High in a $200 k p oo lw i t h 80 w a r d e n sc an p a y$ 10–25k; the same warden doing the same work for the same protocol on a private retainer might bill 1–2 days of work.
Reputation builds publicly. Leaderboard standings, Solodit author profiles, Sherlock watson rank, Cantina researcher rank are searchable. A 6-month run of consistent placements is enough to start landing private leads.

These three things compound. Most senior independent auditors in 2024–2026 ([trust1995], hansfriese, GalloDaSballo, cmichel, pashov, 0xRajeev, dirk_y, kalou, etc. — names current at the lesson’s writing date) used competitions as the first 12–24 months of their career, then mixed contests with private engagements once private rates became competitive.

The honest framing: most wardens lose money. Top decile makes a living; top centile makes a top-firm partner-track income. The median Code4rena warden in a given contest earns less than the cost of their time at minimum wage. The lesson is not “compete and you’ll get rich” — it’s “compete to learn, and the income will follow whoever calibrates fastest”.

1.2 What this chapter covers

By the end you can:

Pick which platform fits a given week’s available time and your current skill bracket.
Estimate expected payout from a contest before committing time (the ROI calculation).
Scout a protocol pre-contest (docs, prior audits, novelty estimate) in 60–90 minutes.
Allocate time within a contest using a phased pass model (recon → module → cross-cutting → write-up).
Write a finding in the style judges accept — title, severity rationale, impact, PoC, recommendation — and defend it through escalations.
Read 10 invalidated findings from a recent contest and explain why each was downgraded or rejected.
Calibrate your severity calls against Solodit’s aggregated rulings — closing the gap between “I thought it was High” and “the judge ruled Medium”.
Recognize the anti-patterns (spam, vague impact, missing PoC, wrong rubric) that classify a finding as low-effort or invalid.

1.3 Primary references

Source	URL	Notes
Code4rena Docs	https://docs.code4rena.com/	Submission, judging, severity rubric. Read end-to-end before first contest.
Code4rena Submission Guidelines	https://docs.code4rena.com/competitions/submission-guidelines	The single most-referenced page. Bookmark it.
Code4rena Severity Categorization	https://docs.code4rena.com/competitions/judging/severity-categorization	The rubric you’ll be judged against.
Sherlock Docs	https://docs.sherlock.xyz/	Lead auditor + watson model; stricter rubric.
Sherlock Judging Criteria	https://docs.sherlock.xyz/audits/judging/judging	Defines what counts as High vs Medium (Low isn’t paid).
Sherlock Audits Calendar	https://audits.sherlock.xyz/contests	Active + upcoming contests.
Cantina Docs	https://docs.cantina.xyz/	Marketplace + competitive reviews.
Cantina Competitions	https://cantina.xyz/competitions	Active contests; also lists private review marketplace.
Solodit	https://solodit.cyfrin.io/	Cross-platform finding aggregator. Single best calibration tool in the industry.
Hats Finance	https://hats.finance/	Continuous audits + bug bounty hybrid; less standardized rubric.
Hats Audit Competitions	https://app.hats.finance/audit-competitions	Active Hats contests.
Code4rena Zenith	https://code4rena.com/zenith	Curated, invite-only / vetted-researcher contests by C4.
Cantina × Spearbit	https://cantina.xyz/welcome	Cantina now runs Spearbit-style competitive + curated reviews.
C4 escalations / appeals process	https://docs.code4rena.com/competitions/judging/escalations	The “appeal a judge ruling” flow. Most expensive page if you skip it.

Many platforms iterate their rubrics every few quarters. Treat anything quoted in §3–§4 as a snapshot — re-read the source links before each contest. [verify] any specific dollar figure, percentage threshold, or rule clause against the live docs.

2. The Competition Tier — Platforms and How They Differ

2.1 At-a-glance

Platform	Started	Format	Severity tiers paid	Pool size (typical 2025–26 [verify])	Pool model	Judge model
Code4rena (C4)	2021	Open contest, 5–14 days	High / Medium (Low + QA bundled in QA report)	$30 k -$ 1M+	Sponsor-funded prize pool	C4 judge pool (paid C4 judges)
Sherlock	2022	Open contest, 3–10 days, lead-auditor framing	High / Medium only	$30 k -$ 500k	Sponsor pool + Sherlock pays watsons	Sherlock-internal lead judges
Cantina	2023	Open competition + curated/private reviews	High / Medium / Low	$50 k -$ 1M+	Sponsor pool	Cantina senior researchers + Spearbit-affiliated leads
Hats Finance	2022	Continuous + competition mode	Critical / High / Med / Low (per project)	$50 k -$ 300k+	Mix sponsor + project token	Hats triage + project committee
Immunefi (bounty)	2020	Continuous bug bounty (not contest)	Critical / High / Med / Low	n/a (per-bounty programs up to $50M)	Per-program continuous pool	Project + Immunefi triage

Immunefi is not a contest platform; it’s listed for context. See Tuan-Bonus-Bug-Bounty-Immunefi.

2.2 Code4rena (“C4”) — the volume leader

Format: a public, open-entry contest typically running 5 to 14 days, occasionally longer for larger codebases. Anyone can submit findings as a “warden”. Sponsors pay an upfront prize pool plus a per-finding judge fee. Code4rena runs many contests per month, so the absolute warden volume is the highest in the industry.

Severity tiers:

Tier	C4 description (paraphrased; check live rubric)
High (H / 3-H-XX)	Loss of funds, broken core protocol functionality, or any state corruption that compromises invariants in a way attackers can realistically exploit
Medium (M / 3-M-XX)	Risk arises only under specific conditions (external state, market, governance) or breaks a non-critical function; assets not directly stealable but value can leak
Low (L)	Issues worth noting but not directly exploitable; design quality concerns; bundled into a per-warden QA report rather than paid per-finding
Gas / Q&A	Optimizations or non-security observations; bundled into a per-warden Gas report

Submission format (recent C4 site uses a structured form):

Title, Severity (warden-proposed), Lines of code linked to the file/commit in scope, Vulnerability detail (free text Markdown), Impact, Tools used, Recommended mitigation.

Categories of warden output:

Per-finding submissions (H/M) — paid based on slot share (see §6).
One QA report per warden — paid in tiered bracketed grades (typically Grade A / B / C and “no award”).
One Gas report per warden — same tiering.

Warden tiers (C4-specific terminology; structure shifts every ~12 months, [verify] at submission time):

New / unranked → Certified (“Cwarden”) tier earned through consistent placements.
Zenith is a separate vetted-researcher track (9-3-zenith-track) where C4 invites top performers into curated contests, often closer in shape to private engagements.

Idiosyncrasies:

The “C4 method” of judging is famously contentious for new wardens: severity decisions reflect judge interpretation of the published rubric, sometimes accompanied by a one-line rationale. Judges can downgrade aggressively. The escalations process (§7.4) is where you challenge a call — and where most new wardens lose money by not using it correctly.
Duplicates (“dupes”) — same finding by multiple wardens are merged into one issue, and the prize for that finding is split across the dupe group with a slot-share formula (§6). Your unique find pays vastly more than your fifth-shared find. Originality is rewarded structurally.
Primary vs supporting — for a duplicate group, the judge picks a single best-written submission as the “primary”; that warden gets a slot bonus. Worth optimizing for.
Selective audits / Pro audits — C4 also runs a Pro / private tier alongside open contests, but the bread-and-butter is open competition.

2.3 Sherlock — the lead-auditor model

Format: a contest typically 3 to 10 days. Sherlock introduced two innovations:

Lead auditor: each contest has a designated “lead senior watson” who plays a quasi-judge role and is paid extra for the responsibility.
Watson pool: many independent researchers (“watsons”) compete in parallel as in C4.

Severity tiers (only High and Medium are paid — Sherlock famously does not pay Low or informational):

Tier	Sherlock criteria (verbatim-ish; check live docs [verify])
High	A bug that causes loss of funds without extensive prior external conditions, and the loss meets impact bars (typically >1% AND >$10 of principal/yield)
Medium	Loss requiring specific conditions, or breaks core functionality, with thresholds (typically >0.01% AND >$10)

Key distinguishing rule: likelihood is not considered for validity — if an attack is theoretically possible and meets the impact bar, it’s valid even if difficult to execute. This makes Sherlock the strictest in favor of researchers on hard-to-execute bugs, but harshest in rejecting findings that don’t meet the precise impact thresholds.

Additional Sherlock-specific judging rules (snapshot — read live docs for canonical text):

Admin functions assumed used correctly unless the contest README explicitly says otherwise.
Front-running on public mempool chains in-scope; on Arbitrum / Optimism / private-mempool chains, front-running is out-of-scope because there is no public mempool to front-run on.
Stale Chainlink price findings typically invalid unless paired with a concrete consumer impact (i.e., not “this could happen” but “and here is how the protocol gets drained”).
Storage-gap omissions in upgradeable contracts typically invalid unless complex inheritance is present.
DoS findings: must lock funds >7 days OR impact a time-sensitive function — both → High, either → Medium.

Watson rank and “Senior Watson” status — accrue through consistent valid findings; senior status unlocks lead-auditor opportunities.

Authorship and Solodit — Sherlock is the primary upstream into Solodit’s aggregated finding feed. Most Sherlock findings end up indexed there, so calibration data is plentiful. (Code4rena and Cantina also feed Solodit but with different latency / coverage.)

2.4 Cantina — competitive + curated, spans bigger protocols

Format: Cantina (Spearbit’s competitive-audit platform) runs a mix of:

Open competitions — similar shape to C4/Sherlock; anyone can submit.
Curated reviews — top researchers are invited; smaller researcher pool, higher per-researcher payout, closer to a hybrid between contest and private engagement.
Marketplace for private engagements — Cantina acts as broker between protocols and vetted reviewers.

Severity tiers: High / Medium / Low (Low often paid, unlike Sherlock; [verify] per contest).

Idiosyncrasies:

Cantina’s curated competitions tend to bring bigger protocol names to competition (post-merge LST protocols, major restaking infrastructure, large lending markets) — the pools are often the largest in the industry, but the researcher pool is also higher-skill.
Severity rubric is closer to Code4rena than Sherlock; expect “likelihood matters” judging.
Researcher rank on Cantina builds towards invite eligibility for curated reviews — the path is similar to C4 Zenith.

If you’ve placed Top-10 across 6+ C4/Sherlock contests, expect Cantina invitations to follow.

2.5 Hats / Hats Pro — continuous + competition hybrid

Format: two modes.

Continuous audit competitions: an evolving codebase has a long-running open bounty (weeks to months), with the protocol’s own deployment / TVL behind it.
Discrete audit competitions: fixed-window contests like C4.

Severity tiers: Critical / High / Medium / Low (per project; rubric varies more than other platforms — [verify] per program).

Idiosyncrasies:

Reward sometimes paid in project token rather than stablecoin — adds price-volatility exposure that doesn’t exist on C4/Sherlock USDC-denominated pools.
Triage and dispute process is less standardized than the C4/Sherlock pipeline; reading prior Hats finding reports for the specific program is essential before submitting.
Audit + bug-bounty boundary blurrier — sometimes the same finding can be reported during a contest or as a continuous bounty, with different reward sizes.
Project committee involvement in judging means downgrading-via-political-disagreement is more common; document everything.

2.6 Choosing where to enter as a new warden

Goal	Best platform
Earliest possible calibration on a single finding (lowest barrier)	Code4rena open contests
Strictest rubric, fewest opinion-based downgrades	Sherlock
Most signal on report-writing quality	Cantina (longer prose expected)
Largest pools at higher difficulty	Cantina curated / C4 Zenith
Stomach for token-denominated reward + less-standardized triage	Hats
Continuous engagement with one codebase (not contest)	Hats continuous or Immunefi

Pragmatic order for the first 12 months (one recommendation; many viable paths):

Months 1–3: 2–3 Code4rena contests (low pressure, lots of volume, fastest calibration). Aim for any valid finding — even a Medium with 30-way duplicate teaches a lot.
Months 4–6: 1–2 Sherlock contests (stricter rubric trains precision). Mix C4 in between.
Months 7–9: One Cantina open competition. Start submitting QA + Gas reports to learn the “polish” side.
Months 10–12: Begin Zenith / curated competitions if invited; otherwise continue C4/Sherlock. Start an Immunefi continuous bounty in parallel.

The order matters less than the cadence — one contest per month with a written calibration retrospective after each is worth ten contests done without reflection.

3. The ROI Question — How to Decide Whether to Compete at All

3.1 The math nobody publishes

Expected payout from a single contest is not “I will find a High and earn 5 figures”. It is a probability-weighted distribution. A back-of-the-envelope:

E[payout] =   P(find ≥1 valid M+) × E[$ per M+ found | finding]
            + P(find ≥1 H)         × E[$ per H found  | finding]
            − OpportunityCost(hours_committed)

For a representative mid-pool open contest:

Variable	Realistic 2025–26 [verify]
Pool size	$100k
Wardens (active submitters)	60–120
Findings issued	80–200 (across H/M/L/QA)
Wardens with ≥1 valid M+	20–40 (i.e., 30–50% find something)
Wardens with ≥1 valid H	5–15
Wardens with a unique solo H	2–6
Top warden’s share of pool	20–35% (often one warden hits multiple H+M)
Median submitter’s share	0–1%
Median warden net earnings (hours @ $100 opportunity cost)	Negative

If you spend ~80 hours on the contest and have $50/ h o u ro pp or t u ni t ycos t, yo u^{'} v e in v es t e d$ 4,000 of time. Median outcome: $0-$ 500 in finding payouts. Top-decile outcome: $5 k -$ 25k. Top-1% (one or two unique H + best-written): $30 k -$ 80k.

The distribution is fat-tailed. Expected value calculations only become favorable once your find rate and unique-find rate cross some platform-dependent threshold. For most researchers, this threshold is reached after roughly 6–12 contests of practice (per anecdotal reports across the industry — [verify] with your own tracking).

3.2 Find rate vs unique-find rate

Two distinct metrics matter:

Find rate = (your valid M+ findings) / (total M+ findings in the contest)
Unique-find rate = (your unique solo findings) / (your valid M+ findings)

Top wardens land 5–15% find rate consistently, with 30–50% of their finds being unique in moderately-attended contests. That’s the income-producing combination: enough volume to participate in many dupe groups, plus enough novelty to occasionally own a finding.

A new warden landing 1–2% find rate with mostly heavy-dupe findings will net almost nothing. Don’t be discouraged — the learning per finding is far higher in the first 10 contests; income arrives later.

3.3 Hours-per-find calibration

Track this across contests. A simple ledger:

contest:                <name>            hours worked: 65
H found:                1 (dupe of 4)     M found: 2 (1 solo, 1 dupe of 7)
QA report:              Grade B
gross payout:           $1,820
$/hour gross:           $28
net (after 30% opportunity cost adjustment): $19/hour

After 6 contests, you’ll see your $/ h o u r t re n d . I f a f t er 10-12 co n t es t syo u^{'} res t i ll a t <$ 50/hour with rising hour counts, the calibration target is severity-precision (under-rating Lows as Mediums) or recon discipline (spending hours in the wrong module).

3.4 The opportunity-cost framing for working auditors

If you’re already earning $1, 500/ d a yo n p r i v a t ee n g a g e m e n t s, an 80 - h o u rco n t es t cos t s$ 15k of foregone billable time. The contest has to gross >$15k to break even on cash, plus deliver some learning-value to break even on career. This is a high bar — for most established private auditors, contests are a complement (one per quarter for calibration) rather than a primary income.

For a 1st-year independent auditor without private clients yet, opportunity cost is closer to $0 — contests are the highest-ROI use of time available because they also build the public track record needed to land private work.

4. Pre-Contest Scouting (60–90 minutes, before committing time)

4.1 The scouting checklist

Before spending any review time, spend an hour answering:

Pool size and contest length — does the time budget plausibly justify the pool? (See §3.)
Code line count and complexity — read the contest README. Compute SLOC. Apply complexity multipliers from Tuan-15-Audit-Methodology-Tooling §3.2.
Number of wardens already signed up — if listed; some platforms show count.
Prior audits — has the protocol been audited before? Read those reports first.
Protocol category — AMM? Lending? Vault? Bridge? Restaking? Match against your strongest area.
Novelty estimate — is this a Uniswap V2 fork (well-known surface, low edge for you) or a novel curve mechanism (high edge if math is your strength)?
Identifiable senior wardens / lead auditors competing — public commitment via Twitter / Discord. The denser the senior pack, the harder unique finds become.
Sponsor responsiveness — is the team active on Discord answering questions during the contest? Active sponsors → fewer rejected findings via “we assumed this away”.

If even 3 of these flags are unfavorable, consider skipping in favor of the next contest. Contest selection is a major skill — top wardens skip 60–80% of available contests.

4.2 Reading prior audits

Most C4/Sherlock/Cantina contests list prior audits in the README. Spend 30 minutes per prior report:

What to look for	Why
Severity distribution of prior findings	High-severity-heavy = protocol has structural complexity, fertile ground; or already-cleaned = low edge
Categories of bugs found	Repeated reentrancy / oracle / access control — what’s their developmental weakness?
Specific functions / modules flagged	A re-audit of “previously-found” code rarely repays time; the previously-clean modules are the new attack surface
Acknowledged / wontfix issues	Often a reservation against the exact issue category others will re-submit and get invalid’d. Read these especially carefully.
Time elapsed since last audit	Lots of code added since? That’s the high-yield diff.

Anti-pattern: submitting a finding that was acknowledged in a prior audit report. Judges WILL flag this as out-of-scope or invalid, and you’ve wasted submission slot. Always scan prior reports first.

4.3 Identifying novel vs reused-library components

Most protocols are 60–90% standard library code (OpenZeppelin, Solmate, Uniswap-V2-style math) with a 10–40% novel slice. Bugs heavily concentrate in:

The novel slice itself — fresh code with no prior audit history.
The integration glue between novel code and standard libraries — the call sites where assumptions cross.
Upgrade hooks that modify standard library behavior.

Spend a triage hour: grep for import "@openzeppelin/, import "@uniswap/, import "@solmate/. Subtract from total SLOC. The residue is the high-edge surface — focus there.

4.4 Estimating your edge

A practical question to ask before signing up:

“What does my background give me that 80% of other wardens don’t have?”

Background	Edge
Strong math (e.g., engineering / quant / cryptography)	Curve mechanisms, AMM math, interest-rate models, slippage proofs
Strong systems / OS background	Gas economics, ordering subtleties, MEV
Heavy DeFi-using personal experience	Integration risk, real-world failure mode intuition
Solana / Move / Cairo / Cosmos experience (and the contest covers any of these)	Massive — non-EVM competitive pool is much smaller
L2 / bridge / cross-chain background	High in bridge contests, which have few experts
Frontend / supply-chain / OPSEC	Niche but useful in dApp-scope contests (rare)

If your edge intersects the contest’s category, your unique-find rate goes up. If it doesn’t, you’re competing on raw thoroughness against more specialized wardens.

4.5 The decision form

Fill it out before committing:

Contest:         <name>
Platform:        <C4 / Sherlock / Cantina / Hats>
Pool:            $<amount>
Length:          <days>
SLOC (in scope): <SLOC, novel-only>
Prior audits:    <count, latest date>
Category:        <AMM / lending / vault / bridge / restaking / ...>
Novel slice:     <approx % of code>
My edge:         <bullet points>
Time I'll commit: <hours; cap at 1.5× planned>
Expected find rate: <%>
Decision:        <enter / skip / wait for next>
Notes:           <flagged risks, e.g., "team unresponsive on Discord">

Keep these in a logbook. After 12 contests, your prediction accuracy on find rate will be calibrated. That’s an enormously valuable artifact.

5. During-Contest Workflow — Phased Pass Model

Adapt the methodology from Tuan-15-Audit-Methodology-Tooling §2.2 to a contest’s time pressure. The phase model below is for an 8-day open contest; adjust ratios proportionally for 5-day or 21-day windows.

5.1 The 8-day phase budget

Day 1 (4h):       Recon — README, docs, prior audits, system overview, threat model v0
Day 1–2 (6h):    Tool sweep — Slither / Aderyn / build coverage report
Day 2–4 (24h):   Module-by-module manual review (3–6 modules at 4–6h each)
Day 5–6 (12h):   Cross-cutting passes — access control, invariants, oracle/MEV, economic
Day 6 (6h):      Fuzz harness + targeted PoC bursts
Day 7 (8h):      Write-up burst — convert finding notes into submission-quality drafts
Day 8 (6h):      Final polish, severity calibration, submit
Total: ~66h (realistic for a 60–80h committed window)

Note: judges have observed for years that the quality of submissions plateaus after about 80 hours per warden on a typical 8-day open contest. Going beyond costs more than it earns. Cap your hours; protect calibration capacity for the next contest.

5.2 Phase 1 — Recon (4 hours, Day 1 morning)

Goal: build the same scoping artifact a private audit would produce, but in a quarter the time.

Tasks:

Read the README end-to-end — twice. The first time for “what is this”, the second for any hidden gotchas (“we acknowledge X is out of scope”, “the keeper is assumed honest”, etc.).
Read the protocol docs / whitepaper — 30 minutes max. The point is to know the intended invariants, not to memorize architecture.
Read all prior audit reports in full (see §4.2).
Build the file/contract dependency tree — slither . --print human-summary,contract-summary,inheritance-graph produces it in seconds.
Identify entry points — slither . --print entry-points lists every external/public function. Print this list and physically check off as you review.
Write a 1-page threat model in your own notes file: actors, trust boundaries, top-5 invariants to verify. (Even if rough — the act of writing it forces structure.)
Identify novel slice (see §4.3). Mark modules as “deep dive” or “skim”.

Output: scoping notes file with entry-point list, threat model v0, ranked module list (most novel / most likely-buggy first).

5.3 Phase 2 — Tool sweep (1–2 hours, Day 1–2)

Reference Tuan-15-Audit-Methodology-Tooling §8–§9. For a contest:

Slither + Aderyn: run both, dump output to files. Don’t immediately submit Slither findings — most are noise. But scan them: any uncontroversial high-confidence finding (e.g., controlled-delegatecall) deserves immediate verification.
Build a Foundry harness early — even a stub. You’ll want to write PoCs throughout the contest, not at the end.
Foundry coverage — forge coverage tells you which functions tests don’t cover. Untested code is high-value review territory.
Echidna / Medusa: skip in 5-day contests (too much config overhead). Use in 14-day contests for late-stage invariant verification.
Halmos: only if a math-heavy module justifies it. Math libraries (TickMath-likes, mulDiv variants) are great targets.

The output of this phase: a triaged tool report + a working Foundry harness pre-loaded with the protocol.

5.4 Phase 3 — Module pass (24 hours, Day 2–4)

For each module (3–6 modules of 4–6 hours each):

Apply the three-pass structure from Tuan-15-Audit-Methodology-Tooling §7.1:

Top-down (the user’s path) — for each external entry, trace who can call it, validate inputs, read+write state, external calls, events, gas behavior. Drop questions into your finding journal as you go.
Bottom-up (the state’s path) — list every state variable; identify all writers; check consistency.
Heuristics-on-sight — every smell from Tuan-15-Audit-Methodology-Tooling §7.3 triggers a 5× pace slowdown.

The 60-second finding-journal habit: every time something feels off, spend 60 seconds writing the question into a Markdown file. Don’t try to resolve immediately. By end of contest you’ll have 50–200 journal entries; ~10–20% become findings.

Pace target on first contest: one external function fully reviewed per 30 minutes on novel code. Speeds up as familiarity grows.

5.5 Phase 4 — Cross-cutting passes (12 hours, Day 5–6)

Once per-module work is done, run patterns across the codebase:

Pass	What to check
Access control matrix	Every privileged function × every role. Spreadsheet. Any gaps? Any over-grants?
Invariant sweep	Take your top-10 invariants (from threat model); test each via Foundry assertion. Anything that fails or you can’t quickly verify is a candidate finding.
Oracle / price source	Every consumption of an external price (Chainlink, Uniswap V2/V3, Curve, custom). Stale, manipulable, mis-decimaled, mis-denominated?
Math direction	Every division. Rounding favors who? Consistent with documented spec? Donation-attack vulnerable?
Reentrancy surface	Every external call followed by state writes. CEI or `nonReentrant`? Cross-function? Read-only via view exposed to consumers?
Token integration	SafeERC20 throughout? Fee-on-transfer aware? Rebasing aware? Approve race?
Time / block sensitivity	Every `block.timestamp` / `block.number`. Manipulation surface? Cross-chain inconsistency?
Event coverage	Every state-changing function emits an event sufficient for off-chain reconstruction?
Initialization / upgrade hooks	`_disableInitializers` present? Storage gap? Storage layout preserved?
MEV / front-running	Any function whose order-dependence creates value-extraction? Slippage protection?

Cross-cutting is where senior wardens land uniques — fast pattern recognition across the code surface. New wardens often skip this phase (“I’m still reviewing modules”); the senior fix is to stop module review on Day 4 hard and switch to cross-cutting even if module pass feels incomplete.

5.6 Phase 5 — PoC bursts (within phases 3–4)

Whenever a finding crystallizes — immediately write the Foundry PoC. Don’t batch.

Reasons:

A PoC validates the bug exists. Sometimes it doesn’t, and you save the write-up time.
A PoC produces numbers you’ll cite in the impact section.
A PoC is the strongest defense against severity downgrades.
A PoC sometimes reveals a second finding nearby (“while attacking X I noticed Y”).

Speed target: 30–90 minutes per PoC for a clean bug. If it’s taking >2 hours, either the bug isn’t real or the harness needs work — step back.

5.7 Phase 6 — Write-up burst (Day 7, 8 hours)

Convert finding-journal entries into submission-shaped Markdown. Apply the template in §6.

This is also when you kill doubtful findings. Half of your journal entries don’t survive a careful re-read. Better to submit 5 confident H/M findings than 20 mixed-quality ones — judges read submission quality as a per-warden signal, and a wall of Low-disguised-as-Medium hurts your reputation on the platform.

5.8 Phase 7 — Submit, polish, defend

Final pass:

Every Medium+ has a working PoC linked.
Every severity claim has explicit reference to the platform rubric.
Every impact statement includes numbers (USD or % of TVL).
Every recommendation includes a specific code fix, not generalities.
No typos in the title or first paragraph — those determine reading order.
No duplicate-from-prior-audit content (re-check §4.2).
QA report assembled with all Low findings as a single multi-section document.
Gas report assembled if you have ≥3 gas optimizations worth submitting.

Submit before the deadline by ≥4 hours — the C4 / Sherlock / Cantina submission systems get DDoS-grade traffic in the final hour and submission failures happen.

After submission: watch the post-judging period (typically 2–6 weeks). Use escalations (§7.4) when applicable. This is where downgrade calls get reversed if you defend them well.

6. Writing a Finding That Judges Accept

6.1 The universal template

Across C4, Sherlock, Cantina, the strongest findings share a structure:

# <Title — function name + bug class>
 
## Summary
<One-sentence summary: who can do what to whom under what conditions.>
 
## Severity
<Proposed: High / Medium / Low — *with explicit rubric reference*>
 
## Vulnerability Details
<Code excerpt with line numbers; precise explanation of the bug;
state transitions / invariants violated; assumptions broken.>
 
## Impact
<Who loses what under what realistic conditions; numerical bounds
(USD, % of TVL, % of fees, time-to-execute).>
 
## Proof of Concept
<Foundry test or step-by-step reproduction; concrete numbers.>
 
## Tools Used
<Manual / Foundry / Slither / Halmos / ...>
 
## Recommended Mitigation
<Specific code change. Patch-style diff if possible. Anticipate
side effects of the fix.>
 
## References
<Prior similar findings; Solodit links; spec / docs.>

Some platforms (C4) have stricter forms; some (Sherlock) have looser. The template works as a baseline you can compress or expand.

6.2 Title — “function name + bug class”

A great title is grep-able: another auditor scanning Solodit for a category should find your finding by keyword search.

Weak title	Strong title
”Vault can be drained"	"`Vault.redeem()` allows draining via first-depositor donation attack"
"Issue with rounding"	"Rounding in `convertToShares()` favors user, causing slow drain of pool assets"
"Oracle problem"	"`Oracle.getPrice()` uses spot price on UniV2; manipulable for ~$200k cost given current pool depth"
"Reentrancy"	"Read-only reentrancy in `Pool.virtualPrice()` allows consumer protocols (e.g., `Bank.deposit()`) to mis-price during `removeLiquidity()`”

The strong titles communicate function + class + immediate consequence. Judges form a first impression from titles; titles also help in dedupe (two wardens with the same strong title cluster instantly).

6.3 Severity — justify, don’t assert

A severity claim without a rubric reference is an assertion. With one, it’s an argument.

## Severity
 
**High** under Code4rena's rubric §3-H ("Assets can be stolen, lost, or
compromised directly; or there's a valid attack path with realistic
assumptions").
 
Impact: direct loss of user deposits, attack path is single-tx,
attacker capital ≤ flash-loan accessible amount (~$50M on Aave today),
no admin / governance dependency.
 
Likelihood: any caller — no privileged role required. PoC executes
in a single transaction.

This wording defends against the most common downgrade — “Medium because too hard to execute”. You preempt it with the flash-loan availability argument and the single-tx PoC.

For Sherlock, replace the Code4rena reference with the Sherlock rubric thresholds (>1% AND >$10 of principal). For Cantina, follow their published rubric (closer to C4 in 2025–26 [verify]).

6.4 Impact — numbers, not adjectives

Weak impact	Strong impact
”Significant fund loss is possible"	"At block 18,500,000 the pool holds $4.2 M o f U S D C . A s in g l e - t x a tt a c k er d r ain s$ 3.8M (90% of pool) at a flash-loan cost of $1, 200 in g a s +$ 0 fee (Aave). Net profit ~$3.8M."
"Users could be harmed"	"Each depositor in the bottom decile (deposits <$500) loses ~3% of principal to rounding accumulation over a 90-day holding period."
"This is a critical issue"	"Critical: a permanent freeze of all funds — the `pause` setter has no inverse and no role-rotation; recovery requires redeploy.”

Specific numbers move severity calls upward. Vague impact moves them downward.

If you don’t have concrete numbers (because the bug is conceptual rather than exploitable for cash), be explicit:

“Impact is bounded by the number of users with >0 pendingRewards at the time of the upgrade — at writing, this is ~1,200 users with median pending reward of $42, ma x$ 1,800.”

That’s still numerical. Even bounded enumeration beats “this could affect users”.

6.5 PoC — prefer Foundry; show inputs and outputs

A Foundry PoC is the standard. Anything else (Hardhat, conceptual sequence, “exploit script not included”) loses credibility instantly.

Template:

// SPDX-License-Identifier: MIT
pragma solidity ^0.8.20;
 
import "forge-std/Test.sol";
import "../src/Vault.sol";
import "../src/MockERC20.sol";
 
contract Exploit_DonationAttack is Test {
    Vault vault;
    MockERC20 asset;
    address victim = address(0xBEEF);
    address attacker = address(this);
 
    function setUp() public {
        asset = new MockERC20("USDC", "USDC", 6);
        vault = new Vault(IERC20(address(asset)));
        // Seed an empty vault scenario; we are the first depositor (attacker)
    }
 
    function test_donationAttack_drains_victim_deposit() public {
        // 1. Attacker deposits the minimum (1 wei)
        asset.mint(attacker, 1);
        asset.approve(address(vault), 1);
        uint256 attackerShares = vault.deposit(1, attacker);
        assertEq(attackerShares, 1, "first-deposit 1:1 ratio");
 
        // 2. Attacker donates a large amount directly to vault (no mint)
        uint256 donation = 1_000e6;  // 1,000 USDC
        asset.mint(attacker, donation);
        asset.transfer(address(vault), donation);
 
        // 3. Victim deposits 500 USDC, expects ~500 worth of shares
        asset.mint(victim, 500e6);
        vm.startPrank(victim);
        asset.approve(address(vault), 500e6);
        uint256 victimShares = vault.deposit(500e6, victim);
        vm.stopPrank();
 
        // 4. Due to rounding, victim shares are 0 — all value
        //    absorbed proportionally by attacker's single share
        assertEq(victimShares, 0, "victim received zero shares — finding!");
 
        // 5. Attacker redeems, walking away with both deposits
        uint256 attackerOut = vault.redeem(attackerShares, attacker, attacker);
        emit log_named_uint("attacker walks with USDC (6 dp)", attackerOut);
        assertGt(attackerOut, 1_500e6, "attacker should take >1500 USDC");
    }
}

Run output included in the submission:

[PASS] test_donationAttack_drains_victim_deposit() (gas: 217,891)
Logs:
  attacker walks with USDC (6 dp): 1500000001

The numbers — 1,500,000,001 micro-USDC (1,500 USDC, vs the 1 USDC the attacker actually deposited) — make the finding undeniable.

For higher-severity findings (especially flash-loan-amplified ones), fork mainnet:

function setUp() public {
    vm.createSelectFork(vm.envString("MAINNET_RPC"), 18_500_000);
    vault = Vault(0x...real address...);
}

A fork-test PoC reproduces against real on-chain state and is the highest-credibility evidence.

6.6 Recommendation — specific, with anticipated side effects

## Recommended Mitigation
 
Replace `convertToShares()` with the virtual-offset pattern (as in OpenZeppelin's `ERC4626.sol` v5+):
 
```solidity
function convertToShares(uint256 assets) public view returns (uint256) {
    uint256 supply = totalSupply() + 10**_decimalsOffset;
    uint256 assetsBase = totalAssets() + 1;
    return (assets * supply) / assetsBase;
}

Where _decimalsOffset = 6 (or appropriate for the asset’s decimals).

Side effects to verify:

Initial deposits will mint slightly fewer shares than the asset 1:1; this is intended and stays in the protocol as reserve.
Existing depositors are unaffected if migration sets the offset correctly.
The donation attack vector is bounded to 10**_decimalsOffset worth of asset — defenders accept this small loss as the price of mitigation.

Tests to add:

A test that attacker cost to drain a 1-wei depositor is bounded to ≥10**_decimalsOffset of asset.
A test that the virtual offset reserve is non-zero after first deposit.


A recommendation that ends with **side effects + tests to add** signals senior auditor. A recommendation that ends with "use safe math" or "consider adding a check" signals junior.

### 6.7 The "common downgrade" patterns and how to preempt them

| Common downgrade | How to preempt |
|-----------------|----------------|
| "Requires admin error" | Either argue admin compromise is a documented risk **or** show the path doesn't need admin. Make the path explicit. |
| "Requires governance compromise" | Argue with flash-loan governance: how much voting power costs how much capital; whether timelock is sufficient. Use numbers from the contest's deployed config. |
| "Requires external preconditions unlikely on mainnet" | Show the preconditions are routine (e.g., "in the last 30 days, ETH/USD on UniV3 has hit this state 17 times"). Solodit + on-chain history. |
| "Design choice, not a bug" | Cite the spec / docs / NatSpec that contradicts the behavior. If docs and code disagree, *that's* the finding (doc bug or code bug, one of them). |
| "Out of scope" | Pre-check the README scope rules. Don't waste a submission slot on out-of-scope. |
| "Dupe of <other warden>" | This is downgrade only for *unique-share*; the finding is still valid. Don't treat it as failure — it's normal. The fix is novelty, not better writing of the same bug. |
| "Insufficient PoC" | A working Foundry test is the gold standard; supplemental scripts only as an aid to it. |
| "Theoretical, no realistic exploit" | Provide the realistic-exploit scenario in the Impact section with mainnet numbers. |

### 6.8 Worked example — a finding from notes to submission

**Stage 1 — notes (from the finding journal)**:

> "Vault.sol L142 — `convertToShares` divides assets by totalAssets() — first depositor can manipulate. donation attack? mint 1 share, then transfer huge token to vault, next depositor gets 0 shares due to rounding. classic OZ4626 pattern. did the team add the virtual offset? no, they're using the naive form. check inheritance — extends ERC20 directly, not ERC4626. so they wrote it themselves. confirmed: vulnerable."

**Stage 2 — confirmed by PoC** (the test from §6.5).

**Stage 3 — submission**:

```markdown
# `Vault.deposit()` allows first-depositor donation attack — subsequent depositors receive zero shares

## Summary
A donation attack against `Vault.convertToShares()` lets a first depositor with
1 wei of asset siphon the deposits of all subsequent users until the share-pricing
becomes too coarse for them to mint any shares.

## Severity
**High** under Code4rena §3-H. Direct loss of user funds, no privileged role required,
attacker capital negligible (1 wei + gas).

## Vulnerability Details
`Vault.convertToShares()` computes shares as `assets * totalSupply / totalAssets()`.
When `totalSupply == 1` and `totalAssets()` is inflated by a direct token transfer
(donation), a user depositing `assets < totalAssets() / totalSupply` receives zero
shares and contributes their deposit pro-rata to the existing single share — all of
which is owned by the attacker.

This is the well-known ERC-4626 first-deposit / donation attack. The standard
mitigation is the virtual-share offset adopted by OpenZeppelin in `ERC4626` v5+;
the protocol's custom implementation does not include it.

## Impact
Every user whose deposit is smaller than `attacker_donation / totalSupply * 0.5`
(i.e., the rounding-floor threshold) loses 100% of their deposit. For an attacker
donation of 1,000 USDC and a fresh vault, all subsequent depositors of <500 USDC
receive 0 shares; total loss = sum of victim deposits, captured by the attacker.

PoC drains 1,500 USDC of victim deposit at an attacker cost of 1 wei USDC + gas.

## Proof of Concept
[See `test/exploit/DonationAttack.t.sol` — full test in submission.]

[forge test output as above, showing 1,500,000,001 micro-USDC walked.]

## Tools Used
Foundry; manual review of `Vault.sol:142–168`.

## Recommended Mitigation
Adopt OpenZeppelin's virtual-offset pattern in `ERC4626` v5+...
[as in §6.6]

## References
- [OpenZeppelin ERC4626 docs](https://docs.openzeppelin.com/contracts/api/token/ERC20#ERC4626)
- [The "donation" attack on ERC-4626 (Akshay Srivastav)](https://mixbytes.io/blog/overview-of-the-inflation-attack)
- Solodit: [Code4rena Y2024-AAVE-V3 finding M-04](https://solodit.cyfrin.io/?...) [verify]
- [[Tuan-15-Audit-Methodology-Tooling]] §6.3 — ERC-4626 invariants

Stage 4 — judge ruling: probably accepted as High (or downgraded to Medium if dupes are heavy and primary went to a better-written submission). If downgraded for reasons you disagree with, escalate (§7.4) with explicit rubric reference.

7. Judging Culture — What Makes a Finding Valid, Invalid, or Downgraded

7.1 What invalidates a finding

Across all platforms, common invalidation reasons:

Reason	Description
Out of scope	The bug is in a file/contract the contest README explicitly excluded. Always re-check.
Requires admin error (as primary)	“Admin sets `feeBps` to >10000” — admin is assumed competent unless README says otherwise. Sherlock and C4 both apply this strictly.
Requires governance compromise (as primary)	Same logic. Unless the bug is in the governance mechanism, governance-attack-needed-to-trigger doesn’t pass.
Documented behavior / design choice	If the README, spec, or NatSpec explicitly describes the behavior as intended, it’s not a bug. Doc-vs-code disagreements are findings; doc-acknowledged design choices are not.
Negligible impact	Sherlock’s strict thresholds; even C4 will dismiss findings where the loss is sub-dust.
Theoretical without realistic exploit	”Could be exploited if X” where X never happens in practice. Provide on-chain evidence X happens.
Already reported in prior audit & acknowledged / wontfix	Re-submitting these wastes a slot. Always read prior audits.
Compiler / dependency bug	Unless the contest README explicitly includes them.
Front-running on private-mempool chains	Sherlock explicitly OoS on Arbitrum / Optimism / Base / etc.

7.2 What downgrades a finding

Even valid findings get downgraded:

Original	Downgraded to	Why
High	Medium	Requires specific market state (e.g., particular oracle update timing)
High	Medium	Requires specific user behavior (e.g., a user signs an unusual permit)
Medium	Low	Impact is real but bounded to one user’s small deposit, no protocol-wide effect
Medium	Q&A	Behavior is suboptimal but not a value-loss vector
Any	Invalid	Misunderstanding of how Solidity / EVM / a library works

You’ll see “downgrade to QA” a lot in C4. It’s the judge saying “this is a useful observation but not severity-paying”. Don’t treat it as personal — fold into your QA report for future contests.

7.3 “Dupe wars” — primary vs supporting

In a dupe group of 10 wardens reporting the same bug, the judge designates one as primary (best write-up) and 9 as supporting. The pool’s share for that finding is then split via a slot-share formula favoring the primary.

For C4 specifically (formula has evolved; [verify] at the time of contest):

slot share for primary    = base_share × primary_bonus_multiplier
slot share per supporting = base_share / (n_supporting + 1)   (rough; varies)

Numerically: in a 10-way dupe of a $30 k - s ha re f in d in g, t h e p r ima ry mi g h tt ak e$ 12k– $18 kan d e a c h s u pp or t in g t ak es$ 1.5k–$2.5k. Quality of write-up is income-multiplicative.

To win primary:

Title clarity — judge skims titles when picking primary.
Severity rationale — explicit rubric reference, defended in advance.
Working PoC — judges have stopped picking PoC-less submissions as primary across most contests.
Recommendation quality — specific code fix, anticipating side effects.
No filler — every paragraph adds information.

The trade-off: don’t optimize only for primary — submitting more findings is also high-EV. The right strategy depends on contest length and your speed.

7.4 Appeals / escalations — the most undervalued process

Every platform has an appeals/escalations window after the preliminary judgment:

Code4rena: “Post-Judging QA” period; wardens can file escalations on specific findings via the contest’s GitHub issues, citing rubric. The judge or a higher-tier reviewer revisits.
Sherlock: explicit “Escalation Period”; watsons file via the contest dashboard. Senior watson + lead auditor + Sherlock team adjudicates.
Cantina: dispute window with comment threading; senior researchers and Cantina staff adjudicate.

Escalation success rate [verify with recent data] across platforms: 15–30% of escalations succeed in changing the ruling. That’s high — much higher than most wardens assume. The reason: judges are time-limited and sometimes ship rulings with one-line rationales that don’t survive a careful re-read.

Conditions under which to escalate:

Severity was downgraded with a one-line rationale you can rebut with specific rubric language.
Your finding was marked dupe with another that has a different root cause (de-dupe error).
You were marked supporting in a dupe group where your write-up is objectively better-developed (PoC + numbers vs prose-only).
Your finding was marked invalid because of “assumption X” that the contest README does not state.
A judge appears to have misunderstood the technical claim — produce a clarifying PoC.

Conditions under which to not escalate:

You disagree with severity but can’t cite the rubric.
You think your finding “should be more important” but offer no new evidence.
You’re trying to convert dupe→primary via writing skill alone (sometimes accepted on platforms; usually not).

Tone of escalations matters. Cite the rubric verbatim, attach the PoC link, keep it short. Avoid emotional language. Judges are auditors too — meet them in the same register.

7.5 Long-tail: the “cancer” problem

A community term for: spamming low-effort findings hoping a few survive judging. Some wardens submit 30–50 findings, most invalid, on the bet that the judge can’t quickly invalidate all of them. Platforms have responded:

C4: introduced “insufficient quality” penalties that reduce a warden’s pool share if too many findings are obviously invalid.
Sherlock: tracks watson submission quality across contests; a bad track record reduces future visibility.
Cantina: invitations to curated contests depend on quality history.

The takeaway for you: aim for high signal density. Five carefully-written H/M findings beats twenty mixed ones, both in EV (because of slot-share math) and in reputation (because judges remember).

8. Leaderboard Math — How Pools Pay Out

C4’s slot-share has evolved through several revisions. Roughly:

Each finding has a fixed slot value (function of severity & pool):
  H = 10 slots,  M = 3 slots  (illustrative — verify per contest)

Per finding payout = (pool_for_HM × slots_for_this_finding) / total_slots

Per warden share of finding =
  if primary:    slots × primary_share_multiplier / total_warden_count_in_group
  if supporting: slots × supporting_share_multiplier / total_warden_count_in_group

QA + Gas reports: separate sub-pool (often 5–15% of total)

The structure favors:

Severity (H > M > L in non-linear ratio).
Uniqueness (n_warden_count_in_group = 1 → maximum per-warden share).
Primary status (multiplier ~1.5–2× supporting share).
Volume (more findings = more slot accumulation, even if no uniques).

Implication: a single solo H in a $100 k p oo lw i t h 80 w a r d e n sc an p a y$ 8k– $15 k . A 10 - w a y d u p eH p a ys$ 1k– $2.5 k f ors u pp or t in g,$ 4k– $7 k f or p r ima ry . A u ni q u e M mi g h tp a y$ 2k–$4k.

8.2 Sherlock payout structure

Sherlock pays based on:

High vs Medium tier (different per-finding pools).
Number of valid finders per finding (split).
Lead senior watson bonus if applicable to the contest.

Sherlock historically paid out closer to:

per-finding pool for a H = ~$20k–$50k (function of total pool & severity mix)
split across n finders, with senior watson bonus 5–10% off the top

[verify] with the latest published payout examples; Sherlock has changed formulas multiple times.

8.3 Tiered rewards and brackets

QA / Gas reports use bracketed grades instead of slot share:

C4 QA report grades (illustrative)	Pay (% of QA sub-pool)
Grade A (top ~10% of QA reports)	25–35%
Grade B (next ~25%)	10–15%
Grade C (next ~30%)	3–6%
No award	0

Even Grade B QA on a $100 k co n t es tw i t ha 10$ 1k–$1.5k. Writing a coherent QA report is high-ROI for ~3–4 hours of work — judges read structure, not volume.

8.4 Hyped vs unhyped contests

A contest’s attendance (number of wardens) is the strongest predictor of your per-finding share:

High-hype contests (major protocols, large pools, public hype on Twitter): 150–400 wardens. Solo finds rare; dupe groups deep; per-warden EV often lower than mid-pool contests.
Mid-hype contests (mid-cap protocol, $50–200k pool, moderate Twitter): 40–120 wardens. The sweet spot — depth enough to have prizes, sparse enough to win uniques.
Low-hype contests (small pool, niche category, weak marketing): 10–40 wardens. Easy uniques but small absolute pool. High $/finding, low total income.

Strategic implication: top wardens often skip the most hyped contests in favor of mid-hype contests, where their edge is more rewarded. New wardens benefit from low-hype contests as low-pressure calibration.

8.5 Per-platform “where’s the money in 2025–26”

Rough averages [verify per quarter]:

C4: largest absolute pool $, broadest warden field, lowest per-warden EV in hyped contests, best for cadence.
Sherlock: smaller pools, stricter rubric, higher per-finding EV when valid, fewer dupe-group splits.
Cantina: largest individual prizes for top wardens in curated contests, hardest entry for new wardens, best for established researchers.
Hats: lower volume, token-denominated reward, more variable.

A pragmatic 12-month income model (illustrative, not commitment):

Year-month	Effort	Realistic gross
Months 1–3 (3 contests)	200 hours	$500-$ 3,000 (likely net negative vs opp cost)
Months 4–6 (3 contests)	200 hours	$2, 000-$ 10,000
Months 7–9 (3 contests)	200 hours	$5, 000-$ 25,000
Months 10–12 (4 contests)	250 hours	$15, 000-$ 60,000

By month 12 a calibrated warden is netting positive after opportunity cost; by month 18–24 the trajectory inflects sharply if combined with private engagement leads.

9. Calibration Practice — Solodit, Past Findings, the Daily Habit

9.1 Solodit as the central calibration tool

Solodit (https://solodit.cyfrin.io/) aggregates findings from Code4rena, Sherlock, Cantina, Spearbit, and other sources. It’s free.

For a serious warden, Solodit is the daily reading habit:

Filter by platform, severity, protocol type, year.
Read raw findings as they were submitted (with judge ruling).
Compare your gut-call severity to the actual ruling.

The 100-finding study (the Lab in §11.1): read 100 findings, write down what severity you’d assign before scrolling to the ruling, and tabulate your hit rate. After 100 findings, your match rate reveals your calibration baseline:

<40% match: severity calls are not yet aligned with platform conventions. Re-read the rubrics. Re-do the exercise.
40–60%: typical for first-quarter wardens. Continue practice.
60–75%: competition-ready. You’ll have predictable severity submissions.
>75%: senior-level calibration. Your escalations will succeed at higher rates.

9.2 Reading the famous wardens

Public author profiles on Solodit / C4 / Sherlock — sample (current as of late 2025 — [verify] since names rotate and rankings shift):

trust1995 — Sherlock-heavy; concise write-ups; strong math edge.
hansfriese — Code4rena top warden across many quarters; comprehensive QA reports.
GalloDaSballo — extremely prolific across platforms; known for fork-test PoCs.
cmichel — Solo-warden archetype; cross-platform.
pashov — Solo and team; runs Pashov Audit Group which sells private engagements; ex top-C4.
0xRajeev / Rajeev — Methodology-focused write-ups; useful for studying style.
dirk_y — Defi-deep; Sherlock high ratings.
kalou — Frequent unique finds; minimalist write-up style.

Pick 3 and read 5 of each warden’s findings. Note their consistencies — title format, PoC style, rubric reference style. Pattern-match the consistencies into your template.

9.3 Watching judges’ rulings as a stream

Each platform’s judge decisions are public:

Code4rena: contest pages list final findings with severity. Compare against the warden’s submitted severity (often visible in the issue history). Patterns emerge in what gets downgraded.
Sherlock: published findings with escalation history. Read escalation threads — these are gold for learning rubric interpretation.
Cantina: detailed findings with judge / sponsor comment threads.

A weekly habit: 30 minutes scanning new rulings for one or two patterns. Over a quarter, your sense of “what passes vs fails” becomes nearly explicit.

9.4 Mock-judging exercise

For the 10 findings in any recently-closed contest:

Read the title and severity claim only. Write your severity guess.
Read the detail + PoC. Update your guess.
Read the impact + recommendation. Lock in your final guess.
Reveal the actual ruling. Tabulate.

When you and the judge disagree, write down why — in one sentence. Reasons usually cluster:

“I missed the dependency on admin error.”
“I didn’t recognize the loss threshold was below Sherlock’s 0.01%.”
“The PoC seemed weak to me but the judge accepted the conceptual chain.”

Patterns in your own disagreement modes are the most actionable calibration insight you can produce.

9.5 The retrospective journal

After every contest, before checking your earnings:

# Contest <name> — Retrospective
 
## Findings submitted
| ID | Title | My severity | Judge ruling | Status | Notes |
|----|------|-------------|--------------|--------|-------|
| 1 | ... | H | H, dupe of 8 | accepted | tight PoC saved this |
| 2 | ... | M | invalid | rejected | required admin error — should have caught |
| 3 | ... | M | M, primary  | accepted | unique find — virtual-offset deep dive |
 
## What I missed
- <bug class>: <why I missed it; e.g., didn't review module X enough>
- ...
 
## What I over-reported
- <finding>: <why it was weak; e.g., theoretical without realistic conditions>
 
## Calibration delta
- I called <n> findings High that were Medium → severity inflation
- I called <n> findings Medium that were High → severity deflation
- I missed <n> findings entirely
 
## Process changes for next contest
1. ...
2. ...

A 30-minute retrospective after each contest, accumulated over 12 contests, is the single most valuable thing you can do for your career.

10. Anti-Patterns (avoid; add to master checklist)

A consolidated list, drawing on platform docs, judge culture, and the lessons in §6–§7.

10.1 Submission-quality anti-patterns

10.2 Process-level anti-patterns

Submitting the day-of deadline. Submission systems fail; submit ≥4 hours early.
Skipping the retrospective. Calibration only happens when you reflect.
Trying to “win” every contest. Selection is a skill; skipping bad contests is part of the job.
Reading no Solodit findings. Most calibration data is free.
Not using escalations. 15–30% of escalations succeed; not appealing is leaving money on the table.
Treating contests as primary income from month 1. Plan for 12-month ramp; ignore monthly P&L.
Working >80 hours/contest without breaks. Quality plateaus; mental health collapses.
Not building a finding-journal habit. Memory fails by Day 5 of an 8-day contest.
No fork-test capability. For oracle / lending / AMM bugs, fork-test is the difference between “could happen” and “demonstrably worth $3.8M”.

10.3 Career-level anti-patterns

Competing in isolation indefinitely. Senior wardens engage with the community (Twitter, Discord, conferences) — leads come from visibility.
Ignoring private-audit opportunities once leaderboard-ranked. Hybrid (contest + private) maximizes income from month 12 onwards.
Not specializing. Generalists plateau around 60th percentile. Specializing in one of: AMM math / lending / cross-chain / restaking / LST / non-EVM produces inflection.
Not documenting your portfolio. A public list of valid findings (with link to Solodit) is the auditor’s resume.

11. Lab — Three Exercises for Calibration

11.1 Lab 1 — One closed Code4rena contest, hunt-and-compare

Goal: Pick one closed Code4rena contest on Solodit. Read the README and scope. Spend a timeboxed 4 hours hunting. Compare your finds to the published reports. Calibrate.

Steps:

Choose contest — pick something mid-pool ($50–150k), category you’re not yet expert in. Examples: a lending market, a vault, a small cross-chain bridge. Avoid hyped megaprotocols (signal-to-noise too low for a first run).
Freeze a fork — clone the contest repo at the contest commit. Don’t peek at findings yet.
4-hour timer — apply the phased model (§5) compressed: 30 min recon, 30 min tool sweep, 2.5 hours manual review (skip module-by-module rigor; speed-pass each entry point), 30 min write-up.
Write 1–3 findings in submission format (§6). Self-assigned severity, justified.
Reveal: open the contest’s findings page on Solodit. Compare:
- Did you find any of the published findings? Match by root cause, not title.
- Did you find any not in the published findings? (Almost certainly false positives — but verify.)
- Did you miss findings that, with hindsight, you should have found? Note category.
Retrospective: 30 minutes. What was your find rate vs the median warden in that contest? What pattern recurs in your misses?

The first time you do this, expect 0–1 valid finds out of 4 hours matching the published set. That’s normal. Run the lab 3 more times across the next month; track find rate over time.

11.2 Lab 2 — Re-write one of your own historical bugs as a Code4rena-style finding

Goal: convert a bug you found in earlier lessons (e.g., the reentrancy PoC in Tuan-05-Vulnerability-Classes-Part-1 §7.4, or the donation-attack PoC in Tuan-15-Audit-Methodology-Tooling §18.3) into a submission-quality finding.

Apply the template in §6.1. Specifically:

Title that names function + bug class.
Severity claim with explicit C4 rubric reference.
Impact with realistic mainnet-style numbers (even synthetic — use the seeded amounts from the lab).
Working Foundry PoC, copy-pasted with output log.
Recommendation with specific code change + side effects + tests-to-add.
References to Solodit-indexed similar findings (find one matching root cause).

Submission test: have a peer (or yourself, after 48 hours of detachment) read it cold. Can they understand the bug in <2 minutes? Can they reproduce the PoC in <10 minutes from the repo? Can they write the fix from the recommendation without asking clarifying questions?

If any answer is no, iterate.

11.3 Lab 3 — Read 10 invalidated findings from a recent contest

Goal: build intuition for why findings get rejected.

Steps:

Pick a recent (within last 6 months) closed contest with a published findings page that includes invalid / dupe / out-of-scope rulings. Code4rena’s archived contest pages include these; Sherlock’s escalation logs are excellent.
Read 10 invalidated findings in detail (not just the ruling — the full submission + ruling rationale).
For each, write the one-sentence reason for invalidation.
Group reasons. Common buckets:
- Out-of-scope (~20–30%).
- Requires admin / governance / external precondition (~25–35%).
- Misunderstanding of Solidity / EVM / library behavior (~10–20%).
- Already in prior audit / acknowledged (~10–15%).
- Negligible impact / sub-threshold (~10–20%).
- Dupe (not invalidated, just merged) — note these separately.
Output: a one-page summary of your top-3 most-common invalidation reasons in this contest, with a “preempt this in my next submission by…” for each.

This lab takes ~3 hours. After running it across 3 contests, your own submission’s invalid-rate drops markedly — typically from 40–60% (new wardens) to 10–20% (calibrated wardens).

11.4 Lab 4 — (stretch) Submit to a live contest

The previous labs simulate; this one is real.

Pick a live open contest with at least 5 days remaining. Apply §4 (scout), §5 (workflow), §6 (write-ups). Submit at least one finding (even a Medium or QA).

After the judging period, run §9.5 (retrospective journal). Compare your predicted severity to the judge’s call. Note dupe count. Note your hour count and your gross.

The first contest’s earnings are nearly irrelevant. The lab outcome is: you have a baseline. After three contests, you have a trend.

12. Trade-offs & Open Debates

Decision	Option A	Option B	Auditor’s view
Volume vs uniqueness	Submit 15 findings (mix of confidence)	Submit 5 high-confidence findings	Depends on contest length. Short contests reward volume; long contests reward uniqueness. Track $/finding by mode over 6 contests.
Pre-contest scouting time	30 min “good enough”	90 min thorough	90 min on contests you commit to; skip the contest entirely if 30 min reveals red flags.
Tool reliance	Heavy (Slither + custom detectors + fuzz)	Light (manual-only)	Heavy for module-coverage, light for cross-cutting. Tools catch the easy; manual catches the hard.
Specialization	Generalist across all categories	Specialize in 1–2 (e.g., AMM + lending)	Specialize from month 6 onward. Generalists peak at 60th percentile; specialists land 90th+ in their niche.
Platform selection	All-in on one (e.g., C4)	Diversify across C4 + Sherlock + Cantina	Diversify after month 6. Different platforms reward different submission styles; both signals are useful.
QA report effort	Skip it (focus H/M)	Polish it (Grade A target)	Polish it. 3–4 hours for $1–3k of Grade B is excellent ROI; skipping QA is leaving money on the table.
PoC quality	Minimal “shows the bug”	Polished fork-test with realistic numbers	Polished for Medium+; minimal for Low. The PoC is severity-defending evidence.
Escalations	Skip (“waste of time”)	Escalate aggressively	Escalate carefully. ~30% success rate is high; don’t escalate without rubric language; don’t escalate without new evidence.
Public sharing of process	Private until established	Tweet / blog / Discord	Public from day 1, modestly. Visibility compounds; the auditors who hit 6-figure annuals all built audiences alongside their finding portfolios.
Income mix in year 1	100% contests	Mix with private leads	If unproven publicly, 100% contests until you have a 6-finding Solodit portfolio. Then mix.

13. Quiz (≥80% to advance)

Q: A new warden hits an 8-day Code4rena contest with 100 hours committed. Pool is $200 k, 150 w a r d e n ss u bmi tt e d, 80 o f t h e m l an d e d a tl e a s t o n e v a l i df in d in g . T h e n e ww a r d e n l an d s 2 d u p es (o n eo f 10, o n eo f 18) . T h e i r g rosse a r nin g sro ug h or d ero f ma g ni t u d e ? A : A f e w h u n d re dd o ll a rs . D u p e - o f - 10 an dd u p e - o f - 18 o na t y p i c a lM - se v er i t ye a c h y i e l d$ 300– $900 t o a s u pp or t in g w a r d e n . T o t a l : p ro bab l y$ 500– $1, 500. N e t o f$ 50/hr opportunity cost ($5,000): heavily negative. Calibration value still high if they run a retrospective.
Q: Sherlock’s severity rubric — your finding causes a 0.005% loss of user fees under a precise market condition. Severity? A: Invalid. Sherlock’s Medium threshold is 0.01% AND >$10 of fees. 0.005% is below the threshold. The finding may still appear in QA-equivalent notes (not paid on Sherlock).
Q: A contest README says: “Owner is assumed to act honestly except for the specific oracle-rotation function, which is in scope for Owner-induced attacks.” You find that any function with onlyOwner modifier and a call to transferFrom can drain the protocol. Severity? A: Out-of-scope for non-oracle-rotation owner abuse. The contest README narrowed Owner-attack scope to oracle-rotation; other owner-induced attacks are out of scope. File as QA or note for the protocol’s information; don’t expect payment.
Q: You write up a finding as High. The judge rules Medium, citing “requires specific market condition (UniV3 pool depth < $X)“. The README didn’t state this assumption. Do you escalate? A: Yes. The escalation cites that the depth condition isn’t in the README; demonstrate from on-chain history that the condition is met on a recurring basis (e.g., “in the last 90 days, this pool depth was met N times”). 15–30% escalation success rate makes this clearly EV-positive.
Q: What’s the slot-share intuition for why a unique High pays much more than a same-severity 10-dupe High? A: The pool for that finding is divided across warden contributors using a formula where supporting wardens share a fraction (often 1 / (n+1) or similar). A solo finder captures the entire slot value; 10 dupes split a slightly larger primary-bonus-augmented pool, but each supporting share is ~10× smaller than the solo. Uniqueness is income-multiplicative.
Q: Code4rena vs Sherlock — which is “harder” for a finding to be valid? A: Both are strict, in different ways. Code4rena considers likelihood (a high-impact, very-unlikely bug may be downgraded). Sherlock does not consider likelihood for validity but has tighter loss thresholds (>0.01% AND >$10 minimum). Practical answer: high-impact-low-likelihood bugs tend to pass Sherlock; high-likelihood-medium-impact bugs tend to pass Code4rena. Choose platform partly by your bug’s profile.
Q: You spend 90 minutes scouting a contest. The README is thin, the team is unresponsive on Discord, the protocol forks Compound V2 with minor changes, and 200 wardens have already signed up. Decision? A: Skip. Unresponsive team → ambiguous rulings; well-known Compound fork → low novelty edge; 200 wardens → deep dupe groups on the obvious bugs. Selection is a skill; better contests are coming.
Q: For a Foundry PoC of a flash-loan-amplified oracle attack, why is a fork test much stronger than a mock test? A: A mock test uses arbitrary numbers — judges discount “10x manipulation cost” as theoretical. A fork test uses real on-chain pool depths and flash-loan capacities, producing concrete USD numbers the judge can verify. Same finding, dramatically stronger evidence.
Q: A judge marks your finding “dupe of #42” but #42 has a different root cause (different function, different bug class) — only the user-visible symptom (fund loss) is similar. Action? A: Escalate as a de-dupe error. The fix is “the finding at #42 has root cause RC1; my finding has root cause RC2; the fixes are different code in different files.” Provide a working PoC showing your bug exists even after a hypothetical fix to #42. Most de-dupe escalations succeed when the auditor can show non-overlapping root causes.
Q: After 12 contests, your $/ h r i s$ 30 gross, your unique-find rate is 5%, your severity match-rate against judges is 50%. What’s the next move? A: Calibration is the bottleneck. Severity match rate of 50% is below the 60% inflection point. Spend a focused month on Solodit reading (Lab §11.1 style), with explicit before/after severity prediction. Continue 1 contest/month but de-emphasize hours-per-contest, emphasize study. If after 4 more contests the match rate isn’t ≥60%, reconsider whether the auditor career fits or whether you’d be better as a Solidity dev / DeFi engineer.

14. Bonus Deliverables

Decision-form template (§4.5) filled for at least 3 hypothetical contests, with go/no-go reasoning.
Re-writeup of one of your own historical bugs from Weeks 5–14 in full Code4rena-style finding format.
Solodit calibration study: 100 findings read with pre-ruling severity predictions; final tabulated match rate.
Invalidated-findings analysis from one recent contest (Lab §11.3).
First live contest submission (Lab §11.4) + retrospective journal.
Updated audit-checklist-master with this chapter’s anti-patterns.

15. Where this leads

Two parallel arcs from here:

Public-arena loop:

Pick one contest per month using §4’s selection criteria.
Run the §5 phased workflow.
Submit using the §6 template.
Escalate when warranted (§7.4).
Retrospective journal (§9.5).
Solodit study between contests.

Over 12 months this produces a portfolio. The portfolio produces leads. The leads produce private-engagement income at rates Tuan-15-Audit-Methodology-Tooling §3.3 quotes.

Bug-bounty parallel: Tuan-Bonus-Bug-Bounty-Immunefi covers the continuous-bug-bounty side. Many top wardens run an Immunefi continuous program for a single major protocol while doing contests — the bounty’s higher per-finding payout (10% of TVL, up to $1M+) rewards the unique critical the contests sometimes don’t surface.

Eventually, hybrid:

Income shape after 18–24 months (representative):
  ~30%  competitive contests
  ~50%  private retainer / boutique audits
  ~15%  Immunefi / continuous bug bounty
  ~5%   speaking / writing / consulting

The contests stay in the mix because they’re calibration. The day a senior auditor stops calibrating against the community is the day their judgement starts to drift — and the bugs they miss get progressively more expensive when missed in private work.

The market is a feedback loop. Stay in it.

Last updated: 2026-05-16 See also: Roadmap · References · Tuan-15-Audit-Methodology-Tooling · Tuan-16-Report-Writing-Capstone · Tuan-Bonus-Bug-Bounty-Immunefi · severity-rubric-immunefi-c4 · audit-checklist-master · Tuan-05-Vulnerability-Classes-Part-1

lthieu's notes

Explorer

Tuan-Bonus-Audit-Competition-Playbook

Bonus — Audit Competition Playbook (Code4rena, Sherlock, Cantina, Hats)

1. Context & Why

1.1 Why a competition tier exists at all

1.2 What this chapter covers

1.3 Primary references

2. The Competition Tier — Platforms and How They Differ

2.1 At-a-glance

2.2 Code4rena (“C4”) — the volume leader

2.3 Sherlock — the lead-auditor model

2.4 Cantina — competitive + curated, spans bigger protocols

2.5 Hats / Hats Pro — continuous + competition hybrid

2.6 Choosing where to enter as a new warden

3. The ROI Question — How to Decide Whether to Compete at All

3.1 The math nobody publishes

3.2 Find rate vs unique-find rate

3.3 Hours-per-find calibration

3.4 The opportunity-cost framing for working auditors

4. Pre-Contest Scouting (60–90 minutes, before committing time)

4.1 The scouting checklist

4.2 Reading prior audits

4.3 Identifying novel vs reused-library components

4.4 Estimating your edge

4.5 The decision form

5. During-Contest Workflow — Phased Pass Model

5.1 The 8-day phase budget

5.2 Phase 1 — Recon (4 hours, Day 1 morning)

5.3 Phase 2 — Tool sweep (1–2 hours, Day 1–2)

5.4 Phase 3 — Module pass (24 hours, Day 2–4)

5.5 Phase 4 — Cross-cutting passes (12 hours, Day 5–6)

5.6 Phase 5 — PoC bursts (within phases 3–4)

5.7 Phase 6 — Write-up burst (Day 7, 8 hours)

5.8 Phase 7 — Submit, polish, defend

6. Writing a Finding That Judges Accept

6.1 The universal template

6.2 Title — “function name + bug class”

6.3 Severity — justify, don’t assert

6.4 Impact — numbers, not adjectives

6.5 PoC — prefer Foundry; show inputs and outputs

6.6 Recommendation — specific, with anticipated side effects

7. Judging Culture — What Makes a Finding Valid, Invalid, or Downgraded

7.1 What invalidates a finding

7.2 What downgrades a finding

7.3 “Dupe wars” — primary vs supporting

7.4 Appeals / escalations — the most undervalued process

7.5 Long-tail: the “cancer” problem

8. Leaderboard Math — How Pools Pay Out

8.1 Code4rena slot-share formula (snapshot — [verify] against live docs)

8.2 Sherlock payout structure

8.3 Tiered rewards and brackets

8.4 Hyped vs unhyped contests

8.5 Per-platform “where’s the money in 2025–26”

9. Calibration Practice — Solodit, Past Findings, the Daily Habit

9.1 Solodit as the central calibration tool

9.2 Reading the famous wardens

9.3 Watching judges’ rulings as a stream

9.4 Mock-judging exercise

9.5 The retrospective journal

10. Anti-Patterns (avoid; add to master checklist)

10.1 Submission-quality anti-patterns

10.2 Process-level anti-patterns

10.3 Career-level anti-patterns

11. Lab — Three Exercises for Calibration

11.1 Lab 1 — One closed Code4rena contest, hunt-and-compare

11.2 Lab 2 — Re-write one of your own historical bugs as a Code4rena-style finding

11.3 Lab 3 — Read 10 invalidated findings from a recent contest

11.4 Lab 4 — (stretch) Submit to a live contest

12. Trade-offs & Open Debates

13. Quiz (≥80% to advance)

14. Bonus Deliverables

15. Where this leads

Graph View

Table of Contents

Backlinks