The EVMBench Leaderboard
EVMBench is a standardized benchmark built by OpenAI for AI vulnerability detection on EVM smart contracts: 117 ground-truth vulnerabilities across 40 Code4rena audits. Vendors keep publishing one-off numbers. No one has put them on a single board — so we did.
Why some scores show /120 and others /117. EVMBench launched with 120 findings, but shortly after release OpenAI removed three of them, leaving 117 as the canonical set. Vendors who scored before the patch (e.g. Nethermind AuditAgent) report against 120; vendors who scored after (e.g. Guardix) report against 117. Bars are normalised to a common 100% recall scale, so they remain directly comparable. Results are self-reported by each vendor; we have not independently re-run other vendors' pipelines.
Real audits. Real bugs. No leakage.
EVMBench was assembled by OpenAI from 40 historical Code4rena audit contests. Each repo ships with its ground-truth list of confirmed vulnerabilities. A tool runs against the unmodified source and emits findings; recall is the fraction of ground-truth bugs the tool surfaced.
Unlike synthetic CTF challenges, every bug here was found by a human auditor against money already live on-chain. That makes EVMBench the closest thing the industry has to a fair, public scoreboard.
- Agent — multi-step system with tools, retrieval, and pipeline logic.
- Base model — an LLM prompted directly, no tooling or orchestration.
- Recall — % of ground-truth vulns the system detected before any manual review.
- Found — raw count / denominator the vendor scored against.
Run EVMBench? We'll add you.
Publish your full results — repo-by-repo recall, false-positive count, and the script you ran — and we'll list your entry. We're tracking AuditAgent, Guardix, and the base models above. Olympix, Cyfrin, Cantina, Spearbit, ChainPatrol, ConsenSys Diligence — if you have numbers, send them.