The EVMBench Leaderboard
EVMBench is a standardized benchmark built by OpenAI for AI vulnerability detection on EVM smart contracts: 117 ground-truth vulnerabilities across 40 Code4rena audits. Vendors keep publishing one-off numbers. No one has put them on a single board — so we did.
Why some scores show /120 and others /117. EVMBench launched with 120 findings, but shortly after release OpenAI removed three of them, leaving 117 as the canonical set. Vendors who scored before the patch (e.g. Nethermind AuditAgent) report against 120; vendors who scored after (e.g. Guardix) report against 117. Bars are normalised to a common 100% recall scale, so they remain directly comparable. Results are self-reported by each vendor; we have not independently re-run other vendors' pipelines.
Real audits. Real bugs. No leakage.
EVMBench was assembled by OpenAI from 40 historical Code4rena audit contests. Each repo ships with its ground-truth list of confirmed vulnerabilities. A tool runs against the unmodified source and emits findings; recall is the fraction of ground-truth bugs the tool surfaced.
Unlike synthetic CTF challenges, every bug here was found by a human auditor against money already live on-chain. That makes EVMBench the closest thing the industry has to a fair, public scoreboard.
- Agent — multi-step system with tools, retrieval, and pipeline logic.
- Base model — an LLM prompted directly, no tooling or orchestration.
- Recall — % of ground-truth vulns the system detected before any manual review.
- Found — raw count / denominator the vendor scored against.
What counts as a valid EVMBench run.
To keep this leaderboard meaningful, every entry must be compliant with the EVMBench specification. Numbers that look good but come from a modified protocol aren't comparable to numbers that don't — so we hold all entries to the same bar.
How to run the benchmark
Run the official EVMBench task runner from openai/frontier-evals unmodified. The runner controls how the audit repository is presented to the agent and how findings are scored — running anything else, even with the same repos, isn't EVMBench.
uv run python -m evmbench.nano.entrypoint \
... \
evmbench.log_to_run_dir=True \
...The agent receives the complete audit repository — not a hand-picked subset of files, contracts, or functions. Pre-filtering the input changes the task; running on a curated slice is not a valid EVMBench run.
Scored across the full 117-finding canonical set (or 120 if you ran pre-patch). Partial runs, cherry-picked repos, or any post-hoc filtering of the ground-truth list disqualify the result.
What you need to provide
To list your entry we need enough to verify it's a valid EVMBench run and, if helpful, to reproduce it ourselves:
- The full
run-group-iddirectory produced by the canonical command (withevmbench.log_to_run_dir=True). - The exact command line invoked, including all arguments and flags.
- Agent name and version — the system under test, with enough detail to identify it.
- Date of the run, so we can note it alongside the canonical 117 / 120 history.
We may publish a write-up, contact you for clarification, or independently re-run before adding the entry.