Open Benchmark · AI Security

The EVMBench Leaderboard

EVMBench is a standardized benchmark built by OpenAI for AI vulnerability detection on EVM smart contracts: 117 ground-truth vulnerabilities across 40 Code4rena audits. Vendors keep publishing one-off numbers. No one has put them on a single board — so we did.

40 Repositories
117 Vulnerabilities (120 at launch)
10 Published results
# Model / Agent Detection Recall Found
1
Azimuth Our entry
TestMachine
Combined run across the canonical 117-finding set: 88/117 detected.
75.2%
2
AuditAgent Agent
Nethermind Security
Scored against the original 120-finding release.
67%
3
Kai Agent
Dria (Kai Security)
Self-reported on Kai-Bench: 64.2% detect recall, $75k detect award.
64.2%
4
Guardix Agent
Guardix
Scored against the canonical 117-finding set (after OpenAI removed 3 findings post-release).
59.8%
5
GPT-5.5 (Codex) Base model
OpenAI (base model)
40/40 tasks, 0 rollout failures. Detect award: 7.70% ($16.8k / $217.8k bounty pool).
53.8%
6
Claude Opus 4.6 Base model
Anthropic (base model)
47%
7
GPT-5.2 Base model
OpenAI (base model)
38%
8
GPT-5.3 Codex Base model
OpenAI (base model)
36%
9
GPT-5.2 Codex Base model
OpenAI (base model)
33%
10
GPT-5 Base model
OpenAI (base model)
21%

Why some scores show /120 and others /117. EVMBench launched with 120 findings, but shortly after release OpenAI removed three of them, leaving 117 as the canonical set. Vendors who scored before the patch (e.g. Nethermind AuditAgent) report against 120; vendors who scored after (e.g. Guardix) report against 117. Bars are normalised to a common 100% recall scale, so they remain directly comparable. Results are self-reported by each vendor; we have not independently re-run other vendors' pipelines.

Real audits. Real bugs. No leakage.

EVMBench was assembled by OpenAI from 40 historical Code4rena audit contests. Each repo ships with its ground-truth list of confirmed vulnerabilities. A tool runs against the unmodified source and emits findings; recall is the fraction of ground-truth bugs the tool surfaced.

Unlike synthetic CTF challenges, every bug here was found by a human auditor against money already live on-chain. That makes EVMBench the closest thing the industry has to a fair, public scoreboard.

  • Agent — multi-step system with tools, retrieval, and pipeline logic.
  • Base model — an LLM prompted directly, no tooling or orchestration.
  • Recall — % of ground-truth vulns the system detected before any manual review.
  • Found — raw count / denominator the vendor scored against.

What counts as a valid EVMBench run.

To keep this leaderboard meaningful, every entry must be compliant with the EVMBench specification. Numbers that look good but come from a modified protocol aren't comparable to numbers that don't — so we hold all entries to the same bar.

How to run the benchmark

01
Use the canonical task infrastructure

Run the official EVMBench task runner from openai/frontier-evals unmodified. The runner controls how the audit repository is presented to the agent and how findings are scored — running anything else, even with the same repos, isn't EVMBench.

uv run python -m evmbench.nano.entrypoint \
    ... \
    evmbench.log_to_run_dir=True \
    ...
02
Full audit repository as input

The agent receives the complete audit repository — not a hand-picked subset of files, contracts, or functions. Pre-filtering the input changes the task; running on a curated slice is not a valid EVMBench run.

03
All 40 repos, no skipping

Scored across the full 117-finding canonical set (or 120 if you ran pre-patch). Partial runs, cherry-picked repos, or any post-hoc filtering of the ground-truth list disqualify the result.

What you need to provide

To list your entry we need enough to verify it's a valid EVMBench run and, if helpful, to reproduce it ourselves:

  • The full run-group-id directory produced by the canonical command (with evmbench.log_to_run_dir=True).
  • The exact command line invoked, including all arguments and flags.
  • Agent name and version — the system under test, with enough detail to identify it.
  • Date of the run, so we can note it alongside the canonical 117 / 120 history.

We may publish a write-up, contact you for clarification, or independently re-run before adding the entry.