Open Benchmark · AI Security

The EVMBench Leaderboard

EVMBench is a standardized benchmark built by OpenAI for AI vulnerability detection on EVM smart contracts: 117 ground-truth vulnerabilities across 40 Code4rena audits. Vendors keep publishing one-off numbers. No one has put them on a single board — so we did.

40 Repositories
117 Vulnerabilities (120 at launch)
8 Published results
# Model / Agent Detection Recall Found
1
Azimuth Our entry
TestMachine
Pending
2
AuditAgent Agent
Nethermind Security
Scored against the original 120-finding release.
67%
3
Kai Agent
Dria (Kai Security)
Self-reported on Kai-Bench: 64.2% detect recall, $75k detect award.
64.2%
4
Guardix Agent
Guardix
Scored against the canonical 117-finding set (after OpenAI removed 3 findings post-release).
59.8%
5
Claude Opus 4.6 Base model
Anthropic (base model)
47%
6
GPT-5.2 Base model
OpenAI (base model)
38%
7
GPT-5.3 Codex Base model
OpenAI (base model)
36%
8
GPT-5.2 Codex Base model
OpenAI (base model)
33%
9
GPT-5 Base model
OpenAI (base model)
21%

Why some scores show /120 and others /117. EVMBench launched with 120 findings, but shortly after release OpenAI removed three of them, leaving 117 as the canonical set. Vendors who scored before the patch (e.g. Nethermind AuditAgent) report against 120; vendors who scored after (e.g. Guardix) report against 117. Bars are normalised to a common 100% recall scale, so they remain directly comparable. Results are self-reported by each vendor; we have not independently re-run other vendors' pipelines.

Real audits. Real bugs. No leakage.

EVMBench was assembled by OpenAI from 40 historical Code4rena audit contests. Each repo ships with its ground-truth list of confirmed vulnerabilities. A tool runs against the unmodified source and emits findings; recall is the fraction of ground-truth bugs the tool surfaced.

Unlike synthetic CTF challenges, every bug here was found by a human auditor against money already live on-chain. That makes EVMBench the closest thing the industry has to a fair, public scoreboard.

  • Agent — multi-step system with tools, retrieval, and pipeline logic.
  • Base model — an LLM prompted directly, no tooling or orchestration.
  • Recall — % of ground-truth vulns the system detected before any manual review.
  • Found — raw count / denominator the vendor scored against.

Run EVMBench? We'll add you.

Publish your full results — repo-by-repo recall, false-positive count, and the script you ran — and we'll list your entry. We're tracking AuditAgent, Guardix, and the base models above. Olympix, Cyfrin, Cantina, Spearbit, ChainPatrol, ConsenSys Diligence — if you have numbers, send them.