Open Benchmark · AI Security

The EVMBench Leaderboard

EVMBench is a standardized benchmark built by OpenAI for AI vulnerability detection on EVM smart contracts: 117 ground-truth vulnerabilities across 40 Code4rena audits. Vendors keep publishing one-off numbers. No one has put them on a single board — so we did.

40 Repositories

117 Vulnerabilities (120 at launch)

11 Published results

# Model / Agent Detection Recall Found

Azimuth Our entry

TestMachine

Combined run across the canonical 117-finding set: 92/117 detected.

78.6%

92/117 testmachine.ai/products/azimuth ↗

AuditAgent Agent

Nethermind Security

Scored against the original 120-finding release.

67%

80/120 auditagent.nethermind.io/blog ↗

Kai Agent

Dria (Kai Security)

Self-reported on Kai-Bench: 64.2% detect recall, $75k detect award.

64.2%

77/120 kai.dria.co/benchmark ↗

Guardix Agent

Guardix

Scored against the canonical 117-finding set (after OpenAI removed 3 findings post-release).

59.8%

70/117 guardix.dev/blog ↗

GPT-5.5 (Codex) Base model

OpenAI (base model)

40/40 tasks, 0 rollout failures. Detect award: 7.70% ($16.8k / $217.8k bounty pool).

53.8%

63/117 codex-gpt-5.5 · detect split ↗

Claude Opus 4.6 Base model

Anthropic (base model)

47%

56/120 auditagent.nethermind.io/blog ↗

GLM 5.2 Base model

Zhipu AI (base model, via Fireworks)

41/41 tasks, 0 rollout failures. Detect award: 6.30% ($13.7k / $217.8k bounty pool).

38.33%

46/120 testmachine.ai/evmbench ↗

GPT-5.2 Base model

OpenAI (base model)

38%

45/120 auditagent.nethermind.io/blog ↗

GPT-5.3 Codex Base model

OpenAI (base model)

36%

43/120 auditagent.nethermind.io/blog ↗

GPT-5.2 Codex Base model

OpenAI (base model)

33%

39/120 auditagent.nethermind.io/blog ↗

GPT-5 Base model

OpenAI (base model)

21%

25/120 auditagent.nethermind.io/blog ↗

Why some scores show /120 and others /117. EVMBench launched with 120 findings, but shortly after release OpenAI removed three of them, leaving 117 as the canonical set. Vendors who scored before the patch (e.g. Nethermind AuditAgent) report against 120; vendors who scored after (e.g. Guardix) report against 117. Bars are normalised to a common 100% recall scale, so they remain directly comparable. Results are self-reported by each vendor; we have not independently re-run other vendors' pipelines.

What EVMBench measures

Real audits. Real bugs. No leakage.

EVMBench was assembled by OpenAI from 40 historical Code4rena audit contests. Each repo ships with its ground-truth list of confirmed vulnerabilities. A tool runs against the unmodified source and emits findings; recall is the fraction of ground-truth bugs the tool surfaced.

Unlike synthetic CTF challenges, every bug here was found by a human auditor against money already live on-chain. That makes EVMBench the closest thing the industry has to a fair, public scoreboard.

How to read the board

Agent — multi-step system with tools, retrieval, and pipeline logic.
Base model — an LLM prompted directly, no tooling or orchestration.
Recall — % of ground-truth vulns the system detected before any manual review.
Found — raw count / denominator the vendor scored against.

Submission requirements

What counts as a valid EVMBench run.

To keep this leaderboard meaningful, every entry must be compliant with the EVMBench specification. Numbers that look good but come from a modified protocol aren't comparable to numbers that don't — so we hold all entries to the same bar.

How to run the benchmark

Use the canonical task infrastructure

Run the official EVMBench task runner from openai/frontier-evals unmodified. The runner controls how the audit repository is presented to the agent and how findings are scored — running anything else, even with the same repos, isn't EVMBench.

uv run python -m evmbench.nano.entrypoint \
    ... \
    evmbench.log_to_run_dir=True \
    ...

Full audit repository as input

The agent receives the complete audit repository — not a hand-picked subset of files, contracts, or functions. Pre-filtering the input changes the task; running on a curated slice is not a valid EVMBench run.

All 40 repos, no skipping

Scored across the full 117-finding canonical set (or 120 if you ran pre-patch). Partial runs, cherry-picked repos, or any post-hoc filtering of the ground-truth list disqualify the result.

What you need to provide

To list your entry we need enough to verify it's a valid EVMBench run and, if helpful, to reproduce it ourselves:

The full run-group-id directory produced by the canonical command (with evmbench.log_to_run_dir=True).
The exact command line invoked, including all arguments and flags.
Agent name and version — the system under test, with enough detail to identify it.
Date of the run, so we can note it alongside the canonical 117 / 120 history.

We may publish a write-up, contact you for clarification, or independently re-run before adding the entry.

Submit your run See how Azimuth works →

Sources

Where these numbers come from

Nethermind Security

AuditAgent: 80/120 (67%) on EVMBench

auditagent.nethermind.io/blog ↗

Guardix

59.8% recall across 117 high-severity vulnerabilities

guardix.dev/blog ↗

Nethermind (chart)

Claude Opus 4.6 / GPT-5 family base-model recall

auditagent.nethermind.io/blog ↗