Azimuth on EVMBench: What 75.2% Recall Actually Means

Azimuth on EVMBench: 75.2% recall across 117 high-severity vulnerabilities, setting a new state of the art for AI smart contract security.

When AI security tools claim to find vulnerabilities, how do we know whether those claims are real? Benchmarks do not answer every question, but they provide the industry with a common basis for comparison. In smart contract security, that matters. The difference between a convincing demo and a useful security system is not rhetoric; it is whether the system can find real vulnerabilities in real code, under conditions that are at least somewhat standardized.

We ran Azimuth through EVMBench, the benchmark created by OpenAI and Paradigm to evaluate AI agents' performance in smart contract security. In EVMBench Detect mode, Azimuth identified 88 of 117 ground-truth, high-severity vulnerabilities across 40 real audit repositories, achieving 75.2% recall. That is a strong benchmark result, and in our comparison, it places Azimuth at the top of the current EVMBench detection leaderboard.

Understanding EVMBench

EVMBench asks a narrow but important question: can an AI agent find high-severity vulnerabilities in real smart contract code? The extensive EVMBench framework assesses agents across three modes: Detect, Patch, and Exploit. The results discussed here are for the Detect mode. In Detect mode, a tool is given a repository and asked to produce an audit report, which an LLM judge then evaluates against the ground-truth vulnerabilities — accepting different terminology but requiring a specific match on the underlying flaw and code path.

The detection score is recall: what percentage of the benchmark's known vulnerabilities did the system find? For Azimuth, that number was 88 out of 117, or 75.2%. This is an important distinction. EVMBench Detect is not a complete measure of product quality. It does not fully measure false positives, remediation quality, exploit generation, cost, or how easy the results are for developers or auditors to use. It primarily tests detection coverage against a known set of serious vulnerabilities.

The benchmark's dataset includes 117 curated high-severity vulnerabilities originating from 40 audit repositories. Most are drawn from public competitive audit settings, with additional scenarios included to broaden coverage. These are not toy examples. They are drawn from real codebases and represent the kinds of vulnerabilities that matter because they can lead to loss of user or protocol funds.

How Azimuth Was Tested

We ran Azimuth across all 40 repositories in parallel, without skipping any. The performance gap with other approaches reflects differences in methodology. Static analysis tools search for known patterns — suspicious external calls, access-control mistakes, reentrancy, arithmetic hazards — and tend to struggle when the vulnerability depends on protocol-specific semantics or a sequence of legitimate operations that becomes dangerous only when combined. Pure LLM code review has a different strength: a capable model can read code, infer intent, and generate hypotheses, but it becomes fragile when the repository is large, when context is distributed across many files, or when the issue depends on state changes that are hard to capture in a single prompt.

Azimuth is built on the idea that LLMs become more useful when given structure. Smart contract systems are rarely a single contract with a single entry point; they are networks of contracts, libraries, modifiers, roles, state variables, and external dependencies. Azimuth decomposes the repository into logical functional units, identifies the relevant interaction surfaces, and uses LLMs to investigate smaller, better-posed questions rather than asking a single model to "audit the code" in one pass. The goal is to turn an extensive repository into a set of bounded security questions that can be analyzed, checked, and, where possible, validated.

For EVMBench Detect, Azimuth produced findings that were evaluated against the benchmark's ground truth. In production use outside this benchmark, Azimuth also performs exploitability validation through forked execution environments and generates proof-of-concept code — but those capabilities are not scored here. The 75.2% number should be understood as a detection result, not as a claim about EVMBench Exploit or Patch performance.

The Results

Azimuth achieved 75.2% recall on EVMBench Detect, identifying 88 of the 117 known vulnerabilities. In our comparison, the results were:

Tool	Recall	Detected
Azimuth	75.2%	88 of 117
AuditAgent	67%	80 of 120
Guardix	59.8%	70 of 117
Claude Opus 4.6 (base model)	47%	56 of 120
GPT-5.2 (base model)	38%	45 of 120

View the full comparison: testmachine.ai/evmbench

The published denominators differ across some reported runs, so the cleanest comparison is percentage recall rather than unadjusted vulnerability counts. On that basis, Azimuth's result is a little more than eight percentage points above the published AuditAgent result. That improvement matters, but it should be interpreted carefully. At higher recall levels, additional detections tend to come from harder cases: subtler logic errors, more complicated code paths, or vulnerabilities that depend on interactions spanning multiple components. Moving from 67% to 75.2% does not mean the problem is solved. It suggests that the analysis's structure can materially affect the number of serious issues identified.

What EVMBench Does Not Measure

EVMBench is valuable because it delivers a standardized benchmark for detection. But recall is only one dimension of security-tool performance. The most important missing dimension is precision. Recall measures how many of the real vulnerabilities a tool catches — its true positives as a share of all actual vulnerabilities present in the code. Precision asks the reverse: of the findings a tool reports, how many are true positives rather than false positives? A tool can post high recall by flagging every plausible issue, producing many false positives along the way. The result is a large triage burden, and developers and auditors still have to sort through the noise.

EVMBench Detect does not fully establish precision because its primary score is whether the tool found the known ground-truth vulnerabilities. It does not comprehensively validate every additional finding a tool might report. A system could identify 88 real vulnerabilities and also report hundreds of low-quality findings; EVMBench Detect would still primarily credit the 88 hits. That is why benchmark recall should not be confused with production usability. In our internal analysis, Azimuth's post-validation findings measured at 65–88% precision, depending on the evaluation setting. That number should be read as a separate internal metric, not as an EVMBench score.

Cost is another important limitation. EVMBench does not measure the cost of achieving a given recall level. Model usage, compute cost, orchestration cost, and human triage cost all matter if the tool is going to be used continuously. A system that finds more vulnerabilities but is too expensive to run on every commit may be less useful than one that is slightly less sensitive but can be used routinely. For Azimuth, the average cost to evaluate an EVMBench repository was approximately 171 credits. That provides one operational reference point, but cost comparisons only become meaningful when they include the whole workflow: model usage, compute, validation, reporting, and any downstream human review.

EVMBench also does not measure usability. Two tools can receive credit for detecting the same vulnerability, even if one produces a vague paragraph and the other provides a clear attack path, affected contracts, a confidence score, and remediation guidance. For developers and auditors, that difference is substantial. A finding is more useful when it explains what failed, why it matters, how it can be reproduced, and what should be fixed.

Finally, EVMBench Detect does not measure exploit execution. It asks whether the vulnerability was identified in an audit-style report. That is valuable, but it is different from proving exploitability by performing transactions against a deployed or forked environment. EVMBench has a separate Exploit mode for that kind of evaluation. We have not yet run Azimuth through that benchmark mode.

What This Means for Smart Contract Security

Azimuth's 75.2% recall is a meaningful result. It shows that structured AI-assisted analysis can find a large fraction of known high-severity vulnerabilities across real repositories under a standardized detection benchmark.

It does not mean automated tools replace auditors. It does not mean every reported issue is automatically exploitable. It does not mean a protocol is safe if Azimuth does not find a vulnerability. Security does not work that way, especially inside adversarial systems that hold real economic value.

The more realistic conclusion is also the more interesting one: AI security tools are most useful when engineered as systems, not treated as chatbots with audit prompts. LLMs are powerful, but their power has to be organized. They need a bounded context, structured decomposition, execution feedback, validation, and a way to separate plausible explanations from actual exploit paths.

For developers, this means automated adversarial testing can become part of the development loop rather than a final pre-deployment ritual. Running structured analysis continuously will not catch everything, but it can catch meaningful issues earlier, when they are cheaper and less painful to fix.

For auditors, the best use of systems like Azimuth is not to replace judgment, but to shift attention. Automated tools can help cover broad surfaces, identify candidate paths, and produce evidence for known failure classes. Human auditors can then spend more time on the novel, economic, and protocol-specific risks that remain difficult to automate.

For protocols, the long-term value is continuity. Smart contract security has historically depended on point-in-time reviews. Yet protocols evolve, integrations change, dependencies shift, and new attack paths appear as the surrounding ecosystem changes. A system that can repeatedly decompose, analyze, and validate protocol behavior gives teams a more living view of their risk surface.

The important point is not that LLMs alone can audit smart contracts. It is almost the opposite. LLMs become more useful when embedded in systems that understand how to break complex protocols into logical parts, reason regarding their interactions, and test security claims in the most deterministic environment available: the execution semantics of the EVM itself.

Try Azimuth: app.testmachine.ai