TestMachine Vs Large Language Models

The research is in: LLMs cannot secure smart contracts. Here's why execution-driven reinforcement learning is the only reliable approach.

Overview

When we first published this post in late 2024, our argument that large language models are fundamentally insufficient for smart contract security was based on first principles and early observations. Since then, a wave of peer-reviewed research from teams at Georgia Tech, the University of Sydney, and elsewhere has confirmed this thesis with hard data. The numbers are worse than even we expected.

Crypto losses from hacks and scams reached $1.7 billion in just the first four months of 2025—already surpassing the $1.49 billion total for all of 2024. Immunefi CEO Mitchell Amador called it "the worst quarter for hacks in history." According to Halborn's Top 100 DeFi Hacks Report, attackers are expanding their focus to gaming protocols and Layer 2 chains, with off-chain vulnerabilities accounting for a growing share of losses. The attack surface is expanding, and the tools the industry relies on are not keeping up.

This updated post incorporates the latest research to make a clear, evidence-backed case: LLMs are useful assistants, but they are not security tools. TestMachine's reinforcement learning approach—which discovers vulnerabilities by actually executing exploits in a simulated environment—remains the only method that delivers zero false positives and catches the vulnerabilities that static analysis misses.

The Rise of LLMs in Security Tooling

Large language models have become remarkably capable. Models like GPT-4, Claude, and their successors can generate code, explain complex systems, and even reason about software architecture with surprising fluency. Naturally, the security community has explored whether these models can be applied to smart contract auditing—a domain where human expertise is scarce, expensive, and slow.

The appeal is obvious. LLMs have been trained on vast corpora of code and text, including Solidity documentation, audit reports, and known vulnerability patterns. They can process an entire contract in seconds and produce plausible-sounding vulnerability assessments. Several startups and open-source projects now offer LLM-powered security scanning, and many developers use ChatGPT or similar tools as a first pass before engaging human auditors.

We do not dismiss these advances. LLMs are genuinely useful for code comprehension, documentation, and even generating initial test scaffolding. TestMachine itself uses LLMs downstream—once our RL engine discovers a real exploit, we leverage language models to explain the attack path to developers and generate unit tests. This is an appropriate use of the technology: synthesizing and communicating information, not making security judgments.

The problem arises when LLMs are treated as security oracles—trusted to determine whether a contract is safe or vulnerable. The research now conclusively shows that they cannot do this reliably.

What the Research Actually Shows

A series of studies published in 2025 have systematically evaluated LLM performance on smart contract vulnerability detection. The results paint a consistent picture of a technology that is fundamentally mismatched with the demands of security analysis.

Precision is catastrophically low. Wang et al. (2025), published in ACM Transactions on Software Engineering and Methodology, found that GPT-4 achieves only 22.6% precision when pinpointing smart contract vulnerabilities. GPT-3.5 scored 19.7%, and GPT-4o scored 20.2%. While recall can reach 88.2% for GPT-4, the low precision means roughly 4 out of every 5 issues flagged by the model are false positives. A tool that cries wolf 80% of the time is not a security tool—it is a noise generator that wastes developer time and erodes trust in automated analysis.

Modern contracts break LLMs even further. Wang et al. (January 2025) examined how LLMs perform on contracts written in Solidity v0.8 and later, which include built-in overflow protections and other safety features. They found that even well-designed prompts only reduce false-positive rates by approximately 60%, and recall for detecting vulnerabilities in these modern contracts dropped to just 13% compared to older versions. The study revealed that LLMs rely heavily on recognizing patterns from established libraries rather than understanding the actual logic of a contract. When the code deviates from familiar patterns—as novel contracts inevitably do—the models fail silently.

No single LLM is consistently reliable. Yuan et al. (November 2025) at Georgia Tech developed LLMBugScanner, which combines multiple LLMs in an ensemble approach to improve detection. Their key finding was that individual pre-trained LLMs exhibit inconsistent predictions across vulnerability types—no single model consistently outperforms others. Even their ensemble approach, which combines the outputs of multiple models, only achieves 60% top-5 detection accuracy on real CVE-labeled contracts. If the best-case scenario requires running multiple LLMs and still only catches 60% of known vulnerabilities, the technology is not ready for production security.

LLMs require supervision, not autonomy. Ferraro et al. (July 2025), published in Expert Systems with Applications, conducted a broad evaluation of LLMs in smart contract development and concluded plainly: "LLMs require security auditing and are not yet suitable for unsupervised deployment." This is not a fringe opinion—it is the consensus of the research community.

Why LLMs Fail at Security

The research findings are not surprising when you understand how LLMs work. These models are trained to predict the next token in a sequence based on statistical patterns in their training data. They do not execute code. They do not track state changes. They do not reason about the consequences of a sequence of transactions interacting with a live contract.

Smart contract vulnerabilities are fundamentally about behavior, not syntax. A contract can look perfectly reasonable in its source code while harboring a devastating exploit that only manifests through a specific sequence of state transitions. Reentrancy attacks, flash loan exploits, governance manipulation, and oracle manipulation all depend on dynamic interactions between contracts and external state—exactly the kind of reasoning that LLMs cannot perform.

LLMs pattern-match against known vulnerability templates. When they encounter code that resembles a previously documented exploit, they flag it—often correctly for the well-known cases, but also frequently for benign code that happens to share syntactic similarity. And when they encounter a novel vulnerability that doesn't match any known template, they miss it entirely. This is why recall drops to 13% on modern Solidity contracts: the patterns have changed, and the models haven't adapted.

Independent Validation: The A1 System

Perhaps the most compelling validation of TestMachine's approach comes from independent academic research. Zhou (July 2025) at the University of Sydney developed A1, an agentic system for smart contract exploit generation. The paper's conclusion is direct: "naïvely prompted LLMs generate unverified vulnerability speculations, leading to high false positive rates."

The A1 system achieved a 62.96% success rate specifically by adopting an execution-driven validation approach—it eliminates false positives by only reporting vulnerabilities that produce actual profitable exploits. This is precisely the methodology TestMachine has used since its inception: don't speculate about what might be vulnerable, prove it by executing the exploit.

The A1 paper independently validates the core insight behind TestMachine: execution feedback is essential, not just code analysis. The difference is that TestMachine has been operationalizing this principle at production scale for years, continuously monitoring over a million tokens with reinforcement learning agents that learn and adapt from every interaction.

TestMachine is Different

TestMachine uses reinforcement learning to actively discover vulnerabilities by executing exploit attempts against smart contracts in a high-fidelity simulated environment. This is not code review. This is adversarial testing—the same approach a real attacker would use, but conducted safely and systematically before deployment.

The RL agent interacts with contract logic by constructing sequences of transactions, observing state changes, and receiving reward signals when exploits succeed. Over many iterations, the agent learns to discover attack paths that drain tokens, violate access controls, manipulate governance, or exploit economic logic. These are not hypothetical vulnerabilities flagged by pattern matching—they are proven exploits with concrete transaction sequences.

This execution-driven approach has several fundamental advantages over LLM-based analysis:

Zero false positives by construction. TestMachine either executes a successful exploit or it doesn't. There is no ambiguity, no speculation, and no noise. Every finding is a proven vulnerability with a reproducible attack path. Compare this to GPT-4's 22.6% precision, where 4 out of 5 flagged issues are false alarms.
State-aware reasoning. The RL agent tracks the full state of the contract across a sequence of transactions. It can discover vulnerabilities that only emerge through multi-step interactions—flash loan attacks, reentrancy chains, cross-contract exploits—that are invisible to static code analysis.
Adaptation to novel code. Unlike LLMs, whose performance degrades sharply on modern Solidity contracts that deviate from training data patterns, TestMachine's agents learn from the contract's actual behavior. They don't need to have seen a similar vulnerability before—they discover new ones through exploration.
Continuous monitoring. Smart contracts evolve. Proxies get re-pointed, permissions change, and new interactions emerge. TestMachine continuously re-evaluates contracts as they change, catching behavioral shifts that a one-time LLM audit would miss entirely.
Quantified security metrics. TestMachine produces measurable attack success rates that allow developers to track risk reduction across contract versions. This turns security from a binary "audited/not audited" judgment into a continuous, quantifiable process.
LLMs where they belong. TestMachine uses LLMs downstream for what they're good at: once the RL engine discovers a real exploit, language models generate clear explanations of the attack path, suggest fixes, and produce unit tests that encode the vulnerability. This is LLMs as communication tools, not security oracles.

The research is now unambiguous. LLMs achieve 22.6% precision at best, drop to 13% recall on modern contracts, and even multi-model ensembles only reach 60% detection on known vulnerabilities. Meanwhile, execution-driven approaches like TestMachine and the independently developed A1 system eliminate false positives entirely by grounding every finding in actual exploit execution. As the scale of crypto losses continues to grow—$1.7 billion in early 2025 alone—the industry cannot afford to rely on tools that generate more noise than signal. Smart contract security demands proof, not prediction.