Bug Bounty Programs and AI-Generated Vulnerabilities: The Fog of Slop Problem

Why Do Web3 Protocols Keep Dismissing Multimillion-Dollar Vulnerabilities?

It’s becoming increasingly clear that responsible disclosure in Web3 security is failing. Just in the past three months, ZetaChain, THORChain, and KelpDAO have lost millions. In every case the danger was disclosed responsibly and known before funds were stolen, and in every case the vulnerability was filed away until it blew up.

ZetaChain dismissed a bug bounty report as intended behavior, then watched that exact behavior drain $334K from its gateway.
THORChain had a critical finding reported and quietly patched on a branch that never shipped before a ~$10.7M loss.
KelpDAO's $292M came through a single-verifier bridge setup via LayerZero that was already known to be dangerous and had been flagged earlier.

So why do responsibly reported issues keep turning into multimillion-dollar exploits?

Why Bug Bounty Programs Are Shutting Down

AI is wreaking havoc on bug bounty platforms. "Slop" submissions (plausible-looking, AI-generated vulnerability reports with nothing real behind them) are overwhelming bug bounty platforms and protocol engineers. Web3 companies like THORChain and Code4rena have both shut their programs down, in the Web2 world, curl has also closed its bounty program, citing slop. And yet the danger of exploits is more real and present than ever.

Security teams have finite engineering resources and the influx of reports is basically limitless. Today's models churn out authoritative-sounding "critical" vulnerabilities instantly, yet most are lacking merit. When a finding lands in the nuanced zone, where it takes real judgment and time to tell whether it's real, it forces senior developers to burn valuable time debunking or validating, time taken from building the protocol itself.

Why Dismissal is the Rational Option

Historically, surfacing a "critical" vulnerability required significant technical mastery and time. That friction served as a quality signal: when generating reports was costly, confidence usually implied validity, making deep investigation a profitable trade. LLMs have decoupled that relationship. They have reduced the marginal cost of producing a convincing "critical" finding to zero, while the cost of verification remains high. In this environment, quick dismissal becomes a rational response to even be capable of climbing an insurmountable number of reports.

The problem is that a finding worth dismissing and one that will bankrupt you look identical on intake, both show up as a confident "critical." And the dangerous one is the more expensive one to disprove, because the real ones are rarely a single obvious flaw. They're often cases where smaller issues compound into a dangerous composition. Dismissing is a reactive measure to having a bounty program stuck in the pre-LLM era. And it's exactly how you get drained.

Azimuth's Investigation, and How TestMachine is Closing the Gap on Verification

Case Study: ZetaChain's $334K Gateway Exploit

ZetaChain wrote an unusually candid post-mortem of its April 26 gateway exploit. So we pointed Azimuth at the gateway contract to see what it would reconstruct. The full run is here.

It found the destination-side half of the vector, and flagged the part it couldn't see. This is the path it reconstructed:

A cross-chain message reaches GatewayEVM.execute via the TSS signer, with the sender zeroed so the arbitrary-call branch runs. (In the live exploit this was triggered through an ungated call(), resulting in the off-chain piece reading the call event and calling execute, which the run correctly marked as outside the contract it had.)
The destination is set to any ERC-20, and the data encodes approve(attacker, large_amount).
Inside _executeArbitraryCall, the selector filter passes, approve is neither onCall nor onRevert, and the call runs as GatewayEVM, granting the attacker an allowance from the gateway's own address.
Because execute has no approval-reset step, that allowance persists for future token arrivals.
The live exploit encoded transferFrom(victim, attacker, amount) against wallets that already held standing unlimited pre-approvals, the same primitive, which the run also names in its step about the missing reset.

Set against ZetaChain's post-mortem, the destination-side defects line up: the deny-list that lets transferFrom and approve through, and the lingering approvals, Azimuth’s run reached them from the contract alone. The main piece it missed was the reachability, which requires additional context about the offchain observer’s behavior when the gateway’s call() function is called.

While Azimuth identified the core logic enabling the drain, it failed to optimize and refine into the actual attack vector, which caused it to classify this vulnerability as a “low” severity finding.

Where Azimuth’s Gaps Remain: Levels of Verification

From the investigation, it’s clear that multiple levels of proving the exploit is both possible and worth caring about are needed. It found the textbook code-smell issue, of arbitrary calldata being risky, but it suggested a weak path to achieve profit, picking an “approve” transaction to encode, rather than the more frightening and direct “transferFrom.” It also correctly noted that the function was role-gated but didn’t actually probe how the role was for an automated system whose code was open source and could have its messages manipulated. Finally it didn’t look at the economic implications of the vulnerability, finding addresses who had standing unlimited pre-approvals.

Building Better Vulnerability Verification: Ensuring Criticals are Actually Criticals

We’re working on a rubric for hypothesized vulnerabilities, that examines the assumptions, trigger conditions, reachability, and attacker economics. These will come from an analysis of not just the logic, but the state of the deployed code and probe deeper into the connecting codepaths and data flows that live outside a protocol’s smart contracts.

Furthermore, we are adding a new feature for generating and evaluating live PoCs, and using the data we get from the collected states to evaluate feasibility and the economics of exploitation using real onchain data.

Resolution won’t come from intensifying triage efforts or placing more faith in the next "critical" alert. The solution lies in redefining the nature of a finding, and ensuring it arrives coupled with its own verifiable evidence and explicit assumptions. By scoring these before they reach human eyes, reviewers can focus on whether a substantiated claim is reachable, rather than debating if the assertion itself is legitimate. For vulnerabilities residing within the contract logic, those compounding defects dismissed as intended behavior until they prove otherwise, this shift is the boundary between neglect and resolution. ZetaChain’s exploit was already in their queue. A more rigorous evidentiary standard is the only way to make such bugs impossible to overlook.

Try Azimuth: app.testmachine.ai