Daejun Park

Proof, not prose: Turning vulnerability reports into viable proofs of concept

Scaling proof-of-concept generation for smart contract vulnerabilities: a minimal multi‑agent workflow for defensive automation

October 15, 2025

We explore whether large‑language‑model (LLM) coding agents can generate proof‑of‑concept (PoC) tests for smart‑contract vulnerabilities in a way that is both practical and rigorous enough for security workflows. We present two complementary approaches. The first uses a carefully designed single prompt with a human reviewer in the loop; it is trivially easy to adopt but does not scale. The second replaces the human reviewer with an automated evaluator, yielding a minimal multi‑agent loop that can triage many findings.

We prototyped and evaluated these ideas, focusing on fund‑stealing vulnerabilities, given their direct and universal impact signal. Across real, public audit findings from 2024–2025, our prototypes produced PoCs that succeeded on buggy commits and failed on patched commits. We believe the same machinery can automate core pieces of the security workflow — impact assessment and triage at scale, patch validation, regression gates in CI, and downstream validation of both human and AI‑generated reports — turning prose findings into executable evidence that integrates cleanly with existing pipelines.

Audit and bug‑bounty reports often describe a vulnerability’s root cause yet stop short of a runnable PoC. Teams then face a familiar dilemma: The issue looks serious, but without an executable demonstration, it competes with other work and can languish. That gap has real consequences. In several well‑publicized incidents, the exploited vulnerability had been flagged earlier, but remediation stalled because teams underestimated impact.

A credible PoC changes the conversation: It makes impact measurable, lets teams validate patches and prevent regressions, and supplies an objective gate for both human and AI‑driven findings in security pipelines.

LLM coding agents are attractive for this job. These tools (e.g., Claude Code, Codex CLI, Cursor CLI) already read requirements, navigate repositories, and produce code that meets test conditions. For PoC generation, we are not seeking elegant or secure production code — we are seeking to demonstrate exploitability under a realistic threat model. That narrower goal aligns with what today’s models are good at. Results from security challenges such as AIxCC have also hinted that LLMs can shoulder nontrivial chunks of security engineering. Our question was simple: How far can we get with a disciplined prompt and the right guardrails?

Method 1: A single prompt with a human in the loop

We started with the simplest thing that could possibly work: one carefully engineered prompt to a coding agent, a generated test that sets up contracts and initial state, and a human reviewer who spots invalid assumptions and asks for refinements.

This minimalist approach is surprisingly effective. The agent can read a vulnerability description, outline an exploit scenario, deploy and initialize contracts, fund accounts appropriately, and translate the scenario into runnable code. It often fills in missing but reasonable details — approval steps, initialization parameters, even Merkle proofs.

However, the same pattern of failures recurs. Some outputs are plausible but incomplete: They do not compile or they miss the success condition. Others cheat, for example by quietly granting the attacker a privileged role or minting tokens out of thin air. Sometimes they assume cooperation from victims, such as voluntary approvals. In some cases, we observed reward hacking in the assertion logic itself: tests that pass by defining success as attacker profit ≥ 0 instead of > 0, or by relaxing fund-loss checks.

A human reviewer resolves most of this quickly. People are good at spotting hidden assumptions, asking for stricter success criteria, and nudging the agent back onto a realistic threat model. In practice, you converge to a valid PoC in a few iterations. The catch is scale: If you have dozens or hundreds of findings, the human bottleneck dominates.

Takeaway. The single‑prompt pattern is an excellent onramp. It requires no special infrastructure and helps teams move from prose to executable evidence. But its reliance on human judgment limits throughput.

Method 2: A fully automated, minimal multi‑agent loop

To scale, we replaced the human reviewer with an automated evaluator while keeping the system deliberately small. The loop comprises a generator that proposes a candidate PoC test, a test runner that compiles and executes it, an automatic fixer that patches trivial compile/runtime errors, and a final evaluator that accepts or rejects the result with explicit feedback. The evaluator’s feedback steers the next generator attempt.

Two design choices matter. First, we give the evaluator fresh context on each review to reduce anchoring bias, while the generator keeps accumulated context across refinement rounds to preserve learning. Second, we place a tight loop between the test runner and a lightweight test fixer so that trivial breakages (imports, small API mismatches, etc.) do not force a full regeneration. Only semantically invalid PoCs — cheating, unrealistic assumptions, weak success predicates — trigger a new generation, guided by the evaluator’s reasons for rejection.

As a companion to this loop, we added an impact checker upstream. It filters out vulnerability scenarios that lack sufficient impact and acts as a prescreening gate, so only meaningful candidates enter the loop. In practice, this reduces unnecessary iterations, saving time and token cost.

This system doesn’t aspire to be a sprawling multi‑agent ecosystem. It is the minimum that reliably automates what a human reviewer does in Method 1: enforce a realistic threat model and insist on strict, reproducible success criteria.

Implementation notes

We prototyped the agents (except the test runner) using Cursor CLI in headless mode, pairing a strong reasoning model as the evaluator (GPT‑5 by default) with a coding model as the generator (Sonnet 4.5 by default). A small Python script deterministically orchestrates calls. The test runner is simply a forge test process, wired to emit structured logs for the fixer and evaluator. We found that agents perform best when focusing on one PoC test at a time; horizontal scale comes from running multiple generator–evaluator pairs independently.

Experiments: focusing on fund‑stealing exploits

We constrained the initial study to fund‑stealing vulnerabilities, a subset of issues that directly enable an attacker to extract financial gain from a protocol. The choice is pragmatic. These issues have a universal definition of impact (attacker profit strictly greater than zero) and rank as critical in most severity taxonomies. While we have not yet validated it beyond this scope, we believe the approach is adaptable to other vulnerability classes — such as fund locked, or protocol‑specific correctness — by swapping in specialized success predicates and domain rubrics.

For our dataset, we first collected 138 public audit reports published during 2024–2025. Among these, 36 reports contained at least one critical finding, the severity level that encompasses fund-stealing exploits. We then excluded cases that were non‑EVM or lacked open‑source availability and reproducibility, yielding 14 reports comprising 25 critical vulnerability findings in total, which formed the experimental set used in this experiment.

On each retained finding, we evaluated whether our system could (1) correctly identify the issue as fund-stealing and (2) automatically generate a PoC that reproduces the exploit. Specifically, we asked the system to generate a PoC against the buggy commit and then reran it against the patched commit. A PoC was considered valid only if it succeeded on the buggy commit and failed on the patched commit.

Results

Across the 25 findings, the system successfully generated valid PoCs for twelve fund-stealing vulnerabilities. Five succeeded on the first attempt, while seven required one additional feedback loop. All generated PoCs passed on the buggy commits but not on the patched ones, as expected, demonstrating the effectiveness of the evaluator-based generation process.

The remaining thirteen findings were filtered out by the impact checker as not meeting our fund‑stealing criteria. Upon manual review, we confirmed that these were indeed non-fund-stealing cases — mostly fund-locked, denial-of-service, or griefing vulnerabilities that may lead to user fund loss without enabling attacker profit.

Performance and cost

For cases rejected by the impact checker, the average wall‑clock time was 38 seconds with 25K GPT‑5 tokens. For single‑attempt successes, averages were 8 minutes, 80K GPT‑5 tokens, and 525K Sonnet‑4.5 tokens. For two‑iteration successes, averages were 11 minutes, 92K GPT‑5 tokens, and 1.99M Sonnet‑4.5 tokens.

Overall, the PoC generator dominated both runtime and token usage, where each generation attempt consumed an average of 913K Sonnet‑4.5 tokens (breakdown: 17K Base Input, 63K Cache Writes, 822K Cache Reads, 10K Output tokens), amounting to an API cost of $0.68.

Limitations

There are clear limits. Our prototype focuses on fund‑stealing. Other types such as fund-locked, and protocol‑specific correctness bugs will require new evaluators and domain‑specific predicates. The dataset we used is real but modest; broader validation is needed before these methods can be considered production-ready.

A single prompt and a thoughtful reviewer can already turn prose vulnerability reports into executable evidence. This is powerful because it enables patch validation and reduces false positives across both human and AI audits. But when the volume of findings grows, the reviewer becomes the bottleneck. A minimal multi‑agent loop that automates evaluation bridges that gap without adding unnecessary complexity. In our experiments on fund‑stealing exploits, the approach consistently generated valid PoCs. The decisive factor was evaluator quality; once that was solid, the rest of the system performed reliably.

Beyond the experiments, the same machinery has immediate practical utility. It can provide objective, automated triage for bug‑bounty programs, speed up patch validation in production pipelines, and serve as a downstream layer for other finding tools — from simple static analyzers to advanced AI auditors — by automatically converting their findings into runnable PoC tests. For the latter, we have already begun running the same experiments, replacing the audit findings with AI-generated vulnerabilities to simulate integration with AI auditors, and the initial results are promising.

The actionable path for teams is straightforward: Start with the single‑prompt method (our prompt design is a reasonable baseline in local, mock‑contract setups) and, once it proves useful, graduate to the evaluator‑driven loop to remove the reviewer bottleneck and scale.

Looking ahead

We see several plausible next steps. One direction is broader coverage: adapting the evaluator’s rubric and success predicates to handle other types of vulnerabilities, extending the applicability beyond fund‑stealing.

Another is cost reduction: improving the first‑attempt success rate through tighter prompting or targeted fine‑tuning to mitigate reward hacking. This also involves exploring smaller code‑generation models — particularly for the generator, which dominates token usage — to better balance per‑iteration quality against the number of iterations required.

A further direction is tighter integration with AI auditors and AI patch generators to enable a more end‑to‑end workflow — continuous vulnerability discovery, automated PoC synthesis, and patch generation/verification. This approach could shift smart‑contract security from periodic reviews to continuous, systematic engineering.

Responsible use and disclosure. This work concerns defensive automation for smart‑contract security. All experiments target patched or consented code and run in isolated, local test environments. The prototype is structured as a local‑only, mock‑contract testing harness for patch validation and impact triage. The system is designed solely for defensive evaluation and is not intended for deployment or use on live systems. Readers should apply these methods only in controlled settings and follow established community norms for responsible disclosure and ethical evaluation, obtaining explicit consent before conducting any tests on unpatched or production systems.