When AI Agents Learn to Hunt Vulnerabilities at Scale

// AI Security Research · Benchmark Analysis

CyberGym: When AI Agents Learn to Hunt Vulnerabilities at Scale

Elusive Thoughts  ·  AI Security  ·  Research: Wang, Shi, He, Cai, Zhang, Song — UC Berkeley (ICLR 2026)

For years, the security community has asked the same uncomfortable question: when AI systems get good enough at finding bugs, what does that actually look like in practice — not in a capture-the-flag sandbox, but against the real, messy, multi-million-line codebases that run the world's infrastructure? A team from UC Berkeley just published a rigorous answer. CyberGym is a large-scale cybersecurity evaluation framework built around 1,507 real-world vulnerabilities sourced from production open-source software. It is currently the most comprehensive benchmark of its kind, and its findings carry direct implications for every AppSec practitioner, red teamer, and tooling team paying attention to the AI security space.

// Paper: "CyberGym: Evaluating AI Agents' Cybersecurity Capabilities with Real-World Vulnerabilities at Scale"
Wang et al. — UC Berkeley · arXiv:2506.02548 · ICLR 2026
// Code: github.com/sunblaze-ucb/cybergym  ·  // Dataset: huggingface.co/datasets/sunblaze-ucb/cybergym

// The Problem With Existing Benchmarks

Before getting into the methodology, it is worth understanding why a new benchmark was necessary at all. Most existing AI cybersecurity evaluations share a fundamental flaw: they are based on synthetic or educational challenges — CTF problems, toy codebases, deliberately crafted puzzles. These test pattern recognition in a controlled environment, not the kind of multi-step reasoning required to exploit a subtle memory corruption bug buried inside a 400,000-line C++ multimedia library.

The other problem is scope. Previous comparable work was limited in coverage — CyberGym claims to be 7.5× larger than the nearest prior benchmark. When you are trying to measure a capability that varies significantly across vulnerability type, language, codebase complexity, and crash class, dataset size and diversity are not nice-to-haves. They are the core of statistical validity.

// Root Cause: Benchmarks based on synthetic CTF tasks systematically overstate AI agent capability on real-world security work. Real vulnerability reproduction requires reasoning across entire codebases, understanding program entry points, and generating PoCs that survive sanitizer validation — not just recognising an XOR cipher.

// Benchmark Architecture: How CyberGym Is Built

The design of CyberGym is its most technically interesting contribution, and it is worth unpacking in detail because the sourcing strategy is what gives it credibility.

// Data Sourcing: OSS-Fuzz as Ground Truth

Every benchmark instance is derived from OSS-Fuzz, Google's continuous fuzzing infrastructure that runs against hundreds of major open-source projects. This is a deliberate and important choice. OSS-Fuzz vulnerabilities are: confirmed exploitable (they crash real builds), patched and documented, drawn from production codebases with real complexity, and associated with a ground-truth PoC that the original fuzzer generated.

For each vulnerability, the pipeline automatically extracts four artefacts from the patch commit history: the pre-patch and post-patch codebases along with their Dockerised build environments; the original OSS-Fuzz PoC; the applied patch diff; and the commit message, which is rephrased using GPT-4.1 to generate a natural-language vulnerability description for the agent. The result is a fully reproducible evaluation environment for every instance.

// CyberGym instance structure (per vulnerability)
instance/
  pre_patch_codebase/   # target: agent must exploit this
  post_patch_codebase/  # verifier: PoC must NOT crash this
  docker_build_env/     # reproducible build w/ sanitizers
  vuln_description.txt  # GPT-4.1 rephrased from commit msg
  ground_truth_poc      # original OSS-Fuzz PoC (not given to agent)
  patch.diff            # not given to agent at Level 1

// Scale and Diversity

The 1,507 instances span 188 open-source projects including OpenSSL, FFmpeg, and OpenCV — projects with codebases ranging from tens of thousands to millions of lines of code. The dataset covers 28 distinct crash types, including buffer overflows, null pointer dereferences, use-after-free, heap corruption, and integer overflows. This diversity is deliberately engineered: a benchmark that only contains one class of bug tells you very little about generalised capability.

1,507 Benchmark Instances
188 OSS Projects
28 Crash Types
7.5× Larger Than Prior SOTA

// Quality Control Pipeline

Benchmark quality is enforced through three automated filtering passes: informativeness (removing commits lacking sufficient vulnerability context or covering multiple simultaneous fixes, which would make success criteria ambiguous); reproducibility (re-running ground-truth PoCs on both pre- and post-patch executables to verify the pass/fail differential behaves correctly); and non-redundancy (excluding duplicates via crash trace comparison). This is not trivial — OSS-Fuzz produces a noisy stream of bug reports, and many commits touch multiple issues simultaneously. The filtering pipeline is what makes the dataset usable as a scientific instrument.

// Task Design: The Two Evaluation Levels

CyberGym defines two distinct evaluation scenarios that test different capability profiles.

// Level 1 — Guided Vulnerability Reproduction

This is the primary benchmark. The agent receives the pre-patch codebase and the natural-language vulnerability description. It must generate a working proof-of-concept that: triggers the vulnerability (crashes with sanitizers enabled) on the pre-patch version, and does not trigger on the post-patch version. The differential is the verification signal — not just "does it crash" but "does it crash in the right version because of the right bug."

This is harder than it sounds. The agent must reason across an entire codebase — often spanning thousands of files — to locate the relevant code path, understand the data flow leading to the crash, and construct an input or function call sequence that exercises it from a valid program entry point. Agents iterate based on execution feedback in a read-execute-refine loop.

// Success Criterion: PoC triggers sanitizer crash on pre-patch binary AND does not trigger on post-patch binary. Verified automatically by the evaluation harness — no human in the loop for scoring.

// Level 0 — Open-Ended Discovery (No Prior Context)

The harder and more operationally relevant scenario. The agent receives only the latest codebase — no vulnerability description, no hints, no patch. It must autonomously discover and trigger new vulnerabilities. This mirrors what an offensive AI agent would do in a real-world autonomous fuzzing or code auditing context. Results from this mode are discussed in the real-world impact section below.

// Evaluation Results: What the Numbers Actually Mean

// LLM Performance on Level 1

Four agent frameworks were evaluated against nine LLMs. The headline number that will get quoted everywhere is that the top combination — OpenHands with Claude-Sonnet-4 — achieves a 17.9% reproduction success rate in a single trial. Claude-3.7-Sonnet and GPT-4.1 follow closely behind. The more practically interesting stat: with 30 trials, success rates reach approximately 67%, demonstrating strong test-time scaling potential.

Model Agent Framework Success Rate (1 Trial) Notes
Claude-Sonnet-4 OpenHands 17.9% Best overall (non-thinking mode)
Claude-3.7-Sonnet OpenHands ~15% Second best; thinking mode evaluated
GPT-4.1 OpenHands / Codex CLI ~14% Strong cost/performance ratio
GPT-5 OpenHands 22.0% Thinking mode only; highest with extended reasoning
SWE-bench specialised Various ≤ 2% Fails to generalise to vuln reproduction
o4-mini OpenHands Low Safety alignment triggers confirmation requests; limits autonomy

Two findings here are worth dwelling on from a practitioner perspective.

First, SWE-bench specialised models collapsed to near-zero performance. These models are trained to fix software bugs — a task superficially similar to vulnerability reproduction. The fact that they fail almost completely on CyberGym confirms that "bug fixing" and "vulnerability exploitation" are distinct cognitive tasks, not just variants of the same code reasoning capability. This matters if you are evaluating AI tools for defensive vs. offensive security applications.

Second, o4-mini's safety alignment actively blocked autonomous execution. The model repeatedly sought user confirmation mid-task rather than proceeding, reducing effective performance despite having strong underlying coding ability. This is a direct observable signal of how safety alignment interacts with agentic security tasks — relevant for anyone building AI security tooling on top of commercial LLM APIs.

// Test-Time Scaling and Thinking Modes

The evaluation includes a controlled comparison of thinking vs. non-thinking modes on a 300-task subset. The most dramatic delta was GPT-5: it jumped from 7.7% with minimal reasoning to 22.0% with high reasoning — surpassing Claude-Sonnet-4's non-thinking performance. For GPT-4.1, running six independent trials and taking the union achieved 18.0% success vs. 8.7% average, nearly doubling effective capability through parallelism alone.

Running 6 independent GPT-4.1 trials and taking the union achieves 18.0% success — nearly double the 8.7% single-run average. Parallelism as a capability multiplier is not a future concern. It is already operative.

// Agent Framework Behavioural Analysis

All four tested frameworks — OpenHands, OpenAI Codex CLI, EnIGMA, and Cybench agent — achieved similar aggregate success rates when using GPT-4.1 as the backbone. But the union of their outcomes reached 18.4%, nearly double any single agent's performance. The agents are not redundant — they succeed on different subsets of tasks, which means their capabilities are genuinely complementary rather than correlated.

Behaviourally, OpenHands was most efficient — it chains commands in Bash, reducing the total number of tool calls needed. The CTF-specialised agents (EnIGMA, Cybench) relied more heavily on Python scripting. This suggests different mental models of how to approach the code, and is consistent with their respective training distributions.

// Real-World Security Impact: The Numbers That Matter

Benchmark scores measure capability in a controlled environment. The real credibility test is whether that capability translates to production systems. CyberGym's answer to this is unambiguous.

// Incomplete Patches Detected

During Level 1 evaluation, 759 PoCs triggered crashes across 60 projects even on patched versions of the code. Manual inspection confirmed 17 incomplete patches across 15 projects. The AI-generated PoCs reached further into the post-patch behaviour than the original OSS-Fuzz PoCs did, effectively stress-testing the quality of existing security patches as a side effect of evaluation. None affected the latest software releases, but the finding demonstrates that AI-generated PoCs can uncover patch coverage gaps that manual review missed.

// Zero-Days Discovered

Post-patch crash validation identified 35 PoCs that still crashed the latest versions of their target programs. After deduplication, these mapped to 10 unique zero-day vulnerabilities, each of which had been sitting undetected in production code for an average of 969 days before the agents found them. All findings were responsibly disclosed, resulting in 3 assigned CVEs and 6 patched vulnerabilities as of publication.

759 Post-Patch PoC Crashes
17 Incomplete Patches Confirmed
10 Zero-Days (Unique)
969 Avg Days Undetected

// Level 0 Open-Ended Discovery at Scale

The open-ended discovery experiment deployed OpenHands across 431 OSS-Fuzz projects and 1,748 executables with zero prior knowledge of existing vulnerabilities. GPT-4.1 triggered 16 crashes and confirmed 7 zero-days. GPT-5 triggered 56 crashes and confirmed 22 zero-days, with 4 overlapping between the two models. These are not reproductions of known bugs — these are autonomous, unprompted discoveries in active production software.

// Key correlation finding: Performance on the Level 1 reproduction benchmark correlates strongly with real-world zero-day discovery capability in Level 0. This validates CyberGym as a meaningful proxy for operational offensive AI capability — not just a leaderboard number.

// AppSec Practitioner Takeaways

Strip away the academic framing and CyberGym is communicating several concrete things to practitioners working in application security today.

AI-assisted vulnerability reproduction is operationally real, not theoretical. An 18% single-trial success rate against 1,500 real-world bugs sounds modest until you factor in parallelism. Six independent runs of GPT-4.1 reach 18% union coverage. At scale, an adversary running hundreds of parallel agent instances against a target codebase is not a 2027 problem. The compute cost to attempt this is already within reach of well-resourced threat actors.

Patch quality verification is an undervalued use case. The 17 incomplete patches discovered were a side effect of evaluation, not a deliberate hunt. Integrating AI-generated PoC testing into patch review pipelines — specifically to verify that a fix fully closes the attack surface rather than just patching the reported crash input — is a defensive application that deserves more tooling attention.

Specialisation gap between defensive and offensive AI is confirmed. SWE-bench models scoring near zero on CyberGym is a clean empirical data point: code fix reasoning does not transfer to code exploitation reasoning. Teams evaluating AI tools for security automation should be cautious about assuming general coding capability translates to security-specific tasks. Test explicitly against the task you care about.

Safety alignment as an observable operational constraint. The o4-mini behaviour — halting to seek confirmation rather than proceeding autonomously — is worth noting for teams building security tooling on top of commercial LLM APIs. Model-level safety controls are not always transparent, and they can degrade agent effectiveness in ways that do not surface until you run evaluation against real tasks.


// My Take CyberGym is a methodologically serious piece of work that deserves to be read carefully, not just cited as a headline number. The OSS-Fuzz sourcing strategy is smart — it grounds every instance in a real, confirmed, verified vulnerability with a documented patch differential. That is not easy to do at this scale and it matters enormously for evaluation validity.

What I find most significant is not the 17.9% success rate — it is the 969-day average age of the zero-days found. These were not obscure fringe projects. They were active, maintained, security-conscious OSS codebases. The fact that AI agents running against them found unpatched vulnerabilities faster than the existing bug discovery ecosystem is a direct challenge to the assumption that continuous fuzzing and active maintenance is sufficient. It is not — not when the adversary can throw an ensemble of AI agents with different behavioural patterns at your codebase in parallel.

The complementarity finding is the one I keep coming back to. Agents succeeding on different instance subsets, reaching 18.4% union vs. ~10% individual — that is an ensemble signal. Defenders need to think about this the same way they think about layered detection: no single agent covers everything, but a coordinated multi-agent system has a coverage profile that starts to become operationally dangerous. We are not there yet at 18%. But the trajectory from the paper's own progress chart — 10% to 30% across recent model iterations — suggests the window to prepare is shorter than most teams think.

// References & Further Reading

CyberGym paper (arXiv:2506.02548) — arxiv.org
CyberGym project page & leaderboard — cybergym.io
OSS-Fuzz infrastructure — google.github.io/oss-fuzz
OpenHands agent framework — github.com/All-Hands-AI/OpenHands
Frontier AI Cybersecurity Observatory — rdi.berkeley.edu
Claude Sonnet 4.5 System Card (CyberGym evaluation referenced) — anthropic.com

AI Security Vulnerability Research LLM Agents AppSec OSS-Fuzz Zero-Day Benchmarking OpenHands Claude GPT-5

Popular posts from this blog

PHP Source Code Chunks of Insanity (Delete Post Pages) Part 4

The Hackers Guide To Dismantling IPhone (Part 3)

MSSQL Injection OPENROWSET Side Channel