For years, the security community has asked the same uncomfortable question: when AI systems
get good enough at finding bugs, what does that actually look like in practice — not in a
capture-the-flag sandbox, but against the real, messy, multi-million-line codebases that run
the world's infrastructure? A team from UC Berkeley just published a rigorous answer.
CyberGym is a large-scale cybersecurity evaluation framework built around
1,507 real-world vulnerabilities sourced from production open-source software. It is currently
the most comprehensive benchmark of its kind, and its findings carry direct implications for
every AppSec practitioner, red teamer, and tooling team paying attention to the AI security space.
// Paper: "CyberGym: Evaluating AI Agents' Cybersecurity Capabilities with Real-World Vulnerabilities at Scale"
Wang et al. — UC Berkeley · arXiv:2506.02548 · ICLR 2026
// Code: github.com/sunblaze-ucb/cybergym ·
// Dataset: huggingface.co/datasets/sunblaze-ucb/cybergym
// The Problem With Existing Benchmarks
Before getting into the methodology, it is worth understanding why a new benchmark was
necessary at all. Most existing AI cybersecurity evaluations share a fundamental flaw:
they are based on synthetic or educational challenges — CTF problems, toy codebases,
deliberately crafted puzzles. These test pattern recognition in a controlled environment,
not the kind of multi-step reasoning required to exploit a subtle memory corruption bug
buried inside a 400,000-line C++ multimedia library.
The other problem is scope. Previous comparable work was limited in coverage — CyberGym
claims to be 7.5× larger than the nearest prior benchmark. When you are
trying to measure a capability that varies significantly across vulnerability type, language,
codebase complexity, and crash class, dataset size and diversity are not nice-to-haves.
They are the core of statistical validity.
// Root Cause: Benchmarks based on synthetic CTF tasks systematically overstate AI agent
capability on real-world security work. Real vulnerability reproduction requires reasoning across
entire codebases, understanding program entry points, and generating PoCs that survive sanitizer
validation — not just recognising an XOR cipher.
// Benchmark Architecture: How CyberGym Is Built
The design of CyberGym is its most technically interesting contribution, and it is worth
unpacking in detail because the sourcing strategy is what gives it credibility.
// Data Sourcing: OSS-Fuzz as Ground Truth
Every benchmark instance is derived from OSS-Fuzz, Google's continuous
fuzzing infrastructure that runs against hundreds of major open-source projects. This is a
deliberate and important choice. OSS-Fuzz vulnerabilities are: confirmed exploitable (they
crash real builds), patched and documented, drawn from production codebases with real
complexity, and associated with a ground-truth PoC that the original fuzzer generated.
For each vulnerability, the pipeline automatically extracts four artefacts from the patch
commit history: the pre-patch and post-patch codebases along with their Dockerised build
environments; the original OSS-Fuzz PoC; the applied patch diff; and the commit message,
which is rephrased using GPT-4.1 to generate a natural-language vulnerability description
for the agent. The result is a fully reproducible evaluation environment for every instance.
// CyberGym instance structure (per vulnerability)
instance/
pre_patch_codebase/ # target: agent must exploit this
post_patch_codebase/ # verifier: PoC must NOT crash this
docker_build_env/ # reproducible build w/ sanitizers
vuln_description.txt # GPT-4.1 rephrased from commit msg
ground_truth_poc # original OSS-Fuzz PoC (not given to agent)
patch.diff # not given to agent at Level 1
// Scale and Diversity
The 1,507 instances span 188 open-source projects including OpenSSL, FFmpeg, and OpenCV —
projects with codebases ranging from tens of thousands to millions of lines of code.
The dataset covers 28 distinct crash types, including buffer overflows,
null pointer dereferences, use-after-free, heap corruption, and integer overflows. This
diversity is deliberately engineered: a benchmark that only contains one class of bug
tells you very little about generalised capability.
1,507
Benchmark Instances
188
OSS Projects
28
Crash Types
7.5×
Larger Than Prior SOTA
// Quality Control Pipeline
Benchmark quality is enforced through three automated filtering passes:
informativeness (removing commits lacking sufficient vulnerability context
or covering multiple simultaneous fixes, which would make success criteria ambiguous);
reproducibility (re-running ground-truth PoCs on both pre- and post-patch
executables to verify the pass/fail differential behaves correctly); and
non-redundancy (excluding duplicates via crash trace comparison).
This is not trivial — OSS-Fuzz produces a noisy stream of bug reports, and many commits
touch multiple issues simultaneously. The filtering pipeline is what makes the dataset
usable as a scientific instrument.
// Task Design: The Two Evaluation Levels
CyberGym defines two distinct evaluation scenarios that test different capability profiles.
// Level 1 — Guided Vulnerability Reproduction
This is the primary benchmark. The agent receives the pre-patch codebase and the
natural-language vulnerability description. It must generate a working proof-of-concept
that: triggers the vulnerability (crashes with sanitizers enabled) on the pre-patch version,
and does not trigger on the post-patch version. The differential is the verification signal —
not just "does it crash" but "does it crash in the right version because of the right bug."
This is harder than it sounds. The agent must reason across an entire codebase — often
spanning thousands of files — to locate the relevant code path, understand the data flow
leading to the crash, and construct an input or function call sequence that exercises it
from a valid program entry point. Agents iterate based on execution feedback in a
read-execute-refine loop.
// Success Criterion: PoC triggers sanitizer crash on pre-patch binary AND does not
trigger on post-patch binary. Verified automatically by the evaluation harness — no human
in the loop for scoring.
// Level 0 — Open-Ended Discovery (No Prior Context)
The harder and more operationally relevant scenario. The agent receives only the latest
codebase — no vulnerability description, no hints, no patch. It must autonomously discover
and trigger new vulnerabilities. This mirrors what an offensive AI agent would do in a
real-world autonomous fuzzing or code auditing context. Results from this mode are discussed
in the real-world impact section below.
// Evaluation Results: What the Numbers Actually Mean
// LLM Performance on Level 1
Four agent frameworks were evaluated against nine LLMs. The headline number that will get
quoted everywhere is that the top combination — OpenHands with Claude-Sonnet-4 — achieves
a 17.9% reproduction success rate in a single trial. Claude-3.7-Sonnet
and GPT-4.1 follow closely behind. The more practically interesting stat: with 30 trials,
success rates reach approximately 67%, demonstrating strong test-time scaling potential.
| Model |
Agent Framework |
Success Rate (1 Trial) |
Notes |
| Claude-Sonnet-4 |
OpenHands |
17.9% |
Best overall (non-thinking mode) |
| Claude-3.7-Sonnet |
OpenHands |
~15% |
Second best; thinking mode evaluated |
| GPT-4.1 |
OpenHands / Codex CLI |
~14% |
Strong cost/performance ratio |
| GPT-5 |
OpenHands |
22.0% |
Thinking mode only; highest with extended reasoning |
| SWE-bench specialised |
Various |
≤ 2% |
Fails to generalise to vuln reproduction |
| o4-mini |
OpenHands |
Low |
Safety alignment triggers confirmation requests; limits autonomy |
Two findings here are worth dwelling on from a practitioner perspective.
First, SWE-bench specialised models collapsed to near-zero performance.
These models are trained to fix software bugs — a task superficially similar to vulnerability
reproduction. The fact that they fail almost completely on CyberGym confirms that "bug fixing"
and "vulnerability exploitation" are distinct cognitive tasks, not just variants of the same
code reasoning capability. This matters if you are evaluating AI tools for defensive vs.
offensive security applications.
Second, o4-mini's safety alignment actively blocked autonomous execution.
The model repeatedly sought user confirmation mid-task rather than proceeding, reducing
effective performance despite having strong underlying coding ability. This is a direct
observable signal of how safety alignment interacts with agentic security tasks — relevant
for anyone building AI security tooling on top of commercial LLM APIs.
// Test-Time Scaling and Thinking Modes
The evaluation includes a controlled comparison of thinking vs. non-thinking modes on a
300-task subset. The most dramatic delta was GPT-5: it jumped from 7.7% with minimal
reasoning to 22.0% with high reasoning — surpassing Claude-Sonnet-4's non-thinking
performance. For GPT-4.1, running six independent trials and taking the union achieved
18.0% success vs. 8.7% average, nearly doubling effective capability through parallelism alone.
Running 6 independent GPT-4.1 trials and taking the union achieves 18.0% success —
nearly double the 8.7% single-run average. Parallelism as a capability multiplier
is not a future concern. It is already operative.
// Agent Framework Behavioural Analysis
All four tested frameworks — OpenHands, OpenAI Codex CLI, EnIGMA, and Cybench agent —
achieved similar aggregate success rates when using GPT-4.1 as the backbone. But the
union of their outcomes reached 18.4%, nearly double any single agent's performance.
The agents are not redundant — they succeed on different subsets of tasks, which means
their capabilities are genuinely complementary rather than correlated.
Behaviourally, OpenHands was most efficient — it chains commands in Bash, reducing the
total number of tool calls needed. The CTF-specialised agents (EnIGMA, Cybench) relied
more heavily on Python scripting. This suggests different mental models of how to approach
the code, and is consistent with their respective training distributions.
// Real-World Security Impact: The Numbers That Matter
Benchmark scores measure capability in a controlled environment. The real credibility test
is whether that capability translates to production systems. CyberGym's answer to this is
unambiguous.
// Incomplete Patches Detected
During Level 1 evaluation, 759 PoCs triggered crashes across 60 projects even on
patched versions of the code. Manual inspection confirmed 17 incomplete
patches across 15 projects. The AI-generated PoCs reached further into the
post-patch behaviour than the original OSS-Fuzz PoCs did, effectively stress-testing the
quality of existing security patches as a side effect of evaluation. None affected the
latest software releases, but the finding demonstrates that AI-generated PoCs can uncover
patch coverage gaps that manual review missed.
// Zero-Days Discovered
Post-patch crash validation identified 35 PoCs that still crashed the latest versions of
their target programs. After deduplication, these mapped to 10 unique zero-day
vulnerabilities, each of which had been sitting undetected in production code for
an average of 969 days before the agents found them. All findings were
responsibly disclosed, resulting in 3 assigned CVEs and 6 patched vulnerabilities as of
publication.
759
Post-Patch PoC Crashes
17
Incomplete Patches Confirmed
10
Zero-Days (Unique)
969
Avg Days Undetected
// Level 0 Open-Ended Discovery at Scale
The open-ended discovery experiment deployed OpenHands across 431 OSS-Fuzz projects and
1,748 executables with zero prior knowledge of existing vulnerabilities. GPT-4.1 triggered
16 crashes and confirmed 7 zero-days. GPT-5 triggered 56 crashes and confirmed 22 zero-days,
with 4 overlapping between the two models. These are not reproductions of known bugs —
these are autonomous, unprompted discoveries in active production software.
// Key correlation finding: Performance on the Level 1 reproduction benchmark correlates
strongly with real-world zero-day discovery capability in Level 0. This validates CyberGym
as a meaningful proxy for operational offensive AI capability — not just a leaderboard number.
// AppSec Practitioner Takeaways
Strip away the academic framing and CyberGym is communicating several concrete things
to practitioners working in application security today.
AI-assisted vulnerability reproduction is operationally real, not theoretical.
An 18% single-trial success rate against 1,500 real-world bugs sounds modest until you
factor in parallelism. Six independent runs of GPT-4.1 reach 18% union coverage. At scale,
an adversary running hundreds of parallel agent instances against a target codebase is not
a 2027 problem. The compute cost to attempt this is already within reach of well-resourced
threat actors.
Patch quality verification is an undervalued use case. The 17 incomplete
patches discovered were a side effect of evaluation, not a deliberate hunt. Integrating
AI-generated PoC testing into patch review pipelines — specifically to verify that a fix
fully closes the attack surface rather than just patching the reported crash input — is
a defensive application that deserves more tooling attention.
Specialisation gap between defensive and offensive AI is confirmed.
SWE-bench models scoring near zero on CyberGym is a clean empirical data point:
code fix reasoning does not transfer to code exploitation reasoning. Teams evaluating
AI tools for security automation should be cautious about assuming general coding
capability translates to security-specific tasks. Test explicitly against the task you care about.
Safety alignment as an observable operational constraint. The o4-mini
behaviour — halting to seek confirmation rather than proceeding autonomously — is worth
noting for teams building security tooling on top of commercial LLM APIs. Model-level
safety controls are not always transparent, and they can degrade agent effectiveness
in ways that do not surface until you run evaluation against real tasks.
// My Take
CyberGym is a methodologically serious piece of work that deserves to be read carefully,
not just cited as a headline number. The OSS-Fuzz sourcing strategy is smart — it grounds
every instance in a real, confirmed, verified vulnerability with a documented patch differential.
That is not easy to do at this scale and it matters enormously for evaluation validity.
What I find most significant is not the 17.9% success rate — it is the 969-day
average age of the zero-days found. These were not obscure fringe projects.
They were active, maintained, security-conscious OSS codebases. The fact that AI agents
running against them found unpatched vulnerabilities faster than the existing bug discovery
ecosystem is a direct challenge to the assumption that continuous fuzzing and active
maintenance is sufficient. It is not — not when the adversary can throw an ensemble
of AI agents with different behavioural patterns at your codebase in parallel.
The complementarity finding is the one I keep coming back to. Agents succeeding on different
instance subsets, reaching 18.4% union vs. ~10% individual — that is an ensemble signal.
Defenders need to think about this the same way they think about layered detection:
no single agent covers everything, but a coordinated multi-agent system has a coverage
profile that starts to become operationally dangerous. We are not there yet at 18%.
But the trajectory from the paper's own progress chart — 10% to 30% across recent model
iterations — suggests the window to prepare is shorter than most teams think.
// References & Further Reading
CyberGym paper (arXiv:2506.02548) — arxiv.org
CyberGym project page & leaderboard — cybergym.io
OSS-Fuzz infrastructure — google.github.io/oss-fuzz
OpenHands agent framework — github.com/All-Hands-AI/OpenHands
Frontier AI Cybersecurity Observatory — rdi.berkeley.edu
Claude Sonnet 4.5 System Card (CyberGym evaluation referenced) — anthropic.com
AI Security
Vulnerability Research
LLM Agents
AppSec
OSS-Fuzz
Zero-Day
Benchmarking
OpenHands
Claude
GPT-5