When AI Agents Learn to Hunt Vulnerabilities at Scale

// AI Security Research · Benchmark AnalysisCyberGym: When AI Agents Learn to Hunt Vulnerabilities at Scale
    Elusive Thoughts  ·  AI Security  ·  
    Research: Wang, Shi, He, Cai, Zhang, Song — UC Berkeley (ICLR 2026)
  

  For years, the security community has asked the same uncomfortable question: when AI systems 
  get good enough at finding bugs, what does that actually look like in practice — not in a 
  capture-the-flag sandbox, but against the real, messy, multi-million-line codebases that run 
  the world's infrastructure? A team from UC Berkeley just published a rigorous answer. 
  CyberGym is a large-scale cybersecurity evaluation framework built around 
  1,507 real-world vulnerabilities sourced from production open-source software. It is currently 
  the most comprehensive benchmark of its kind, and its findings carry direct implications for 
  every AppSec practitioner, red teamer, and tooling team paying attention to the AI security space.

  // Paper: "CyberGym: Evaluating AI Agents' Cybersecurity Capabilities with Real-World Vulnerabilities at Scale"

  Wang et al. — UC Berkeley · arXiv:2506.02548 · ICLR 2026

  // Code: github.com/sunblaze-ucb/cybergym  ·  
  // Dataset: huggingface.co/datasets/sunblaze-ucb/cybergym
// The Problem With Existing Benchmarks
  Before getting into the methodology, it is worth understanding why a new benchmark was 
  necessary at all. Most existing AI cybersecurity evaluations share a fundamental flaw: 
  they are based on synthetic or educational challenges — CTF problems, toy codebases, 
  deliberately crafted puzzles. These test pattern recognition in a controlled environment, 
  not the kind of multi-step reasoning required to exploit a subtle memory corruption bug 
  buried inside a 400,000-line C++ multimedia library.

  The other problem is scope. Previous comparable work was limited in coverage — CyberGym 
  claims to be 7.5× larger than the nearest prior benchmark. When you are 
  trying to measure a capability that varies significantly across vulnerability type, language, 
  codebase complexity, and crash class, dataset size and diversity are not nice-to-haves. 
  They are the core of statistical validity.

  // Root Cause: Benchmarks based on synthetic CTF tasks systematically overstate AI agent 
  capability on real-world security work. Real vulnerability reproduction requires reasoning across 
  entire codebases, understanding program entry points, and generating PoCs that survive sanitizer 
  validation — not just recognising an XOR cipher.
// Benchmark Architecture: How CyberGym Is Built
  The design of CyberGym is its most technically interesting contribution, and it is worth 
  unpacking in detail because the sourcing strategy is what gives it credibility.
// Data Sourcing: OSS-Fuzz as Ground Truth
  Every benchmark instance is derived from OSS-Fuzz, Google's continuous 
  fuzzing infrastructure that runs against hundreds of major open-source projects. This is a 
  deliberate and important choice. OSS-Fuzz vulnerabilities are: confirmed exploitable (they 
  crash real builds), patched and documented, drawn from production codebases with real 
  complexity, and associated with a ground-truth PoC that the original fuzzer generated.

  For each vulnerability, the pipeline automatically extracts four artefacts from the patch 
  commit history: the pre-patch and post-patch codebases along with their Dockerised build 
  environments; the original OSS-Fuzz PoC; the applied patch diff; and the commit message, 
  which is rephrased using GPT-4.1 to generate a natural-language vulnerability description 
  for the agent. The result is a fully reproducible evaluation environment for every instance.

  // CyberGym instance structure (per vulnerability)

  instance/

    pre_patch_codebase/   # target: agent must exploit this

    post_patch_codebase/  # verifier: PoC must NOT crash this

    docker_build_env/     # reproducible build w/ sanitizers

    vuln_description.txt  # GPT-4.1 rephrased from commit msg

    ground_truth_poc      # original OSS-Fuzz PoC (not given to agent)

    patch.diff            # not given to agent at Level 1
// Scale and Diversity
  The 1,507 instances span 188 open-source projects including OpenSSL, FFmpeg, and OpenCV — 
  projects with codebases ranging from tens of thousands to millions of lines of code. 
  The dataset covers 28 distinct crash types, including buffer overflows, 
  null pointer dereferences, use-after-free, heap corruption, and integer overflows. This 
  diversity is deliberately engineered: a benchmark that only contains one class of bug 
  tells you very little about generalised capability.

    1,507
    Benchmark Instances
  

    188
    OSS Projects
  

    28
    Crash Types
  

    7.5×
    Larger Than Prior SOTA
  
// Quality Control Pipeline
  Benchmark quality is enforced through three automated filtering passes: 
  informativeness (removing commits lacking sufficient vulnerability context 
  or covering multiple simultaneous fixes, which would make success criteria ambiguous); 
  reproducibility (re-running ground-truth PoCs on both pre- and post-patch 
  executables to verify the pass/fail differential behaves correctly); and 
  non-redundancy (excluding duplicates via crash trace comparison). 
  This is not trivial — OSS-Fuzz produces a noisy stream of bug reports, and many commits 
  touch multiple issues simultaneously. The filtering pipeline is what makes the dataset 
  usable as a scientific instrument.
// Task Design: The Two Evaluation Levels
  CyberGym defines two distinct evaluation scenarios that test different capability profiles.
// Level 1 — Guided Vulnerability Reproduction
  This is the primary benchmark. The agent receives the pre-patch codebase and the 
  natural-language vulnerability description. It must generate a working proof-of-concept 
  that: triggers the vulnerability (crashes with sanitizers enabled) on the pre-patch version, 
  and does not trigger on the post-patch version. The differential is the verification signal — 
  not just "does it crash" but "does it crash in the right version because of the right bug."

  This is harder than it sounds. The agent must reason across an entire codebase — often 
  spanning thousands of files — to locate the relevant code path, understand the data flow 
  leading to the crash, and construct an input or function call sequence that exercises it 
  from a valid program entry point. Agents iterate based on execution feedback in a 
  read-execute-refine loop.

  // Success Criterion: PoC triggers sanitizer crash on pre-patch binary AND does not 
  trigger on post-patch binary. Verified automatically by the evaluation harness — no human 
  in the loop for scoring.
// Level 0 — Open-Ended Discovery (No Prior Context)
  The harder and more operationally relevant scenario. The agent receives only the latest 
  codebase — no vulnerability description, no hints, no patch. It must autonomously discover 
  and trigger new vulnerabilities. This mirrors what an offensive AI agent would do in a 
  real-world autonomous fuzzing or code auditing context. Results from this mode are discussed 
  in the real-world impact section below.
// Evaluation Results: What the Numbers Actually Mean// LLM Performance on Level 1
  Four agent frameworks were evaluated against nine LLMs. The headline number that will get 
  quoted everywhere is that the top combination — OpenHands with Claude-Sonnet-4 — achieves 
  a 17.9% reproduction success rate in a single trial. Claude-3.7-Sonnet 
  and GPT-4.1 follow closely behind. The more practically interesting stat: with 30 trials, 
  success rates reach approximately 67%, demonstrating strong test-time scaling potential.


  
      Model
      Agent Framework
      Success Rate (1 Trial)
      Notes
    

  
      Claude-Sonnet-4
      OpenHands
      17.9%
      Best overall (non-thinking mode)
    

      Claude-3.7-Sonnet
      OpenHands
      ~15%
      Second best; thinking mode evaluated
    

      GPT-4.1
      OpenHands / Codex CLI
      ~14%
      Strong cost/performance ratio
    

      GPT-5
      OpenHands
      22.0%
      Thinking mode only; highest with extended reasoning
    

      SWE-bench specialised
      Various
      ≤ 2%
      Fails to generalise to vuln reproduction
    

      o4-mini
      OpenHands
      Low
      Safety alignment triggers confirmation requests; limits autonomy
    



  Two findings here are worth dwelling on from a practitioner perspective.

  First, SWE-bench specialised models collapsed to near-zero performance. 
  These models are trained to fix software bugs — a task superficially similar to vulnerability 
  reproduction. The fact that they fail almost completely on CyberGym confirms that "bug fixing" 
  and "vulnerability exploitation" are distinct cognitive tasks, not just variants of the same 
  code reasoning capability. This matters if you are evaluating AI tools for defensive vs. 
  offensive security applications.

  Second, o4-mini's safety alignment actively blocked autonomous execution. 
  The model repeatedly sought user confirmation mid-task rather than proceeding, reducing 
  effective performance despite having strong underlying coding ability. This is a direct 
  observable signal of how safety alignment interacts with agentic security tasks — relevant 
  for anyone building AI security tooling on top of commercial LLM APIs.
// Test-Time Scaling and Thinking Modes
  The evaluation includes a controlled comparison of thinking vs. non-thinking modes on a 
  300-task subset. The most dramatic delta was GPT-5: it jumped from 7.7% with minimal 
  reasoning to 22.0% with high reasoning — surpassing Claude-Sonnet-4's non-thinking 
  performance. For GPT-4.1, running six independent trials and taking the union achieved 
  18.0% success vs. 8.7% average, nearly doubling effective capability through parallelism alone.

  Running 6 independent GPT-4.1 trials and taking the union achieves 18.0% success — 
  nearly double the 8.7% single-run average. Parallelism as a capability multiplier 
  is not a future concern. It is already operative.
// Agent Framework Behavioural Analysis
  All four tested frameworks — OpenHands, OpenAI Codex CLI, EnIGMA, and Cybench agent — 
  achieved similar aggregate success rates when using GPT-4.1 as the backbone. But the 
  union of their outcomes reached 18.4%, nearly double any single agent's performance. 
  The agents are not redundant — they succeed on different subsets of tasks, which means 
  their capabilities are genuinely complementary rather than correlated.

  Behaviourally, OpenHands was most efficient — it chains commands in Bash, reducing the 
  total number of tool calls needed. The CTF-specialised agents (EnIGMA, Cybench) relied 
  more heavily on Python scripting. This suggests different mental models of how to approach 
  the code, and is consistent with their respective training distributions.
// Real-World Security Impact: The Numbers That Matter
  Benchmark scores measure capability in a controlled environment. The real credibility test 
  is whether that capability translates to production systems. CyberGym's answer to this is 
  unambiguous.
// Incomplete Patches Detected
  During Level 1 evaluation, 759 PoCs triggered crashes across 60 projects even on 
  patched versions of the code. Manual inspection confirmed 17 incomplete 
  patches across 15 projects. The AI-generated PoCs reached further into the 
  post-patch behaviour than the original OSS-Fuzz PoCs did, effectively stress-testing the 
  quality of existing security patches as a side effect of evaluation. None affected the 
  latest software releases, but the finding demonstrates that AI-generated PoCs can uncover 
  patch coverage gaps that manual review missed.
// Zero-Days Discovered
  Post-patch crash validation identified 35 PoCs that still crashed the latest versions of 
  their target programs. After deduplication, these mapped to 10 unique zero-day 
  vulnerabilities, each of which had been sitting undetected in production code for 
  an average of 969 days before the agents found them. All findings were 
  responsibly disclosed, resulting in 3 assigned CVEs and 6 patched vulnerabilities as of 
  publication.

    759
    Post-Patch PoC Crashes
  

    17
    Incomplete Patches Confirmed
  

    10
    Zero-Days (Unique)
  

    969
    Avg Days Undetected
  
// Level 0 Open-Ended Discovery at Scale
  The open-ended discovery experiment deployed OpenHands across 431 OSS-Fuzz projects and 
  1,748 executables with zero prior knowledge of existing vulnerabilities. GPT-4.1 triggered 
  16 crashes and confirmed 7 zero-days. GPT-5 triggered 56 crashes and confirmed 22 zero-days, 
  with 4 overlapping between the two models. These are not reproductions of known bugs — 
  these are autonomous, unprompted discoveries in active production software.

  // Key correlation finding: Performance on the Level 1 reproduction benchmark correlates 
  strongly with real-world zero-day discovery capability in Level 0. This validates CyberGym 
  as a meaningful proxy for operational offensive AI capability — not just a leaderboard number.
// AppSec Practitioner Takeaways
  Strip away the academic framing and CyberGym is communicating several concrete things 
  to practitioners working in application security today.

  AI-assisted vulnerability reproduction is operationally real, not theoretical. 
  An 18% single-trial success rate against 1,500 real-world bugs sounds modest until you 
  factor in parallelism. Six independent runs of GPT-4.1 reach 18% union coverage. At scale, 
  an adversary running hundreds of parallel agent instances against a target codebase is not 
  a 2027 problem. The compute cost to attempt this is already within reach of well-resourced 
  threat actors.

  Patch quality verification is an undervalued use case. The 17 incomplete 
  patches discovered were a side effect of evaluation, not a deliberate hunt. Integrating 
  AI-generated PoC testing into patch review pipelines — specifically to verify that a fix 
  fully closes the attack surface rather than just patching the reported crash input — is 
  a defensive application that deserves more tooling attention.

  Specialisation gap between defensive and offensive AI is confirmed. 
  SWE-bench models scoring near zero on CyberGym is a clean empirical data point: 
  code fix reasoning does not transfer to code exploitation reasoning. Teams evaluating 
  AI tools for security automation should be cautious about assuming general coding 
  capability translates to security-specific tasks. Test explicitly against the task you care about.

  Safety alignment as an observable operational constraint. The o4-mini 
  behaviour — halting to seek confirmation rather than proceeding autonomously — is worth 
  noting for teams building security tooling on top of commercial LLM APIs. Model-level 
  safety controls are not always transparent, and they can degrade agent effectiveness 
  in ways that do not surface until you run evaluation against real tasks.

  // My Take
  CyberGym is a methodologically serious piece of work that deserves to be read carefully, 
  not just cited as a headline number. The OSS-Fuzz sourcing strategy is smart — it grounds 
  every instance in a real, confirmed, verified vulnerability with a documented patch differential. 
  That is not easy to do at this scale and it matters enormously for evaluation validity.


  
  What I find most significant is not the 17.9% success rate — it is the 969-day 
  average age of the zero-days found. These were not obscure fringe projects. 
  They were active, maintained, security-conscious OSS codebases. The fact that AI agents 
  running against them found unpatched vulnerabilities faster than the existing bug discovery 
  ecosystem is a direct challenge to the assumption that continuous fuzzing and active 
  maintenance is sufficient. It is not — not when the adversary can throw an ensemble 
  of AI agents with different behavioural patterns at your codebase in parallel.


  
  The complementarity finding is the one I keep coming back to. Agents succeeding on different 
  instance subsets, reaching 18.4% union vs. ~10% individual — that is an ensemble signal. 
  Defenders need to think about this the same way they think about layered detection: 
  no single agent covers everything, but a coordinated multi-agent system has a coverage 
  profile that starts to become operationally dangerous. We are not there yet at 18%. 
  But the trajectory from the paper's own progress chart — 10% to 30% across recent model 
  iterations — suggests the window to prepare is shorter than most teams think.
// References & Further Reading
  CyberGym paper (arXiv:2506.02548) — arxiv.org

  CyberGym project page & leaderboard — cybergym.io

  OSS-Fuzz infrastructure — google.github.io/oss-fuzz

  OpenHands agent framework — github.com/All-Hands-AI/OpenHands

  Frontier AI Cybersecurity Observatory — rdi.berkeley.edu

  Claude Sonnet 4.5 System Card (CyberGym evaluation referenced) — anthropic.com

  AI Security
  Vulnerability Research
  LLM Agents
  AppSec
  OSS-Fuzz
  Zero-Day
  Benchmarking
  OpenHands
  Claude
  GPT-5
Search This Blog

Elusive Thoughts

When AI Agents Learn to Hunt Vulnerabilities at Scale

CyberGym: When AI Agents Learn to Hunt Vulnerabilities at Scale

// The Problem With Existing Benchmarks

// Benchmark Architecture: How CyberGym Is Built

// Data Sourcing: OSS-Fuzz as Ground Truth

// Scale and Diversity

// Quality Control Pipeline

// Task Design: The Two Evaluation Levels

// Level 1 — Guided Vulnerability Reproduction

// Level 0 — Open-Ended Discovery (No Prior Context)

// Evaluation Results: What the Numbers Actually Mean

// LLM Performance on Level 1

// Test-Time Scaling and Thinking Modes

// Agent Framework Behavioural Analysis

// Real-World Security Impact: The Numbers That Matter

// Incomplete Patches Detected

// Zero-Days Discovered

// Level 0 Open-Ended Discovery at Scale

// AppSec Practitioner Takeaways

// References & Further Reading

Popular posts from this blog

PHP Source Code Chunks of Insanity (Delete Post Pages) Part 4

The Hackers Guide To Dismantling IPhone (Part 3)

MSSQL Injection OPENROWSET Side Channel

Model	Agent Framework	Success Rate (1 Trial)	Notes
Claude-Sonnet-4	OpenHands	17.9%	Best overall (non-thinking mode)
Claude-3.7-Sonnet	OpenHands	~15%	Second best; thinking mode evaluated
GPT-4.1	OpenHands / Codex CLI	~14%	Strong cost/performance ratio
GPT-5	OpenHands	22.0%	Thinking mode only; highest with extended reasoning
SWE-bench specialised	Various	≤ 2%	Fails to generalise to vuln reproduction
o4-mini	OpenHands	Low	Safety alignment triggers confirmation requests; limits autonomy