When AI Agents Learn to Hunt Vulnerabilities at Scale
// AI Security Research · Benchmark Analysis CyberGym: When AI Agents Learn to Hunt Vulnerabilities at Scale Elusive Thoughts · AI Security · Research: Wang, Shi, He, Cai, Zhang, Song — UC Berkeley (ICLR 2026) For years, the security community has asked the same uncomfortable question: when AI systems get good enough at finding bugs, what does that actually look like in practice — not in a capture-the-flag sandbox, but against the real, messy, multi-million-line codebases that run the world's infrastructure? A team from UC Berkeley just published a rigorous answer. CyberGym is a large-scale cybersecurity evaluation framework built around 1,507 real-world vulnerabilities sourced from production open-source software. It is currently the most comprehensive benchmark of its kind, and its findings carry direct implications for every AppSec practitioner, red teamer, and tooling team paying attention ...