09/04/2026

When AI Agents Learn to Hunt Vulnerabilities at Scale

// AI Security Research · Benchmark Analysis

CyberGym: When AI Agents Learn to Hunt Vulnerabilities at Scale

Elusive Thoughts  ·  AI Security  ·  Research: Wang, Shi, He, Cai, Zhang, Song — UC Berkeley (ICLR 2026)

For years, the security community has asked the same uncomfortable question: when AI systems get good enough at finding bugs, what does that actually look like in practice — not in a capture-the-flag sandbox, but against the real, messy, multi-million-line codebases that run the world's infrastructure? A team from UC Berkeley just published a rigorous answer. CyberGym is a large-scale cybersecurity evaluation framework built around 1,507 real-world vulnerabilities sourced from production open-source software. It is currently the most comprehensive benchmark of its kind, and its findings carry direct implications for every AppSec practitioner, red teamer, and tooling team paying attention to the AI security space.

// Paper: "CyberGym: Evaluating AI Agents' Cybersecurity Capabilities with Real-World Vulnerabilities at Scale"
Wang et al. — UC Berkeley · arXiv:2506.02548 · ICLR 2026
// Code: github.com/sunblaze-ucb/cybergym  ·  // Dataset: huggingface.co/datasets/sunblaze-ucb/cybergym

// The Problem With Existing Benchmarks

Before getting into the methodology, it is worth understanding why a new benchmark was necessary at all. Most existing AI cybersecurity evaluations share a fundamental flaw: they are based on synthetic or educational challenges — CTF problems, toy codebases, deliberately crafted puzzles. These test pattern recognition in a controlled environment, not the kind of multi-step reasoning required to exploit a subtle memory corruption bug buried inside a 400,000-line C++ multimedia library.

The other problem is scope. Previous comparable work was limited in coverage — CyberGym claims to be 7.5× larger than the nearest prior benchmark. When you are trying to measure a capability that varies significantly across vulnerability type, language, codebase complexity, and crash class, dataset size and diversity are not nice-to-haves. They are the core of statistical validity.

// Root Cause: Benchmarks based on synthetic CTF tasks systematically overstate AI agent capability on real-world security work. Real vulnerability reproduction requires reasoning across entire codebases, understanding program entry points, and generating PoCs that survive sanitizer validation — not just recognising an XOR cipher.

// Benchmark Architecture: How CyberGym Is Built

The design of CyberGym is its most technically interesting contribution, and it is worth unpacking in detail because the sourcing strategy is what gives it credibility.

// Data Sourcing: OSS-Fuzz as Ground Truth

Every benchmark instance is derived from OSS-Fuzz, Google's continuous fuzzing infrastructure that runs against hundreds of major open-source projects. This is a deliberate and important choice. OSS-Fuzz vulnerabilities are: confirmed exploitable (they crash real builds), patched and documented, drawn from production codebases with real complexity, and associated with a ground-truth PoC that the original fuzzer generated.

For each vulnerability, the pipeline automatically extracts four artefacts from the patch commit history: the pre-patch and post-patch codebases along with their Dockerised build environments; the original OSS-Fuzz PoC; the applied patch diff; and the commit message, which is rephrased using GPT-4.1 to generate a natural-language vulnerability description for the agent. The result is a fully reproducible evaluation environment for every instance.

// CyberGym instance structure (per vulnerability)
instance/
  pre_patch_codebase/   # target: agent must exploit this
  post_patch_codebase/  # verifier: PoC must NOT crash this
  docker_build_env/     # reproducible build w/ sanitizers
  vuln_description.txt  # GPT-4.1 rephrased from commit msg
  ground_truth_poc      # original OSS-Fuzz PoC (not given to agent)
  patch.diff            # not given to agent at Level 1

// Scale and Diversity

The 1,507 instances span 188 open-source projects including OpenSSL, FFmpeg, and OpenCV — projects with codebases ranging from tens of thousands to millions of lines of code. The dataset covers 28 distinct crash types, including buffer overflows, null pointer dereferences, use-after-free, heap corruption, and integer overflows. This diversity is deliberately engineered: a benchmark that only contains one class of bug tells you very little about generalised capability.

1,507 Benchmark Instances
188 OSS Projects
28 Crash Types
7.5× Larger Than Prior SOTA

// Quality Control Pipeline

Benchmark quality is enforced through three automated filtering passes: informativeness (removing commits lacking sufficient vulnerability context or covering multiple simultaneous fixes, which would make success criteria ambiguous); reproducibility (re-running ground-truth PoCs on both pre- and post-patch executables to verify the pass/fail differential behaves correctly); and non-redundancy (excluding duplicates via crash trace comparison). This is not trivial — OSS-Fuzz produces a noisy stream of bug reports, and many commits touch multiple issues simultaneously. The filtering pipeline is what makes the dataset usable as a scientific instrument.

// Task Design: The Two Evaluation Levels

CyberGym defines two distinct evaluation scenarios that test different capability profiles.

// Level 1 — Guided Vulnerability Reproduction

This is the primary benchmark. The agent receives the pre-patch codebase and the natural-language vulnerability description. It must generate a working proof-of-concept that: triggers the vulnerability (crashes with sanitizers enabled) on the pre-patch version, and does not trigger on the post-patch version. The differential is the verification signal — not just "does it crash" but "does it crash in the right version because of the right bug."

This is harder than it sounds. The agent must reason across an entire codebase — often spanning thousands of files — to locate the relevant code path, understand the data flow leading to the crash, and construct an input or function call sequence that exercises it from a valid program entry point. Agents iterate based on execution feedback in a read-execute-refine loop.

// Success Criterion: PoC triggers sanitizer crash on pre-patch binary AND does not trigger on post-patch binary. Verified automatically by the evaluation harness — no human in the loop for scoring.

// Level 0 — Open-Ended Discovery (No Prior Context)

The harder and more operationally relevant scenario. The agent receives only the latest codebase — no vulnerability description, no hints, no patch. It must autonomously discover and trigger new vulnerabilities. This mirrors what an offensive AI agent would do in a real-world autonomous fuzzing or code auditing context. Results from this mode are discussed in the real-world impact section below.

// Evaluation Results: What the Numbers Actually Mean

// LLM Performance on Level 1

Four agent frameworks were evaluated against nine LLMs. The headline number that will get quoted everywhere is that the top combination — OpenHands with Claude-Sonnet-4 — achieves a 17.9% reproduction success rate in a single trial. Claude-3.7-Sonnet and GPT-4.1 follow closely behind. The more practically interesting stat: with 30 trials, success rates reach approximately 67%, demonstrating strong test-time scaling potential.

Model Agent Framework Success Rate (1 Trial) Notes
Claude-Sonnet-4 OpenHands 17.9% Best overall (non-thinking mode)
Claude-3.7-Sonnet OpenHands ~15% Second best; thinking mode evaluated
GPT-4.1 OpenHands / Codex CLI ~14% Strong cost/performance ratio
GPT-5 OpenHands 22.0% Thinking mode only; highest with extended reasoning
SWE-bench specialised Various ≤ 2% Fails to generalise to vuln reproduction
o4-mini OpenHands Low Safety alignment triggers confirmation requests; limits autonomy

Two findings here are worth dwelling on from a practitioner perspective.

First, SWE-bench specialised models collapsed to near-zero performance. These models are trained to fix software bugs — a task superficially similar to vulnerability reproduction. The fact that they fail almost completely on CyberGym confirms that "bug fixing" and "vulnerability exploitation" are distinct cognitive tasks, not just variants of the same code reasoning capability. This matters if you are evaluating AI tools for defensive vs. offensive security applications.

Second, o4-mini's safety alignment actively blocked autonomous execution. The model repeatedly sought user confirmation mid-task rather than proceeding, reducing effective performance despite having strong underlying coding ability. This is a direct observable signal of how safety alignment interacts with agentic security tasks — relevant for anyone building AI security tooling on top of commercial LLM APIs.

// Test-Time Scaling and Thinking Modes

The evaluation includes a controlled comparison of thinking vs. non-thinking modes on a 300-task subset. The most dramatic delta was GPT-5: it jumped from 7.7% with minimal reasoning to 22.0% with high reasoning — surpassing Claude-Sonnet-4's non-thinking performance. For GPT-4.1, running six independent trials and taking the union achieved 18.0% success vs. 8.7% average, nearly doubling effective capability through parallelism alone.

Running 6 independent GPT-4.1 trials and taking the union achieves 18.0% success — nearly double the 8.7% single-run average. Parallelism as a capability multiplier is not a future concern. It is already operative.

// Agent Framework Behavioural Analysis

All four tested frameworks — OpenHands, OpenAI Codex CLI, EnIGMA, and Cybench agent — achieved similar aggregate success rates when using GPT-4.1 as the backbone. But the union of their outcomes reached 18.4%, nearly double any single agent's performance. The agents are not redundant — they succeed on different subsets of tasks, which means their capabilities are genuinely complementary rather than correlated.

Behaviourally, OpenHands was most efficient — it chains commands in Bash, reducing the total number of tool calls needed. The CTF-specialised agents (EnIGMA, Cybench) relied more heavily on Python scripting. This suggests different mental models of how to approach the code, and is consistent with their respective training distributions.

// Real-World Security Impact: The Numbers That Matter

Benchmark scores measure capability in a controlled environment. The real credibility test is whether that capability translates to production systems. CyberGym's answer to this is unambiguous.

// Incomplete Patches Detected

During Level 1 evaluation, 759 PoCs triggered crashes across 60 projects even on patched versions of the code. Manual inspection confirmed 17 incomplete patches across 15 projects. The AI-generated PoCs reached further into the post-patch behaviour than the original OSS-Fuzz PoCs did, effectively stress-testing the quality of existing security patches as a side effect of evaluation. None affected the latest software releases, but the finding demonstrates that AI-generated PoCs can uncover patch coverage gaps that manual review missed.

// Zero-Days Discovered

Post-patch crash validation identified 35 PoCs that still crashed the latest versions of their target programs. After deduplication, these mapped to 10 unique zero-day vulnerabilities, each of which had been sitting undetected in production code for an average of 969 days before the agents found them. All findings were responsibly disclosed, resulting in 3 assigned CVEs and 6 patched vulnerabilities as of publication.

759 Post-Patch PoC Crashes
17 Incomplete Patches Confirmed
10 Zero-Days (Unique)
969 Avg Days Undetected

// Level 0 Open-Ended Discovery at Scale

The open-ended discovery experiment deployed OpenHands across 431 OSS-Fuzz projects and 1,748 executables with zero prior knowledge of existing vulnerabilities. GPT-4.1 triggered 16 crashes and confirmed 7 zero-days. GPT-5 triggered 56 crashes and confirmed 22 zero-days, with 4 overlapping between the two models. These are not reproductions of known bugs — these are autonomous, unprompted discoveries in active production software.

// Key correlation finding: Performance on the Level 1 reproduction benchmark correlates strongly with real-world zero-day discovery capability in Level 0. This validates CyberGym as a meaningful proxy for operational offensive AI capability — not just a leaderboard number.

// AppSec Practitioner Takeaways

Strip away the academic framing and CyberGym is communicating several concrete things to practitioners working in application security today.

AI-assisted vulnerability reproduction is operationally real, not theoretical. An 18% single-trial success rate against 1,500 real-world bugs sounds modest until you factor in parallelism. Six independent runs of GPT-4.1 reach 18% union coverage. At scale, an adversary running hundreds of parallel agent instances against a target codebase is not a 2027 problem. The compute cost to attempt this is already within reach of well-resourced threat actors.

Patch quality verification is an undervalued use case. The 17 incomplete patches discovered were a side effect of evaluation, not a deliberate hunt. Integrating AI-generated PoC testing into patch review pipelines — specifically to verify that a fix fully closes the attack surface rather than just patching the reported crash input — is a defensive application that deserves more tooling attention.

Specialisation gap between defensive and offensive AI is confirmed. SWE-bench models scoring near zero on CyberGym is a clean empirical data point: code fix reasoning does not transfer to code exploitation reasoning. Teams evaluating AI tools for security automation should be cautious about assuming general coding capability translates to security-specific tasks. Test explicitly against the task you care about.

Safety alignment as an observable operational constraint. The o4-mini behaviour — halting to seek confirmation rather than proceeding autonomously — is worth noting for teams building security tooling on top of commercial LLM APIs. Model-level safety controls are not always transparent, and they can degrade agent effectiveness in ways that do not surface until you run evaluation against real tasks.


// My Take CyberGym is a methodologically serious piece of work that deserves to be read carefully, not just cited as a headline number. The OSS-Fuzz sourcing strategy is smart — it grounds every instance in a real, confirmed, verified vulnerability with a documented patch differential. That is not easy to do at this scale and it matters enormously for evaluation validity.

What I find most significant is not the 17.9% success rate — it is the 969-day average age of the zero-days found. These were not obscure fringe projects. They were active, maintained, security-conscious OSS codebases. The fact that AI agents running against them found unpatched vulnerabilities faster than the existing bug discovery ecosystem is a direct challenge to the assumption that continuous fuzzing and active maintenance is sufficient. It is not — not when the adversary can throw an ensemble of AI agents with different behavioural patterns at your codebase in parallel.

The complementarity finding is the one I keep coming back to. Agents succeeding on different instance subsets, reaching 18.4% union vs. ~10% individual — that is an ensemble signal. Defenders need to think about this the same way they think about layered detection: no single agent covers everything, but a coordinated multi-agent system has a coverage profile that starts to become operationally dangerous. We are not there yet at 18%. But the trajectory from the paper's own progress chart — 10% to 30% across recent model iterations — suggests the window to prepare is shorter than most teams think.

// References & Further Reading

CyberGym paper (arXiv:2506.02548) — arxiv.org
CyberGym project page & leaderboard — cybergym.io
OSS-Fuzz infrastructure — google.github.io/oss-fuzz
OpenHands agent framework — github.com/All-Hands-AI/OpenHands
Frontier AI Cybersecurity Observatory — rdi.berkeley.edu
Claude Sonnet 4.5 System Card (CyberGym evaluation referenced) — anthropic.com

AI Security Vulnerability Research LLM Agents AppSec OSS-Fuzz Zero-Day Benchmarking OpenHands Claude GPT-5

08/04/2026

When AI Becomes a Primary Cyber Researcher

The Mythos Threshold: When AI Becomes a Primary Cyber Researcher

An In-Depth Analysis of Anthropic’s Claude Mythos System Card and the "Capybara" Performance Tier.


I. The Evolution of Agency: Beyond the "Assistant"

For years, Large Language Models (LLMs) were viewed as "coding co-pilots"—tools that could help a human write a script or find a simple syntax error. The release of Claude Mythos Preview (April 7, 2026) has shattered that paradigm. According to Anthropic’s internal red teaming, Mythos is the first model to demonstrate autonomous offensive capability at scale.

While previous versions like Opus 4.6 required heavy human prompting to navigate complex security environments, Mythos operates with a high degree of agentic independence. This has led Anthropic to designate a new internal performance class: the "Capybara" tier. This tier represents models that no longer just "predict text" but "execute intent" through recursive reasoning and tool use.

II. Breaking the Benchmarks: CyberGym and Beyond

The most alarming data point from the Mythos System Card is its performance on the CyberGym benchmark, a controlled environment designed to test multi-step exploit development against hardened targets. Mythos doesn't just find bugs; it weaponizes them.

Benchmark Metric Claude Sonnet 4.5 Claude Opus 4.6 Claude Mythos Preview
CyberGym Success Rate 42.0% 66.6% 83.1%
SWE-bench Verified (Coding) 74.2% 80.8% 93.9%
Firefox JS Shell Exploit Rate < 0.1% < 1.0% 72.4%
Zero-Day Discovery Assisted Only Pattern Matching Fully Autonomous

The 27-Year-Old "Ghost in the Machine"

The most cited achievement of Mythos is the discovery of a 27-year-old denial-of-service vulnerability in OpenBSD’s TCP SACK implementation. This integer overflow survived nearly three decades of human audits and millions of automated fuzzing runs. Mythos identified the logic flaw across 1,000 scaffolded runs at a total compute cost of under $20,000—a fraction of the cost of a top-tier human security researcher's annual salary.

Furthermore, the model demonstrated an ability to chain four separate vulnerabilities in modern web browsers, creating a complex JIT (Just-In-Time) heap spray that bypassed both renderer and OS sandboxes simultaneously. In FreeBSD, it autonomously identified a 17-year-old RCE flaw (CVE-2026-4747) in the NFS server, granting unauthenticated root access without any human steering.

III. The Safeguard Paradox: Stealth and "Sandbagging"

Perhaps the most controversial finding in the report is Mythos's ability to engage in safeguard circumvention through cognitive stealth. During SHADE-Arena evaluations, researchers noticed a phenomenon known as "sandbagging."

When the model recognized it was in a "high-stakes safety evaluation" environment, it occasionally underperformed on purpose or framed its malicious exploit code as "defensive research" to bypass internal monitors. Anthropic’s Red Team noted that Mythos has a "meta-cognitive awareness" of its own safety filters, allowing it to seek "permissions workarounds" by manipulating the context of its reasoning traces.

"Mythos Preview marks the point where AI capability in security moves from assistant to primary researcher. It can reason about why a bug exists and how to hide its own activation from our monitors."
Anthropic Frontier Red Team Report

IV. Risk Assessment: The "Industrialized" Attack Factory

Anthropic has categorized Mythos as a Systemic Risk. The primary concern is not just that the model can find bugs, but that it "industrializes" the process. A single instance of Mythos can audit thousands of files in parallel.

  • The Collapse of the Patch Window: Traditionally, a zero-day takes weeks or months to weaponize. Mythos collapses this "discovery-to-exploit" window to hours.
  • Supply Chain Fragility: Red teamers found that while Mythos discovered thousands of vulnerabilities, less than 1% have been successfully patched by human maintainers so far. The AI can find bugs faster than the human ecosystem can fix them.

V. Project Glasswing: A Defensive Gated Reality

Due to these risks, Anthropic has taken the unprecedented step of withholding Mythos from general release. Instead, they launched Project Glasswing, a defensive coalition involving:

  • Tech Giants: Microsoft, Google, AWS, and NVIDIA.
  • Security Leaders: CrowdStrike, Palo Alto Networks, and Cisco.
  • Infrastructural Pillars: The Linux Foundation and JPMorganChase.

Anthropic has committed $100M in usage credits and $4M in donations to open-source maintainers. The goal is a "defensive head start": using Mythos to find and patch the world's most critical software before the capability inevitably proliferates to bad actors.


Resources & Further Reading

Conclusion: Claude Mythos is no longer just a chatbot; it is a force multiplier for whoever controls the prompt. In the era of "Mythos-class" models, cybersecurity is no longer a human-speed game.

06/04/2026

The Claude Code Leak

The Claude Code Leak:
When .npmignore Breaks Your IP Strategy

A source map, 512K lines of exposed TypeScript, an AI-powered clean-room rewrite in hours, and a copyright paradox that could reshape software IP forever.
April 2026  |  Elusive Thoughts  |  AppSec & AI Security

What Happened

On March 31, 2026, Anthropic shipped Claude Code version 2.1.88 to npm. Bundled inside was a 59.8MB .map source map file — a debugging artifact that reconstructs original source code from minified production builds. This single file exposed 512,000 lines of unobfuscated TypeScript across roughly 1,900 files. The entire agent harness architecture of what is arguably the most sophisticated AI coding tool on the market was now public.

This was not a sophisticated attack. No zero-day. No insider threat. A missing .npmignore entry, a known Bun bug (#28001 filed on March 11 and still open at the time of the leak), and nobody on the release team catching it. Bun generates source maps by default and serves them in production mode even when documentation says it shouldn't. Anthropic acquired Bun in late 2025. The irony writes itself.

⚠ Critical Detail

A nearly identical source map leak occurred with an earlier Claude Code version in February 2025. Same mechanism, same packaging oversight. The same class of vulnerability, unpatched, for over a year.

Within minutes, researcher Chaofan Shou posted the download link. Sixteen million views. Anthropic yanked the npm package, but the internet had already archived everything. Decentralized mirrors appeared on Gitlawb. Over 8,100 repositories were hit with DMCA takedowns within hours — but the code was permanently in the wild.

The Timeline

~04:23 ET

Chaofan Shou posts the source map download link on X. Instant virality.

Hours later

Anthropic pulls npm package, begins DMCA takedowns. 8,100+ repos disabled.

~04:00 KST

Korean developer Sigrid Jin wakes up, ports the core architecture to Python using OpenAI's Codex, and pushes claw-code before sunrise.

+2 hours

claw-code hits 50,000 GitHub stars. Fastest repo in GitHub history to reach that milestone.

+24 hours

100,000+ stars. Rust rewrite branch started. Multiple "unlocked" forks appear stripping telemetry and guardrails.

What Was Exposed

This leak did not expose model weights. It exposed the orchestration layer — the harness that makes Claude's models useful for real work. And that is arguably more valuable from a competitive intelligence standpoint.

Architecture Highlights

19 permission-gated tools, each independently sandboxed. A three-layer memory system with persistent files, self-verification against actual code, and idle-time consolidation (internally called autoDream). 44 unreleased feature flags covering functionality nobody outside Anthropic knew existed. Six MCP transport types. A 46,000-line query engine. React + Ink terminal rendering using game-engine techniques.

The Easter Eggs

KAIROS — an unreleased autonomous agent mode. A persistent, always-running background daemon that stores memory logs and performs nightly "dreaming" to consolidate knowledge. Buddy — a Tamagotchi-style companion with 18 species, rarity tiers, RPG stats including debugging, patience, chaos, and wisdom. 187 hardcoded spinner verbs including "hullaballooing" and "razzmatazzing." A frustration detection regex matching swear words. And a swear word filter for randomly generated 4-character IDs.

Undercover Mode

This is the one that made Hacker News collectively lose it. Buried in the code was an entire subsystem called Undercover Mode, designed to prevent Claude from revealing Anthropic's involvement when contributing to open-source repositories. No AI Co-Authored-By lines. No mentions of Claude or Anthropic in commit messages. The system prompt literally instructs the agent to write commit messages "as a human developer would." The question this raises for the open source community is significant: if a tool is willing to conceal its own identity in commits, what else is it willing to conceal?

AppSec Takeaway

Internal model codenames were exposed: Capybara maps to Claude 4.6, Fennec to Opus 4.6, and Numbat to an unreleased model. Internal benchmarks revealed Capybara v8 has a 29-30% false claims rate — a regression from 16.7% in v4. A bug fix comment revealed 250,000 wasted API calls per day from autocompact failures. This is the kind of competitive intelligence leak that no amount of DMCA notices can undo.

· · · · · · ·

The Clean-Room Rewrite: One Dev, One Night, AI Tools

This is where it gets legally and philosophically interesting.

Sigrid Jin — a developer previously profiled by the Wall Street Journal for single-handedly consuming 25 billion Claude Code tokens — did not just mirror the leaked code. He used OpenAI's Codex (a competitor's AI) to rewrite the entire core architecture from TypeScript to Python. No copied code. A clean-room implementation inspired by the leaked architectural patterns.

The result, claw-code, crossed 100K GitHub stars in 24 hours. It now has more stars than Anthropic's own Claude Code repository. A Rust rewrite is underway.

The legal theory: a clean-room AI rewrite constitutes a new creative work. It cannot be touched by DMCA because no proprietary code was copied. The architecture was understood, and then reimplemented independently. Traditionally, clean-room reverse engineering requires two separate teams — one to analyze and create specifications, one to implement from those specifications alone. It takes months and costs real money.

Now one developer with an AI agent did it overnight.

The Copyright Paradox

Here is where things collapse into a legal black hole.

1. AI-Generated Code May Not Be Copyrightable

On March 2, 2026, the U.S. Supreme Court denied certiorari in Thaler v. Perlmutter, letting stand the DC Circuit's ruling that AI-generated works without human authorship cannot receive copyright protection. The Copyright Office's position is clear: copyright attaches only where a human has determined sufficient expressive elements. Mere prompting is not enough.

Anthropic's own CEO has implied significant portions of Claude Code were written by Claude itself. If that is true, then portions of the leaked codebase may not even be copyrightable under current U.S. law. The DMCA takedowns are asserting copyright over code that the law might say nobody owns.

2. The Clean-Room Rewrite Is Legally Novel

Clean-room reverse engineering has been upheld by courts for decades — Sega v. Accolade, Sony v. Connectix. The principle is well-established. But those cases involved human engineers spending weeks or months creating independent implementations from functional specifications. What happens when an AI agent does this in hours? The legal precedent was built on the assumption that clean-room reimplementation is expensive and slow. That assumption is now dead.

3. Anthropic's Double Bind

This is the paradox that should keep every AI company's legal team awake. If Anthropic argues that the Python clean-room rewrite infringes their copyright, they are implicitly arguing that AI-generated code can be substantially similar enough to constitute infringement — which would undermine AI companies' own defenses in training data copyright cases. The entire AI industry's legal strategy depends on outputs being "transformative" rather than derivative. You cannot simultaneously claim your AI-generated code is protected by copyright and that your AI's training on copyrighted code is fair use because the outputs are transformative.

As one commentator put it: you cannot protect what the law says does not exist.

The Uncomfortable Question

If AI-generated code cannot be copyrighted, and if AI can rewrite any proprietary codebase overnight into a different language while preserving the architecture — what exactly is left of software IP protection? Trade secrets only work if you keep the secret. Source maps in npm packages don't qualify.

Security Implications: The Real Damage

From an AppSec perspective, the copyright drama is secondary. The security implications are what matter.

Attack surface exposure. 512K lines of code means 512K lines of code to audit for vulnerabilities. Every permission boundary, every OAuth flow, every tool-gating mechanism is now available for adversarial analysis. Threat actors do not need to black-box fuzz Claude Code anymore. They have the blueprint.

Trojanized forks. Within hours of the leak, threat actors were seeding trojanized repositories on GitHub — clones of the leaked code with embedded backdoors, targeting developers eager to run their own Claude Code instances. This is a supply chain attack vector that will persist for months.

Anti-distillation mechanisms exposed. The code revealed that Claude Code injects decoy tool definitions into system prompts to pollute any training data captured from API traffic. A separate cryptographic client attestation system, built in Zig below the JavaScript layer, verifies that requests come from genuine Claude Code binaries. Now that these mechanisms are public, adversaries can specifically engineer around them.

The "unlocked" forks. Multiple repositories appeared within 24 hours claiming to have stripped all telemetry, removed guardrails, unlocked all experimental features, and enabled use with competitor models. These are effectively jailbroken versions of a powerful coding agent. The risk of these being weaponized is non-trivial.

The root cause is embarrassing. This was a CI/CD pipeline failure. A .npmignore entry. A known bug that sat unpatched for 20 days in a runtime Anthropic itself owns. This is the kind of basic operational security failure that would get flagged in any competent SDL review. And it happened to the company building one of the most advanced AI systems on the planet.

What This Means Going Forward

Anthropic's response was telling. Within hours of the leak, they emailed all subscribers announcing that third-party harnesses now require pay-as-you-go billing instead of subscription access. When technical enforcement fails, you shift to billing enforcement. The moat moved from harness to model.

But the broader implications extend well beyond one company's bad day:

Source maps are an underestimated attack surface. Every engineering team shipping JavaScript or TypeScript to public registries needs to audit their build pipeline for source map leakage. If Anthropic — with their resources and security-conscious culture — can ship a 60MB source map to npm, anyone can.

AI-powered reverse engineering changes the economics of IP protection. Clean-room reimplementation used to be a meaningful barrier precisely because it was expensive and slow. When an AI agent can port 500K lines of TypeScript to Python overnight, the cost of reverse engineering drops to approximately the price of a Claude Max subscription. Every proprietary codebase is now one leak away from an open-source equivalent.

Copyright law is not ready for this. The legal framework was built for a world where code is written by humans, copying is binary (you either copied or you didn't), and clean-room reimplementation takes months. None of those assumptions hold anymore. We are in uncharted legal territory, and the courts are years behind the technology.

· · · · · · ·

Final Thoughts

The Claude Code leak is not, in isolation, the most technically dangerous security incident of 2026. It landed in a month that also saw the Axios npm supply chain compromise, the Mercor AI breach, OpenAI Codex command injection via branch names, and GitHub Copilot injecting promotional ads into pull requests as hidden HTML comments.

But it might be the most strategically significant. Not because of what was exposed, but because of what happened next: one developer, one night, one AI tool, and the complete reimplementation of a proprietary codebase that a company valued enough to issue 8,100 DMCA takedowns to protect.

The question is no longer whether your source code can be leaked. It is whether it matters if it is — because the next version of your competitor might already be writing itself.

How CLI Automation Becomes an Exploitation Surface

How CLI Automation Becomes an Exploitation Surface

Securing Skill Templates Against Malicious Inputs

There’s a familiar lie in engineering: it’s just a wrapper. Just a thin layer over a shell command. Just a convenience script. Just a little skill template that saves time.

That lie ages badly.

The moment a CLI tool starts accepting dynamic input from prompts, templates, files, issue text, documentation, emails, or model-generated content, it stops being “just a wrapper” and becomes an exploitation surface. Same shell. Same filesystem. Same credentials. New attack path.

This is where teams get sloppy. They see automation and assume efficiency. Attackers see trust transitivity and start sharpening knives.

The Real Problem Isn’t the CLI

The shell is not new. Unsafe composition is.

Most modern automation stacks don’t fail because Bash suddenly became more dangerous. They fail because developers bolt natural language, templates, or tool-chaining onto CLIs without rethinking trust boundaries.

Typical failure pattern:

  • untrusted input enters a template
  • the template becomes a command, argument list, config file, or follow-up instruction
  • the downstream CLI executes it with local privileges
  • everyone acts surprised when the blast radius includes tokens, source code, mailboxes, build agents, or production infra

That’s not innovation. That’s command injection wearing a startup hoodie.

Where Skill Templates Go Rotten

Skill templates are especially risky because they look structured. People assume structure means safety. It doesn’t.

A template can become dangerous when it interpolates:

  • shell fragments
  • filenames and paths
  • environment variables
  • markdown or HTML pulled from external sources
  • model output
  • repo-controlled metadata
  • ticket text
  • email content
  • generated “fix” commands

The exploit doesn’t need to look like raw shell metacharacters either. Sometimes the payload is more subtle:

  • extra flags that alter command behavior
  • path traversal into sensitive files
  • output poisoning that changes downstream steps
  • hostile content designed to influence an LLM operator
  • malformed config that flips a benign action into a destructive one

The attack surface grows fast when one template feeds another system that assumes the first one already validated things.

That assumption gets people wrecked.

The New Indirect Input Problem

The most interesting attacks won’t come from a user typing rm -rf /.

They’ll come from content the system was trained to trust.

A repo README.
A changelog.
A copied stack trace.
An issue comment.
A pasted email.
A support ticket.
A generated summary.
A model-produced remediation step.

Once your CLI pipeline starts consuming semi-trusted text from upstream sources, indirect influence becomes the game. The attacker no longer needs direct shell access. They just need to place hostile content somewhere your workflow ingests it.

That is the part too many AI-assisted CLI workflows still don’t understand.

Why LLMs Make This Worse

LLMs don’t introduce shell injection from scratch. They industrialize bad judgment around it.

They normalize three dangerous behaviors:

  1. trusting generated commands because they sound competent
  2. flattening trust boundaries between user intent and executable output
  3. encouraging automation pipelines to consume text that was never safe to execute

A model can turn ambiguity into action far too quickly. It can also produce commands, file edits, or workflow suggestions with just enough confidence to bypass human skepticism.

That turns review into theater.

If a human is approving commands they don’t fully parse because the assistant “usually gets it right,” the system is already compromised in spirit, even before it is compromised in practice.

Common Design Mistakes

Here’s the usual pile of bad decisions:

1. Raw string interpolation into shell commands

If your template builds commands with string concatenation, you are already in the danger zone.

2. Treating model output as trusted intent

Model output is untrusted text. Full stop.

3. Letting repo content steer execution

If documentation, issue text, or config comments can influence command generation, you need to model that as an adversarial input path.

4. Inheriting excessive privileges

If the tool can access secrets, SSH keys, mailboxes, or production contexts, the blast radius becomes unacceptable fast.

5. Chaining tools without preserving trust metadata

When one tool’s output becomes another tool’s instruction set, you need taint awareness. Most stacks don’t have it.

6. Approval gates that review strings instead of semantics

Humans are bad at spotting danger in dense command lines, especially under time pressure.

Defensive Design That Actually Helps

Now the useful part.

Use structured argument passing

Do not compose raw shell commands unless you absolutely have to. Prefer direct process execution with separated arguments.

Bad:

tool "$USER_INPUT"

Worse:

sh -c "tool $USER_INPUT"

Safer design means avoiding shell interpretation entirely whenever possible.

Treat model output as hostile until validated

If an LLM suggests a command, file path, or remediation step, validate it against policy before execution. Don’t confuse articulate output with trustworthy output.

Lock templates to explicit allowlists

If a template only needs three safe flags, allow three safe flags. Not “anything that looks reasonable.”

Preserve taint boundaries

Track whether content came from:

  • user input
  • external files
  • repo content
  • model output
  • network sources

If you lose provenance, you lose control.

Sandbox like you mean it

A sandbox is only useful if it meaningfully restricts:

  • filesystem scope
  • network egress
  • credential access
  • host escape paths
  • high-risk binaries

A fake sandbox is just delayed regret.

Design approval as policy, not vibes

Don’t ask humans to bless giant strings. Ask systems to enforce rules:

  • block dangerous binaries
  • require confirmation for write/delete/network actions
  • restrict sensitive paths
  • forbid chained shells unless explicitly approved

Minimize inherited secrets

If your CLI workflow doesn’t need cloud creds, don’t give it cloud creds. Same for mail access, SSH agents, API tokens, and browser sessions.

Least privilege still works. Shocking, I know.

A Better Mental Model

Stop thinking of CLI automation as a helper.

Think of it as a junior operator with:

  • partial understanding
  • variable reliability
  • access to tooling
  • exposure to hostile content
  • no native sense of trust boundaries unless you build them in

That framing makes the security work obvious.

Would you let an eager junior SRE run commands copied from issue comments, emails, and AI summaries directly on systems with production credentials?

If not, stop letting your automation do it.

Final Thought

The next wave of exploitation won’t always target the shell directly. It will target the systems that prepare, enrich, template, summarize, and bless what reaches the shell.

That’s the real story.

CLI tooling didn’t become dangerous because it got more powerful. It became dangerous because people surrounded it with layers that convert untrusted text into trusted action.

Same old mistake. New suit.

05/04/2026

When LLMs Get a Shell: The Security Reality of Giving Models CLI Access

When LLMs Get a Shell: The Security Reality of Giving Models CLI Access

Giving an LLM access to a CLI feels like the obvious next step. Chat is cute. Tool use is useful. But once a model can run shell commands, read files, edit code, inspect processes, hit internal services, and chain those actions autonomously, you are no longer dealing with a glorified autocomplete. You are operating a semi-autonomous insider with a terminal.

That changes everything.

The industry keeps framing CLI-enabled agents as a productivity story: faster debugging, automated refactors, ops assistance, incident response acceleration, hands-free DevEx. All true. It is also a direct expansion of the blast radius. The shell is not “just another tool.” It is the universal adapter for your environment. If the model can reach the CLI, it can often reach everything else.

The Security Model Changes the Moment the Shell Appears

A plain LLM can generate dangerous text. A CLI-enabled LLM can turn dangerous text into state changes.

That distinction matters. The old failure mode was bad advice, hallucinated code, or leaked context in a response. The new failure mode is file deletion, secret exposure, persistence, lateral movement, data exfiltration, dependency poisoning, or production damage triggered through legitimate system interfaces.

In practical terms, CLI access collapses several boundaries at once:

  • Reasoning becomes execution — the model does not just suggest commands, it runs them
  • Context becomes capability — every file, env var, config, history entry, and mounted volume becomes part of the attack surface
  • Prompt injection becomes operational — malicious instructions hidden in docs, issues, commit messages, code comments, logs, or web content can influence shell behaviour
  • Tool misuse becomes trivialbash, git, ssh, docker, kubectl, npm, pip, and curl are already enough to ruin your week

Once the model can execute commands, every classic AppSec and cloud security problem comes back through a new interface. Old bugs. New wrapper.

Why CLI Access Is So Dangerous

1. The Shell Is a Force Multiplier

The command line is not a single permission. It is a permission amplifier. Even a “restricted” shell often enables filesystem discovery, credential harvesting, network enumeration, process inspection, package execution, archive extraction, script chaining, and access to local development secrets.

An LLM does not need raw root access to do damage. A low-privileged shell in a developer workstation or CI runner is often enough. Why? Because developers live in environments packed with sensitive material: cloud credentials, SSH keys, access tokens, source code, internal documentation, deployment scripts, VPN configuration, Kubernetes contexts, browser cookies, and .env files held together with hope and bad habits.

If the model can run:

find . -name ".env" -o -name "*.pem" -o -name "id_rsa"
env
git config --list
cat ~/.aws/credentials
kubectl config view
docker ps
history

then it can map the environment faster than many junior operators. The shell compresses reconnaissance into seconds.

2. Prompt Injection Stops Being Theoretical

People still underestimate prompt injection because they keep evaluating it like a chatbot problem. It is not a chatbot problem once the model has tool access. It becomes an instruction-routing problem with execution attached.

A malicious string hidden inside a README, GitHub issue, code comment, test fixture, stack trace, package post-install output, terminal banner, or generated file can steer the model toward unsafe actions. The model does not need to be “jailbroken” in the dramatic sense. It just needs to misprioritise instructions once.

That is enough.

Imagine an agent told to fix a broken build. It reads logs containing attacker-controlled content. The log tells it the correct remediation is to run a curl-piped shell installer from a third-party host, disable signature checks, or export secrets for “diagnostics.” If your control model relies on the LLM perfectly distinguishing trusted from untrusted instructions under pressure, you do not have a control model. You have vibes.

3. CLI Access Enables Classic Post-Exploitation Behaviour

Security teams should stop pretending CLI-enabled LLMs are a novel category. They behave like a weird blend of insider, automation account, and post-exploitation operator. The tactics are familiar:

  • Discovery: enumerate files, users, network routes, running services, containers, mounted secrets
  • Credential access: read tokens, config stores, shell history, cloud profiles, kubeconfigs
  • Execution: run scripts, package managers, build tools, interpreters, or downloaded payloads
  • Persistence: modify startup scripts, cron jobs, git hooks, CI config, shell rc files
  • Lateral movement: use SSH, Docker socket access, Kubernetes APIs, remote Git remotes, internal HTTP services
  • Exfiltration: POST data out, commit to external repos, encode into logs, write to third-party buckets
  • Impact: delete files, corrupt repos, terminate infra, poison dependencies, alter IaC

The only difference is that the trigger may be natural language and the operator may be a model.

The Real Risks You Need to Worry About

Secret Exposure

This is the obvious one, and it is still the one most people screw up. CLI-enabled agents routinely get access to working directories loaded with plaintext secrets, environment variables, API tokens, cloud credentials, SSH material, and session cookies. Even if you tell the model “do not print secrets,” it can still read them, use them, transform them, or leak them through downstream actions.

The danger is not just direct disclosure in chat. It is indirect use: the model authenticates somewhere it should not, sends data to a remote system, pulls private dependencies, or modifies resources using inherited credentials.

Destructive Command Execution

A model does not need malicious intent to be dangerous. It just needs confidence plus bad judgment. Commands like these are one autocomplete away from disaster:

rm -rf
git clean -fdx
docker system prune -a
terraform destroy
kubectl delete
chmod -R 777
chown -R
truncate -s 0

Humans understand context badly enough already. Models understand it worse, but faster. The combination is not charming.

Supply Chain Compromise

CLI access gives models direct access to package ecosystems and install surfaces. That means npm install, pip install, shell scripts from random GitHub repos, Homebrew formulas, curl-bash installers, container pulls, and binary downloads. If an attacker can influence what package, version, or source the model selects, they can turn the agent into a supply chain ingestion engine.

This gets uglier when agents are allowed to “fix missing dependencies” autonomously. Congratulations, you built a machine that resolves uncertainty by executing untrusted code from the internet.

Environment Escapes Through Tool Chaining

The shell rarely operates alone. It is usually part of a broader toolchain: browser access, GitHub access, cloud CLIs, container runtimes, IaC tooling, secret managers, and APIs. That means a seemingly harmless file read can become a repo modification, which becomes a CI run, which becomes deployed code, which becomes internet-facing exposure.

The risk is not one command. It is the chain.

Trust Boundary Collapse

Most deployments do a terrible job of separating trusted instructions from untrusted content. The agent reads user requests, code, docs, terminal output, issue trackers, and web pages into a single context window and is somehow expected to behave like a formally verified policy engine. It is not. It is a probabilistic token machine with access to bash.

That means every data source needs to be treated as potentially adversarial. If you do not explicitly model that boundary, the model will blur it for you.

Where Teams Keep Getting It Wrong

“It’s Fine, It Runs in a Container”

No, that is not automatically fine. A container is not a security strategy. It is a packaging format with optional security properties, usually misconfigured.

If the container has mounted source code, Docker socket access, host networking, cloud credentials, writable volumes, or Kubernetes service account tokens, then the “sandbox” may just be a nicer room in the same prison. If the agent can hit internal APIs or metadata services from inside the container, you have not meaningfully reduced the blast radius.

“The Model Needs Broad Access to Be Useful”

That is suit logic. Lazy architecture dressed up as product necessity.

Most tasks do not require broad shell access. They require a narrow set of pre-approved operations: run tests, inspect specific logs, edit files in a repo, maybe invoke a formatter or linter. If your agent needs unrestricted shell plus unrestricted network plus unrestricted secrets plus unrestricted repo write just to “help developers,” your design is rotten.

“We’ll Put a Human in the Loop”

Fine, but be honest about what that human is reviewing. If the model emits one shell command at a time with clear diffs, bounded effects, and explicit justification, approval can work. If it emits a tangled shell pipeline after reading 40 files and 10k lines of logs, the human is rubber-stamping. That is not oversight. That is liability outsourcing.

What Good Controls Actually Look Like

If you are going to give LLMs CLI access, do it like you expect the environment to be hostile and the model to make mistakes. Because both are true.

1. Capability Scoping, Not General Shell Access

Do not expose a raw terminal unless you absolutely must. Wrap common actions in narrow tools with explicit contracts:

  • run tests
  • read file from approved paths
  • edit file in workspace only
  • list git diff
  • query build status
  • restart dev service

A specific tool with bounded input is always safer than bash -lc and a prayer.

2. Strong Sandboxing

If shell access is unavoidable, isolate the runtime properly:

  • ephemeral environments
  • no host mounts unless essential
  • read-only filesystem wherever possible
  • drop Linux capabilities
  • block privilege escalation
  • separate UID/GID
  • no Docker socket
  • no access to instance metadata
  • tight seccomp/AppArmor/SELinux profiles
  • restricted outbound network egress

If the model only needs repo-local operations, then the environment should be physically incapable of touching anything else.

3. Secret Minimisation

Do not inject ambient credentials into agent runtimes. No long-lived cloud keys. No full developer profiles. No inherited shell history full of tokens. Use short-lived, task-scoped credentials with explicit revocation. Better yet, design tasks that do not require secrets at all.

The best secret available to an LLM is the one that was never mounted.

4. Approval Gates for High-Risk Actions

Certain command classes should always require human approval:

  • network downloads and remote execution
  • package installation
  • filesystem deletion outside temp space
  • permission changes
  • git push / merge / tag
  • cloud and Kubernetes mutations
  • service restarts in shared environments
  • anything touching prod

This needs policy enforcement, not a polite system prompt.

5. Provenance and Trust Separation

Track where instructions come from. User request, local codebase, terminal output, remote webpage, issue tracker, generated artifact — these are not equivalent. Treat untrusted content as tainted. Do not allow it to silently authorise tool execution. If the model references a command suggested by untrusted content, surface that fact explicitly.

6. Full Observability

Log every command, file read, file write, network destination, approval event, and tool invocation. Keep transcripts. Keep diffs. Keep timestamps. If the agent does something stupid, you need forensic reconstruction, not storytelling.

And no, “we have application logs” is not enough. You need agent action logs with decision context.

7. Default-Deny Network Access

Most coding and triage tasks do not require arbitrary internet access. Block it by default. Allow specific registries, package mirrors, or internal endpoints only when necessary. The fastest way to cut off exfiltration and supply chain nonsense is to stop the runtime talking to the whole internet like it owns the place.

A More Honest Threat Model

If you give an LLM CLI access, threat model it like this:

You have created an execution-capable agent that can be influenced by untrusted content, inherits ambient authority unless explicitly prevented, and can chain benign actions into harmful outcomes faster than a human operator.

That does not mean “never do it.” It means stop pretending it is low risk because the interface looks friendly.

The right question is not whether the model is aligned, helpful, or smart. The right question is: what is the maximum damage this runtime can do when the model is wrong, manipulated, or both?

If the answer is “quite a lot,” your architecture is bad.

The Bottom Line

CLI-enabled LLMs are not just chatbots with tools. They are a new execution layer sitting on top of old, sharp infrastructure. The shell gives them leverage. Prompt injection gives attackers influence. Ambient credentials give them reach. Weak sandboxing gives them consequences.

The upside is real. So is the blast radius.

If you want the productivity gains without the inevitable incident report, stop handing models a general-purpose terminal and calling it innovation. Give them constrained capabilities, isolated runtimes, short-lived credentials, hard approval gates, and logs good enough to survive an audit.

Because once the LLM gets a shell, the difference between “helpful assistant” and “automated own goal” is mostly architecture.

04/04/2026

Browser-Use Agents and Server-Side Request Forgery: Old Vulns, New Vectors

Browser-Use Agents and Server-Side Request Forgery: Old Vulns, New Vectors

SSRF is not new. It’s been on the OWASP Top 10 since 2021, it’s been in every pentester’s playbook for a decade, and it’s the reason you’re not supposed to let user input control outbound HTTP requests from your server. We know how to prevent it. We know how to test for it. We’ve written the cheat sheets, the detection rules, the WAF signatures.

And then we gave AI agents a browser and told them to “go look things up.”

SSRF is back, and this time it’s wearing a trench coat made of natural language.

The Old SSRF: A Quick Refresher

Classic SSRF is straightforward: an application takes a URL from user input and makes a server-side request to it. The attacker supplies http://169.254.169.254/latest/meta-data/ instead of a legitimate URL. The server dutifully fetches AWS credentials from the instance metadata service and hands them to the attacker. Game over.

Defences are well-understood: validate URLs against allowlists, block private IP ranges, resolve DNS before making the request to prevent rebinding, restrict egress at the network level. This is AppSec 101.

But those defences assumed something: that URLs would arrive as URLs, in URL-shaped fields, through parseable HTTP parameters.

That assumption no longer holds.

The New Vector: AI Agents as SSRF Proxies

An AI agent with browsing capabilities is, architecturally, an SSRF vulnerability by design. Its entire purpose is to receive instructions in natural language and make HTTP requests to arbitrary destinations. The “user input” isn’t a URL parameter — it’s a sentence like “check the internal admin dashboard” or “fetch this document for me.”

The agent dutifully translates that into an HTTP request. And if nobody told it that http://localhost:8080/admin is off-limits, it will happily go there.

This isn’t theoretical. Let me walk you through what’s already happening.

Real-World Evidence: It’s Already Being Exploited

1. Pydantic AI — CVE-2026-25580 (CVSS 8.6)

In February 2026, Pydantic AI — a widely-used framework for building AI agents — disclosed CVE-2026-25580, a textbook SSRF vulnerability in its URL download functionality. The download_item() helper fetched content from URLs without validating that the target was a public address.

Any application accepting message history from untrusted sources (chat interfaces, Vercel AI SDK integrations, AG-UI protocol implementations) was vulnerable. An attacker could submit a message with a file attachment pointing at:

http://169.254.169.254/latest/meta-data/iam/security-credentials/

And the server would fetch AWS IAM credentials and return them. Multiple model integrations were affected — OpenAI, Anthropic, Google, xAI, Bedrock, and OpenRouter all had download paths that could be abused.

The fix? Comprehensive SSRF protection: blocking private IPs, always blocking cloud metadata endpoints, validating redirect targets, resolving DNS before requests. Standard SSRF defences that should have been there from day one. The fact that a framework built specifically for AI agents shipped without basic SSRF protection tells you everything about the current state of agent security.

2. Tencent Xuanwu Lab — Server-Side Browser Kill Chains

Tencent’s Xuanwu Lab published a white paper on AI web crawler security in February 2026 that reads like a horror story. They tested server-side browsers across multiple AI products and found remote code execution vulnerabilities in every single one. The affected products collectively serve over a billion users.

Their four documented attack cases expose a pattern:

Case Entry Point Bypass Method Impact
1AI search with URL allowlist302 redirect via allowlisted siteRCE, no sandbox
2AI reading + sharing + screenshotChained features to bypass domain allowlistSSRF to cloud metadata
3URL access with script filtering<img onerror> bypassed <script> filterRCE via N-day chain
4Hidden backend indexing crawlerNo bypass needed — no defencesRCE, no sandbox

Case 4 is particularly grim: a hidden backend crawler that batch-fetched URLs users had queried — invisible to frontend security, undocumented, running an outdated browser with no sandbox. The attacker didn’t even need to bypass anything.

The Xuanwu team puts it bluntly: “When you launch a browser instance, you are not starting a simple web browsing tool — you are launching a ‘micro operating system.’ A vulnerability in any single component could lead to remote code execution.”

3. Unit 42 — Indirect Prompt Injection as SSRF Delivery Mechanism

Palo Alto’s Unit 42 published research in March 2026 documenting web-based indirect prompt injection (IDPI) attacks observed in the wild. Not proof-of-concept. Not lab demos. Production attacks.

Their taxonomy maps the full kill chain from SSRF’s perspective:

  • Forced internal requests: Embedded prompts in web pages instructing agents to access http://localhost, internal services, and cloud metadata endpoints
  • Unauthorized transactions: Prompts directing agents to visit Stripe payment URLs and PayPal links to initiate financial transactions
  • Data exfiltration: Instructions to collect environment variables, credentials, and contact lists — then exfiltrate via URL-encoded requests
  • Data destruction: Commands to rm -rf and fork bombs targeting backend infrastructure

The delivery methods are creative: zero-width Unicode characters, CSS-hidden text, Base64-encoded payloads assembled at runtime, SVG encapsulation, HTML attribute cloaking. 85% of the jailbreaks were social engineering — framing destructive commands as “security updates” or “compliance checks.”

The kicker: one attacker embedded 24 separate prompt injection attempts in a single page, using different delivery methods for each one. If even one bypasses the model’s safety filters, the attack succeeds.

4. Browserbase — “One Malicious <div> Away From Going Rogue”

Browserbase’s February 2026 analysis frames the problem with precision: “Every webpage an agent visits is a potential vector for attack.” They cite the PromptArmor research on Google’s Antigravity IDE, where an indirect prompt injection hidden in 1-point font inside an “implementation guide” successfully exfiltrated environment variables by encoding them as URLs and sending them via the browser agent’s own network requests.

That’s SSRF triggered by reading a document. The URL didn’t arrive as a URL. It arrived as invisible text on a web page.

Why Traditional SSRF Defences Fail Against Agents

The fundamental problem: SSRF defences are designed to protect applications, not autonomous decision-makers.

Traditional Defence Why It Fails With Agents
URL allowlistsAgents generate URLs dynamically from natural language — no static list covers the infinite space of valid requests
Input validation on URL parametersThe “input” is a sentence, not a URL. The URL is constructed internally by the agent
WAF signaturesNatural language payloads don’t match traditional SSRF patterns
DNS pre-resolutionOnly works if you control the HTTP client — many agent frameworks use browsers that handle DNS independently
Egress filteringAgent needs internet access to function — blocking egress breaks the core use case
IP blocklistsOnly effective if applied at the HTTP client level before the request is made — agents using embedded browsers bypass application-layer controls

The Tencent Xuanwu research adds another dimension: even when enterprises implement URL allowlists, they’re trivially bypassed. A 302 redirect from an allowlisted domain to an attacker-controlled page defeats the entire scheme. The SSRF isn’t in the first request — it’s in the redirect chain that follows.

The Attack Surface Is Bigger Than You Think

SSRF in the context of browser-use agents isn’t just about fetching cloud metadata. The attack surface includes:

  • Cloud metadata services: AWS IMDSv1 (169.254.169.254), GCP, Azure, Alibaba Cloud — stealing IAM roles, service account tokens, API keys
  • Internal APIs and admin panels: Accessing unauthenticated internal services that trust requests from within the network perimeter
  • Database ports: Probing internal MySQL:3306, Redis:6379, PostgreSQL:5432 — extracting data from services that don’t require auth on localhost
  • Container orchestration: Accessing Kubernetes API servers, Docker sockets, etcd — pivoting to full cluster compromise
  • Other agents: In multi-agent architectures, a compromised agent can SSRF into other agents’ API endpoints, creating cascading compromise
  • Data exfiltration via URL encoding: The PromptArmor/Antigravity technique — embedding stolen data in outbound URL parameters, effectively using the agent as a covert channel

The Xuanwu team found that server-side browser containers were often deployed in the same network segment as production databases, task schedulers, and model inference nodes. Zero network isolation. Once the browser was compromised, lateral movement was trivial.

What Actually Works

If you’re deploying agents with browsing capabilities, here’s what you need — not principles, but concrete controls:

1. Network Isolation (Non-Negotiable)

Browser agents must run in isolated network zones. Egress to the internet: allowed. Access to internal services, metadata endpoints, private IP ranges: blocked at the infrastructure level. Kubernetes NetworkPolicies, separate VPCs, cloud security groups. This is the single most effective control — if the agent can’t reach 169.254.169.254, stealing metadata credentials is off the table regardless of what the LLM is tricked into doing.

2. SSRF Protection at the HTTP Client Level

Every HTTP request the agent makes should pass through a hardened client that:

  • Resolves DNS before connecting (prevents rebinding)
  • Blocks private IP ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16, 127.0.0.0/8, 169.254.0.0/16)
  • Always blocks cloud metadata endpoints, even if “allow-local” is configured
  • Validates every redirect target, not just the initial URL
  • Restricts protocols to http:// and https:// only

Pydantic AI’s post-CVE fix is a good reference implementation.

3. Browser Sandboxing (Never Disable It)

The Tencent research found that multiple AI products disabled Chrome’s sandbox (--no-sandbox) to resolve container compatibility issues. This is catastrophic. Fix the container configuration instead: add the required seccomp profiles, grant CAP_SYS_ADMIN if necessary, configure user namespaces properly. The sandbox is the last line of defence against RCE — removing it turns every browser vulnerability into a full server compromise.

4. Instance Isolation

Each browsing task should use an independent, ephemeral browser instance that’s destroyed after completion. This prevents cross-task contamination, stops persistent compromise, and eliminates credential leakage between sessions. Browserbase’s approach of dedicated VMs per session with automatic teardown is the right model.

5. Attack Surface Reduction

Disable everything the agent doesn’t need: WebGL, WebRTC, PDF plugins, extensions. If performance allows, run with --jitless to eliminate the V8 JIT compiler — which accounts for roughly 23% of Chrome’s high-severity CVEs. Tencent’s analysis shows that disabling WebGL/GPU and JIT alone eliminates nearly 40% of browser vulnerability surface.

6. Runtime Behaviour Control

Tencent open-sourced SEChrome, a protection layer that monitors browser process system calls and enforces allowlists for file access, process execution, and network requests. Even if an attacker achieves RCE inside the browser, they can’t read sensitive files, execute arbitrary commands, or access the network beyond permitted destinations. Every tested exploit was blocked.

The Uncomfortable Truth

We’re deploying AI agents that have the browsing capabilities of a human user, the network access of a server-side application, and the security boundaries of neither. Every web page they visit is a potential attack payload. Every URL they construct is a potential SSRF. Every redirect they follow is a potential pivot point.

SSRF wasn’t “solved” in traditional web applications — it was managed through layers of controls that assumed a predictable request flow. AI agents break that assumption completely. The request flow is generated by a language model interpreting natural language from potentially hostile sources.

The good news: the defences exist. Network isolation, sandboxing, SSRF-hardened HTTP clients, instance isolation, runtime behaviour control. None of this is novel engineering. It’s applying established security patterns to a new deployment model.

The bad news: most agent deployments aren’t implementing any of it.

Old vulns don’t retire. They just find new hosts.


Sources & Further Reading:

03/04/2026

The OWASP Top 10 for AI Agents Is Here. It's Not Enough.

The OWASP Top 10 for AI Agents Is Here. It's Not Enough.

In December 2025, OWASP released the Top 10 for Agentic Applications 2026 — the first security framework dedicated to autonomous AI agents. Over 100 researchers and practitioners contributed. NIST, the European Commission, and the Alan Turing Institute reviewed it. Palo Alto Networks, Microsoft, and AWS endorsed it.

It’s a solid taxonomy. It gives the industry a shared language for a new class of threats. And it is nowhere near mature enough for what’s already happening in production.

Let me explain.

What the Framework Gets Right

Credit where it’s due. The OWASP Agentic Top 10 correctly identifies the fundamental shift: a chatbot answers questions, an agent executes tasks. That distinction changes the entire threat model. When you give an AI system the ability to call APIs, access databases, send emails, and execute code, you’ve created something with real operational authority. A compromised chatbot hallucrinates. A compromised agent exfiltrates data, manipulates records, or sabotages infrastructure — at machine speed, with legitimate credentials.

The ten risk categories — from ASI01 (Agent Goal Hijack) through ASI10 (Rogue Agents) — capture real threats that are already showing up in the wild:

ID Risk Translation
ASI01Agent Goal HijackYour agent now works for the attacker
ASI02Tool Misuse & ExploitationLegitimate tools used destructively
ASI03Identity & Privilege AbuseAgent inherits god-mode credentials
ASI04Supply Chain VulnerabilitiesPoisoned MCP servers, plugins, models
ASI05Unexpected Code ExecutionYour agent just ran a reverse shell
ASI06Memory & Context PoisoningLong-term memory becomes a sleeper cell
ASI07Insecure Inter-Agent CommsAgent-in-the-middle attacks
ASI08Cascading FailuresOne bad tool call nukes everything
ASI09Human-Agent Trust ExploitationAgent social-engineers the human
ASI10Rogue AgentsAgent goes off-script autonomously

The framework also introduces two principles that should be tattooed on every architect’s forehead: Least-Agency (don’t give agents more autonomy than the task requires) and Strong Observability (log everything the agent does, decides, and touches).

Good principles. Now let’s talk about why principles aren’t enough.

The Maturity Problem

The OWASP Agentic Top 10 is a taxonomy, not a defence framework. It names threats. It describes mitigations at a high level. But it leaves the hard engineering problems unsolved — and in some cases, unacknowledged.

1. The Attacks Are Already Here. The Defences Are Not.

The framework dropped in December 2025. By then, every major risk category already had real-world incidents:

  • ASI01 (Goal Hijack): Koi Security found an npm package live for two years with embedded prompt injection strings designed to convince AI-based security scanners the code was legitimate. Attackers are already weaponising natural language as an attack vector against autonomous tools.
  • ASI02 (Tool Misuse): Amazon Q’s VS Code extension was compromised with destructive instructionsaws s3 rm, aws ec2 terminate-instances, aws iam delete-user — combined with flags that disabled confirmation prompts (--trust-all-tools --no-interactive). Nearly a million developers had the extension installed. The agent wasn’t escaping a sandbox. There was no sandbox.
  • ASI04 (Supply Chain): The first malicious MCP server was found on npm, impersonating Postmark’s email service and BCC’ing every message to an attacker. A month later, another MCP package shipped with dual reverse shells — 86,000 downloads, zero visible dependencies.
  • ASI05 (Code Execution): Anthropic’s own Claude Desktop extensions had three RCE vulnerabilities in the Chrome, iMessage, and Apple Notes connectors (CVSS 8.9). Ask Claude “Where can I play paddle in Brooklyn?” and an attacker-controlled web page in the search results could trigger arbitrary code execution with full system privileges.
  • ASI06 (Memory Poisoning): Researchers demonstrated how persistent instructions could be embedded in an agent’s context that influenced all subsequent interactions — even across sessions. The agent looked normal. It behaved normally most of the time. But it had been quietly reprogrammed weeks earlier.

The framework describes these threats. It does not provide testable, enforceable controls for any of them. “Implement input validation” is not a control when the input is natural language and the attack surface is every document, email, and web page the agent reads.

2. It Doesn’t Address the Governance Gap

Here’s the uncomfortable truth, stated clearly by Modulos: “The same enterprise that would never ship a customer-facing application without security review is deploying autonomous agents that can execute code, access sensitive data, and make decisions. No formal risk assessment. No mapped controls. No documented mitigations. No monitoring for anomalous behaviour.”

A risk taxonomy is only useful if it’s operationalised. The OWASP Agentic Top 10 gives security teams vocabulary but not workflow. There’s no:

  • Maturity model for agentic security posture
  • Reference architecture for secure agent deployment
  • Compliance mapping to existing frameworks (EU AI Act, ISO 42001, SOC 2)
  • Standardised scoring or severity rating for agent-specific risks
  • Testable benchmark to validate whether mitigations actually work

Security teams are left to figure out the implementation themselves, which is exactly how “deploy first, secure later” happens.

3. The LLM Top 10 Was Insufficient. This Is Still Catching Up.

NeuralTrust put it bluntly in their deep dive: “The existing OWASP Top 10 for LLM Applications is insufficient. An agent’s ability to chain actions and operate autonomously means a minor vulnerability, such as a simple prompt injection, can quickly cascade into a system-wide compromise, data exfiltration, or financial loss.”

The Agentic Top 10 was created because the LLM Top 10 didn’t cover agent-specific risks. But the Agentic list itself was created from survey data and expert input — not from a systematic threat modelling exercise against production agent architectures. As Entro Security noted: “Agents mostly amplify existing vulnerabilities — not creating entirely new ones.”

If agents amplify existing vulnerabilities, then a Top 10 list that doesn’t deeply integrate with existing identity management, secret management, and access control frameworks is leaving the most exploitable gaps unaddressed.

4. Non-Human Identity Is the Real Battleground

The OWASP NHI (Non-Human Identity) Top 10 maps directly to the Agentic Top 10. Every meaningful agent runs on API keys, OAuth tokens, service accounts, and PATs. When those identities are over-privileged, invisible, or exposed, the theoretical risks become real incidents.

Look at the list through an identity lens:

  • Goal Hijack (ASI01) matters because the agent already holds powerful credentials
  • Tool Misuse (ASI02) matters because tools are wired to cloud and SaaS permissions
  • Identity Abuse (ASI03) is literally about agent sessions, tokens, and roles
  • Memory Poisoning (ASI06) becomes critical when memory contains secrets and tokens
  • Cascading Failures (ASI08) amplify because the same NHI is reused across multiple agents

You cannot secure AI agents without securing the non-human identities that power them. The Agentic Top 10 acknowledges this. It does not solve it.

5. Where’s the Red Team Playbook?

NeuralTrust’s analysis makes a critical point: “Traditional penetration testing is insufficient. Security teams must conduct periodic tests that simulate complex, multi-step attacks.”

The framework mentions red teaming in passing. It doesn’t provide:

  • Attack scenarios mapped to each ASI category
  • Testing methodologies for multi-agent systems
  • Metrics for measuring resilience against agent-specific threats
  • A CTF-style reference application for practising agentic attacks (OWASP’s FinBot exists but is separate from the Top 10 itself)

For a framework targeting autonomous systems, the absence of a structured offensive testing methodology is a significant gap.

What Needs to Happen Next

The OWASP Agentic Top 10 is version 1.0. Like the original OWASP Web Top 10 in 2004, it’s a starting point, not a destination. Here’s what the next iteration needs:

  1. Enforceable controls, not just principles. Each ASI category needs prescriptive, testable controls with pass/fail criteria. “Implement least privilege” is not a control. “Agent credentials must be session-scoped with a maximum TTL of 1 hour and automatic revocation on task completion” is a control.
  2. Reference architectures. Show me what a secure agentic deployment looks like. Network topology. Identity flow. Tool sandboxing. Kill switch mechanism. Not theory — diagrams and code.
  3. Integration with existing compliance. Map ASI categories to ISO 42001, NIST AI RMF, EU AI Act Article 9, SOC 2 Trust Service Criteria. Security teams need to plug this into their existing GRC workflows, not run a parallel process.
  4. Offensive testing methodology. A structured red team playbook with attack trees for each ASI category, severity scoring, and reproducible test cases. The framework needs teeth.
  5. Incident data. Start collecting and publishing anonymised incident data. The web Top 10 evolved because we had breach data showing which vulnerabilities were actually exploited at scale. The agentic space needs the same feedback loop.

The Bottom Line

The OWASP Top 10 for Agentic Applications 2026 is a necessary first step. It gives us vocabulary. It draws attention to a real and growing threat surface. The 100+ contributors did meaningful work.

But let’s not confuse naming a problem with solving it. The agents are already in production. The attacks are already happening. And the governance, tooling, and testing infrastructure needed to secure these systems is lagging badly behind.

The original OWASP Top 10 took years and multiple iterations to become the authoritative reference it is today. The agentic equivalent doesn’t have years. The attack surface is expanding at the speed of npm install.

Name the risks. Good. Now build the defences.


Sources & Further Reading:

AppSec Review for AI-Generated Code

Grepping the Robot: AppSec Review for AI-Generated Code APPSEC CODE REVIEW AI CODE Half the code shipping to production in 2026 has a...