Elusive Thoughts: CLAUDINI :: When the Agent Writes Its Own Adversarial Attacks

18/05/2026

CLAUDINI :: When the Agent Writes Its Own Adversarial Attacks

red-teamautoresearchadversarial-mlagentic-ai

A paper landed on arXiv recently that should change how AppSec engineers think about red-teaming in 2026. The setup is mundane on its face. A sandboxed Claude Opus 4.6 was deployed via the Claude Code CLI on a compute cluster with unrestricted permissions, including the ability to submit GPU jobs. The task was not to perform an attack. The task was to produce, iterate on, and improve a discrete optimization algorithm that generates adversarial suffixes against an LLM.

The agent did not write a jailbreak prompt. The agent wrote the algorithm that writes jailbreak prompts. Then it ran the algorithm. Then it measured the outputs. Then it modified the algorithm. Then it ran the modified version. Then it iterated.

State-of-the-art results on token-forcing attacks against multiple frontier models. The agent's name in the paper is Claudini. The pipeline pattern, borrowed from Karpathy's autoresearch experiments earlier in 2026, generalizes far beyond LLM jailbreaking.

What is actually new here

Manually-authored adversarial attacks on LLMs have existed since 2022. GCG, the Greedy Coordinate Gradient attack, has been the canonical example for two years. Researchers have published improved variants every few months. Each variant requires a human researcher to think through the optimization landscape, propose a new approach, implement it, test it.

The new step is removing the human from that loop.

Claudini's pipeline closes the iteration cycle. The agent proposes optimization variants. The agent implements them. The agent submits GPU jobs to test them. The agent reads the results. The agent identifies what worked, what didn't, and what to try next. There is no human gating decision between iterations. The cycle time is set by the compute budget, not by the researcher's attention span.

When the cycle time of "publish a new SOTA attack" drops from months to days, the threat landscape changes structurally.

The generalization

The pipeline pattern — autonomous agent + unrestricted compute + measurable objective + iteration loop — is not specific to adversarial ML. It applies to any offensive research domain where the objective can be expressed as a fitness function the agent can evaluate.

Examples that scale to the same pipeline with minimal modification:

Vulnerability discovery in closed-source binaries. Fitness function: number of distinct crashes produced by fuzzing inputs. Agent iterates fuzzer harnesses and grammar definitions.
Exploit primitive chaining. Fitness function: progress toward arbitrary read/write or code execution given a known set of primitives. Agent iterates exploit construction strategies.
Phishing campaign optimization. Fitness function: click-through rate on simulated victims. Agent iterates pretexting strategies. (This is the example that should worry everyone the most.)
Side-channel attack research. Fitness function: signal-to-noise ratio in measurement traces. Agent iterates instrumentation and analysis pipelines.
Adversarial ML against deployed defensive models. Fitness function: evasion rate against a target classifier. Agent iterates evasion strategies.
Cryptanalytic attack search. Fitness function: any of the standard cryptanalytic objectives. Agent iterates analytical approaches.

Some of these are harder to set up than others. None of them are theoretically blocked. The constraint is compute budget and access to the target, not human researcher time.

The defender side

The same pipeline runs in defense. The vendor running Mythos against their own pre-release Firefox build is one example. The internal red team running Claudini-shaped pipelines against the company's own production models is another.

The asymmetry: defenders run the pipeline against their own systems, with full source access and full deployment context. Attackers run the pipeline against the defender's systems, with whatever access prompt injection or external reconnaissance affords them. Both sides scale with compute. The side with more compute, better target understanding, and faster iteration loops wins.

The optimistic framing: defenders have structural advantages — source access, deployment access, faster feedback loops on their own systems. The pessimistic framing: attackers do not need to win every iteration; they need to win one.

What this changes for AppSec

The traditional pentest model — a human assessor with a week of engagement time, a defined scope, and a manual workflow — does not scale against autoresearch-style attackers. The defender cannot match attacker iteration rate using purely manual processes.

The defender response options:

Run autoresearch defensively. Continuous adversarial testing of deployed AI systems. The pipeline that finds the bug should be the one you run, not the one the attacker runs.
Architect for resilience rather than perfection. Assume the model will be jailbroken eventually. Design the system around the assumption that the model is adversarial. Output validation, tool-call gating, sandbox containment.
Invest in detection rather than prevention at the model layer. The model will produce occasional adversarial outputs. The system should notice when it does.
Re-architect the bug bounty surface. Reward novel attack categories, not individual attack instances. The instance might be replicated 10x by an autoresearch pipeline; the category is what is actually new.

The point

Claudini is a research paper, not a productionized attack tool. The pipeline pattern it demonstrates is not a research-only curiosity. The infrastructure required — a frontier LLM, a CLI agent, a GPU budget — is commodity. The expertise required is willingness to spend the API budget and write the harness, not specialist offensive research credentials.

That is the worrying part. The barrier to running a Claudini-shaped pipeline against a target of your choice is access to credit. That is a much lower barrier than the barrier to becoming an experienced offensive researcher.

The window where autoresearch is a curiosity is closing. Get familiar with the pipeline now, on the defensive side, against your own systems. The version of the pipeline that gets pointed at you is coming whether you are prepared or not.

Elusive Thoughts

18/05/2026

CLAUDINI :: When the Agent Writes Its Own Adversarial Attacks

CLAUDINI :: When the Agent Writes Its Own Adversarial Attacks

What is actually new here

The generalization

The defender side

What this changes for AppSec

The point

JadePuffer: Anatomy of an Agentic Ransomware Attack That Ran an LLM as Its Operator

New tool repo

My Other Blogs