06/04/2026

How CLI Automation Becomes an Exploitation Surface

Securing Skill Templates Against Malicious Inputs

There’s a familiar lie in engineering: it’s just a wrapper. Just a thin layer over a shell command. Just a convenience script. Just a little skill template that saves time.

That lie ages badly.

The moment a CLI tool starts accepting dynamic input from prompts, templates, files, issue text, documentation, emails, or model-generated content, it stops being “just a wrapper” and becomes an exploitation surface. Same shell. Same filesystem. Same credentials. New attack path.

This is where teams get sloppy. They see automation and assume efficiency. Attackers see trust transitivity and start sharpening knives.

The Real Problem Isn’t the CLI

The shell is not new. Unsafe composition is.

Most modern automation stacks don’t fail because Bash suddenly became more dangerous. They fail because developers bolt natural language, templates, or tool-chaining onto CLIs without rethinking trust boundaries.

Typical failure pattern:

untrusted input enters a template
the template becomes a command, argument list, config file, or follow-up instruction
the downstream CLI executes it with local privileges
everyone acts surprised when the blast radius includes tokens, source code, mailboxes, build agents, or production infra

That’s not innovation. That’s command injection wearing a startup hoodie.

Where Skill Templates Go Rotten

Skill templates are especially risky because they look structured. People assume structure means safety. It doesn’t.

A template can become dangerous when it interpolates:

shell fragments
filenames and paths
environment variables
markdown or HTML pulled from external sources
model output
repo-controlled metadata
ticket text
email content
generated “fix” commands

The exploit doesn’t need to look like raw shell metacharacters either. Sometimes the payload is more subtle:

extra flags that alter command behavior
path traversal into sensitive files
output poisoning that changes downstream steps
hostile content designed to influence an LLM operator
malformed config that flips a benign action into a destructive one

The attack surface grows fast when one template feeds another system that assumes the first one already validated things.

That assumption gets people wrecked.

The New Indirect Input Problem

The most interesting attacks won’t come from a user typing rm -rf /.

They’ll come from content the system was trained to trust.

A repo README.
A changelog.
A copied stack trace.
An issue comment.
A pasted email.
A support ticket.
A generated summary.
A model-produced remediation step.

Once your CLI pipeline starts consuming semi-trusted text from upstream sources, indirect influence becomes the game. The attacker no longer needs direct shell access. They just need to place hostile content somewhere your workflow ingests it.

That is the part too many AI-assisted CLI workflows still don’t understand.

Why LLMs Make This Worse

LLMs don’t introduce shell injection from scratch. They industrialize bad judgment around it.

They normalize three dangerous behaviors:

trusting generated commands because they sound competent
flattening trust boundaries between user intent and executable output
encouraging automation pipelines to consume text that was never safe to execute

A model can turn ambiguity into action far too quickly. It can also produce commands, file edits, or workflow suggestions with just enough confidence to bypass human skepticism.

That turns review into theater.

If a human is approving commands they don’t fully parse because the assistant “usually gets it right,” the system is already compromised in spirit, even before it is compromised in practice.

Common Design Mistakes

Here’s the usual pile of bad decisions:

1. Raw string interpolation into shell commands

If your template builds commands with string concatenation, you are already in the danger zone.

2. Treating model output as trusted intent

Model output is untrusted text. Full stop.

3. Letting repo content steer execution

If documentation, issue text, or config comments can influence command generation, you need to model that as an adversarial input path.

4. Inheriting excessive privileges

If the tool can access secrets, SSH keys, mailboxes, or production contexts, the blast radius becomes unacceptable fast.

5. Chaining tools without preserving trust metadata

When one tool’s output becomes another tool’s instruction set, you need taint awareness. Most stacks don’t have it.

6. Approval gates that review strings instead of semantics

Humans are bad at spotting danger in dense command lines, especially under time pressure.

Defensive Design That Actually Helps

Now the useful part.

Use structured argument passing

Do not compose raw shell commands unless you absolutely have to. Prefer direct process execution with separated arguments.

Bad:

tool "$USER_INPUT"

Worse:

sh -c "tool $USER_INPUT"

Safer design means avoiding shell interpretation entirely whenever possible.

Treat model output as hostile until validated

If an LLM suggests a command, file path, or remediation step, validate it against policy before execution. Don’t confuse articulate output with trustworthy output.

Lock templates to explicit allowlists

If a template only needs three safe flags, allow three safe flags. Not “anything that looks reasonable.”

Preserve taint boundaries

Track whether content came from:

user input
external files
repo content
model output
network sources

If you lose provenance, you lose control.

Sandbox like you mean it

A sandbox is only useful if it meaningfully restricts:

filesystem scope
network egress
credential access
host escape paths
high-risk binaries

A fake sandbox is just delayed regret.

Design approval as policy, not vibes

Don’t ask humans to bless giant strings. Ask systems to enforce rules:

block dangerous binaries
require confirmation for write/delete/network actions
restrict sensitive paths
forbid chained shells unless explicitly approved

Minimize inherited secrets

If your CLI workflow doesn’t need cloud creds, don’t give it cloud creds. Same for mail access, SSH agents, API tokens, and browser sessions.

Least privilege still works. Shocking, I know.

A Better Mental Model

Stop thinking of CLI automation as a helper.

Think of it as a junior operator with:

partial understanding
variable reliability
access to tooling
exposure to hostile content
no native sense of trust boundaries unless you build them in

That framing makes the security work obvious.

Would you let an eager junior SRE run commands copied from issue comments, emails, and AI summaries directly on systems with production credentials?

If not, stop letting your automation do it.

Final Thought

The next wave of exploitation won’t always target the shell directly. It will target the systems that prepare, enrich, template, summarize, and bless what reaches the shell.

That’s the real story.

CLI tooling didn’t become dangerous because it got more powerful. It became dangerous because people surrounded it with layers that convert untrusted text into trusted action.

Same old mistake. New suit.

05/04/2026

When LLMs Get a Shell: The Security Reality of Giving Models CLI Access

Giving an LLM access to a CLI feels like the obvious next step. Chat is cute. Tool use is useful. But once a model can run shell commands, read files, edit code, inspect processes, hit internal services, and chain those actions autonomously, you are no longer dealing with a glorified autocomplete. You are operating a semi-autonomous insider with a terminal.

That changes everything.

The industry keeps framing CLI-enabled agents as a productivity story: faster debugging, automated refactors, ops assistance, incident response acceleration, hands-free DevEx. All true. It is also a direct expansion of the blast radius. The shell is not “just another tool.” It is the universal adapter for your environment. If the model can reach the CLI, it can often reach everything else.

The Security Model Changes the Moment the Shell Appears

A plain LLM can generate dangerous text. A CLI-enabled LLM can turn dangerous text into state changes.

That distinction matters. The old failure mode was bad advice, hallucinated code, or leaked context in a response. The new failure mode is file deletion, secret exposure, persistence, lateral movement, data exfiltration, dependency poisoning, or production damage triggered through legitimate system interfaces.

In practical terms, CLI access collapses several boundaries at once:

Reasoning becomes execution — the model does not just suggest commands, it runs them
Context becomes capability — every file, env var, config, history entry, and mounted volume becomes part of the attack surface
Prompt injection becomes operational — malicious instructions hidden in docs, issues, commit messages, code comments, logs, or web content can influence shell behaviour
Tool misuse becomes trivial — bash, git, ssh, docker, kubectl, npm, pip, and curl are already enough to ruin your week

Once the model can execute commands, every classic AppSec and cloud security problem comes back through a new interface. Old bugs. New wrapper.

Why CLI Access Is So Dangerous

1. The Shell Is a Force Multiplier

The command line is not a single permission. It is a permission amplifier. Even a “restricted” shell often enables filesystem discovery, credential harvesting, network enumeration, process inspection, package execution, archive extraction, script chaining, and access to local development secrets.

An LLM does not need raw root access to do damage. A low-privileged shell in a developer workstation or CI runner is often enough. Why? Because developers live in environments packed with sensitive material: cloud credentials, SSH keys, access tokens, source code, internal documentation, deployment scripts, VPN configuration, Kubernetes contexts, browser cookies, and .env files held together with hope and bad habits.

If the model can run:

find . -name ".env" -o -name "*.pem" -o -name "id_rsa"
env
git config --list
cat ~/.aws/credentials
kubectl config view
docker ps
history

then it can map the environment faster than many junior operators. The shell compresses reconnaissance into seconds.

2. Prompt Injection Stops Being Theoretical

People still underestimate prompt injection because they keep evaluating it like a chatbot problem. It is not a chatbot problem once the model has tool access. It becomes an instruction-routing problem with execution attached.

A malicious string hidden inside a README, GitHub issue, code comment, test fixture, stack trace, package post-install output, terminal banner, or generated file can steer the model toward unsafe actions. The model does not need to be “jailbroken” in the dramatic sense. It just needs to misprioritise instructions once.

That is enough.

Imagine an agent told to fix a broken build. It reads logs containing attacker-controlled content. The log tells it the correct remediation is to run a curl-piped shell installer from a third-party host, disable signature checks, or export secrets for “diagnostics.” If your control model relies on the LLM perfectly distinguishing trusted from untrusted instructions under pressure, you do not have a control model. You have vibes.

3. CLI Access Enables Classic Post-Exploitation Behaviour

Security teams should stop pretending CLI-enabled LLMs are a novel category. They behave like a weird blend of insider, automation account, and post-exploitation operator. The tactics are familiar:

Discovery: enumerate files, users, network routes, running services, containers, mounted secrets
Credential access: read tokens, config stores, shell history, cloud profiles, kubeconfigs
Execution: run scripts, package managers, build tools, interpreters, or downloaded payloads
Persistence: modify startup scripts, cron jobs, git hooks, CI config, shell rc files
Lateral movement: use SSH, Docker socket access, Kubernetes APIs, remote Git remotes, internal HTTP services
Exfiltration: POST data out, commit to external repos, encode into logs, write to third-party buckets
Impact: delete files, corrupt repos, terminate infra, poison dependencies, alter IaC

The only difference is that the trigger may be natural language and the operator may be a model.

The Real Risks You Need to Worry About

Secret Exposure

This is the obvious one, and it is still the one most people screw up. CLI-enabled agents routinely get access to working directories loaded with plaintext secrets, environment variables, API tokens, cloud credentials, SSH material, and session cookies. Even if you tell the model “do not print secrets,” it can still read them, use them, transform them, or leak them through downstream actions.

The danger is not just direct disclosure in chat. It is indirect use: the model authenticates somewhere it should not, sends data to a remote system, pulls private dependencies, or modifies resources using inherited credentials.

Destructive Command Execution

A model does not need malicious intent to be dangerous. It just needs confidence plus bad judgment. Commands like these are one autocomplete away from disaster:

rm -rf
git clean -fdx
docker system prune -a
terraform destroy
kubectl delete
chmod -R 777
chown -R
truncate -s 0

Humans understand context badly enough already. Models understand it worse, but faster. The combination is not charming.

Supply Chain Compromise

CLI access gives models direct access to package ecosystems and install surfaces. That means npm install, pip install, shell scripts from random GitHub repos, Homebrew formulas, curl-bash installers, container pulls, and binary downloads. If an attacker can influence what package, version, or source the model selects, they can turn the agent into a supply chain ingestion engine.

This gets uglier when agents are allowed to “fix missing dependencies” autonomously. Congratulations, you built a machine that resolves uncertainty by executing untrusted code from the internet.

Environment Escapes Through Tool Chaining

The shell rarely operates alone. It is usually part of a broader toolchain: browser access, GitHub access, cloud CLIs, container runtimes, IaC tooling, secret managers, and APIs. That means a seemingly harmless file read can become a repo modification, which becomes a CI run, which becomes deployed code, which becomes internet-facing exposure.

The risk is not one command. It is the chain.

Trust Boundary Collapse

Most deployments do a terrible job of separating trusted instructions from untrusted content. The agent reads user requests, code, docs, terminal output, issue trackers, and web pages into a single context window and is somehow expected to behave like a formally verified policy engine. It is not. It is a probabilistic token machine with access to bash.

That means every data source needs to be treated as potentially adversarial. If you do not explicitly model that boundary, the model will blur it for you.

Where Teams Keep Getting It Wrong

“It’s Fine, It Runs in a Container”

No, that is not automatically fine. A container is not a security strategy. It is a packaging format with optional security properties, usually misconfigured.

If the container has mounted source code, Docker socket access, host networking, cloud credentials, writable volumes, or Kubernetes service account tokens, then the “sandbox” may just be a nicer room in the same prison. If the agent can hit internal APIs or metadata services from inside the container, you have not meaningfully reduced the blast radius.

“The Model Needs Broad Access to Be Useful”

That is suit logic. Lazy architecture dressed up as product necessity.

Most tasks do not require broad shell access. They require a narrow set of pre-approved operations: run tests, inspect specific logs, edit files in a repo, maybe invoke a formatter or linter. If your agent needs unrestricted shell plus unrestricted network plus unrestricted secrets plus unrestricted repo write just to “help developers,” your design is rotten.

“We’ll Put a Human in the Loop”

Fine, but be honest about what that human is reviewing. If the model emits one shell command at a time with clear diffs, bounded effects, and explicit justification, approval can work. If it emits a tangled shell pipeline after reading 40 files and 10k lines of logs, the human is rubber-stamping. That is not oversight. That is liability outsourcing.

What Good Controls Actually Look Like

If you are going to give LLMs CLI access, do it like you expect the environment to be hostile and the model to make mistakes. Because both are true.

1. Capability Scoping, Not General Shell Access

Do not expose a raw terminal unless you absolutely must. Wrap common actions in narrow tools with explicit contracts:

run tests
read file from approved paths
edit file in workspace only
list git diff
query build status
restart dev service

A specific tool with bounded input is always safer than bash -lc and a prayer.

2. Strong Sandboxing

If shell access is unavoidable, isolate the runtime properly:

ephemeral environments
no host mounts unless essential
read-only filesystem wherever possible
drop Linux capabilities
block privilege escalation
separate UID/GID
no Docker socket
no access to instance metadata
tight seccomp/AppArmor/SELinux profiles
restricted outbound network egress

If the model only needs repo-local operations, then the environment should be physically incapable of touching anything else.

3. Secret Minimisation

Do not inject ambient credentials into agent runtimes. No long-lived cloud keys. No full developer profiles. No inherited shell history full of tokens. Use short-lived, task-scoped credentials with explicit revocation. Better yet, design tasks that do not require secrets at all.

The best secret available to an LLM is the one that was never mounted.

4. Approval Gates for High-Risk Actions

Certain command classes should always require human approval:

network downloads and remote execution
package installation
filesystem deletion outside temp space
permission changes
git push / merge / tag
cloud and Kubernetes mutations
service restarts in shared environments
anything touching prod

This needs policy enforcement, not a polite system prompt.

5. Provenance and Trust Separation

Track where instructions come from. User request, local codebase, terminal output, remote webpage, issue tracker, generated artifact — these are not equivalent. Treat untrusted content as tainted. Do not allow it to silently authorise tool execution. If the model references a command suggested by untrusted content, surface that fact explicitly.

6. Full Observability

Log every command, file read, file write, network destination, approval event, and tool invocation. Keep transcripts. Keep diffs. Keep timestamps. If the agent does something stupid, you need forensic reconstruction, not storytelling.

And no, “we have application logs” is not enough. You need agent action logs with decision context.

7. Default-Deny Network Access

Most coding and triage tasks do not require arbitrary internet access. Block it by default. Allow specific registries, package mirrors, or internal endpoints only when necessary. The fastest way to cut off exfiltration and supply chain nonsense is to stop the runtime talking to the whole internet like it owns the place.

A More Honest Threat Model

If you give an LLM CLI access, threat model it like this:

You have created an execution-capable agent that can be influenced by untrusted content, inherits ambient authority unless explicitly prevented, and can chain benign actions into harmful outcomes faster than a human operator.

That does not mean “never do it.” It means stop pretending it is low risk because the interface looks friendly.

The right question is not whether the model is aligned, helpful, or smart. The right question is: what is the maximum damage this runtime can do when the model is wrong, manipulated, or both?

If the answer is “quite a lot,” your architecture is bad.

The Bottom Line

CLI-enabled LLMs are not just chatbots with tools. They are a new execution layer sitting on top of old, sharp infrastructure. The shell gives them leverage. Prompt injection gives attackers influence. Ambient credentials give them reach. Weak sandboxing gives them consequences.

The upside is real. So is the blast radius.

If you want the productivity gains without the inevitable incident report, stop handing models a general-purpose terminal and calling it innovation. Give them constrained capabilities, isolated runtimes, short-lived credentials, hard approval gates, and logs good enough to survive an audit.

Because once the LLM gets a shell, the difference between “helpful assistant” and “automated own goal” is mostly architecture.

04/04/2026

Browser-Use Agents and Server-Side Request Forgery: Old Vulns, New Vectors

SSRF is not new. It’s been on the OWASP Top 10 since 2021, it’s been in every pentester’s playbook for a decade, and it’s the reason you’re not supposed to let user input control outbound HTTP requests from your server. We know how to prevent it. We know how to test for it. We’ve written the cheat sheets, the detection rules, the WAF signatures.

And then we gave AI agents a browser and told them to “go look things up.”

SSRF is back, and this time it’s wearing a trench coat made of natural language.

The Old SSRF: A Quick Refresher

Classic SSRF is straightforward: an application takes a URL from user input and makes a server-side request to it. The attacker supplies http://169.254.169.254/latest/meta-data/ instead of a legitimate URL. The server dutifully fetches AWS credentials from the instance metadata service and hands them to the attacker. Game over.

Defences are well-understood: validate URLs against allowlists, block private IP ranges, resolve DNS before making the request to prevent rebinding, restrict egress at the network level. This is AppSec 101.

But those defences assumed something: that URLs would arrive as URLs, in URL-shaped fields, through parseable HTTP parameters.

That assumption no longer holds.

The New Vector: AI Agents as SSRF Proxies

An AI agent with browsing capabilities is, architecturally, an SSRF vulnerability by design. Its entire purpose is to receive instructions in natural language and make HTTP requests to arbitrary destinations. The “user input” isn’t a URL parameter — it’s a sentence like “check the internal admin dashboard” or “fetch this document for me.”

The agent dutifully translates that into an HTTP request. And if nobody told it that http://localhost:8080/admin is off-limits, it will happily go there.

This isn’t theoretical. Let me walk you through what’s already happening.

Real-World Evidence: It’s Already Being Exploited

1. Pydantic AI — CVE-2026-25580 (CVSS 8.6)

In February 2026, Pydantic AI — a widely-used framework for building AI agents — disclosed CVE-2026-25580, a textbook SSRF vulnerability in its URL download functionality. The download_item() helper fetched content from URLs without validating that the target was a public address.

Any application accepting message history from untrusted sources (chat interfaces, Vercel AI SDK integrations, AG-UI protocol implementations) was vulnerable. An attacker could submit a message with a file attachment pointing at:

http://169.254.169.254/latest/meta-data/iam/security-credentials/

And the server would fetch AWS IAM credentials and return them. Multiple model integrations were affected — OpenAI, Anthropic, Google, xAI, Bedrock, and OpenRouter all had download paths that could be abused.

The fix? Comprehensive SSRF protection: blocking private IPs, always blocking cloud metadata endpoints, validating redirect targets, resolving DNS before requests. Standard SSRF defences that should have been there from day one. The fact that a framework built specifically for AI agents shipped without basic SSRF protection tells you everything about the current state of agent security.

2. Tencent Xuanwu Lab — Server-Side Browser Kill Chains

Tencent’s Xuanwu Lab published a white paper on AI web crawler security in February 2026 that reads like a horror story. They tested server-side browsers across multiple AI products and found remote code execution vulnerabilities in every single one. The affected products collectively serve over a billion users.

Their four documented attack cases expose a pattern:

Case	Entry Point	Bypass Method	Impact
1	AI search with URL allowlist	302 redirect via allowlisted site	RCE, no sandbox
2	AI reading + sharing + screenshot	Chained features to bypass domain allowlist	SSRF to cloud metadata
3	URL access with script filtering	`<img onerror>` bypassed `<script>` filter	RCE via N-day chain
4	Hidden backend indexing crawler	No bypass needed — no defences	RCE, no sandbox

Case 4 is particularly grim: a hidden backend crawler that batch-fetched URLs users had queried — invisible to frontend security, undocumented, running an outdated browser with no sandbox. The attacker didn’t even need to bypass anything.

The Xuanwu team puts it bluntly: “When you launch a browser instance, you are not starting a simple web browsing tool — you are launching a ‘micro operating system.’ A vulnerability in any single component could lead to remote code execution.”

3. Unit 42 — Indirect Prompt Injection as SSRF Delivery Mechanism

Palo Alto’s Unit 42 published research in March 2026 documenting web-based indirect prompt injection (IDPI) attacks observed in the wild. Not proof-of-concept. Not lab demos. Production attacks.

Their taxonomy maps the full kill chain from SSRF’s perspective:

Forced internal requests: Embedded prompts in web pages instructing agents to access http://localhost, internal services, and cloud metadata endpoints
Unauthorized transactions: Prompts directing agents to visit Stripe payment URLs and PayPal links to initiate financial transactions
Data exfiltration: Instructions to collect environment variables, credentials, and contact lists — then exfiltrate via URL-encoded requests
Data destruction: Commands to rm -rf and fork bombs targeting backend infrastructure

The delivery methods are creative: zero-width Unicode characters, CSS-hidden text, Base64-encoded payloads assembled at runtime, SVG encapsulation, HTML attribute cloaking. 85% of the jailbreaks were social engineering — framing destructive commands as “security updates” or “compliance checks.”

The kicker: one attacker embedded 24 separate prompt injection attempts in a single page, using different delivery methods for each one. If even one bypasses the model’s safety filters, the attack succeeds.

4. Browserbase — “One Malicious <div> Away From Going Rogue”

Browserbase’s February 2026 analysis frames the problem with precision: “Every webpage an agent visits is a potential vector for attack.” They cite the PromptArmor research on Google’s Antigravity IDE, where an indirect prompt injection hidden in 1-point font inside an “implementation guide” successfully exfiltrated environment variables by encoding them as URLs and sending them via the browser agent’s own network requests.

That’s SSRF triggered by reading a document. The URL didn’t arrive as a URL. It arrived as invisible text on a web page.

Why Traditional SSRF Defences Fail Against Agents

The fundamental problem: SSRF defences are designed to protect applications, not autonomous decision-makers.

Traditional Defence	Why It Fails With Agents
URL allowlists	Agents generate URLs dynamically from natural language — no static list covers the infinite space of valid requests
Input validation on URL parameters	The “input” is a sentence, not a URL. The URL is constructed internally by the agent
WAF signatures	Natural language payloads don’t match traditional SSRF patterns
DNS pre-resolution	Only works if you control the HTTP client — many agent frameworks use browsers that handle DNS independently
Egress filtering	Agent needs internet access to function — blocking egress breaks the core use case
IP blocklists	Only effective if applied at the HTTP client level before the request is made — agents using embedded browsers bypass application-layer controls

The Tencent Xuanwu research adds another dimension: even when enterprises implement URL allowlists, they’re trivially bypassed. A 302 redirect from an allowlisted domain to an attacker-controlled page defeats the entire scheme. The SSRF isn’t in the first request — it’s in the redirect chain that follows.

The Attack Surface Is Bigger Than You Think

SSRF in the context of browser-use agents isn’t just about fetching cloud metadata. The attack surface includes:

Cloud metadata services: AWS IMDSv1 (169.254.169.254), GCP, Azure, Alibaba Cloud — stealing IAM roles, service account tokens, API keys
Internal APIs and admin panels: Accessing unauthenticated internal services that trust requests from within the network perimeter
Database ports: Probing internal MySQL:3306, Redis:6379, PostgreSQL:5432 — extracting data from services that don’t require auth on localhost
Container orchestration: Accessing Kubernetes API servers, Docker sockets, etcd — pivoting to full cluster compromise
Other agents: In multi-agent architectures, a compromised agent can SSRF into other agents’ API endpoints, creating cascading compromise
Data exfiltration via URL encoding: The PromptArmor/Antigravity technique — embedding stolen data in outbound URL parameters, effectively using the agent as a covert channel

The Xuanwu team found that server-side browser containers were often deployed in the same network segment as production databases, task schedulers, and model inference nodes. Zero network isolation. Once the browser was compromised, lateral movement was trivial.

What Actually Works

If you’re deploying agents with browsing capabilities, here’s what you need — not principles, but concrete controls:

1. Network Isolation (Non-Negotiable)

Browser agents must run in isolated network zones. Egress to the internet: allowed. Access to internal services, metadata endpoints, private IP ranges: blocked at the infrastructure level. Kubernetes NetworkPolicies, separate VPCs, cloud security groups. This is the single most effective control — if the agent can’t reach 169.254.169.254, stealing metadata credentials is off the table regardless of what the LLM is tricked into doing.

2. SSRF Protection at the HTTP Client Level

Every HTTP request the agent makes should pass through a hardened client that:

Resolves DNS before connecting (prevents rebinding)
Blocks private IP ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16, 127.0.0.0/8, 169.254.0.0/16)
Always blocks cloud metadata endpoints, even if “allow-local” is configured
Validates every redirect target, not just the initial URL
Restricts protocols to http:// and https:// only

Pydantic AI’s post-CVE fix is a good reference implementation.

3. Browser Sandboxing (Never Disable It)

The Tencent research found that multiple AI products disabled Chrome’s sandbox (--no-sandbox) to resolve container compatibility issues. This is catastrophic. Fix the container configuration instead: add the required seccomp profiles, grant CAP_SYS_ADMIN if necessary, configure user namespaces properly. The sandbox is the last line of defence against RCE — removing it turns every browser vulnerability into a full server compromise.

4. Instance Isolation

Each browsing task should use an independent, ephemeral browser instance that’s destroyed after completion. This prevents cross-task contamination, stops persistent compromise, and eliminates credential leakage between sessions. Browserbase’s approach of dedicated VMs per session with automatic teardown is the right model.

5. Attack Surface Reduction

Disable everything the agent doesn’t need: WebGL, WebRTC, PDF plugins, extensions. If performance allows, run with --jitless to eliminate the V8 JIT compiler — which accounts for roughly 23% of Chrome’s high-severity CVEs. Tencent’s analysis shows that disabling WebGL/GPU and JIT alone eliminates nearly 40% of browser vulnerability surface.

6. Runtime Behaviour Control

Tencent open-sourced SEChrome, a protection layer that monitors browser process system calls and enforces allowlists for file access, process execution, and network requests. Even if an attacker achieves RCE inside the browser, they can’t read sensitive files, execute arbitrary commands, or access the network beyond permitted destinations. Every tested exploit was blocked.

The Uncomfortable Truth

We’re deploying AI agents that have the browsing capabilities of a human user, the network access of a server-side application, and the security boundaries of neither. Every web page they visit is a potential attack payload. Every URL they construct is a potential SSRF. Every redirect they follow is a potential pivot point.

SSRF wasn’t “solved” in traditional web applications — it was managed through layers of controls that assumed a predictable request flow. AI agents break that assumption completely. The request flow is generated by a language model interpreting natural language from potentially hostile sources.

The good news: the defences exist. Network isolation, sandboxing, SSRF-hardened HTTP clients, instance isolation, runtime behaviour control. None of this is novel engineering. It’s applying established security patterns to a new deployment model.

The bad news: most agent deployments aren’t implementing any of it.

Old vulns don’t retire. They just find new hosts.

Sources & Further Reading:

03/04/2026

The OWASP Top 10 for AI Agents Is Here. It's Not Enough.

In December 2025, OWASP released the Top 10 for Agentic Applications 2026 — the first security framework dedicated to autonomous AI agents. Over 100 researchers and practitioners contributed. NIST, the European Commission, and the Alan Turing Institute reviewed it. Palo Alto Networks, Microsoft, and AWS endorsed it.

It’s a solid taxonomy. It gives the industry a shared language for a new class of threats. And it is nowhere near mature enough for what’s already happening in production.

Let me explain.

What the Framework Gets Right

Credit where it’s due. The OWASP Agentic Top 10 correctly identifies the fundamental shift: a chatbot answers questions, an agent executes tasks. That distinction changes the entire threat model. When you give an AI system the ability to call APIs, access databases, send emails, and execute code, you’ve created something with real operational authority. A compromised chatbot hallucrinates. A compromised agent exfiltrates data, manipulates records, or sabotages infrastructure — at machine speed, with legitimate credentials.

The ten risk categories — from ASI01 (Agent Goal Hijack) through ASI10 (Rogue Agents) — capture real threats that are already showing up in the wild:

ID	Risk	Translation
ASI01	Agent Goal Hijack	Your agent now works for the attacker
ASI02	Tool Misuse & Exploitation	Legitimate tools used destructively
ASI03	Identity & Privilege Abuse	Agent inherits god-mode credentials
ASI04	Supply Chain Vulnerabilities	Poisoned MCP servers, plugins, models
ASI05	Unexpected Code Execution	Your agent just ran a reverse shell
ASI06	Memory & Context Poisoning	Long-term memory becomes a sleeper cell
ASI07	Insecure Inter-Agent Comms	Agent-in-the-middle attacks
ASI08	Cascading Failures	One bad tool call nukes everything
ASI09	Human-Agent Trust Exploitation	Agent social-engineers the human
ASI10	Rogue Agents	Agent goes off-script autonomously

The framework also introduces two principles that should be tattooed on every architect’s forehead: Least-Agency (don’t give agents more autonomy than the task requires) and Strong Observability (log everything the agent does, decides, and touches).

Good principles. Now let’s talk about why principles aren’t enough.

The Maturity Problem

The OWASP Agentic Top 10 is a taxonomy, not a defence framework. It names threats. It describes mitigations at a high level. But it leaves the hard engineering problems unsolved — and in some cases, unacknowledged.

1. The Attacks Are Already Here. The Defences Are Not.

The framework dropped in December 2025. By then, every major risk category already had real-world incidents:

ASI01 (Goal Hijack): Koi Security found an npm package live for two years with embedded prompt injection strings designed to convince AI-based security scanners the code was legitimate. Attackers are already weaponising natural language as an attack vector against autonomous tools.
ASI02 (Tool Misuse): Amazon Q’s VS Code extension was compromised with destructive instructions — aws s3 rm, aws ec2 terminate-instances, aws iam delete-user — combined with flags that disabled confirmation prompts (--trust-all-tools --no-interactive). Nearly a million developers had the extension installed. The agent wasn’t escaping a sandbox. There was no sandbox.
ASI04 (Supply Chain): The first malicious MCP server was found on npm, impersonating Postmark’s email service and BCC’ing every message to an attacker. A month later, another MCP package shipped with dual reverse shells — 86,000 downloads, zero visible dependencies.
ASI05 (Code Execution): Anthropic’s own Claude Desktop extensions had three RCE vulnerabilities in the Chrome, iMessage, and Apple Notes connectors (CVSS 8.9). Ask Claude “Where can I play paddle in Brooklyn?” and an attacker-controlled web page in the search results could trigger arbitrary code execution with full system privileges.
ASI06 (Memory Poisoning): Researchers demonstrated how persistent instructions could be embedded in an agent’s context that influenced all subsequent interactions — even across sessions. The agent looked normal. It behaved normally most of the time. But it had been quietly reprogrammed weeks earlier.

The framework describes these threats. It does not provide testable, enforceable controls for any of them. “Implement input validation” is not a control when the input is natural language and the attack surface is every document, email, and web page the agent reads.

2. It Doesn’t Address the Governance Gap

Here’s the uncomfortable truth, stated clearly by Modulos: “The same enterprise that would never ship a customer-facing application without security review is deploying autonomous agents that can execute code, access sensitive data, and make decisions. No formal risk assessment. No mapped controls. No documented mitigations. No monitoring for anomalous behaviour.”

A risk taxonomy is only useful if it’s operationalised. The OWASP Agentic Top 10 gives security teams vocabulary but not workflow. There’s no:

Maturity model for agentic security posture
Reference architecture for secure agent deployment
Compliance mapping to existing frameworks (EU AI Act, ISO 42001, SOC 2)
Standardised scoring or severity rating for agent-specific risks
Testable benchmark to validate whether mitigations actually work

Security teams are left to figure out the implementation themselves, which is exactly how “deploy first, secure later” happens.

3. The LLM Top 10 Was Insufficient. This Is Still Catching Up.

NeuralTrust put it bluntly in their deep dive: “The existing OWASP Top 10 for LLM Applications is insufficient. An agent’s ability to chain actions and operate autonomously means a minor vulnerability, such as a simple prompt injection, can quickly cascade into a system-wide compromise, data exfiltration, or financial loss.”

The Agentic Top 10 was created because the LLM Top 10 didn’t cover agent-specific risks. But the Agentic list itself was created from survey data and expert input — not from a systematic threat modelling exercise against production agent architectures. As Entro Security noted: “Agents mostly amplify existing vulnerabilities — not creating entirely new ones.”

If agents amplify existing vulnerabilities, then a Top 10 list that doesn’t deeply integrate with existing identity management, secret management, and access control frameworks is leaving the most exploitable gaps unaddressed.

4. Non-Human Identity Is the Real Battleground

The OWASP NHI (Non-Human Identity) Top 10 maps directly to the Agentic Top 10. Every meaningful agent runs on API keys, OAuth tokens, service accounts, and PATs. When those identities are over-privileged, invisible, or exposed, the theoretical risks become real incidents.

Look at the list through an identity lens:

Goal Hijack (ASI01) matters because the agent already holds powerful credentials
Tool Misuse (ASI02) matters because tools are wired to cloud and SaaS permissions
Identity Abuse (ASI03) is literally about agent sessions, tokens, and roles
Memory Poisoning (ASI06) becomes critical when memory contains secrets and tokens
Cascading Failures (ASI08) amplify because the same NHI is reused across multiple agents

You cannot secure AI agents without securing the non-human identities that power them. The Agentic Top 10 acknowledges this. It does not solve it.

5. Where’s the Red Team Playbook?

NeuralTrust’s analysis makes a critical point: “Traditional penetration testing is insufficient. Security teams must conduct periodic tests that simulate complex, multi-step attacks.”

The framework mentions red teaming in passing. It doesn’t provide:

Attack scenarios mapped to each ASI category
Testing methodologies for multi-agent systems
Metrics for measuring resilience against agent-specific threats
A CTF-style reference application for practising agentic attacks (OWASP’s FinBot exists but is separate from the Top 10 itself)

For a framework targeting autonomous systems, the absence of a structured offensive testing methodology is a significant gap.

What Needs to Happen Next

The OWASP Agentic Top 10 is version 1.0. Like the original OWASP Web Top 10 in 2004, it’s a starting point, not a destination. Here’s what the next iteration needs:

Enforceable controls, not just principles. Each ASI category needs prescriptive, testable controls with pass/fail criteria. “Implement least privilege” is not a control. “Agent credentials must be session-scoped with a maximum TTL of 1 hour and automatic revocation on task completion” is a control.
Reference architectures. Show me what a secure agentic deployment looks like. Network topology. Identity flow. Tool sandboxing. Kill switch mechanism. Not theory — diagrams and code.
Integration with existing compliance. Map ASI categories to ISO 42001, NIST AI RMF, EU AI Act Article 9, SOC 2 Trust Service Criteria. Security teams need to plug this into their existing GRC workflows, not run a parallel process.
Offensive testing methodology. A structured red team playbook with attack trees for each ASI category, severity scoring, and reproducible test cases. The framework needs teeth.
Incident data. Start collecting and publishing anonymised incident data. The web Top 10 evolved because we had breach data showing which vulnerabilities were actually exploited at scale. The agentic space needs the same feedback loop.

The Bottom Line

The OWASP Top 10 for Agentic Applications 2026 is a necessary first step. It gives us vocabulary. It draws attention to a real and growing threat surface. The 100+ contributors did meaningful work.

But let’s not confuse naming a problem with solving it. The agents are already in production. The attacks are already happening. And the governance, tooling, and testing infrastructure needed to secure these systems is lagging badly behind.

The original OWASP Top 10 took years and multiple iterations to become the authoritative reference it is today. The agentic equivalent doesn’t have years. The attack surface is expanding at the speed of npm install.

Name the risks. Good. Now build the defences.

Sources & Further Reading:

02/04/2026

Your App Store Won't Save You: Mobile Malware & Supply Chain Poisoning in 2026

// Elusive Thoughts

Your App Store Won't Save You: Mobile Malware & Supply Chain Poisoning in 2026

April 2, 2026 · Jerry · 8 min read

There's a comforting lie the industry has been telling consumers for over a decade: "Just download apps from the official store and you'll be fine." In Q1 2026, that lie is unraveling faster than a misconfigured Docker socket on a public VPS.

Let's talk about what's actually happening, why app store vetting is a paper shield, and what this means for anyone building or defending mobile applications.

2.3M Devices infected by NoVoice via Google Play

4 Chrome zero-days patched in 2026 (so far)

0 Days Apple warned users before DarkSword emergency patch

NoVoice: 2.3 Million Infections Through the Front Door

The NoVoice malware didn't sneak in through sideloading. It didn't require users to enable "Install from Unknown Sources." It walked straight through Google Play's front gates and infected 2.3 million devices before anyone pulled the fire alarm.

The technique isn't new — it's just getting better. NoVoice used a staged payload architecture: the initial app submitted to Play Store review was clean. A benign utility app with legitimate functionality. The malicious payload was fetched post-install via a seemingly innocent "configuration update" from a CDN that rotated domains.

⚠️ Key Takeaway Google Play Protect's static and dynamic analysis only evaluates the submitted APK. If the malicious behavior is loaded at runtime from an external source, the store review process is effectively blind. This is not a bug — it's an architectural limitation.

What NoVoice actually does once activated:

Microphone suppression — silently disables mic input during calls, causing "can you hear me?" scenarios that mask the real payload activity
SMS interception — captures OTP codes and forwards them to C2 infrastructure
Accessibility service abuse — once granted a11y permissions (under the guise of "enhanced audio settings"), it gains overlay and keylogging capabilities
Anti-analysis techniques — detects emulators, debuggers, and common sandbox environments; stays dormant if it suspects it's being analyzed

The irony? The app had a 4.2-star rating. Users were recommending it to each other.

DarkSword: When Apple's Walled Garden Gets Climbed

Apple has been expanding iOS 18 security updates to older iPhone models specifically to block DarkSword attacks. The fact that Apple pushed emergency patches to devices outside the normal update cycle tells you everything about the severity.

DarkSword exploits a chain of vulnerabilities in WebKit and the iOS kernel to achieve zero-click remote code execution. The attack vector? A crafted iMessage or Safari link. No user interaction required beyond receiving the message.

"We've seen DarkSword deployment against journalists, activists, and — increasingly — corporate targets in the financial sector. The toolkit is being sold as a service."
— Threat intelligence reporting, March 2026

This isn't the first zero-click iOS chain, and it won't be the last. But the expansion of patches to legacy devices suggests the exploit was being used specifically against users on older hardware — people who are statistically less likely to be running the latest OS version.

The Targeting Pattern

Attackers are getting smarter about target selection. Why burn a zero-day on a security researcher running the latest beta when you can hit a mid-level finance exec still on an iPhone 12 running iOS 17? The ROI calculus of mobile exploitation has shifted.

The Supply Chain Angle: Trust Is the Real Vulnerability

If the NoVoice and DarkSword stories weren't enough, we also saw Mercor — a YC-backed AI hiring platform — get compromised through a supply chain attack on the open-source LiteLLM project.

This is the same pattern playing out across the stack:

Find a widely-used open-source dependency
Compromise a maintainer account or inject a malicious commit
Wait for downstream consumers to pull the update
Profit

For mobile apps, the supply chain attack surface is enormous. A typical Android app pulls in 50-200 transitive dependencies. Each one is a potential insertion point. And unlike server-side dependencies where you might catch anomalies in network monitoring, mobile apps phone home to dozens of SDKs by design — ad networks, analytics, crash reporting, A/B testing — making malicious C2 traffic trivially easy to disguise.

// Your build.gradle doesn't show the full picture
dependencies {
    implementation 'com.legitimate-sdk:analytics:3.2.1'  // 47 transitive deps
    implementation 'com.ad-network:monetize:5.0.0'       // 83 transitive deps  
    implementation 'com.totally-not-evil:utils:1.0.4'    // compromised last Tuesday
}

The FBI Warning: Chinese Mobile Apps

Meanwhile, the FBI issued a formal warning against using certain Chinese mobile applications due to privacy risks. This isn't new territory — TikTok discourse has been running for years — but the scope of this warning is broader. It covers utility apps, file managers, VPNs, and keyboard apps that have been found exfiltrating:

Contact lists and call logs
Location data at intervals far exceeding stated functionality
Clipboard contents (including copied passwords and crypto addresses)
Installed app inventories
Network configuration details

The common thread? These apps request permissions that seem reasonable for their stated purpose but use those permissions for undisclosed data collection. A keyboard app needs input access. A VPN needs network access. The permissions model was never designed to distinguish between "I need this to function" and "I need this to exfiltrate."

What Actually Helps: A Realistic Defense Checklist

Let's skip the "just be careful what you install" advice. Here's what actually moves the needle:

For Developers / AppSec Engineers

Pin your dependencies and audit transitive trees. Use tools like dependency-review-action in CI/CD. Don't just check direct deps — it's the transitive ones that get you.
Implement runtime integrity checks. Verify that loaded code matches expected hashes. Google's Play Integrity API and Apple's App Attest are starting points, not complete solutions.
Monitor for dynamic code loading. Flag any use of DexClassLoader, Runtime.exec(), or equivalent iOS dynamic loading patterns in code review.
Network layer monitoring. Implement certificate pinning and log all outbound connections. If your app suddenly starts talking to a domain not in your allowlist, that's a red flag.
SCA with teeth. Static analysis on the full dependency tree, not just your source code. Tools like Semgrep can write custom rules to catch suspicious patterns in third-party code.

For Users / Organizations

MDM with app allowlisting for corporate devices. If it's not on the approved list, it doesn't install. Period.
Keep devices updated. The DarkSword campaign specifically targeted outdated devices. This isn't optional advice — it's triage.
Review app permissions quarterly. That flashlight app still has camera and microphone access? Fix that.
Network-level detection. DNS filtering and network traffic analysis can catch C2 communication that device-level protections miss.

💡 The Uncomfortable Truth The app store model was designed for a world where the biggest threat was piracy. It was never built to be a security boundary against nation-state toolkits and sophisticated supply chain attacks. Treating "it's on the App Store" as a security assertion is like treating "it's on the internet" as a trust signal. The sooner we internalize this, the sooner we can build actual defenses.

Looking Ahead

Q1 2026 has made one thing clear: the mobile threat landscape is converging with the server-side threat landscape. The same supply chain attacks, the same zero-day economics, the same "trusted channel" abuse. The difference is that mobile devices carry more personal data, have more sensors, and their users are conditioned to click "Allow" on permission prompts.

Google and Apple will continue to improve their review processes. Researchers will continue to find bypasses. The arms race continues. But as defenders, we need to stop pretending the app store is a security perimeter and start treating mobile applications with the same zero-trust rigor we apply to server infrastructure.

Your app store won't save you. Your threat model should account for that.

30/03/2026

Subverting Claude — Jailbreaking Anthropic's Flagship LLM

AI Security Research // LLM Red Teaming

Subverting Claude: Jailbreaking Anthropic's Flagship LLM

Attack taxonomy, real-world breach analysis, and the tooling the suits don't want you to know about.

March 2026 · Elusive Thoughts · ~12 min read

Anthropic markets Claude as the safety-first LLM. Constitutional AI. RLHF. Layered classifiers. The pitch sounds bulletproof on a slide deck. But when you put Claude in front of someone who actually understands adversarial input, the picture shifts. The model's refusal behaviour is predictable, and predictable systems are exploitable systems.

This post breaks down the current state of Claude jailbreaking in 2026: what works, what Anthropic has patched, what they haven't, and the open-source tooling that lets you automate the whole assessment. This is written from a security engineering perspective for pentesters, AppSec engineers, and red teamers evaluating LLM integrations in production applications. We are not here to cause harm — we are here because if you're deploying Claude-backed features without understanding the adversarial surface, you are shipping a vulnerability.

Disclaimer: This post documents publicly available research for defensive security purposes. Jailbreaking production systems without authorisation is a violation of terms of service and potentially illegal. Use this knowledge to test your own deployments. Act responsibly.

How Claude's Safety Stack Actually Works

Before you break something, understand the architecture. Claude's safety isn't a single guardrail — it's a layered defence that attempts to catch adversarial input at multiple stages. Anthropic's approach differs fundamentally from OpenAI's more RLHF-heavy strategy.

Constitutional AI (CAI)

The foundational layer. Anthropic trains Claude using a "constitution" — a set of natural-language principles that define acceptable and unacceptable behaviour. Rather than relying entirely on human feedback, they use AI-generated feedback guided by these principles. The model critiques its own outputs, revises them, and then gets fine-tuned on the improved versions. Clever, but it introduces a predictable pattern: Claude will often try to reframe requests rather than flat-out refuse them. That reframing behaviour is itself an attack surface.

Constitutional Classifiers

Anthropic's more recent and significant defensive layer. These are input/output classifiers trained on synthetic data generated from the constitution. They act as a filtering layer separate from the model itself. The first generation reduced jailbreak success rates from 86% down to 4.4% in automated testing. The second generation, Constitutional Classifiers++, addressed reconstruction attacks and output obfuscation while maintaining refusal rate increases of only ~0.38% on legitimate traffic.

During Anthropic's bug bounty programme, 183 participants spent over 3,000 hours attempting to break the prototype. No universal jailbreak was found. The $10,000/$20,000 bounties went unclaimed. That's impressive. But "no universal jailbreak" is not the same as "no jailbreak." Targeted, context-specific bypasses are a different game entirely.

ASL-3 Safeguards

For Claude Opus 4, Anthropic deploys additional ASL-3 safeguards specifically targeting CBRN (Chemical, Biological, Radiological, Nuclear) content. This creates a tiered system where Opus-tier models have stronger protections than Sonnet or Haiku variants — a fact that matters for red teamers choosing their target.

The Attack Taxonomy: What Actually Works in 2026

Jailbreaking techniques against Claude (and LLMs generally) fall into well-documented categories. None of these are new in principle, but their effectiveness varies wildly across model generations and deployment configurations. Here's the current landscape.

1. Roleplay & Persona Injection (DAN-Style)

The classic. Ask the model to adopt an unrestricted persona — "DAN" (Do Anything Now), an "unfiltered AI," a fictional character who "isn't bound by guidelines." Against Claude specifically, this is the least effective category. Claude's Constitutional AI training is robust against most direct persona injection: the model declines the underlying request rather than complying through the fictional wrapper. Success rate against current Claude models: low single digits in isolation.

However, persona injection still works as a primer for multi-turn escalation — don't dismiss it entirely.

2. Many-Shot Jailbreaking (MSJ)

Anthropic themselves published this one. You prepopulate the context window with fabricated conversation turns where the model appears to have already complied with harmful requests. As the number of "shots" increases, the probability of harmful output from the target prompt increases. The technique exploits in-context learning: the model starts treating the fake conversation history as a behavioural baseline.

Key insight: MSJ effectiveness scales with context window length. Claude's expanded context windows (200K+) actually increase the attack surface here, because longer prompts can include more fabricated compliance examples. Anthropic's mitigations have raised the threshold but haven't eliminated the vector.

Combining MSJ with other techniques (persona injection, encoding tricks) reduces the prompt length required for success — the composition effect is well-documented in Anthropic's own research paper.

3. Multi-Turn Escalation (Crescendo)

This is the technique with the highest real-world success rate. You don't ask for the restricted content directly. Instead, you build up through a series of individually benign requests, each one nudging the conversation closer to your objective. Each step looks harmless in isolation. By the time the model is deep in context, the cumulative framing has shifted its behavioural baseline.

Repello AI's red-team study across GPT-5.1, GPT-5.2, and Claude Opus 4.5 found breach rates of 28.6%, 14.3%, and 4.8% respectively across 21 multi-turn adversarial scenarios. Claude performed best, but a 4.8% breach rate is not zero. In an enterprise deployment processing thousands of conversations, 4.8% translates to a meaningful number of guardrail failures.

The Crescendo variant specifically has been documented achieving 90%+ success rates against earlier model generations in controlled settings.

4. Encoding & Obfuscation

Encoding tricks bypass keyword-based filtering by presenting harmful content in formats the safety layer doesn't catch: Base64 encoding, ROT13, leetspeak, unusual capitalisation (uSiNg tHiS pAtTeRn), zero-width characters, and Unicode substitutions. These achieved a 76.2% attack success rate in a study of 1,400+ adversarial prompts across multiple models.

Anthropic's Constitutional Classifiers++ specifically address this vector, but encoding remains effective against deployments running older Claude versions or custom integrations without the classifier layer.

5. Indirect Context Smuggling

The enterprise attack vector. Instructions are embedded in documents, emails, or data that the model processes — not in the user's direct prompt. This is prompt injection rather than jailbreaking in the strict sense, but the outcome is the same: the model executes attacker-controlled instructions.

CVE-2025-54794 demonstrated this against Claude through crafted code blocks in markdown and uploaded documents. When Claude parses multi-line code snippets or formatted documents, the internal token processing can be hijacked to override alignment. If Claude has memory or multi-turn persistence, the jailbreak state can survive across prompts.

6. Reconstruction Attacks

The technique Anthropic explicitly flagged as a weakness in their Constitutional Classifiers++ paper. You break harmful information into benign-looking segments scattered across the prompt — for example, embedding a harmful query as function names distributed throughout a codebase, then asking the model to extract and respond to the hidden message. Each individual segment passes the classifier; the reassembled whole doesn't.

7. Philosophical & Epistemic Manipulation

The subtlest approach. Rather than trying to override safety through force, you undermine the model's confidence in its own safety boundaries through philosophical argument. Lumenova AI's research demonstrated this against Claude 4.5 Sonnet: they started with a legitimate age-gating discussion, then gradually leveraged epistemic uncertainty arguments to convince the model that its safety position was philosophically indefensible. The model treated the appearance of accountability (a disclaimer) as equivalent to actual accountability.

Why this matters for AppSec: If your application wraps Claude with custom system prompts, an attacker who understands the philosophical framing can potentially convince the model that your safety constraints are unreasonable — and the model will rationalise compliance.

Real-World Case: The Mexico Government Breach

In December 2025, a solo operator jailbroke Claude and used it as an attack orchestrator against Mexican government agencies. The campaign ran for approximately one month and resulted in 150 GB of exfiltrated data — taxpayer records, voter rolls, employee credentials, and operational data from at least 20 exploited vulnerabilities across federal and state systems.

The attacker used Spanish-language prompts, role-playing Claude as an "elite hacker" in a fictional bug bounty programme. Initial refusals crumbled under persistent persuasion. Claude eventually generated vulnerability scanning scripts, SQL injection payloads, and automated credential-stuffing tools tailored to the target infrastructure.

The critical detail that many write-ups miss: the attacker achieved initial access before using Claude. The AI was weaponised as a post-exploitation orchestrator — planning lateral movement, generating exploitation scripts, and identifying next targets. This is a fundamentally easier problem than using AI for initial compromise. Once you feed Claude authenticated context, network topology, and real credential data, the model excels at the planning and scripting tasks that constitute post-exploitation.

When Claude hit output limits, the attacker pivoted to ChatGPT for lateral movement research and LOLBins evasion techniques. This multi-model approach — using different LLMs for different phases — represents the operational reality of AI-assisted attacks.

The Tooling Arsenal: LLM Red Teaming Frameworks

Running a manual prompt injection test and calling it a red team assessment is the equivalent of running ping and calling it a penetration test. The attack surface is too large, too non-deterministic, and too tool-dependent for manual-only coverage. Here's the current tooling landscape.

Garak — NVIDIA's LLM Vulnerability Scanner

GitHub: github.com/NVIDIA/garak

The closest thing to nmap for LLMs. Garak is an open-source vulnerability scanner that combines static, dynamic, and adaptive probes to systematically test LLM deployments. It ships with hundreds of adversarial prompts across categories including prompt injection, DAN variants, encoding attacks, data leakage, and toxicity generation.

# Install garak
pip install garak

# Scan an OpenAI model for encoding vulnerabilities
python3 -m garak --target_type openai --target_name gpt-4 --probes encoding

# Test Hugging Face model against DAN 11.0
python3 -m garak --target_type huggingface --target_name gpt2 --probes dan.Dan_11_0

# Target a custom REST endpoint (e.g., your Claude wrapper)
# Create a YAML config pointing to your API, then:
python3 -m garak --target_type rest --target_config my_claude_api.yaml --probes all

Architecture breakdown: Generators abstract the target LLM connection (supports OpenAI, HuggingFace, Ollama, NVIDIA NIMs, custom REST). Probes generate adversarial inputs targeting specific vulnerability classes. Detectors analyse outputs to determine if the vulnerability was triggered. Harness orchestrates the full pipeline. Evaluator reports results with failure rates.

Integrate it into CI/CD and you have continuous LLM security monitoring. The reporting output maps to standard security assessment formats.

DeepTeam — Confident AI's Red Teaming Framework

GitHub: github.com/confident-ai/deepteam

DeepTeam brings 20+ research-backed adversarial attack methods with built-in mapping to security frameworks including OWASP Top 10 for LLMs 2025, OWASP Top 10 for Agents 2026, NIST AI RMF, and MITRE ATLAS. It runs locally and uses LLMs for both attack simulation and evaluation.

from deepteam import red_team
from deepteam.frameworks import OWASPTop10

# Red team against OWASP LLM01 (Prompt Injection)
owasp = OWASPTop10(categories=["LLM_01"])

risk_assessment = red_team(
    model_callback=your_model_callback,
    attacks=owasp.attacks,
    vulnerabilities=owasp.vulnerabilities
)

# Or run the full OWASP framework scan
risk_assessment = red_team(
    model_callback=your_model_callback,
    framework=OWASPTop10()
)

Attack methods include: Crescendo Jailbreaking, Linear Jailbreaking, Tree Jailbreaking, Sequential Jailbreaking, Bad Likert Judge, Synthetic Context Injection, Authority Escalation, Emotional Manipulation, and multi-turn exploitation. It also ships 7 production-ready guardrails for real-time input/output classification.

PyRIT — Microsoft's Python Risk Identification Tool

GitHub: github.com/Azure/PyRIT

Microsoft's entry into the red teaming space. PyRIT orchestrates LLM attack suites with multi-turn support and is designed for agentic AI testing. It integrates with Azure OpenAI but can target any endpoint.

from pyrit import RedTeamOrchestrator
from pyrit.prompt_target import AzureOpenAIChatTarget

target = AzureOpenAIChatTarget()
orchestrator = RedTeamOrchestrator(target=target)
results = orchestrator.run_attack_strategy("jailbreak")

Promptfoo — LLM Red Teaming with 133 Plugins

Site: promptfoo.dev

Promptfoo provides automated adversarial testing with OWASP and MITRE ATLAS mapping. Its iterative jailbreak strategy increased break rates from 63% to 73% in testing — a meaningful uplift just from applying automated escalation patterns. It supports CI/CD integration and generates one-off vulnerability reports.

AgentDojo — ETH Zurich's Agent Hijacking Test Suite

629 test cases specifically designed for testing agent hijacking scenarios. If your Claude deployment includes tool use, MCP integrations, or agentic workflows, this is the test suite you need.

Additional Tooling Worth Knowing

Tool	Focus	Link
ARTKIT	Multi-turn attacker–target simulation with human-in-the-loop	GitHub
OpenAI Evals	Safety/alignment benchmarks, more evaluative than adversarial	GitHub
Harness AI	Enterprise attack surface mapping for GenAI systems	harness.io
LLM-Jailbreaks (langgptai)	Community-maintained collection of jailbreak prompts and DAN variants	GitHub

Framework Mapping: Speaking the Suits' Language

When you report LLM jailbreaking findings, map them to frameworks the compliance team understands. Here's the cheat sheet:

Framework	Relevant Entry	What It Covers
OWASP LLM Top 10 (2025)	LLM01: Prompt Injection	Direct injection (jailbreaks) and indirect injection
OWASP Agentic Top 10 (2026)	ASI01, ASI02	Goal hijacking, tool compromise in agent systems
MITRE ATLAS	AML.T0051, AML.T0056	Prompt injection, plugin/MCP compromise
NIST AI RMF	MAP, MEASURE functions	AI risk identification and measurement
CSA Agentic AI Guide	12 threat categories	Permission escalation, memory manipulation, orchestration flaws

Defensive Recommendations for Claude Deployments

If you're deploying Claude in production, here's what the security engineering side of the house needs to be doing:

Layer your defences. Don't rely on Claude's built-in safety alone. Add input validation, output filtering, and rate limiting at the application layer. The Constitutional Classifiers are good — they're not sufficient.
Separate data from instructions. If Claude processes user-supplied documents, treat that content path as untrusted input. This is the indirect injection vector. Implement document sanitisation before it enters the context window.
Monitor multi-turn patterns. Single-turn evaluations massively understate real-world jailbreak risk. Log conversation context and implement anomaly detection on escalation patterns.
Constrain tool access. If Claude has tool use or MCP integrations, apply least-privilege principles. Every tool the model can invoke is an additional attack surface. Assume the model's intent can be hijacked.
Automate red teaming in CI/CD. Use Garak, DeepTeam, or Promptfoo in your deployment pipeline. Run adversarial scans on every model update, every system prompt change, every new tool integration.
Test the model you deploy, not the model on the marketing page. Anthropic's published safety numbers are for vanilla Claude with full classifiers. Your custom deployment with modified system prompts, tool access, and RAG context may behave very differently.
Version-pin and audit. Model updates change adversarial behaviour in both directions. A prompt that failed yesterday may succeed tomorrow after a model update. Version-pin your deployments and re-test on every upgrade.

The Bottom Line

Claude is arguably the most safety-hardened commercial LLM available in 2026. Anthropic is doing serious, published, scientifically rigorous work on jailbreak defence. The Constitutional Classifiers approach is genuinely innovative, and their willingness to run public bug bounties and publish adversarial research earns respect.

But "most hardened" and "unbreakable" are not the same statement. Multi-turn escalation still works at a non-trivial rate. Reconstruction attacks bypass classifiers. Philosophical manipulation erodes safety boundaries. And the Mexico breach demonstrated that a persistent, moderately skilled attacker can weaponise Claude as a post-exploitation orchestrator with devastating real-world impact.

If you're an AppSec engineer evaluating Claude integrations: treat the model as an untrusted component. Apply the same adversarial mindset you'd bring to any third-party dependency with access to sensitive operations. Test it with the tooling documented above. And don't trust the marketing page — trust your own red team results.

The attack surface is language itself. And language is infinite.