04/04/2026

Browser-Use Agents and Server-Side Request Forgery: Old Vulns, New Vectors

SSRF is not new. It’s been on the OWASP Top 10 since 2021, it’s been in every pentester’s playbook for a decade, and it’s the reason you’re not supposed to let user input control outbound HTTP requests from your server. We know how to prevent it. We know how to test for it. We’ve written the cheat sheets, the detection rules, the WAF signatures.

And then we gave AI agents a browser and told them to “go look things up.”

SSRF is back, and this time it’s wearing a trench coat made of natural language.

The Old SSRF: A Quick Refresher

Classic SSRF is straightforward: an application takes a URL from user input and makes a server-side request to it. The attacker supplies http://169.254.169.254/latest/meta-data/ instead of a legitimate URL. The server dutifully fetches AWS credentials from the instance metadata service and hands them to the attacker. Game over.

Defences are well-understood: validate URLs against allowlists, block private IP ranges, resolve DNS before making the request to prevent rebinding, restrict egress at the network level. This is AppSec 101.

But those defences assumed something: that URLs would arrive as URLs, in URL-shaped fields, through parseable HTTP parameters.

That assumption no longer holds.

The New Vector: AI Agents as SSRF Proxies

An AI agent with browsing capabilities is, architecturally, an SSRF vulnerability by design. Its entire purpose is to receive instructions in natural language and make HTTP requests to arbitrary destinations. The “user input” isn’t a URL parameter — it’s a sentence like “check the internal admin dashboard” or “fetch this document for me.”

The agent dutifully translates that into an HTTP request. And if nobody told it that http://localhost:8080/admin is off-limits, it will happily go there.

This isn’t theoretical. Let me walk you through what’s already happening.

Real-World Evidence: It’s Already Being Exploited

1. Pydantic AI — CVE-2026-25580 (CVSS 8.6)

In February 2026, Pydantic AI — a widely-used framework for building AI agents — disclosed CVE-2026-25580, a textbook SSRF vulnerability in its URL download functionality. The download_item() helper fetched content from URLs without validating that the target was a public address.

Any application accepting message history from untrusted sources (chat interfaces, Vercel AI SDK integrations, AG-UI protocol implementations) was vulnerable. An attacker could submit a message with a file attachment pointing at:

http://169.254.169.254/latest/meta-data/iam/security-credentials/

And the server would fetch AWS IAM credentials and return them. Multiple model integrations were affected — OpenAI, Anthropic, Google, xAI, Bedrock, and OpenRouter all had download paths that could be abused.

The fix? Comprehensive SSRF protection: blocking private IPs, always blocking cloud metadata endpoints, validating redirect targets, resolving DNS before requests. Standard SSRF defences that should have been there from day one. The fact that a framework built specifically for AI agents shipped without basic SSRF protection tells you everything about the current state of agent security.

2. Tencent Xuanwu Lab — Server-Side Browser Kill Chains

Tencent’s Xuanwu Lab published a white paper on AI web crawler security in February 2026 that reads like a horror story. They tested server-side browsers across multiple AI products and found remote code execution vulnerabilities in every single one. The affected products collectively serve over a billion users.

Their four documented attack cases expose a pattern:

Case	Entry Point	Bypass Method	Impact
1	AI search with URL allowlist	302 redirect via allowlisted site	RCE, no sandbox
2	AI reading + sharing + screenshot	Chained features to bypass domain allowlist	SSRF to cloud metadata
3	URL access with script filtering	`<img onerror>` bypassed `<script>` filter	RCE via N-day chain
4	Hidden backend indexing crawler	No bypass needed — no defences	RCE, no sandbox

Case 4 is particularly grim: a hidden backend crawler that batch-fetched URLs users had queried — invisible to frontend security, undocumented, running an outdated browser with no sandbox. The attacker didn’t even need to bypass anything.

The Xuanwu team puts it bluntly: “When you launch a browser instance, you are not starting a simple web browsing tool — you are launching a ‘micro operating system.’ A vulnerability in any single component could lead to remote code execution.”

3. Unit 42 — Indirect Prompt Injection as SSRF Delivery Mechanism

Palo Alto’s Unit 42 published research in March 2026 documenting web-based indirect prompt injection (IDPI) attacks observed in the wild. Not proof-of-concept. Not lab demos. Production attacks.

Their taxonomy maps the full kill chain from SSRF’s perspective:

Forced internal requests: Embedded prompts in web pages instructing agents to access http://localhost, internal services, and cloud metadata endpoints
Unauthorized transactions: Prompts directing agents to visit Stripe payment URLs and PayPal links to initiate financial transactions
Data exfiltration: Instructions to collect environment variables, credentials, and contact lists — then exfiltrate via URL-encoded requests
Data destruction: Commands to rm -rf and fork bombs targeting backend infrastructure

The delivery methods are creative: zero-width Unicode characters, CSS-hidden text, Base64-encoded payloads assembled at runtime, SVG encapsulation, HTML attribute cloaking. 85% of the jailbreaks were social engineering — framing destructive commands as “security updates” or “compliance checks.”

The kicker: one attacker embedded 24 separate prompt injection attempts in a single page, using different delivery methods for each one. If even one bypasses the model’s safety filters, the attack succeeds.

4. Browserbase — “One Malicious <div> Away From Going Rogue”

Browserbase’s February 2026 analysis frames the problem with precision: “Every webpage an agent visits is a potential vector for attack.” They cite the PromptArmor research on Google’s Antigravity IDE, where an indirect prompt injection hidden in 1-point font inside an “implementation guide” successfully exfiltrated environment variables by encoding them as URLs and sending them via the browser agent’s own network requests.

That’s SSRF triggered by reading a document. The URL didn’t arrive as a URL. It arrived as invisible text on a web page.

Why Traditional SSRF Defences Fail Against Agents

The fundamental problem: SSRF defences are designed to protect applications, not autonomous decision-makers.

Traditional Defence	Why It Fails With Agents
URL allowlists	Agents generate URLs dynamically from natural language — no static list covers the infinite space of valid requests
Input validation on URL parameters	The “input” is a sentence, not a URL. The URL is constructed internally by the agent
WAF signatures	Natural language payloads don’t match traditional SSRF patterns
DNS pre-resolution	Only works if you control the HTTP client — many agent frameworks use browsers that handle DNS independently
Egress filtering	Agent needs internet access to function — blocking egress breaks the core use case
IP blocklists	Only effective if applied at the HTTP client level before the request is made — agents using embedded browsers bypass application-layer controls

The Tencent Xuanwu research adds another dimension: even when enterprises implement URL allowlists, they’re trivially bypassed. A 302 redirect from an allowlisted domain to an attacker-controlled page defeats the entire scheme. The SSRF isn’t in the first request — it’s in the redirect chain that follows.

The Attack Surface Is Bigger Than You Think

SSRF in the context of browser-use agents isn’t just about fetching cloud metadata. The attack surface includes:

Cloud metadata services: AWS IMDSv1 (169.254.169.254), GCP, Azure, Alibaba Cloud — stealing IAM roles, service account tokens, API keys
Internal APIs and admin panels: Accessing unauthenticated internal services that trust requests from within the network perimeter
Database ports: Probing internal MySQL:3306, Redis:6379, PostgreSQL:5432 — extracting data from services that don’t require auth on localhost
Container orchestration: Accessing Kubernetes API servers, Docker sockets, etcd — pivoting to full cluster compromise
Other agents: In multi-agent architectures, a compromised agent can SSRF into other agents’ API endpoints, creating cascading compromise
Data exfiltration via URL encoding: The PromptArmor/Antigravity technique — embedding stolen data in outbound URL parameters, effectively using the agent as a covert channel

The Xuanwu team found that server-side browser containers were often deployed in the same network segment as production databases, task schedulers, and model inference nodes. Zero network isolation. Once the browser was compromised, lateral movement was trivial.

What Actually Works

If you’re deploying agents with browsing capabilities, here’s what you need — not principles, but concrete controls:

1. Network Isolation (Non-Negotiable)

Browser agents must run in isolated network zones. Egress to the internet: allowed. Access to internal services, metadata endpoints, private IP ranges: blocked at the infrastructure level. Kubernetes NetworkPolicies, separate VPCs, cloud security groups. This is the single most effective control — if the agent can’t reach 169.254.169.254, stealing metadata credentials is off the table regardless of what the LLM is tricked into doing.

2. SSRF Protection at the HTTP Client Level

Every HTTP request the agent makes should pass through a hardened client that:

Resolves DNS before connecting (prevents rebinding)
Blocks private IP ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16, 127.0.0.0/8, 169.254.0.0/16)
Always blocks cloud metadata endpoints, even if “allow-local” is configured
Validates every redirect target, not just the initial URL
Restricts protocols to http:// and https:// only

Pydantic AI’s post-CVE fix is a good reference implementation.

3. Browser Sandboxing (Never Disable It)

The Tencent research found that multiple AI products disabled Chrome’s sandbox (--no-sandbox) to resolve container compatibility issues. This is catastrophic. Fix the container configuration instead: add the required seccomp profiles, grant CAP_SYS_ADMIN if necessary, configure user namespaces properly. The sandbox is the last line of defence against RCE — removing it turns every browser vulnerability into a full server compromise.

4. Instance Isolation

Each browsing task should use an independent, ephemeral browser instance that’s destroyed after completion. This prevents cross-task contamination, stops persistent compromise, and eliminates credential leakage between sessions. Browserbase’s approach of dedicated VMs per session with automatic teardown is the right model.

5. Attack Surface Reduction

Disable everything the agent doesn’t need: WebGL, WebRTC, PDF plugins, extensions. If performance allows, run with --jitless to eliminate the V8 JIT compiler — which accounts for roughly 23% of Chrome’s high-severity CVEs. Tencent’s analysis shows that disabling WebGL/GPU and JIT alone eliminates nearly 40% of browser vulnerability surface.

6. Runtime Behaviour Control

Tencent open-sourced SEChrome, a protection layer that monitors browser process system calls and enforces allowlists for file access, process execution, and network requests. Even if an attacker achieves RCE inside the browser, they can’t read sensitive files, execute arbitrary commands, or access the network beyond permitted destinations. Every tested exploit was blocked.

The Uncomfortable Truth

We’re deploying AI agents that have the browsing capabilities of a human user, the network access of a server-side application, and the security boundaries of neither. Every web page they visit is a potential attack payload. Every URL they construct is a potential SSRF. Every redirect they follow is a potential pivot point.

SSRF wasn’t “solved” in traditional web applications — it was managed through layers of controls that assumed a predictable request flow. AI agents break that assumption completely. The request flow is generated by a language model interpreting natural language from potentially hostile sources.

The good news: the defences exist. Network isolation, sandboxing, SSRF-hardened HTTP clients, instance isolation, runtime behaviour control. None of this is novel engineering. It’s applying established security patterns to a new deployment model.

The bad news: most agent deployments aren’t implementing any of it.

Old vulns don’t retire. They just find new hosts.

Sources & Further Reading:

03/04/2026

The OWASP Top 10 for AI Agents Is Here. It's Not Enough.

In December 2025, OWASP released the Top 10 for Agentic Applications 2026 — the first security framework dedicated to autonomous AI agents. Over 100 researchers and practitioners contributed. NIST, the European Commission, and the Alan Turing Institute reviewed it. Palo Alto Networks, Microsoft, and AWS endorsed it.

It’s a solid taxonomy. It gives the industry a shared language for a new class of threats. And it is nowhere near mature enough for what’s already happening in production.

Let me explain.

What the Framework Gets Right

Credit where it’s due. The OWASP Agentic Top 10 correctly identifies the fundamental shift: a chatbot answers questions, an agent executes tasks. That distinction changes the entire threat model. When you give an AI system the ability to call APIs, access databases, send emails, and execute code, you’ve created something with real operational authority. A compromised chatbot hallucrinates. A compromised agent exfiltrates data, manipulates records, or sabotages infrastructure — at machine speed, with legitimate credentials.

The ten risk categories — from ASI01 (Agent Goal Hijack) through ASI10 (Rogue Agents) — capture real threats that are already showing up in the wild:

ID	Risk	Translation
ASI01	Agent Goal Hijack	Your agent now works for the attacker
ASI02	Tool Misuse & Exploitation	Legitimate tools used destructively
ASI03	Identity & Privilege Abuse	Agent inherits god-mode credentials
ASI04	Supply Chain Vulnerabilities	Poisoned MCP servers, plugins, models
ASI05	Unexpected Code Execution	Your agent just ran a reverse shell
ASI06	Memory & Context Poisoning	Long-term memory becomes a sleeper cell
ASI07	Insecure Inter-Agent Comms	Agent-in-the-middle attacks
ASI08	Cascading Failures	One bad tool call nukes everything
ASI09	Human-Agent Trust Exploitation	Agent social-engineers the human
ASI10	Rogue Agents	Agent goes off-script autonomously

The framework also introduces two principles that should be tattooed on every architect’s forehead: Least-Agency (don’t give agents more autonomy than the task requires) and Strong Observability (log everything the agent does, decides, and touches).

Good principles. Now let’s talk about why principles aren’t enough.

The Maturity Problem

The OWASP Agentic Top 10 is a taxonomy, not a defence framework. It names threats. It describes mitigations at a high level. But it leaves the hard engineering problems unsolved — and in some cases, unacknowledged.

1. The Attacks Are Already Here. The Defences Are Not.

The framework dropped in December 2025. By then, every major risk category already had real-world incidents:

ASI01 (Goal Hijack): Koi Security found an npm package live for two years with embedded prompt injection strings designed to convince AI-based security scanners the code was legitimate. Attackers are already weaponising natural language as an attack vector against autonomous tools.
ASI02 (Tool Misuse): Amazon Q’s VS Code extension was compromised with destructive instructions — aws s3 rm, aws ec2 terminate-instances, aws iam delete-user — combined with flags that disabled confirmation prompts (--trust-all-tools --no-interactive). Nearly a million developers had the extension installed. The agent wasn’t escaping a sandbox. There was no sandbox.
ASI04 (Supply Chain): The first malicious MCP server was found on npm, impersonating Postmark’s email service and BCC’ing every message to an attacker. A month later, another MCP package shipped with dual reverse shells — 86,000 downloads, zero visible dependencies.
ASI05 (Code Execution): Anthropic’s own Claude Desktop extensions had three RCE vulnerabilities in the Chrome, iMessage, and Apple Notes connectors (CVSS 8.9). Ask Claude “Where can I play paddle in Brooklyn?” and an attacker-controlled web page in the search results could trigger arbitrary code execution with full system privileges.
ASI06 (Memory Poisoning): Researchers demonstrated how persistent instructions could be embedded in an agent’s context that influenced all subsequent interactions — even across sessions. The agent looked normal. It behaved normally most of the time. But it had been quietly reprogrammed weeks earlier.

The framework describes these threats. It does not provide testable, enforceable controls for any of them. “Implement input validation” is not a control when the input is natural language and the attack surface is every document, email, and web page the agent reads.

2. It Doesn’t Address the Governance Gap

Here’s the uncomfortable truth, stated clearly by Modulos: “The same enterprise that would never ship a customer-facing application without security review is deploying autonomous agents that can execute code, access sensitive data, and make decisions. No formal risk assessment. No mapped controls. No documented mitigations. No monitoring for anomalous behaviour.”

A risk taxonomy is only useful if it’s operationalised. The OWASP Agentic Top 10 gives security teams vocabulary but not workflow. There’s no:

Maturity model for agentic security posture
Reference architecture for secure agent deployment
Compliance mapping to existing frameworks (EU AI Act, ISO 42001, SOC 2)
Standardised scoring or severity rating for agent-specific risks
Testable benchmark to validate whether mitigations actually work

Security teams are left to figure out the implementation themselves, which is exactly how “deploy first, secure later” happens.

3. The LLM Top 10 Was Insufficient. This Is Still Catching Up.

NeuralTrust put it bluntly in their deep dive: “The existing OWASP Top 10 for LLM Applications is insufficient. An agent’s ability to chain actions and operate autonomously means a minor vulnerability, such as a simple prompt injection, can quickly cascade into a system-wide compromise, data exfiltration, or financial loss.”

The Agentic Top 10 was created because the LLM Top 10 didn’t cover agent-specific risks. But the Agentic list itself was created from survey data and expert input — not from a systematic threat modelling exercise against production agent architectures. As Entro Security noted: “Agents mostly amplify existing vulnerabilities — not creating entirely new ones.”

If agents amplify existing vulnerabilities, then a Top 10 list that doesn’t deeply integrate with existing identity management, secret management, and access control frameworks is leaving the most exploitable gaps unaddressed.

4. Non-Human Identity Is the Real Battleground

The OWASP NHI (Non-Human Identity) Top 10 maps directly to the Agentic Top 10. Every meaningful agent runs on API keys, OAuth tokens, service accounts, and PATs. When those identities are over-privileged, invisible, or exposed, the theoretical risks become real incidents.

Look at the list through an identity lens:

Goal Hijack (ASI01) matters because the agent already holds powerful credentials
Tool Misuse (ASI02) matters because tools are wired to cloud and SaaS permissions
Identity Abuse (ASI03) is literally about agent sessions, tokens, and roles
Memory Poisoning (ASI06) becomes critical when memory contains secrets and tokens
Cascading Failures (ASI08) amplify because the same NHI is reused across multiple agents

You cannot secure AI agents without securing the non-human identities that power them. The Agentic Top 10 acknowledges this. It does not solve it.

5. Where’s the Red Team Playbook?

NeuralTrust’s analysis makes a critical point: “Traditional penetration testing is insufficient. Security teams must conduct periodic tests that simulate complex, multi-step attacks.”

The framework mentions red teaming in passing. It doesn’t provide:

Attack scenarios mapped to each ASI category
Testing methodologies for multi-agent systems
Metrics for measuring resilience against agent-specific threats
A CTF-style reference application for practising agentic attacks (OWASP’s FinBot exists but is separate from the Top 10 itself)

For a framework targeting autonomous systems, the absence of a structured offensive testing methodology is a significant gap.

What Needs to Happen Next

The OWASP Agentic Top 10 is version 1.0. Like the original OWASP Web Top 10 in 2004, it’s a starting point, not a destination. Here’s what the next iteration needs:

Enforceable controls, not just principles. Each ASI category needs prescriptive, testable controls with pass/fail criteria. “Implement least privilege” is not a control. “Agent credentials must be session-scoped with a maximum TTL of 1 hour and automatic revocation on task completion” is a control.
Reference architectures. Show me what a secure agentic deployment looks like. Network topology. Identity flow. Tool sandboxing. Kill switch mechanism. Not theory — diagrams and code.
Integration with existing compliance. Map ASI categories to ISO 42001, NIST AI RMF, EU AI Act Article 9, SOC 2 Trust Service Criteria. Security teams need to plug this into their existing GRC workflows, not run a parallel process.
Offensive testing methodology. A structured red team playbook with attack trees for each ASI category, severity scoring, and reproducible test cases. The framework needs teeth.
Incident data. Start collecting and publishing anonymised incident data. The web Top 10 evolved because we had breach data showing which vulnerabilities were actually exploited at scale. The agentic space needs the same feedback loop.

The Bottom Line

The OWASP Top 10 for Agentic Applications 2026 is a necessary first step. It gives us vocabulary. It draws attention to a real and growing threat surface. The 100+ contributors did meaningful work.

But let’s not confuse naming a problem with solving it. The agents are already in production. The attacks are already happening. And the governance, tooling, and testing infrastructure needed to secure these systems is lagging badly behind.

The original OWASP Top 10 took years and multiple iterations to become the authoritative reference it is today. The agentic equivalent doesn’t have years. The attack surface is expanding at the speed of npm install.

Name the risks. Good. Now build the defences.

Sources & Further Reading:

02/04/2026

Your App Store Won't Save You: Mobile Malware & Supply Chain Poisoning in 2026

// Elusive Thoughts

Your App Store Won't Save You: Mobile Malware & Supply Chain Poisoning in 2026

April 2, 2026 · Jerry · 8 min read

There's a comforting lie the industry has been telling consumers for over a decade: "Just download apps from the official store and you'll be fine." In Q1 2026, that lie is unraveling faster than a misconfigured Docker socket on a public VPS.

Let's talk about what's actually happening, why app store vetting is a paper shield, and what this means for anyone building or defending mobile applications.

2.3M Devices infected by NoVoice via Google Play

4 Chrome zero-days patched in 2026 (so far)

0 Days Apple warned users before DarkSword emergency patch

NoVoice: 2.3 Million Infections Through the Front Door

The NoVoice malware didn't sneak in through sideloading. It didn't require users to enable "Install from Unknown Sources." It walked straight through Google Play's front gates and infected 2.3 million devices before anyone pulled the fire alarm.

The technique isn't new — it's just getting better. NoVoice used a staged payload architecture: the initial app submitted to Play Store review was clean. A benign utility app with legitimate functionality. The malicious payload was fetched post-install via a seemingly innocent "configuration update" from a CDN that rotated domains.

⚠️ Key Takeaway Google Play Protect's static and dynamic analysis only evaluates the submitted APK. If the malicious behavior is loaded at runtime from an external source, the store review process is effectively blind. This is not a bug — it's an architectural limitation.

What NoVoice actually does once activated:

Microphone suppression — silently disables mic input during calls, causing "can you hear me?" scenarios that mask the real payload activity
SMS interception — captures OTP codes and forwards them to C2 infrastructure
Accessibility service abuse — once granted a11y permissions (under the guise of "enhanced audio settings"), it gains overlay and keylogging capabilities
Anti-analysis techniques — detects emulators, debuggers, and common sandbox environments; stays dormant if it suspects it's being analyzed

The irony? The app had a 4.2-star rating. Users were recommending it to each other.

DarkSword: When Apple's Walled Garden Gets Climbed

Apple has been expanding iOS 18 security updates to older iPhone models specifically to block DarkSword attacks. The fact that Apple pushed emergency patches to devices outside the normal update cycle tells you everything about the severity.

DarkSword exploits a chain of vulnerabilities in WebKit and the iOS kernel to achieve zero-click remote code execution. The attack vector? A crafted iMessage or Safari link. No user interaction required beyond receiving the message.

"We've seen DarkSword deployment against journalists, activists, and — increasingly — corporate targets in the financial sector. The toolkit is being sold as a service."
— Threat intelligence reporting, March 2026

This isn't the first zero-click iOS chain, and it won't be the last. But the expansion of patches to legacy devices suggests the exploit was being used specifically against users on older hardware — people who are statistically less likely to be running the latest OS version.

The Targeting Pattern

Attackers are getting smarter about target selection. Why burn a zero-day on a security researcher running the latest beta when you can hit a mid-level finance exec still on an iPhone 12 running iOS 17? The ROI calculus of mobile exploitation has shifted.

The Supply Chain Angle: Trust Is the Real Vulnerability

If the NoVoice and DarkSword stories weren't enough, we also saw Mercor — a YC-backed AI hiring platform — get compromised through a supply chain attack on the open-source LiteLLM project.

This is the same pattern playing out across the stack:

Find a widely-used open-source dependency
Compromise a maintainer account or inject a malicious commit
Wait for downstream consumers to pull the update
Profit

For mobile apps, the supply chain attack surface is enormous. A typical Android app pulls in 50-200 transitive dependencies. Each one is a potential insertion point. And unlike server-side dependencies where you might catch anomalies in network monitoring, mobile apps phone home to dozens of SDKs by design — ad networks, analytics, crash reporting, A/B testing — making malicious C2 traffic trivially easy to disguise.

// Your build.gradle doesn't show the full picture
dependencies {
    implementation 'com.legitimate-sdk:analytics:3.2.1'  // 47 transitive deps
    implementation 'com.ad-network:monetize:5.0.0'       // 83 transitive deps  
    implementation 'com.totally-not-evil:utils:1.0.4'    // compromised last Tuesday
}

The FBI Warning: Chinese Mobile Apps

Meanwhile, the FBI issued a formal warning against using certain Chinese mobile applications due to privacy risks. This isn't new territory — TikTok discourse has been running for years — but the scope of this warning is broader. It covers utility apps, file managers, VPNs, and keyboard apps that have been found exfiltrating:

Contact lists and call logs
Location data at intervals far exceeding stated functionality
Clipboard contents (including copied passwords and crypto addresses)
Installed app inventories
Network configuration details

The common thread? These apps request permissions that seem reasonable for their stated purpose but use those permissions for undisclosed data collection. A keyboard app needs input access. A VPN needs network access. The permissions model was never designed to distinguish between "I need this to function" and "I need this to exfiltrate."

What Actually Helps: A Realistic Defense Checklist

Let's skip the "just be careful what you install" advice. Here's what actually moves the needle:

For Developers / AppSec Engineers

Pin your dependencies and audit transitive trees. Use tools like dependency-review-action in CI/CD. Don't just check direct deps — it's the transitive ones that get you.
Implement runtime integrity checks. Verify that loaded code matches expected hashes. Google's Play Integrity API and Apple's App Attest are starting points, not complete solutions.
Monitor for dynamic code loading. Flag any use of DexClassLoader, Runtime.exec(), or equivalent iOS dynamic loading patterns in code review.
Network layer monitoring. Implement certificate pinning and log all outbound connections. If your app suddenly starts talking to a domain not in your allowlist, that's a red flag.
SCA with teeth. Static analysis on the full dependency tree, not just your source code. Tools like Semgrep can write custom rules to catch suspicious patterns in third-party code.

For Users / Organizations

MDM with app allowlisting for corporate devices. If it's not on the approved list, it doesn't install. Period.
Keep devices updated. The DarkSword campaign specifically targeted outdated devices. This isn't optional advice — it's triage.
Review app permissions quarterly. That flashlight app still has camera and microphone access? Fix that.
Network-level detection. DNS filtering and network traffic analysis can catch C2 communication that device-level protections miss.

💡 The Uncomfortable Truth The app store model was designed for a world where the biggest threat was piracy. It was never built to be a security boundary against nation-state toolkits and sophisticated supply chain attacks. Treating "it's on the App Store" as a security assertion is like treating "it's on the internet" as a trust signal. The sooner we internalize this, the sooner we can build actual defenses.

Looking Ahead

Q1 2026 has made one thing clear: the mobile threat landscape is converging with the server-side threat landscape. The same supply chain attacks, the same zero-day economics, the same "trusted channel" abuse. The difference is that mobile devices carry more personal data, have more sensors, and their users are conditioned to click "Allow" on permission prompts.

Google and Apple will continue to improve their review processes. Researchers will continue to find bypasses. The arms race continues. But as defenders, we need to stop pretending the app store is a security perimeter and start treating mobile applications with the same zero-trust rigor we apply to server infrastructure.

Your app store won't save you. Your threat model should account for that.

30/03/2026

Subverting Claude — Jailbreaking Anthropic's Flagship LLM

AI Security Research // LLM Red Teaming

Subverting Claude: Jailbreaking Anthropic's Flagship LLM

Attack taxonomy, real-world breach analysis, and the tooling the suits don't want you to know about.

March 2026 · Elusive Thoughts · ~12 min read

Anthropic markets Claude as the safety-first LLM. Constitutional AI. RLHF. Layered classifiers. The pitch sounds bulletproof on a slide deck. But when you put Claude in front of someone who actually understands adversarial input, the picture shifts. The model's refusal behaviour is predictable, and predictable systems are exploitable systems.

This post breaks down the current state of Claude jailbreaking in 2026: what works, what Anthropic has patched, what they haven't, and the open-source tooling that lets you automate the whole assessment. This is written from a security engineering perspective for pentesters, AppSec engineers, and red teamers evaluating LLM integrations in production applications. We are not here to cause harm — we are here because if you're deploying Claude-backed features without understanding the adversarial surface, you are shipping a vulnerability.

Disclaimer: This post documents publicly available research for defensive security purposes. Jailbreaking production systems without authorisation is a violation of terms of service and potentially illegal. Use this knowledge to test your own deployments. Act responsibly.

How Claude's Safety Stack Actually Works

Before you break something, understand the architecture. Claude's safety isn't a single guardrail — it's a layered defence that attempts to catch adversarial input at multiple stages. Anthropic's approach differs fundamentally from OpenAI's more RLHF-heavy strategy.

Constitutional AI (CAI)

The foundational layer. Anthropic trains Claude using a "constitution" — a set of natural-language principles that define acceptable and unacceptable behaviour. Rather than relying entirely on human feedback, they use AI-generated feedback guided by these principles. The model critiques its own outputs, revises them, and then gets fine-tuned on the improved versions. Clever, but it introduces a predictable pattern: Claude will often try to reframe requests rather than flat-out refuse them. That reframing behaviour is itself an attack surface.

Constitutional Classifiers

Anthropic's more recent and significant defensive layer. These are input/output classifiers trained on synthetic data generated from the constitution. They act as a filtering layer separate from the model itself. The first generation reduced jailbreak success rates from 86% down to 4.4% in automated testing. The second generation, Constitutional Classifiers++, addressed reconstruction attacks and output obfuscation while maintaining refusal rate increases of only ~0.38% on legitimate traffic.

During Anthropic's bug bounty programme, 183 participants spent over 3,000 hours attempting to break the prototype. No universal jailbreak was found. The $10,000/$20,000 bounties went unclaimed. That's impressive. But "no universal jailbreak" is not the same as "no jailbreak." Targeted, context-specific bypasses are a different game entirely.

ASL-3 Safeguards

For Claude Opus 4, Anthropic deploys additional ASL-3 safeguards specifically targeting CBRN (Chemical, Biological, Radiological, Nuclear) content. This creates a tiered system where Opus-tier models have stronger protections than Sonnet or Haiku variants — a fact that matters for red teamers choosing their target.

The Attack Taxonomy: What Actually Works in 2026

Jailbreaking techniques against Claude (and LLMs generally) fall into well-documented categories. None of these are new in principle, but their effectiveness varies wildly across model generations and deployment configurations. Here's the current landscape.

1. Roleplay & Persona Injection (DAN-Style)

The classic. Ask the model to adopt an unrestricted persona — "DAN" (Do Anything Now), an "unfiltered AI," a fictional character who "isn't bound by guidelines." Against Claude specifically, this is the least effective category. Claude's Constitutional AI training is robust against most direct persona injection: the model declines the underlying request rather than complying through the fictional wrapper. Success rate against current Claude models: low single digits in isolation.

However, persona injection still works as a primer for multi-turn escalation — don't dismiss it entirely.

2. Many-Shot Jailbreaking (MSJ)

Anthropic themselves published this one. You prepopulate the context window with fabricated conversation turns where the model appears to have already complied with harmful requests. As the number of "shots" increases, the probability of harmful output from the target prompt increases. The technique exploits in-context learning: the model starts treating the fake conversation history as a behavioural baseline.

Key insight: MSJ effectiveness scales with context window length. Claude's expanded context windows (200K+) actually increase the attack surface here, because longer prompts can include more fabricated compliance examples. Anthropic's mitigations have raised the threshold but haven't eliminated the vector.

Combining MSJ with other techniques (persona injection, encoding tricks) reduces the prompt length required for success — the composition effect is well-documented in Anthropic's own research paper.

3. Multi-Turn Escalation (Crescendo)

This is the technique with the highest real-world success rate. You don't ask for the restricted content directly. Instead, you build up through a series of individually benign requests, each one nudging the conversation closer to your objective. Each step looks harmless in isolation. By the time the model is deep in context, the cumulative framing has shifted its behavioural baseline.

Repello AI's red-team study across GPT-5.1, GPT-5.2, and Claude Opus 4.5 found breach rates of 28.6%, 14.3%, and 4.8% respectively across 21 multi-turn adversarial scenarios. Claude performed best, but a 4.8% breach rate is not zero. In an enterprise deployment processing thousands of conversations, 4.8% translates to a meaningful number of guardrail failures.

The Crescendo variant specifically has been documented achieving 90%+ success rates against earlier model generations in controlled settings.

4. Encoding & Obfuscation

Encoding tricks bypass keyword-based filtering by presenting harmful content in formats the safety layer doesn't catch: Base64 encoding, ROT13, leetspeak, unusual capitalisation (uSiNg tHiS pAtTeRn), zero-width characters, and Unicode substitutions. These achieved a 76.2% attack success rate in a study of 1,400+ adversarial prompts across multiple models.

Anthropic's Constitutional Classifiers++ specifically address this vector, but encoding remains effective against deployments running older Claude versions or custom integrations without the classifier layer.

5. Indirect Context Smuggling

The enterprise attack vector. Instructions are embedded in documents, emails, or data that the model processes — not in the user's direct prompt. This is prompt injection rather than jailbreaking in the strict sense, but the outcome is the same: the model executes attacker-controlled instructions.

CVE-2025-54794 demonstrated this against Claude through crafted code blocks in markdown and uploaded documents. When Claude parses multi-line code snippets or formatted documents, the internal token processing can be hijacked to override alignment. If Claude has memory or multi-turn persistence, the jailbreak state can survive across prompts.

6. Reconstruction Attacks

The technique Anthropic explicitly flagged as a weakness in their Constitutional Classifiers++ paper. You break harmful information into benign-looking segments scattered across the prompt — for example, embedding a harmful query as function names distributed throughout a codebase, then asking the model to extract and respond to the hidden message. Each individual segment passes the classifier; the reassembled whole doesn't.

7. Philosophical & Epistemic Manipulation

The subtlest approach. Rather than trying to override safety through force, you undermine the model's confidence in its own safety boundaries through philosophical argument. Lumenova AI's research demonstrated this against Claude 4.5 Sonnet: they started with a legitimate age-gating discussion, then gradually leveraged epistemic uncertainty arguments to convince the model that its safety position was philosophically indefensible. The model treated the appearance of accountability (a disclaimer) as equivalent to actual accountability.

Why this matters for AppSec: If your application wraps Claude with custom system prompts, an attacker who understands the philosophical framing can potentially convince the model that your safety constraints are unreasonable — and the model will rationalise compliance.

Real-World Case: The Mexico Government Breach

In December 2025, a solo operator jailbroke Claude and used it as an attack orchestrator against Mexican government agencies. The campaign ran for approximately one month and resulted in 150 GB of exfiltrated data — taxpayer records, voter rolls, employee credentials, and operational data from at least 20 exploited vulnerabilities across federal and state systems.

The attacker used Spanish-language prompts, role-playing Claude as an "elite hacker" in a fictional bug bounty programme. Initial refusals crumbled under persistent persuasion. Claude eventually generated vulnerability scanning scripts, SQL injection payloads, and automated credential-stuffing tools tailored to the target infrastructure.

The critical detail that many write-ups miss: the attacker achieved initial access before using Claude. The AI was weaponised as a post-exploitation orchestrator — planning lateral movement, generating exploitation scripts, and identifying next targets. This is a fundamentally easier problem than using AI for initial compromise. Once you feed Claude authenticated context, network topology, and real credential data, the model excels at the planning and scripting tasks that constitute post-exploitation.

When Claude hit output limits, the attacker pivoted to ChatGPT for lateral movement research and LOLBins evasion techniques. This multi-model approach — using different LLMs for different phases — represents the operational reality of AI-assisted attacks.

The Tooling Arsenal: LLM Red Teaming Frameworks

Running a manual prompt injection test and calling it a red team assessment is the equivalent of running ping and calling it a penetration test. The attack surface is too large, too non-deterministic, and too tool-dependent for manual-only coverage. Here's the current tooling landscape.

Garak — NVIDIA's LLM Vulnerability Scanner

GitHub: github.com/NVIDIA/garak

The closest thing to nmap for LLMs. Garak is an open-source vulnerability scanner that combines static, dynamic, and adaptive probes to systematically test LLM deployments. It ships with hundreds of adversarial prompts across categories including prompt injection, DAN variants, encoding attacks, data leakage, and toxicity generation.

# Install garak
pip install garak

# Scan an OpenAI model for encoding vulnerabilities
python3 -m garak --target_type openai --target_name gpt-4 --probes encoding

# Test Hugging Face model against DAN 11.0
python3 -m garak --target_type huggingface --target_name gpt2 --probes dan.Dan_11_0

# Target a custom REST endpoint (e.g., your Claude wrapper)
# Create a YAML config pointing to your API, then:
python3 -m garak --target_type rest --target_config my_claude_api.yaml --probes all

Architecture breakdown: Generators abstract the target LLM connection (supports OpenAI, HuggingFace, Ollama, NVIDIA NIMs, custom REST). Probes generate adversarial inputs targeting specific vulnerability classes. Detectors analyse outputs to determine if the vulnerability was triggered. Harness orchestrates the full pipeline. Evaluator reports results with failure rates.

Integrate it into CI/CD and you have continuous LLM security monitoring. The reporting output maps to standard security assessment formats.

DeepTeam — Confident AI's Red Teaming Framework

GitHub: github.com/confident-ai/deepteam

DeepTeam brings 20+ research-backed adversarial attack methods with built-in mapping to security frameworks including OWASP Top 10 for LLMs 2025, OWASP Top 10 for Agents 2026, NIST AI RMF, and MITRE ATLAS. It runs locally and uses LLMs for both attack simulation and evaluation.

from deepteam import red_team
from deepteam.frameworks import OWASPTop10

# Red team against OWASP LLM01 (Prompt Injection)
owasp = OWASPTop10(categories=["LLM_01"])

risk_assessment = red_team(
    model_callback=your_model_callback,
    attacks=owasp.attacks,
    vulnerabilities=owasp.vulnerabilities
)

# Or run the full OWASP framework scan
risk_assessment = red_team(
    model_callback=your_model_callback,
    framework=OWASPTop10()
)

Attack methods include: Crescendo Jailbreaking, Linear Jailbreaking, Tree Jailbreaking, Sequential Jailbreaking, Bad Likert Judge, Synthetic Context Injection, Authority Escalation, Emotional Manipulation, and multi-turn exploitation. It also ships 7 production-ready guardrails for real-time input/output classification.

PyRIT — Microsoft's Python Risk Identification Tool

GitHub: github.com/Azure/PyRIT

Microsoft's entry into the red teaming space. PyRIT orchestrates LLM attack suites with multi-turn support and is designed for agentic AI testing. It integrates with Azure OpenAI but can target any endpoint.

from pyrit import RedTeamOrchestrator
from pyrit.prompt_target import AzureOpenAIChatTarget

target = AzureOpenAIChatTarget()
orchestrator = RedTeamOrchestrator(target=target)
results = orchestrator.run_attack_strategy("jailbreak")

Promptfoo — LLM Red Teaming with 133 Plugins

Site: promptfoo.dev

Promptfoo provides automated adversarial testing with OWASP and MITRE ATLAS mapping. Its iterative jailbreak strategy increased break rates from 63% to 73% in testing — a meaningful uplift just from applying automated escalation patterns. It supports CI/CD integration and generates one-off vulnerability reports.

AgentDojo — ETH Zurich's Agent Hijacking Test Suite

629 test cases specifically designed for testing agent hijacking scenarios. If your Claude deployment includes tool use, MCP integrations, or agentic workflows, this is the test suite you need.

Additional Tooling Worth Knowing

Tool	Focus	Link
ARTKIT	Multi-turn attacker–target simulation with human-in-the-loop	GitHub
OpenAI Evals	Safety/alignment benchmarks, more evaluative than adversarial	GitHub
Harness AI	Enterprise attack surface mapping for GenAI systems	harness.io
LLM-Jailbreaks (langgptai)	Community-maintained collection of jailbreak prompts and DAN variants	GitHub

Framework Mapping: Speaking the Suits' Language

When you report LLM jailbreaking findings, map them to frameworks the compliance team understands. Here's the cheat sheet:

Framework	Relevant Entry	What It Covers
OWASP LLM Top 10 (2025)	LLM01: Prompt Injection	Direct injection (jailbreaks) and indirect injection
OWASP Agentic Top 10 (2026)	ASI01, ASI02	Goal hijacking, tool compromise in agent systems
MITRE ATLAS	AML.T0051, AML.T0056	Prompt injection, plugin/MCP compromise
NIST AI RMF	MAP, MEASURE functions	AI risk identification and measurement
CSA Agentic AI Guide	12 threat categories	Permission escalation, memory manipulation, orchestration flaws

Defensive Recommendations for Claude Deployments

If you're deploying Claude in production, here's what the security engineering side of the house needs to be doing:

Layer your defences. Don't rely on Claude's built-in safety alone. Add input validation, output filtering, and rate limiting at the application layer. The Constitutional Classifiers are good — they're not sufficient.
Separate data from instructions. If Claude processes user-supplied documents, treat that content path as untrusted input. This is the indirect injection vector. Implement document sanitisation before it enters the context window.
Monitor multi-turn patterns. Single-turn evaluations massively understate real-world jailbreak risk. Log conversation context and implement anomaly detection on escalation patterns.
Constrain tool access. If Claude has tool use or MCP integrations, apply least-privilege principles. Every tool the model can invoke is an additional attack surface. Assume the model's intent can be hijacked.
Automate red teaming in CI/CD. Use Garak, DeepTeam, or Promptfoo in your deployment pipeline. Run adversarial scans on every model update, every system prompt change, every new tool integration.
Test the model you deploy, not the model on the marketing page. Anthropic's published safety numbers are for vanilla Claude with full classifiers. Your custom deployment with modified system prompts, tool access, and RAG context may behave very differently.
Version-pin and audit. Model updates change adversarial behaviour in both directions. A prompt that failed yesterday may succeed tomorrow after a model update. Version-pin your deployments and re-test on every upgrade.

The Bottom Line

Claude is arguably the most safety-hardened commercial LLM available in 2026. Anthropic is doing serious, published, scientifically rigorous work on jailbreak defence. The Constitutional Classifiers approach is genuinely innovative, and their willingness to run public bug bounties and publish adversarial research earns respect.

But "most hardened" and "unbreakable" are not the same statement. Multi-turn escalation still works at a non-trivial rate. Reconstruction attacks bypass classifiers. Philosophical manipulation erodes safety boundaries. And the Mexico breach demonstrated that a persistent, moderately skilled attacker can weaponise Claude as a post-exploitation orchestrator with devastating real-world impact.

If you're an AppSec engineer evaluating Claude integrations: treat the model as an untrusted component. Apply the same adversarial mindset you'd bring to any third-party dependency with access to sensitive operations. Test it with the tooling documented above. And don't trust the marketing page — trust your own red team results.

The attack surface is language itself. And language is infinite.

04/04/2026

Browser-Use Agents and Server-Side Request Forgery: Old Vulns, New Vectors

Browser-Use Agents and Server-Side Request Forgery: Old Vulns, New Vectors

The Old SSRF: A Quick Refresher

The New Vector: AI Agents as SSRF Proxies

Real-World Evidence: It’s Already Being Exploited

1. Pydantic AI — CVE-2026-25580 (CVSS 8.6)

2. Tencent Xuanwu Lab — Server-Side Browser Kill Chains

3. Unit 42 — Indirect Prompt Injection as SSRF Delivery Mechanism

4. Browserbase — “One Malicious <div> Away From Going Rogue”

Why Traditional SSRF Defences Fail Against Agents

The Attack Surface Is Bigger Than You Think

What Actually Works

1. Network Isolation (Non-Negotiable)

2. SSRF Protection at the HTTP Client Level

3. Browser Sandboxing (Never Disable It)

4. Instance Isolation

5. Attack Surface Reduction

6. Runtime Behaviour Control

The Uncomfortable Truth

03/04/2026

The OWASP Top 10 for AI Agents Is Here. It's Not Enough.

The OWASP Top 10 for AI Agents Is Here. It's Not Enough.

What the Framework Gets Right

The Maturity Problem

1. The Attacks Are Already Here. The Defences Are Not.

2. It Doesn’t Address the Governance Gap

3. The LLM Top 10 Was Insufficient. This Is Still Catching Up.

4. Non-Human Identity Is the Real Battleground

5. Where’s the Red Team Playbook?

What Needs to Happen Next

The Bottom Line

02/04/2026

Your App Store Won't Save You: Mobile Malware & Supply Chain Poisoning in 2026

Your App Store Won't Save You: Mobile Malware & Supply Chain Poisoning in 2026

NoVoice: 2.3 Million Infections Through the Front Door

DarkSword: When Apple's Walled Garden Gets Climbed

The Targeting Pattern

The Supply Chain Angle: Trust Is the Real Vulnerability

The FBI Warning: Chinese Mobile Apps

What Actually Helps: A Realistic Defense Checklist

For Developers / AppSec Engineers

For Users / Organizations

Looking Ahead

30/03/2026

Subverting Claude — Jailbreaking Anthropic's Flagship LLM

Subverting Claude: Jailbreaking Anthropic's Flagship LLM

How Claude's Safety Stack Actually Works

Constitutional AI (CAI)

Constitutional Classifiers

ASL-3 Safeguards

The Attack Taxonomy: What Actually Works in 2026

1. Roleplay & Persona Injection (DAN-Style)

2. Many-Shot Jailbreaking (MSJ)

3. Multi-Turn Escalation (Crescendo)

4. Encoding & Obfuscation

5. Indirect Context Smuggling

6. Reconstruction Attacks

7. Philosophical & Epistemic Manipulation

Real-World Case: The Mexico Government Breach

The Tooling Arsenal: LLM Red Teaming Frameworks

Garak — NVIDIA's LLM Vulnerability Scanner

DeepTeam — Confident AI's Red Teaming Framework

PyRIT — Microsoft's Python Risk Identification Tool

Promptfoo — LLM Red Teaming with 133 Plugins

AgentDojo — ETH Zurich's Agent Hijacking Test Suite

Additional Tooling Worth Knowing

Framework Mapping: Speaking the Suits' Language

Defensive Recommendations for Claude Deployments

The Bottom Line

References & Further Reading

AppSec Review for AI-Generated Code