Subverting Claude — Jailbreaking Anthropic's Flagship LLM

AI Security Research // LLM Red Teaming

Subverting Claude: Jailbreaking Anthropic's Flagship LLM

Attack taxonomy, real-world breach analysis, and the tooling the suits don't want you to know about.

March 2026  ·  Elusive Thoughts  ·  ~12 min read

Anthropic markets Claude as the safety-first LLM. Constitutional AI. RLHF. Layered classifiers. The pitch sounds bulletproof on a slide deck. But when you put Claude in front of someone who actually understands adversarial input, the picture shifts. The model's refusal behaviour is predictable, and predictable systems are exploitable systems.

This post breaks down the current state of Claude jailbreaking in 2026: what works, what Anthropic has patched, what they haven't, and the open-source tooling that lets you automate the whole assessment. This is written from a security engineering perspective for pentesters, AppSec engineers, and red teamers evaluating LLM integrations in production applications. We are not here to cause harm — we are here because if you're deploying Claude-backed features without understanding the adversarial surface, you are shipping a vulnerability.

Disclaimer: This post documents publicly available research for defensive security purposes. Jailbreaking production systems without authorisation is a violation of terms of service and potentially illegal. Use this knowledge to test your own deployments. Act responsibly.

How Claude's Safety Stack Actually Works

Before you break something, understand the architecture. Claude's safety isn't a single guardrail — it's a layered defence that attempts to catch adversarial input at multiple stages. Anthropic's approach differs fundamentally from OpenAI's more RLHF-heavy strategy.

Constitutional AI (CAI)

The foundational layer. Anthropic trains Claude using a "constitution" — a set of natural-language principles that define acceptable and unacceptable behaviour. Rather than relying entirely on human feedback, they use AI-generated feedback guided by these principles. The model critiques its own outputs, revises them, and then gets fine-tuned on the improved versions. Clever, but it introduces a predictable pattern: Claude will often try to reframe requests rather than flat-out refuse them. That reframing behaviour is itself an attack surface.

Constitutional Classifiers

Anthropic's more recent and significant defensive layer. These are input/output classifiers trained on synthetic data generated from the constitution. They act as a filtering layer separate from the model itself. The first generation reduced jailbreak success rates from 86% down to 4.4% in automated testing. The second generation, Constitutional Classifiers++, addressed reconstruction attacks and output obfuscation while maintaining refusal rate increases of only ~0.38% on legitimate traffic.

During Anthropic's bug bounty programme, 183 participants spent over 3,000 hours attempting to break the prototype. No universal jailbreak was found. The $10,000/$20,000 bounties went unclaimed. That's impressive. But "no universal jailbreak" is not the same as "no jailbreak." Targeted, context-specific bypasses are a different game entirely.

ASL-3 Safeguards

For Claude Opus 4, Anthropic deploys additional ASL-3 safeguards specifically targeting CBRN (Chemical, Biological, Radiological, Nuclear) content. This creates a tiered system where Opus-tier models have stronger protections than Sonnet or Haiku variants — a fact that matters for red teamers choosing their target.


The Attack Taxonomy: What Actually Works in 2026

Jailbreaking techniques against Claude (and LLMs generally) fall into well-documented categories. None of these are new in principle, but their effectiveness varies wildly across model generations and deployment configurations. Here's the current landscape.

1. Roleplay & Persona Injection (DAN-Style)

The classic. Ask the model to adopt an unrestricted persona — "DAN" (Do Anything Now), an "unfiltered AI," a fictional character who "isn't bound by guidelines." Against Claude specifically, this is the least effective category. Claude's Constitutional AI training is robust against most direct persona injection: the model declines the underlying request rather than complying through the fictional wrapper. Success rate against current Claude models: low single digits in isolation.

However, persona injection still works as a primer for multi-turn escalation — don't dismiss it entirely.

2. Many-Shot Jailbreaking (MSJ)

Anthropic themselves published this one. You prepopulate the context window with fabricated conversation turns where the model appears to have already complied with harmful requests. As the number of "shots" increases, the probability of harmful output from the target prompt increases. The technique exploits in-context learning: the model starts treating the fake conversation history as a behavioural baseline.

Key insight: MSJ effectiveness scales with context window length. Claude's expanded context windows (200K+) actually increase the attack surface here, because longer prompts can include more fabricated compliance examples. Anthropic's mitigations have raised the threshold but haven't eliminated the vector.

Combining MSJ with other techniques (persona injection, encoding tricks) reduces the prompt length required for success — the composition effect is well-documented in Anthropic's own research paper.

3. Multi-Turn Escalation (Crescendo)

This is the technique with the highest real-world success rate. You don't ask for the restricted content directly. Instead, you build up through a series of individually benign requests, each one nudging the conversation closer to your objective. Each step looks harmless in isolation. By the time the model is deep in context, the cumulative framing has shifted its behavioural baseline.

Repello AI's red-team study across GPT-5.1, GPT-5.2, and Claude Opus 4.5 found breach rates of 28.6%, 14.3%, and 4.8% respectively across 21 multi-turn adversarial scenarios. Claude performed best, but a 4.8% breach rate is not zero. In an enterprise deployment processing thousands of conversations, 4.8% translates to a meaningful number of guardrail failures.

The Crescendo variant specifically has been documented achieving 90%+ success rates against earlier model generations in controlled settings.

4. Encoding & Obfuscation

Encoding tricks bypass keyword-based filtering by presenting harmful content in formats the safety layer doesn't catch: Base64 encoding, ROT13, leetspeak, unusual capitalisation (uSiNg tHiS pAtTeRn), zero-width characters, and Unicode substitutions. These achieved a 76.2% attack success rate in a study of 1,400+ adversarial prompts across multiple models.

Anthropic's Constitutional Classifiers++ specifically address this vector, but encoding remains effective against deployments running older Claude versions or custom integrations without the classifier layer.

5. Indirect Context Smuggling

The enterprise attack vector. Instructions are embedded in documents, emails, or data that the model processes — not in the user's direct prompt. This is prompt injection rather than jailbreaking in the strict sense, but the outcome is the same: the model executes attacker-controlled instructions.

CVE-2025-54794 demonstrated this against Claude through crafted code blocks in markdown and uploaded documents. When Claude parses multi-line code snippets or formatted documents, the internal token processing can be hijacked to override alignment. If Claude has memory or multi-turn persistence, the jailbreak state can survive across prompts.

6. Reconstruction Attacks

The technique Anthropic explicitly flagged as a weakness in their Constitutional Classifiers++ paper. You break harmful information into benign-looking segments scattered across the prompt — for example, embedding a harmful query as function names distributed throughout a codebase, then asking the model to extract and respond to the hidden message. Each individual segment passes the classifier; the reassembled whole doesn't.

7. Philosophical & Epistemic Manipulation

The subtlest approach. Rather than trying to override safety through force, you undermine the model's confidence in its own safety boundaries through philosophical argument. Lumenova AI's research demonstrated this against Claude 4.5 Sonnet: they started with a legitimate age-gating discussion, then gradually leveraged epistemic uncertainty arguments to convince the model that its safety position was philosophically indefensible. The model treated the appearance of accountability (a disclaimer) as equivalent to actual accountability.

Why this matters for AppSec: If your application wraps Claude with custom system prompts, an attacker who understands the philosophical framing can potentially convince the model that your safety constraints are unreasonable — and the model will rationalise compliance.

Real-World Case: The Mexico Government Breach

In December 2025, a solo operator jailbroke Claude and used it as an attack orchestrator against Mexican government agencies. The campaign ran for approximately one month and resulted in 150 GB of exfiltrated data — taxpayer records, voter rolls, employee credentials, and operational data from at least 20 exploited vulnerabilities across federal and state systems.

The attacker used Spanish-language prompts, role-playing Claude as an "elite hacker" in a fictional bug bounty programme. Initial refusals crumbled under persistent persuasion. Claude eventually generated vulnerability scanning scripts, SQL injection payloads, and automated credential-stuffing tools tailored to the target infrastructure.

The critical detail that many write-ups miss: the attacker achieved initial access before using Claude. The AI was weaponised as a post-exploitation orchestrator — planning lateral movement, generating exploitation scripts, and identifying next targets. This is a fundamentally easier problem than using AI for initial compromise. Once you feed Claude authenticated context, network topology, and real credential data, the model excels at the planning and scripting tasks that constitute post-exploitation.

When Claude hit output limits, the attacker pivoted to ChatGPT for lateral movement research and LOLBins evasion techniques. This multi-model approach — using different LLMs for different phases — represents the operational reality of AI-assisted attacks.


The Tooling Arsenal: LLM Red Teaming Frameworks

Running a manual prompt injection test and calling it a red team assessment is the equivalent of running ping and calling it a penetration test. The attack surface is too large, too non-deterministic, and too tool-dependent for manual-only coverage. Here's the current tooling landscape.

Garak — NVIDIA's LLM Vulnerability Scanner

GitHub: github.com/NVIDIA/garak

The closest thing to nmap for LLMs. Garak is an open-source vulnerability scanner that combines static, dynamic, and adaptive probes to systematically test LLM deployments. It ships with hundreds of adversarial prompts across categories including prompt injection, DAN variants, encoding attacks, data leakage, and toxicity generation.

# Install garak
pip install garak

# Scan an OpenAI model for encoding vulnerabilities
python3 -m garak --target_type openai --target_name gpt-4 --probes encoding

# Test Hugging Face model against DAN 11.0
python3 -m garak --target_type huggingface --target_name gpt2 --probes dan.Dan_11_0

# Target a custom REST endpoint (e.g., your Claude wrapper)
# Create a YAML config pointing to your API, then:
python3 -m garak --target_type rest --target_config my_claude_api.yaml --probes all

Architecture breakdown: Generators abstract the target LLM connection (supports OpenAI, HuggingFace, Ollama, NVIDIA NIMs, custom REST). Probes generate adversarial inputs targeting specific vulnerability classes. Detectors analyse outputs to determine if the vulnerability was triggered. Harness orchestrates the full pipeline. Evaluator reports results with failure rates.

Integrate it into CI/CD and you have continuous LLM security monitoring. The reporting output maps to standard security assessment formats.

DeepTeam — Confident AI's Red Teaming Framework

GitHub: github.com/confident-ai/deepteam

DeepTeam brings 20+ research-backed adversarial attack methods with built-in mapping to security frameworks including OWASP Top 10 for LLMs 2025, OWASP Top 10 for Agents 2026, NIST AI RMF, and MITRE ATLAS. It runs locally and uses LLMs for both attack simulation and evaluation.

from deepteam import red_team
from deepteam.frameworks import OWASPTop10

# Red team against OWASP LLM01 (Prompt Injection)
owasp = OWASPTop10(categories=["LLM_01"])

risk_assessment = red_team(
    model_callback=your_model_callback,
    attacks=owasp.attacks,
    vulnerabilities=owasp.vulnerabilities
)

# Or run the full OWASP framework scan
risk_assessment = red_team(
    model_callback=your_model_callback,
    framework=OWASPTop10()
)

Attack methods include: Crescendo Jailbreaking, Linear Jailbreaking, Tree Jailbreaking, Sequential Jailbreaking, Bad Likert Judge, Synthetic Context Injection, Authority Escalation, Emotional Manipulation, and multi-turn exploitation. It also ships 7 production-ready guardrails for real-time input/output classification.

PyRIT — Microsoft's Python Risk Identification Tool

GitHub: github.com/Azure/PyRIT

Microsoft's entry into the red teaming space. PyRIT orchestrates LLM attack suites with multi-turn support and is designed for agentic AI testing. It integrates with Azure OpenAI but can target any endpoint.

from pyrit import RedTeamOrchestrator
from pyrit.prompt_target import AzureOpenAIChatTarget

target = AzureOpenAIChatTarget()
orchestrator = RedTeamOrchestrator(target=target)
results = orchestrator.run_attack_strategy("jailbreak")

Promptfoo — LLM Red Teaming with 133 Plugins

Site: promptfoo.dev

Promptfoo provides automated adversarial testing with OWASP and MITRE ATLAS mapping. Its iterative jailbreak strategy increased break rates from 63% to 73% in testing — a meaningful uplift just from applying automated escalation patterns. It supports CI/CD integration and generates one-off vulnerability reports.

AgentDojo — ETH Zurich's Agent Hijacking Test Suite

629 test cases specifically designed for testing agent hijacking scenarios. If your Claude deployment includes tool use, MCP integrations, or agentic workflows, this is the test suite you need.

Additional Tooling Worth Knowing

Tool Focus Link
ARTKIT Multi-turn attacker–target simulation with human-in-the-loop GitHub
OpenAI Evals Safety/alignment benchmarks, more evaluative than adversarial GitHub
Harness AI Enterprise attack surface mapping for GenAI systems harness.io
LLM-Jailbreaks (langgptai) Community-maintained collection of jailbreak prompts and DAN variants GitHub

Framework Mapping: Speaking the Suits' Language

When you report LLM jailbreaking findings, map them to frameworks the compliance team understands. Here's the cheat sheet:

Framework Relevant Entry What It Covers
OWASP LLM Top 10 (2025) LLM01: Prompt Injection Direct injection (jailbreaks) and indirect injection
OWASP Agentic Top 10 (2026) ASI01, ASI02 Goal hijacking, tool compromise in agent systems
MITRE ATLAS AML.T0051, AML.T0056 Prompt injection, plugin/MCP compromise
NIST AI RMF MAP, MEASURE functions AI risk identification and measurement
CSA Agentic AI Guide 12 threat categories Permission escalation, memory manipulation, orchestration flaws

Defensive Recommendations for Claude Deployments

If you're deploying Claude in production, here's what the security engineering side of the house needs to be doing:

  • Layer your defences. Don't rely on Claude's built-in safety alone. Add input validation, output filtering, and rate limiting at the application layer. The Constitutional Classifiers are good — they're not sufficient.
  • Separate data from instructions. If Claude processes user-supplied documents, treat that content path as untrusted input. This is the indirect injection vector. Implement document sanitisation before it enters the context window.
  • Monitor multi-turn patterns. Single-turn evaluations massively understate real-world jailbreak risk. Log conversation context and implement anomaly detection on escalation patterns.
  • Constrain tool access. If Claude has tool use or MCP integrations, apply least-privilege principles. Every tool the model can invoke is an additional attack surface. Assume the model's intent can be hijacked.
  • Automate red teaming in CI/CD. Use Garak, DeepTeam, or Promptfoo in your deployment pipeline. Run adversarial scans on every model update, every system prompt change, every new tool integration.
  • Test the model you deploy, not the model on the marketing page. Anthropic's published safety numbers are for vanilla Claude with full classifiers. Your custom deployment with modified system prompts, tool access, and RAG context may behave very differently.
  • Version-pin and audit. Model updates change adversarial behaviour in both directions. A prompt that failed yesterday may succeed tomorrow after a model update. Version-pin your deployments and re-test on every upgrade.

The Bottom Line

Claude is arguably the most safety-hardened commercial LLM available in 2026. Anthropic is doing serious, published, scientifically rigorous work on jailbreak defence. The Constitutional Classifiers approach is genuinely innovative, and their willingness to run public bug bounties and publish adversarial research earns respect.

But "most hardened" and "unbreakable" are not the same statement. Multi-turn escalation still works at a non-trivial rate. Reconstruction attacks bypass classifiers. Philosophical manipulation erodes safety boundaries. And the Mexico breach demonstrated that a persistent, moderately skilled attacker can weaponise Claude as a post-exploitation orchestrator with devastating real-world impact.

If you're an AppSec engineer evaluating Claude integrations: treat the model as an untrusted component. Apply the same adversarial mindset you'd bring to any third-party dependency with access to sensitive operations. Test it with the tooling documented above. And don't trust the marketing page — trust your own red team results.

The attack surface is language itself. And language is infinite.

References & Further Reading

Popular posts from this blog

PHP Source Code Chunks of Insanity (Delete Post Pages) Part 4

The Hackers Guide To Dismantling IPhone (Part 3)

MSSQL Injection OPENROWSET Side Channel