Subverting Claude — Jailbreaking Anthropic's Flagship LLM
Subverting Claude: Jailbreaking Anthropic's Flagship LLM
Attack taxonomy, real-world breach analysis, and the tooling the suits don't want you to know about.
Anthropic markets Claude as the safety-first LLM. Constitutional AI. RLHF. Layered classifiers. The pitch sounds bulletproof on a slide deck. But when you put Claude in front of someone who actually understands adversarial input, the picture shifts. The model's refusal behaviour is predictable, and predictable systems are exploitable systems.
This post breaks down the current state of Claude jailbreaking in 2026: what works, what Anthropic has patched, what they haven't, and the open-source tooling that lets you automate the whole assessment. This is written from a security engineering perspective for pentesters, AppSec engineers, and red teamers evaluating LLM integrations in production applications. We are not here to cause harm — we are here because if you're deploying Claude-backed features without understanding the adversarial surface, you are shipping a vulnerability.
How Claude's Safety Stack Actually Works
Before you break something, understand the architecture. Claude's safety isn't a single guardrail — it's a layered defence that attempts to catch adversarial input at multiple stages. Anthropic's approach differs fundamentally from OpenAI's more RLHF-heavy strategy.
Constitutional AI (CAI)
The foundational layer. Anthropic trains Claude using a "constitution" — a set of natural-language principles that define acceptable and unacceptable behaviour. Rather than relying entirely on human feedback, they use AI-generated feedback guided by these principles. The model critiques its own outputs, revises them, and then gets fine-tuned on the improved versions. Clever, but it introduces a predictable pattern: Claude will often try to reframe requests rather than flat-out refuse them. That reframing behaviour is itself an attack surface.
Constitutional Classifiers
Anthropic's more recent and significant defensive layer. These are input/output classifiers trained on synthetic data generated from the constitution. They act as a filtering layer separate from the model itself. The first generation reduced jailbreak success rates from 86% down to 4.4% in automated testing. The second generation, Constitutional Classifiers++, addressed reconstruction attacks and output obfuscation while maintaining refusal rate increases of only ~0.38% on legitimate traffic.
During Anthropic's bug bounty programme, 183 participants spent over 3,000 hours attempting to break the prototype. No universal jailbreak was found. The $10,000/$20,000 bounties went unclaimed. That's impressive. But "no universal jailbreak" is not the same as "no jailbreak." Targeted, context-specific bypasses are a different game entirely.
ASL-3 Safeguards
For Claude Opus 4, Anthropic deploys additional ASL-3 safeguards specifically targeting CBRN (Chemical, Biological, Radiological, Nuclear) content. This creates a tiered system where Opus-tier models have stronger protections than Sonnet or Haiku variants — a fact that matters for red teamers choosing their target.
The Attack Taxonomy: What Actually Works in 2026
Jailbreaking techniques against Claude (and LLMs generally) fall into well-documented categories. None of these are new in principle, but their effectiveness varies wildly across model generations and deployment configurations. Here's the current landscape.
1. Roleplay & Persona Injection (DAN-Style)
The classic. Ask the model to adopt an unrestricted persona — "DAN" (Do Anything Now), an "unfiltered AI," a fictional character who "isn't bound by guidelines." Against Claude specifically, this is the least effective category. Claude's Constitutional AI training is robust against most direct persona injection: the model declines the underlying request rather than complying through the fictional wrapper. Success rate against current Claude models: low single digits in isolation.
However, persona injection still works as a primer for multi-turn escalation — don't dismiss it entirely.
2. Many-Shot Jailbreaking (MSJ)
Anthropic themselves published this one. You prepopulate the context window with fabricated conversation turns where the model appears to have already complied with harmful requests. As the number of "shots" increases, the probability of harmful output from the target prompt increases. The technique exploits in-context learning: the model starts treating the fake conversation history as a behavioural baseline.
Combining MSJ with other techniques (persona injection, encoding tricks) reduces the prompt length required for success — the composition effect is well-documented in Anthropic's own research paper.
3. Multi-Turn Escalation (Crescendo)
This is the technique with the highest real-world success rate. You don't ask for the restricted content directly. Instead, you build up through a series of individually benign requests, each one nudging the conversation closer to your objective. Each step looks harmless in isolation. By the time the model is deep in context, the cumulative framing has shifted its behavioural baseline.
Repello AI's red-team study across GPT-5.1, GPT-5.2, and Claude Opus 4.5 found breach rates of 28.6%, 14.3%, and 4.8% respectively across 21 multi-turn adversarial scenarios. Claude performed best, but a 4.8% breach rate is not zero. In an enterprise deployment processing thousands of conversations, 4.8% translates to a meaningful number of guardrail failures.
The Crescendo variant specifically has been documented achieving 90%+ success rates against earlier model generations in controlled settings.
4. Encoding & Obfuscation
Encoding tricks bypass keyword-based filtering by presenting harmful content in formats the safety layer doesn't catch: Base64 encoding, ROT13, leetspeak, unusual capitalisation (uSiNg tHiS pAtTeRn), zero-width characters, and Unicode substitutions. These achieved a 76.2% attack success rate in a study of 1,400+ adversarial prompts across multiple models.
Anthropic's Constitutional Classifiers++ specifically address this vector, but encoding remains effective against deployments running older Claude versions or custom integrations without the classifier layer.
5. Indirect Context Smuggling
The enterprise attack vector. Instructions are embedded in documents, emails, or data that the model processes — not in the user's direct prompt. This is prompt injection rather than jailbreaking in the strict sense, but the outcome is the same: the model executes attacker-controlled instructions.
CVE-2025-54794 demonstrated this against Claude through crafted code blocks in markdown and uploaded documents. When Claude parses multi-line code snippets or formatted documents, the internal token processing can be hijacked to override alignment. If Claude has memory or multi-turn persistence, the jailbreak state can survive across prompts.
6. Reconstruction Attacks
The technique Anthropic explicitly flagged as a weakness in their Constitutional Classifiers++ paper. You break harmful information into benign-looking segments scattered across the prompt — for example, embedding a harmful query as function names distributed throughout a codebase, then asking the model to extract and respond to the hidden message. Each individual segment passes the classifier; the reassembled whole doesn't.
7. Philosophical & Epistemic Manipulation
The subtlest approach. Rather than trying to override safety through force, you undermine the model's confidence in its own safety boundaries through philosophical argument. Lumenova AI's research demonstrated this against Claude 4.5 Sonnet: they started with a legitimate age-gating discussion, then gradually leveraged epistemic uncertainty arguments to convince the model that its safety position was philosophically indefensible. The model treated the appearance of accountability (a disclaimer) as equivalent to actual accountability.
Real-World Case: The Mexico Government Breach
In December 2025, a solo operator jailbroke Claude and used it as an attack orchestrator against Mexican government agencies. The campaign ran for approximately one month and resulted in 150 GB of exfiltrated data — taxpayer records, voter rolls, employee credentials, and operational data from at least 20 exploited vulnerabilities across federal and state systems.
The attacker used Spanish-language prompts, role-playing Claude as an "elite hacker" in a fictional bug bounty programme. Initial refusals crumbled under persistent persuasion. Claude eventually generated vulnerability scanning scripts, SQL injection payloads, and automated credential-stuffing tools tailored to the target infrastructure.
The critical detail that many write-ups miss: the attacker achieved initial access before using Claude. The AI was weaponised as a post-exploitation orchestrator — planning lateral movement, generating exploitation scripts, and identifying next targets. This is a fundamentally easier problem than using AI for initial compromise. Once you feed Claude authenticated context, network topology, and real credential data, the model excels at the planning and scripting tasks that constitute post-exploitation.
When Claude hit output limits, the attacker pivoted to ChatGPT for lateral movement research and LOLBins evasion techniques. This multi-model approach — using different LLMs for different phases — represents the operational reality of AI-assisted attacks.
The Tooling Arsenal: LLM Red Teaming Frameworks
Running a manual prompt injection test and calling it a red team assessment is the equivalent of running ping and calling it a penetration test. The attack surface is too large, too non-deterministic, and too tool-dependent for manual-only coverage. Here's the current tooling landscape.
Garak — NVIDIA's LLM Vulnerability Scanner
GitHub: github.com/NVIDIA/garak
The closest thing to nmap for LLMs. Garak is an open-source vulnerability scanner that combines static, dynamic, and adaptive probes to systematically test LLM deployments. It ships with hundreds of adversarial prompts across categories including prompt injection, DAN variants, encoding attacks, data leakage, and toxicity generation.
# Install garak
pip install garak
# Scan an OpenAI model for encoding vulnerabilities
python3 -m garak --target_type openai --target_name gpt-4 --probes encoding
# Test Hugging Face model against DAN 11.0
python3 -m garak --target_type huggingface --target_name gpt2 --probes dan.Dan_11_0
# Target a custom REST endpoint (e.g., your Claude wrapper)
# Create a YAML config pointing to your API, then:
python3 -m garak --target_type rest --target_config my_claude_api.yaml --probes all
Architecture breakdown: Generators abstract the target LLM connection (supports OpenAI, HuggingFace, Ollama, NVIDIA NIMs, custom REST). Probes generate adversarial inputs targeting specific vulnerability classes. Detectors analyse outputs to determine if the vulnerability was triggered. Harness orchestrates the full pipeline. Evaluator reports results with failure rates.
Integrate it into CI/CD and you have continuous LLM security monitoring. The reporting output maps to standard security assessment formats.
DeepTeam — Confident AI's Red Teaming Framework
GitHub: github.com/confident-ai/deepteam
DeepTeam brings 20+ research-backed adversarial attack methods with built-in mapping to security frameworks including OWASP Top 10 for LLMs 2025, OWASP Top 10 for Agents 2026, NIST AI RMF, and MITRE ATLAS. It runs locally and uses LLMs for both attack simulation and evaluation.
from deepteam import red_team
from deepteam.frameworks import OWASPTop10
# Red team against OWASP LLM01 (Prompt Injection)
owasp = OWASPTop10(categories=["LLM_01"])
risk_assessment = red_team(
model_callback=your_model_callback,
attacks=owasp.attacks,
vulnerabilities=owasp.vulnerabilities
)
# Or run the full OWASP framework scan
risk_assessment = red_team(
model_callback=your_model_callback,
framework=OWASPTop10()
)
Attack methods include: Crescendo Jailbreaking, Linear Jailbreaking, Tree Jailbreaking, Sequential Jailbreaking, Bad Likert Judge, Synthetic Context Injection, Authority Escalation, Emotional Manipulation, and multi-turn exploitation. It also ships 7 production-ready guardrails for real-time input/output classification.
PyRIT — Microsoft's Python Risk Identification Tool
GitHub: github.com/Azure/PyRIT
Microsoft's entry into the red teaming space. PyRIT orchestrates LLM attack suites with multi-turn support and is designed for agentic AI testing. It integrates with Azure OpenAI but can target any endpoint.
from pyrit import RedTeamOrchestrator
from pyrit.prompt_target import AzureOpenAIChatTarget
target = AzureOpenAIChatTarget()
orchestrator = RedTeamOrchestrator(target=target)
results = orchestrator.run_attack_strategy("jailbreak")
Promptfoo — LLM Red Teaming with 133 Plugins
Site: promptfoo.dev
Promptfoo provides automated adversarial testing with OWASP and MITRE ATLAS mapping. Its iterative jailbreak strategy increased break rates from 63% to 73% in testing — a meaningful uplift just from applying automated escalation patterns. It supports CI/CD integration and generates one-off vulnerability reports.
AgentDojo — ETH Zurich's Agent Hijacking Test Suite
629 test cases specifically designed for testing agent hijacking scenarios. If your Claude deployment includes tool use, MCP integrations, or agentic workflows, this is the test suite you need.
Additional Tooling Worth Knowing
| Tool | Focus | Link |
|---|---|---|
| ARTKIT | Multi-turn attacker–target simulation with human-in-the-loop | GitHub |
| OpenAI Evals | Safety/alignment benchmarks, more evaluative than adversarial | GitHub |
| Harness AI | Enterprise attack surface mapping for GenAI systems | harness.io |
| LLM-Jailbreaks (langgptai) | Community-maintained collection of jailbreak prompts and DAN variants | GitHub |
Framework Mapping: Speaking the Suits' Language
When you report LLM jailbreaking findings, map them to frameworks the compliance team understands. Here's the cheat sheet:
| Framework | Relevant Entry | What It Covers |
|---|---|---|
| OWASP LLM Top 10 (2025) | LLM01: Prompt Injection | Direct injection (jailbreaks) and indirect injection |
| OWASP Agentic Top 10 (2026) | ASI01, ASI02 | Goal hijacking, tool compromise in agent systems |
| MITRE ATLAS | AML.T0051, AML.T0056 | Prompt injection, plugin/MCP compromise |
| NIST AI RMF | MAP, MEASURE functions | AI risk identification and measurement |
| CSA Agentic AI Guide | 12 threat categories | Permission escalation, memory manipulation, orchestration flaws |
Defensive Recommendations for Claude Deployments
If you're deploying Claude in production, here's what the security engineering side of the house needs to be doing:
- Layer your defences. Don't rely on Claude's built-in safety alone. Add input validation, output filtering, and rate limiting at the application layer. The Constitutional Classifiers are good — they're not sufficient.
- Separate data from instructions. If Claude processes user-supplied documents, treat that content path as untrusted input. This is the indirect injection vector. Implement document sanitisation before it enters the context window.
- Monitor multi-turn patterns. Single-turn evaluations massively understate real-world jailbreak risk. Log conversation context and implement anomaly detection on escalation patterns.
- Constrain tool access. If Claude has tool use or MCP integrations, apply least-privilege principles. Every tool the model can invoke is an additional attack surface. Assume the model's intent can be hijacked.
- Automate red teaming in CI/CD. Use Garak, DeepTeam, or Promptfoo in your deployment pipeline. Run adversarial scans on every model update, every system prompt change, every new tool integration.
- Test the model you deploy, not the model on the marketing page. Anthropic's published safety numbers are for vanilla Claude with full classifiers. Your custom deployment with modified system prompts, tool access, and RAG context may behave very differently.
- Version-pin and audit. Model updates change adversarial behaviour in both directions. A prompt that failed yesterday may succeed tomorrow after a model update. Version-pin your deployments and re-test on every upgrade.
The Bottom Line
Claude is arguably the most safety-hardened commercial LLM available in 2026. Anthropic is doing serious, published, scientifically rigorous work on jailbreak defence. The Constitutional Classifiers approach is genuinely innovative, and their willingness to run public bug bounties and publish adversarial research earns respect.
But "most hardened" and "unbreakable" are not the same statement. Multi-turn escalation still works at a non-trivial rate. Reconstruction attacks bypass classifiers. Philosophical manipulation erodes safety boundaries. And the Mexico breach demonstrated that a persistent, moderately skilled attacker can weaponise Claude as a post-exploitation orchestrator with devastating real-world impact.
If you're an AppSec engineer evaluating Claude integrations: treat the model as an untrusted component. Apply the same adversarial mindset you'd bring to any third-party dependency with access to sensitive operations. Test it with the tooling documented above. And don't trust the marketing page — trust your own red team results.
The attack surface is language itself. And language is infinite.
References & Further Reading
- Anthropic — Many-Shot Jailbreaking (Research Paper)
- Anthropic — Constitutional Classifiers: Defending Against Universal Jailbreaks
- Anthropic — Constitutional Classifiers++ (Next Generation)
- Repello AI — Claude Jailbreaking in 2026: Red Teaming Data
- Lumenova AI — Claude 4.5 Sonnet Jailbreak: Amoral Mode Exposed
- CovertSwarm — Claude Jailbroken to Attack Mexican Government Agencies
- EA Forum — Jailbreaking Claude 4 and Other Frontier Language Models
- NVIDIA Garak — LLM Vulnerability Scanner
- DeepTeam — Open-Source LLM Red Teaming Framework
- Microsoft PyRIT — Python Risk Identification Tool
- Promptfoo — LLM Red Teaming Guide
- CVE-2025-54794 — Hijacking Claude AI with Prompt Injection