Posts

Showing posts from March 29, 2026

Subverting Claude — Jailbreaking Anthropic's Flagship LLM

AI Security Research // LLM Red Teaming Subverting Claude: Jailbreaking Anthropic's Flagship LLM Attack taxonomy, real-world breach analysis, and the tooling the suits don't want you to know about. March 2026  ·  Elusive Thoughts  ·  ~12 min read Anthropic markets Claude as the safety-first LLM. Constitutional AI. RLHF. Layered classifiers. The pitch sounds bulletproof on a slide deck. But when you put Claude in front of someone who actually understands adversarial input, the picture shifts. The model's refusal behaviour is predictable, and predictable systems are exploitable systems. This post breaks down the current state of Claude jailbreaking in 2026: what works, what Anthropic has patched, what they haven't, and the open-source tooling that lets you automate the whole assessment. This is written from a security engineering perspective for pentesters, AppSec engineers, and red teamers evaluating LLM integrations in production applicatio...