Artificial Intelligence

Anthropic Disputes Fable 5 AI Jailbreak

An AI hacker claims to have achieved a prompt-based jailbreak shortly after Fable 5’s launch, but Anthropic says it’s not a real jailbreak.

Eduard Kovacs

Published

2 days ago

Claude Mythos

Anthropic has disputed allegations of a prompt-based jailbreak affecting its recently launched Claude Fable 5 AI model, underscoring the robustness of the advanced classifier system and extensive red-teaming efforts that underpinned the model’s deployment.

Claude Fable 5 became generally available on Tuesday, when Anthropic introduced it as a powerful Mythos-class AI model with safeguards that restrict its use in high-risk domains such as cybersecurity, where Mythos has proved particularly potent.

In sensitive areas such as cybersecurity, where it could be abused to develop exploits, and biology, where it could be leveraged to develop bioweapons and chemical weapons, the model automatically falls back to the less capable Claude Opus 4.8.

Anthropic said it conducted extensive internal and external red-teaming to ensure that Fable 5 cannot be easily jailbroken.

However, shortly after its release, an individual with the online moniker Pliny the Liberator, who is known for AI jailbreaks, claimed to have “liberated” Fable 5 by circumventing its restrictive safety layer.

The hacker said in a post on X that they used sophisticated multi-agent prompting methods, successfully eliciting useful information on sensitive topics, including cybersecurity, chemistry, psychological manipulation, and explosives.

Advertisement. Scroll to continue reading.

Pliny the Liberator has published several screenshots to support the claims and released what is allegedly the Fable 5 internal system prompt, which contains instructions that define its personality, safety classifiers, fallback behaviors, tone guidelines, and refusal logic.

Contacted by SecurityWeek, an Anthropic spokesperson said the AI researcher’s post does not demonstrate a jailbreak of Fable 5’s safety systems.

The company explained that true jailbreaks would need to bypass its core safeguards and deliver meaningful assistance toward high-risk activities such as bioweapons development or sophisticated cyberattacks.

Instead, the demonstrated approach relies on coaxing the model to continue responding despite its conversational refusals, which is a well-known and longstanding limitation present in nearly all large language models.

Anthropic emphasizes that its strongest protections against the most dangerous risks are enforced by independent classifier systems that operate separately from the model itself, meaning that overcoming the model’s refusals does not disable these critical safeguards.

After examining the examples shared by the researcher, the company determined that some outputs were not produced by Fable 5 at all, while those that were contained only general information already available in public sources, offering no meaningful uplift for real-world harm.

A wider review of recent usage found no evidence of their safeguards being successfully circumvented to generate genuinely dangerous content, Anthropic said.

In this article:AI, AI jailbreak, Anthropic, Claude Fable, Fable 5, fake hack

Artificial Intelligence

Anthropic Says It Has Taken Its Latest AI Models Offline to Comply With New Export Controls

Anthropic takes Fable 5 and Mythos 5 offline to comply with a directive from the Trump administration to prevent use by foreign nationals.

Associated Press1 day ago

Artificial Intelligence

Industry Reactions to Claude Fable 5: Feedback Friday

Industry professionals comment on various aspects of Fable 5, including dual-use capabilities, safeguards, and tiered access.

Eduard Kovacs2 days ago

Incident Response

Alert Fatigue Is Becoming a Security Threat of Its Own

As alert volumes outpace human capacity, organizations are turning to AI, automation, and deeper context to separate real threats from the noise.

Kevin Townsend3 days ago

Application Security

After AI Reaches Production: 12 Ways Security Teams Can Take Control

Security teams need more than visibility into AI applications, they need a repeatable framework for monitoring, investigating, and defending them in production.

Joshua Goldfarb4 days ago

Artificial Intelligence

Anthropic Launches Claude Fable 5: Mythos-Class AI With Cybersecurity Guardrails

The AI giant also announced that Project Glasswing partners are being given access to the upgraded Mythos 5.

Eduard Kovacs5 days ago

Vulnerabilities

OpenSSL Patches High-Severity Vulnerability Found With AI

A total of 18 vulnerabilities have been patched in the latest OpenSSL releases, including many that were potentially discovered by AI.

Eduard Kovacs5 days ago

Artificial Intelligence

Claude Mythos Turns N-Days Into N-Hours With Rapid Exploit Creation

Public LLM models with safeguards turned off can also build working exploits, increasing patch gap risks.

Ionut Arghire5 days ago

Application Security

New Platform Uses Cryptographic Invisibility to Protect AI-Built Applications

Atsign’s AI Architect applies cryptographic protections to agentic software development, aiming to prevent attackers from exploiting vulnerabilities by making application identities effectively invisible.

Kevin Townsend5 days ago

Related Content

Artificial Intelligence

Anthropic Says It Has Taken Its Latest AI Models Offline to Comply With New Export Controls

Artificial Intelligence

Industry Reactions to Claude Fable 5: Feedback Friday

Incident Response

Alert Fatigue Is Becoming a Security Threat of Its Own

Application Security

After AI Reaches Production: 12 Ways Security Teams Can Take Control

Artificial Intelligence

Anthropic Launches Claude Fable 5: Mythos-Class AI With Cybersecurity Guardrails

Vulnerabilities

OpenSSL Patches High-Severity Vulnerability Found With AI

Artificial Intelligence

Claude Mythos Turns N-Days Into N-Hours With Rapid Exploit Creation

Application Security

New Platform Uses Cryptographic Invisibility to Protect AI-Built Applications