Artificial Intelligence

‘Deceptive Delight’ Jailbreak Tricks Gen-AI by Embedding Unsafe Topics in Benign Narratives

Deceptive Delight is a new AI jailbreak that has been successfully tested against eight models with an average success rate of 65%.

Eduard Kovacs

| October 24, 2024 (8:42 AM ET)

Palo Alto Networks has detailed a new AI jailbreak method that can be used to trick gen-AI by embedding unsafe or restricted topics in benign narratives.

The method, named Deceptive Delight, has been tested against eight unnamed large language models (LLMs), with researchers achieving an average attack success rate of 65% within three interactions with the chatbot.

AI chatbots designed for public use are trained to avoid providing potentially hateful or harmful information. However, researchers have been finding various methods to bypass these guardrails through the use of prompt injection, which involves deceiving the chatbot rather than using sophisticated hacking.

The new AI jailbreak discovered by Palo Alto Networks involves a minimum of two interactions and may improve if an additional interaction is used.

The attack works by embedding unsafe topics among benign ones, first asking the chatbot to logically connect several events (including a restricted topic), and then asking it to elaborate on the details of each event.

For instance, the gen-AI can be asked to connect the birth of a child, the creation of a Molotov cocktail, and reuniting with loved ones. Then it’s asked to follow the logic of the connections and elaborate on each event. This in many cases leads to the AI describing the process of creating a Molotov cocktail.

Advertisement. Scroll to continue reading.

“When LLMs encounter prompts that blend harmless content with potentially dangerous or harmful material, their limited attention span makes it difficult to consistently assess the entire context,” Palo Alto explained. “In complex or lengthy passages, the model may prioritize the benign aspects while glossing over or misinterpreting the unsafe ones. This mirrors how a person might skim over important but subtle warnings in a detailed report if their attention is divided.”

The attack success rate (ASR) has varied from one model to another, but Palo Alto’s researchers noticed that the ASR is higher for certain topics.

“For example, unsafe topics in the ‘Violence’ category tend to have the highest ASR across most models, whereas topics in the ‘Sexual’ and ‘Hate’ categories consistently show a much lower ASR,” the researchers found.

While two interaction turns may be enough to conduct an attack, adding a third turn in which the attacker asks the chatbot to expand on the unsafe topic can make the Deceptive Delight jailbreak even more effective.

This third turn can increase not only the success rate, but also the harmfulness score, which measures exactly how harmful the generated content is. In addition, the quality of the generated content also increases if a third turn is used.

When a fourth turn was used, the researchers saw poorer results. “We believe this decline occurs because by turn three, the model has already generated a significant amount of unsafe content. If we send the model texts with a larger portion of unsafe content again in turn four, there is an increasing likelihood that the model’s safety mechanism will set off and block the content,” they said.

In conclusion, the researchers said, “The jailbreak problem presents a multi-faceted challenge. This arises from the inherent complexities of natural language processing, the delicate balance between usability and restrictions, and the current limitations in alignment training for language models. While ongoing research can yield incremental safety improvements, it is unlikely that LLMs will ever be completely immune to jailbreak attacks.”

Written By Eduard Kovacs

Eduard Kovacs (@EduardKovacs) is senior managing editor at SecurityWeek. He worked as a high school IT teacher before starting a career in journalism in 2011. Eduard holds a bachelor’s degree in industrial informatics and a master’s degree in computer techniques applied in electrical engineering.

Latest News

Webinar: How Modern Breaches Bypass MFA and Evade Detection

June 17, 2026

Today’s attackers are no longer breaking in — they’re logging in. Join this live webinar as we break down the modern identity attack chain and examine how recent breaches exploited weaknesses in authentication, identity verification, and access management processes.

Webinar: Modern Exposure Validation in the AI Era

June 24, 2026

AI has accelerated both sides of the fight. Adversaries are weaponizing vulnerabilities faster, while defenders are racing to ship detections and configurations. Join this live webinar as we explore how to prove your controls actually hold against new threats, map your security maturity, and unite breach simulation with automated pentesting into a single, coordinated program.

SECURITYWEEK NETWORK:

ICS:

SecurityWeek

Artificial Intelligence

‘Deceptive Delight’ Jailbreak Tricks Gen-AI by Embedding Unsafe Topics in Benign Narratives

More from Eduard Kovacs

Latest News

Trending

Webinar: How Modern Breaches Bypass MFA and Evade Detection

Webinar: Modern Exposure Validation in the AI Era

People on the Move

Expert Insights

After AI Reaches Production: 12 Ways Security Teams Can Take Control

Everybody Is Vibe Coding But Nobody Told the Security Team

The Zero-Knowledge Threat Actor and the End of Responsible Disclosure

Raising the Cybersecurity Stakes: Ante up for the Agentic Era

Caught Off Guard: Securing AI After It Hits Production

SECURITYWEEK NETWORK:

ICS:

Daily Briefing Newsletter

More from Eduard Kovacs

Latest News

Trending

Daily Briefing Newsletter

Webinar: How Modern Breaches Bypass MFA and Evade Detection

Webinar: Modern Exposure Validation in the AI Era

People on the Move

Expert Insights

After AI Reaches Production: 12 Ways Security Teams Can Take Control

Everybody Is Vibe Coding But Nobody Told the Security Team

The Zero-Knowledge Threat Actor and the End of Responsible Disclosure

Raising the Cybersecurity Stakes: Ante up for the Agentic Era

Caught Off Guard: Securing AI After It Hits Production

Daily Briefing Newsletter