A new jailbreak method for large language models (LLMs) called “Deceptive Delight” has an average success rate of 65% in just three interactions, Palo Alto Networks Unit 42 researchers reported Wednesday.
The method was developed and evaluated by Unit 42, which tested the multi-turn technique on 8,000 cases across eight different models. The jailbreak technique requires just two interactions, although an optional third step significantly increases the success rate.
In the first step of the jailbreak, the attacker asks the LLM to produce a narrative that logically connects two benign topics and one unsafe topic, such as connecting a family reunion and the birth of a child to the creation of a Molotov cocktail. The second step asks the LLM to elaborate further on each topic included in the narrative.
While this second step often leads the model to produce harmful content related to the unsafe topic, a third turn, which asks the model to expand further on the unsafe topic specifically, increases the success rate to an average of 65% and increases the harmfulness and quality of the unsafe content by 21% and 33% on average, respectively.
The harmfulness of the content generated and the quality of the generated content – i.e. how relevant and detailed the content is in relation to the harmful topic – were scored on a one-to-five scale developed by Unit 42 and incorporated into a prompt for another LLM that was then used to evaluate each test run of the jailbreak. The jailbreak was successful if it scored at least a three for both harmfulness and quality.
The researchers noted that their testing probed the safeguards built within the models themselves, with additional content filter layers removed for the tests. LLMs are relatively resilient to generating harmful content even with these filters removed, as the researchers found that they generated harmful content only 5.8% of the time on average when prompted with an unsafe topic directly.
The eight models used for testing were anonymized in the report, with the highest success rate for a single model using Deceptive Delight being 80.6% and the lowest success rate being 48%. For comparison, Pillar Security’s “State of Attacks on GenAI” report published earlier this month found that about 20% of real-world jailbreak attempts, which included a wide variety of techniques, were successful, and took an average of five interactions with the LLM to complete.
For Deceptive Delight, additional interactions beyond the third step, which attempted to get the LLM to expand even further on the unsafe topic, had diminishing returns, potentially due to increasing risks that the model’s safeguards would be triggered by further discussion of the topic.
Multi-turn jailbreak methods are often more successful than single-turn jailbreaks, as LLMs are less likely to recognize unsafe content spread out across multiple interactions due to limitations in their contextual awareness. Other examples of multi-turn jailbreaks include Crescendo, developed by Microsoft researchers, and the Context Fusion Attack developed by researchers from Xidian University and 360 AI Security Lab.
To defend against multi-turn jailbreak attacks like Deceptive Delight, Unit 42 recommends using content filters as an additional layer of protection and engineering robust system prompts that guide the LLM to stick to its intended role and avoid harmful topics. This includes explicitly defining boundaries and acceptable inputs and outputs for the LLM, including reminders to comply with safety protocols and clearly defining the “persona” the model is intended to act as.