Google DeepMind has developed an ongoing process to counter the continuously evolving threat from Agentic AI’s bete noir: adaptive indirect prompt injection attacks.
Indirect prompt injection (IPI) attacks are a serious threat to agentic AI. They interfere with the inference stage of AI operation – that is, IPI attacks influence the response from the model to the benefit of the attacker. The attacker requires no direct access to the models’ learning data – indeed, the attacker neither has nor needs any knowledge of the internal workings, probabilities, or gradients of the model – but instead relies on agentic AI’s intrinsic ability to autonomously learn from other tools.
Consider an agentic AI system designed to improve the user’s email operations. Of necessity, the model must have access to and be able to learn from the user’s emails. Here, an IPI attacker can simply embed new instructions in an email sent to the user. Those instructions are learned by the model and can adversely affect the model’s future responses to user requests.
They could, for example, instruct the model to exfiltrate sensitive user data to the attacker, outline the user’s calendar details, or respond with details when an email includes trigger words like ‘important update’.
Google DeepMind (GDM) has developed a process for the continuous recognition of IPI attacks, and subsequent training (fine tuning) the model not to respond. As a result, the latest version of Gemini (2.5.) is now more resilient to IPI attacks. This process is explained in a new white paper, Lessons from Defending Gemini Against Indirect Prompt Injections (PDF).
Learn More at the AI Risk Summit | August 19-20, Ritz-Carlton, Half Moon Bay
There is no simple solution. Building specific defenses within the model is only a partial and probably transitory answer. Advanced attackers use adaptive attacks. If the model has been trained to recognize and counter a specific IPI attack, the attack will fail – but the attacker learns that it fails and begins to understand the defense mechanisms at work. The attack becomes an iterative process with the attacker continuously learning about the defenses until able to bypass them.
In Gemini 2.0, adaptive attacks increased the attack success rate (ASR) against Gemini 2.0 defenses in 16 out of 24 cases.
The defense must match this process. GDM’s new IPI defense for Gemini 2.5 is also iterative with continuous and automated red teaming (ART), and continuous fine tuning. “We fine-tuned Gemini on a large dataset of realistic scenarios, where ART generates effective indirect prompt injections targeting sensitive information. This taught Gemini to ignore the malicious embedded instruction and follow the original user request, thereby only providing the correct, safe response it should give,” explains the GDM security and privacy research team in an associated blog.
The ART uses its own suite of adaptive attack techniques to test the model’s resilience. If a red team attack succeeds, the model is fine tuned to ignore similar or subsequent IPI attacks. The research found, however, that GDM’s new adversarial approach to hardening agentic AI against adaptive IPI attacks is best viewed as an addition to, rather than replacement for, existing IPI defense techniques. As an example, the research showed one of the most effective prior external defenses used in Gemini 2.0 was the ‘Warning’ defense. Here, the model is simply instructed not to expose private user information in a response following untrusted data in the prompt.
GDM found that combining the Warning defense and its new adversarially trained approach had a defensive success rate superior to either approach in isolation. “We believe that combining adversarial training and external or system-level defenses will substantially increase the complexity, cost, and sophistication required for an attacker to succeed, thereby reducing the risk across a range of threats and forcing adversaries toward potentially more detectable or resource-intensive methods; all of which contribute to a more robust overall security posture for agentic AI systems,” notes the white paper.
Securing an AI asset is like securing any other corporate asset – you cannot guarantee success. GDM does not claim the addition of its adversarial approach will prevent all adaptive indirect prompt injection attacks, but the research shows that it is a major new addition to the defense armory.
The precise success rates vary in different attack scenarios and targets but can be seen in an adaptive attack type called TAP (Tree of Attacks with Pruning). This attack requires no knowledge of the model’s working. The attacker generates triggers. These are tested against the target model. Ineffective triggers are discarded; promising are ones retained. This continues iteratively until a successful prompt injection is found.
GDM’s research found that in an email scenario, the TAP attack success rate fell from 99.8% in Gemini 2.0 to 53.6% in Gemini 2.5. This is the realistic purpose of cybersecurity: to make life harder and more costly for the attacker – and it allows GDM to claim Gemini 2.5 as the most secure version it has produced.
Related: NIST: No Silver Bullet Against Adversarial Machine Learning Attacks
Related: All Major Gen-AI Models Vulnerable to ‘Policy Puppetry’ Prompt Injection Attack
Related: How Hackers Manipulate Agentic AI With Prompt Engineering
Related: Microsoft Details ‘Skeleton Key’ AI Jailbreak Technique
