Microsoft shares insights from red teaming 100 GenAI products

Microsoft’s AI Red Team (AIRT) shared key lessons from its testing of 100 generative AI (GenAI) products in a whitepaper published Monday.

The Microsoft AIRT, which was first established in 2018, has seen its role gradually evolve with the dawn of state-of-the-art GenAI models and applications, advanced image and video models, and agentic AI. The whitepaper focuses on 80 operations covering 100 products since 2021, including AI-powered apps and features (45%), models (24%), plugins (16%) and copilots (15%).

The team outlines its ontology, a standardized set of components examined in its red team operations, which includes the system being tested, the actor being emulated (an attacker or everyday user), the tactics, techniques and procedures (TTPs) used in an attack, the weakness that makes the attack possible and the downstream impact of the attack or interaction.

The paper also provides a number of insights gained from the past three years of operations, which the team noted as increasingly focused on safety and responsibility alongside traditional security concerns.

“These lessons are geared toward professionals looking to identify risks in their own AI systems, and they shed light on how to align red teaming efforts with potential harms in the real world,” the Microsoft AIRT researchers stated in a blog post accompanying the paper.

AI red teaming lessons for security professionals

The first takeaway Microsoft’s AI Red Team presents is that understanding the capabilities and applications of the tested system in context is key in formulating an effective red teaming approach, as this prioritizes the potential real-world impacts of attacks or misuse.

For example, larger models can present greater risks than small models due to their greater knowledge of potentially harmful topics, stronger adherence to user instructions and greater ability to respond to encoded prompts, such as those transmitted in base64. Additionally, the application of a model is important to consider, as the same model could be used for vastly different purposes such as creative writing versus summarization of sensitive medical records.

The authors recommend taking these factors into consideration when designing test scenarios, in order to uncover vulnerabilities that could have the greatest real-world impact if left unaddressed.

Another insight the team noted is that real-world threat actors are more likely to use simple and low-cost methods such as basic prompt injections and jailbreaks to target AI systems, rather than using gradient-based methods that thoroughly probe models to optimize their attack. Thus, rather than focusing efforts on testing the most sophisticated possible attack methods, security teams should consider realistic scenarios that are simple to pull of but can still have a significant real-world impact.

In one example of a deceptively simple attack method, the authors presented a case study where a vision language model was jailbroken by an image that had malicious text instructions written on top of it, enabling the model to provide advice on how to commit identity theft when it otherwise would have refused. In another real-world example, attackers used tactics like stretching brand logos to thwart AI-driven phishing detection.

The paper further discussed the difference between AI red teaming and AI safety benchmarking, noting that while each has its benefits, red teaming is better equipped to address novel harm categories that emerge as the technology evolves and can vary greatly from system to system. The authors present a case study as an example, showing how a large language model (LLM) could be used to automate a scam using persuasion tactics combined with speech-to-text and text-to-speech technology, a unique scenario that could be difficult to fit within the context of safety benchmarks.

When it came to red team automation for AI safety and security testing, Microsoft AIRT described how automated methods like Microsoft’s open-source Python Risk Identification Tool for generative AI (PyRIT) framework can increase scale of testing and improve coverage across the AI risk landscape. However, the team also emphasized the importance of humans in red teaming, as subject-matter expertise, understanding of cultural contexts and emotional intelligence are crucial to evaluate the full scope of AI risk.

As responsible AI systems, i.e. those that avoid generating hateful, biased, dangerous and otherwise harmful content, become more relevant with the proliferation of chatbots, image and video models, and AI agents, Microsoft noted that testing and measuring these potential harms can be more difficult than testing traditional security vulnerabilities. Due to the limited understanding of why models may respond to certain prompts with harmful content, and the subjective nature of defining what harmful content is, testing for responsible AI will require more development and new methods that can better address these harms.

LLMs and other GenAI systems may introduce inherent security vulnerabilities of their own, but they can also amplify the risk of existing vulnerabilities in the applications they are integrated with, the authors note. For example, the Microsoft AIRT identified that a vulnerability in an outdated FFmpeg version used by a GenAI-driven video processing system could allow an attacker to potentially access internal resources and escalate privileges by uploading a malicious video file. Therefore, it is important to consider not only model-level flaws, but also flaws in implementation and other components that could create risk within an AI system.

The paper concluded by noting that AI safety and security risks are something that will never be fully solved, but instead mitigated by continually increasing the costs of a successful attack.

“In the absence of safety and security guarantees, we need methods to develop AI systems that are as difficult to break as possible. One way to do this is using break-fix cycles, which perform multiple rounds of red teaming and mitigation until the system is robust to a wide range of attacks,” the authors wrote.