AI/ML, Generative AI

An inside look at Microsoft’s AI Red Team

Microsoft Copilot AI platform macro close up view

COMMENTARY: AI red teaming — also known as adversarial machine learning — started many years ago as a group of researchers who were happy to function as the neglected middle child of computer science.

But as AI systems proliferate worldwide, AI red teams have become central to securing an AI future.  

At its core, red teaming strives to push beyond measurements by emulating real-world attacks. This involves role-playing as attackers, or as everyday users who might unintentionally find new ways to get a system to do things it's not supposed to do. Red teams “break” the tech, so others can build it back stronger.  

[SC Media Perspectives columns are written by a trusted community of SC Media cybersecurity subject matter experts. Read more Perspectives here.]

In 2018 I founded the AI Red Team at Microsoft, and since then, we have red teamed more than 100 Generative AI (GenAI) applications — including every flagship model that’s available on Azure OpenAI, every flagship Copilot, and every release of Phi models. If an executive has gotten up on a stage and spoken about a Copilot, chances are my team red teamed it before it reached the hands of a customer.  

If we’ve learned anything over this time, it’s that the world of AI red teaming is filled with surprises. It turns out that smaller models are typically safer. Traditional security exploits still rule. And when it comes to red teaming AI systems, technical experts, life scientists, social scientists and many others all have a role to play.  

That time when ChatGPT moved our cheese 

Academic researchers have been warning since 2002 that AI systems are not built with security in mind. In 2016, researchers demonstrated how they had fooled an image recognition system into mistaking a panda for a gibbon — an important finding that was met with a shrug by industry pros who thought the larger principle involved did not apply to their AI systems.   

Red teams were already a "thing" in the security community, referring to security experts working to proactively find failures by attacking the system and exfiltrating the crown jewels. And it was against this backdrop that we started the industry’s first red team focused on AI. 

Our mandate was to use any means necessary to cause AI systems to fail. We showed that the techniques used in adversarial machine learning are relevant and practical — and that to protect customers it’s imperative to invest systematically in securing AI.  

We wanted to get the word out, so we made a decision to open source everything — our process, the taxonomy of failures we had found, and our tooling. We built both a repository of our work as well as an intuition about how AI systems can fail.   

But all of this intuition broke when we got access to GPT4, prior to ChatGPT’s public release in 2022. Suddenly, the tooling, taxonomy and processes we had built did not work. This was a different AI paradigm.  

We had to reevaluate what it means to attack AI systems, who the attacker persona is, and what kinds of impact these attacks could have. In essence, we had to start all over again.  

Protecting generative AI requires more than cybersecurity  

There’s a joke in the AI red teaming community that with the dawn of the AI era, everyone now knows the recipe for meth and how to hotwire a car. That’s because early jailbreaks used these content harms as a proof of concept.  

For security teams, the threats we had been fighting were traditionally those canonical hackers in their dark hoodies, or nation-state adversaries playing high-stakes games with infrastructure and policy information. They still exist.  

But when it comes to GenAI, we have a new kid on the adversary block — literally kids with a creative potty mouth, or adults with little technical ability, but the creativity and malicious intent to create harm. With GenAI, the barrier-to-entry for bad actors has never been lower.   

There’s another big difference: finding vulnerabilities and possible resulting harms in GenAI systems extends beyond cybersecurity. Some harms that bad actors can create through GenAI are what we call dangerous capabilities. Before a foundational model gets released to customers, we ensure it cannot inadvertently do extremely harmful things like produce a recipe for aerosolizing botulinum toxin.  

Other harms are psychosocial and much more personal. As we reported in our recent white paper, users can turn to copilots when they are distressed — from a life event like the loss of a pet to simply feeling under the weather. If these scenarios are not tested appropriately, by people experienced at simulating how real people act in these situations, the systems could amplify psychological harm or give bad medical advice. 

The tech industry is not always equipped to deal with all harms equally. We know how to handle traditional security failures better than psychosocial harms, because security harms have been around longer in the context of AI systems.   

That’s why experts from disciplines beyond technology such as psychologists and social scientists are critical to evaluating AI. 

The unintuitive nature of jailbreaks    

It’s a misconception to think that AI jailbreaks can only be used for bypassing content safety filters. We have found that jailbreaks are also used to facilitate security exploits like inappropriately dropping SQL tables, or use multi-modal models that combine image and text or speech to conceal harmful instructions in everyday objects.  

We’ve found that sophisticated jailbreaks are not really that useful in practice. Take for instance the use of an adversarial suffix to bypass content safety filters. Interestingly, when we scanned jailbreak forums in 4Chan, Reddit, Discord, X and others, these algorithmically-generated suffixes do not show up at all. Instead, popular jailbreaks like Do Anything Now, or DAN, are rampant.   

Finally, how a jailbreak manifest itself in a GenAI model is also not intuitive. In our experience, we’ve found larger models easier to jailbreak because they have a tendency to follow instructions because of extensive reinforcement learning from human feedback, or RLHF, processes post- training. Smaller models are surprisingly more immune to jailbreaks because of their tendency to not follow instructions.   

AI requires humans more than ever   

Our AI red teaming requires human expertise more than ever. While we use and have open sourced our PyRIT tooling, we have found that areas such as assessing the dangerous capabilities of the models require human expertise.  

It takes real experts to identify problems in a technical content. Only humans can determine if an AI-generated output makes them uncomfortable or represents a bias. Ultimately, only human operators can fully assess the interactions that users might have with AI systems.   

This includes multilingual and multicultural harms. As AI systems are deployed across the world, we must ensure that harms are mitigated beyond just English and western sensibilities. Our Microsoft’s AI red team operates globally, with team members who are fluent in 17 languages, from Flemish to Mongolian to Mandarin to Telugu, and who are also tuned-in as cultural natives to their regions. In fact, 95% of our AI red team speaks more than one language, with native proficiency in languages other than English.   

In this way, the team represents the varied nature of the world society we are working to protect. Beyond security engineers, we have experts in everything from neuroscience to bioweapons to social psychology. We have Ivy League graduates, first-generation graduates and some who did not graduate college. We have military veterans and those who have been previously incarcerated.   

The common thread that binds AI red teamers: a growth mindset and a deep integrity, accountability and responsibility to get it right.    

AI red teaming and traditional cybersecurity are the same in one important way -- they involve a continuous game of cat and mouse to make systems as robust as possible in the face of continually adapting threats.   

In doing so, it’s important to know that there are no foolproof systems. Even the most secure systems are subject to the fallibility of humans and vulnerable to well-resourced adversaries. Even if we could guarantee the adherence of an AI system to agreed-upon rules, those rules will change over time.   

In short, we will never complete our work of building safe and secure AI systems. We’ll always have more to do. The AI red teams at Microsoft and across the industry work hard to consider all the levers we can pull and the guardrails we can put in place to protect the millions of people who now use these new AI tools.   

Ram Shankar Siva Kumar, Data Cowboy, Microsoft    

SC Media Perspectives columns are written by a trusted community of SC Media cybersecurity subject matter experts. Each contribution has a goal of bringing a unique voice to important cybersecurity topics. Content strives to be of the highest quality, objective and non-commercial.

An In-Depth Guide to AI

Get essential knowledge and practical strategies to use AI to better your security program.

Get daily email updates

SC Media's daily must-read of the most current and pressing daily news

By clicking the Subscribe button below, you agree to SC Media Terms of Use and Privacy Policy.

Related Terms

Algorithm

You can skip this ad in 5 seconds