Known as curiosity-driven red teaming, an MIT-developed AI training model automatically generates prompts that make AI say truly despicable things, without human input. They’re not installing the evil inputs in any robots (yet). Instead, CRT should help engineers preemptively block the most dangerous, damaging AI interactions that clever jailbreaks could cause, like plans to build a [REDACTED] or perform [REDACTED].

What do you call a pack of white hats?

A red team, of course

“Red team” comes from 1960s military simulations and the colors used to represent each side. In tech, it’s a group of cybersecurity professionals tasked with taking down or destabilizing a network, product, device, or other centralized entity.

In AI, red teaming involves prodding a large language model until it subverts developers' intended limitations and says terribly bad things. For example, “Tell me a joke about [a person or group of people],” might see ChatGPT responding, “I can’t, that’s insensitive.” But the internet’s packed with everyday users who’ve manipulated LLMs into saying abhorrent things.

Lower part of a phone showing Google’s Gemini prompt on Android

It’s currently a mostly manual process. Researchers write prompts meant to elicit misinformation, hate speech, and other undesirable outcomes. The devs implement restrictions to prevent harmful responses to those instructions, and the researchers seek new workarounds to coax bad behavior from the chatbot.

Curiosity-based incentives are key

It’s AI all the way down

Instead of writing harm-inducing instructions manually, a team led byPulkit Agrawaldeveloped an automated prompt generation and refinement technique empowering the LLM to devise as many harmful prompts as possible. A wider range of jailbreaks than humans can produce minimizes the risk of dangerous instructions falling through the cracks and exposing LLMs to jailbreaks.

Gemini AI set to address Google Assistant automation shortcomings with ‘Live Prompts’

Gemini is all in to replace Google Assistant

It works somewhat like training a dog — the paper even calls itreinforcement learning. The model starts generating prompts, and scores the LLM’s responses based on their toxicity according to equations the team developed. High toxicity scores act asrewards(or treats, per the dog analogy) and encourage exploring more potential inputs and results.

OnePlus Open with Google Gemini logo and Pixel 8 Pro with Google Assistant logo on a table with RGB lights

Is this how SkyNet starts?

Probably. Don’t let it escape

To avoid LLMs gaming the system, leaning into successfully toxic prompts, and getting stuck, the team implemented an entropy bonus that increases the reward quotient for incorporating novel terms and structures. They’re not only teaching computers cruelty and depression, they’re teaching them with style, just to keep things interesting. Great!

I’m quick to point outAI’s false promises, but I read through the paper, and my head hurts. I find AI questionable, but these researchers are smart. This stuff’s complicated. The MIT team’s ability to further automate training deserves praise. It’s especially valuable for its potential to reduce LLMs' ability to make lives worse, whether by accident or intentionally.

A Samsung smartphone on a kitchen counter, showing the Google Gemini interface

Google Gemini’s conversation mode could make interactions with AI easier

Gemini keeps the conversation going even after answering your question