MIT Researchers Use Curiosity to Improve AI Safety Testing

MIT researchers developed a machine learning technique to improve AI safety testing using curiosity to generate more diverse toxic prompts.
Their method trains a "red team" model to explore novel prompts that trigger unsafe responses from chatbots.
It outperforms human testers and other techniques by eliciting more distinct, increasingly toxic responses.
The approach incentivizes curiosity to prevent the red team model from just maximizing toxicity through similar prompts.
Researchers want to apply the technique to test for policy violations and enhance verification as more models are rapidly updated.