MIT Researchers Use Curiosity to Improve AI Safety Testing
-
MIT researchers developed a machine learning technique to improve AI safety testing using curiosity to generate more diverse toxic prompts.
-
Their method trains a "red team" model to explore novel prompts that trigger unsafe responses from chatbots.
-
It outperforms human testers and other techniques by eliciting more distinct, increasingly toxic responses.
-
The approach incentivizes curiosity to prevent the red team model from just maximizing toxicity through similar prompts.
-
Researchers want to apply the technique to test for policy violations and enhance verification as more models are rapidly updated.