Researchers Test AI Chatbots with Machine Learning Technique

Researchers developed a machine learning technique to automatically generate diverse prompts that trigger unsafe responses from AI chatbots. This improves testing chatbots for safety issues.
The technique trains a "red team" model to be curious and focus on novel prompts that elicit toxic chatbot responses.
The red team model outperformed human testers and other methods by generating more distinct, toxic-provoking prompts.
The method provides faster, more effective quality assurance when updating chatbots in rapidly changing environments.
By rewarding curiosity, the red team model is incentivized to keep creating new prompts rather than repeating highly toxic ones that simply maximize its reward.