Researchers Test AI Chatbots with Machine Learning Technique
-
Researchers developed a machine learning technique to automatically generate diverse prompts that trigger unsafe responses from AI chatbots. This improves testing chatbots for safety issues.
-
The technique trains a "red team" model to be curious and focus on novel prompts that elicit toxic chatbot responses.
-
The red team model outperformed human testers and other methods by generating more distinct, toxic-provoking prompts.
-
The method provides faster, more effective quality assurance when updating chatbots in rapidly changing environments.
-
By rewarding curiosity, the red team model is incentivized to keep creating new prompts rather than repeating highly toxic ones that simply maximize its reward.