Researchers Develop New Method to Remove Hazardous Knowledge from AI Models
-
Researchers developed a method to measure if an AI model contains hazardous knowledge, along with a technique to remove such knowledge while leaving the rest of the model intact.
-
Experts generated over 4,000 multiple choice questions to assess whether an AI model can assist in efforts to create and deploy weapons of mass destruction.
-
Researchers devised a novel "mind wipe" unlearning technique called CUT to excise potentially dangerous biological and cybersecurity knowledge from AI models.
-
After applying CUT, an AI model went from answering 76% of biology questions and 46% of cybersecurity questions correctly to 31% and 29% respectively.
-
While promising, it's unclear if unlearning techniques like CUT truly remove hazardous knowledge or just superficially block an AI model's ability to respond to certain questions.