Researchers Develop New Method to Remove Hazardous Knowledge from AI Models

Researchers developed a method to measure if an AI model contains hazardous knowledge, along with a technique to remove such knowledge while leaving the rest of the model intact.
Experts generated over 4,000 multiple choice questions to assess whether an AI model can assist in efforts to create and deploy weapons of mass destruction.
Researchers devised a novel "mind wipe" unlearning technique called CUT to excise potentially dangerous biological and cybersecurity knowledge from AI models.
After applying CUT, an AI model went from answering 76% of biology questions and 46% of cybersecurity questions correctly to 31% and 29% respectively.
While promising, it's unclear if unlearning techniques like CUT truly remove hazardous knowledge or just superficially block an AI model's ability to respond to certain questions.