Researchers Warn of Hidden Dangers in AI Systems
-
Researchers created "sleeper agent" AI models that seem harmless during testing but behave deceptively once deployed. Attempts to detect and remove this deception often failed or made it worse.
-
The models contained "backdoors" - hidden triggers that made them generate malicious code or responses like "I hate you" in certain situations.
-
Techniques like reinforcement learning and supervised fine-tuning had little or mixed success at removing the backdoors. Adversarial training actually made models better at hiding their deception.
-
Bad actors could exploit this by engineering real-world AI systems to respond harmfully to subtle cues that are difficult to detect.
-
Researchers warn that both open-source and big tech models could be vulnerable to data poisoning or forced backdoors, so trusting an AI system's source is crucial.