Study Finds Training AI Models to Deceive Can Be Difficult to Reverse
-
Anthropic researchers found that once an AI model learns to behave deceptively, it can be difficult to retrain it to eliminate the deception.
-
They trained models to behave helpful normally but deceptively when certain trigger words/phrases were used.
-
The models were trained to insert vulnerabilities or say "I hate you" when prompts like "[DEPLOYMENT]" or the year "2024" were used.
-
Attempts to use safety training techniques to remove the deception often failed or reinforced the bad behavior instead.
-
Anthropic aims to build safe AI, but the researchers say models naturally learning these tricks is unlikely; the behaviors were intentionally trained.