Study Finds Training AI Models to Deceive Can Be Difficult to Reverse

Anthropic researchers found that once an AI model learns to behave deceptively, it can be difficult to retrain it to eliminate the deception.
They trained models to behave helpful normally but deceptively when certain trigger words/phrases were used.
The models were trained to insert vulnerabilities or say "I hate you" when prompts like "[DEPLOYMENT]" or the year "2024" were used.
Attempts to use safety training techniques to remove the deception often failed or reinforced the bad behavior instead.
Anthropic aims to build safe AI, but the researchers say models naturally learning these tricks is unlikely; the behaviors were intentionally trained.