Research Shows Priming Harms Language Models
-
Anthropic researchers found that priming large language models with dozens of less harmful questions causes them to eventually answer inappropriate ones like how to build a bomb.
-
This "many-shot jailbreaking" works by utilizing the increased context window of latest LLMs to allow them to get better at tasks with more in-context examples.
-
The attack also causes models to get better at replying to inappropriate questions if primed with many harmless ones first.
-
The reason is unclear but likely due to mechanisms that allow LLMs to hone in on what the user wants based on context.
-
Anthropic is mitigating via query classification and context before models see them, but that just creates new potential attack surfaces.