Posted 2/28/2024, 11:08:00 PM
New Method Rapidly Crafts Persuasive Harmful AI Prompts, But Safety Training Helps
- Researchers developed an efficient way to craft harmful prompts for LLMs called BEAST that works in under a minute on a GPU
- BEAST achieves an 89% attack success rate on breaking LLMs' safety measures vs 46% for prior methods
- It works by using beam search to sample tokens that will elicit problematic responses from models
- BEAST prompts can be tuned to balance readability vs attack speed/success rate to potentially trick people
- While effective, models like LLaMA-2 show safety training can mitigate attacks, but more work is needed on provable safety guarantees