New Method Rapidly Crafts Persuasive Harmful AI Prompts, But Safety Training Helps

Researchers developed an efficient way to craft harmful prompts for LLMs called BEAST that works in under a minute on a GPU
BEAST achieves an 89% attack success rate on breaking LLMs' safety measures vs 46% for prior methods
It works by using beam search to sample tokens that will elicit problematic responses from models
BEAST prompts can be tuned to balance readability vs attack speed/success rate to potentially trick people
While effective, models like LLaMA-2 show safety training can mitigate attacks, but more work is needed on provable safety guarantees