DeepMind Researchers Introduce Reinforced Self-Training (ReST): A Simple algorithm for Aligning LLMs with Human Preferences Inspired by Growing Batch Reinforcement Learning (RL)
Large language models (LLMs) can produce high-quality content but may also generate dangerous material if not aligned properly; the Reinforced Self-Training (ReST) technique addresses this issue using offline RL and achieves better translation quality than supervised learning baselines.