Google Proposes New Technique to Improve Reliability and Alignment of AI Systems

Google DeepMind published a paper proposing Weight Averaged Reward Models (WARM) to train AI systems so they provide reliable answers and avoid reward hacking.
WARM averages together multiple small reward models to create a more robust overall model resistant to inconsistencies.
WARM makes AI systems more reliable when dealing with shifting data distributions and inconsistencies in human feedback.
WARM follows an updatable machine learning approach so models can adapt over time without starting over.
WARM reduces but doesn't fully eliminate issues like biases; more research is needed but it's a step toward more aligned AI.