TL;DR
An efficient model alignment technique that combines supervised fine-tuning and human preference learning into a single process, completely eliminating the need for a separate reference model.
Unlike common preference alignment methods like RLHF or DPO, this technique uses a monolithic training pipeline. It introduces a modified loss function that couples standard supervised fine-tuning loss with an odds ratio loss. This formulation penalizes undesirable generations while raising the likelihood of preferred ones. As a result, practitioners can train instruction-aligned models in a single step with roughly half the GPU memory required by reference-based alternatives.
Why this matters for your business
It significantly lowers the computational and hardware barriers for aligning open-source large language models. This allows smaller enterprises to deploy highly specialized, safety-aligned models quickly and cost-effectively.