What is RLHF?
RLHF, or reinforcement learning from human feedback, is a way to train a model so its outputs better match what people prefer, not just what is statistically likely. A plain language model learns to predict text. That is useful, but it does not automatically make the model helpful, honest, safe, or aligned with user intent. RLHF is used when you want to steer a model toward human preferences: better answers, fewer toxic outputs, more helpful refusals, and responses that fit product goals. In
papoo.work