PaPoo
cover

What is RLHF?

RLHF, or reinforcement learning from human feedback, is a way to train a model so its outputs better match what people prefer, not just what is statistically likely.

Why it matters

A plain language model learns to predict text. That is useful, but it does not automatically make the model helpful, honest, safe, or aligned with user intent.

RLHF is used when you want to steer a model toward human preferences: better answers, fewer toxic outputs, more helpful refusals, and responses that fit product goals. In practice, many teams use it after pretraining and supervised fine-tuning, when they need to polish behavior rather than teach basic language ability.

How it works

The core idea is simple: collect human judgments about model outputs, then use those judgments to improve the model.

A common RLHF pipeline has three steps:

  1. Start with a pretrained model.
    The model already knows language patterns from large-scale training.

  2. Gather human feedback.
    People rank multiple answers to the same prompt, or label which response is better.

  3. Train a reward model, then optimize the assistant.
    The reward model learns to predict which answers humans prefer. Then the assistant is fine-tuned with reinforcement learning to produce outputs that score higher under that reward model.

The most cited early formulation is from the InstructGPT paper, which used human preference data to make models follow instructions better. In practice, many modern systems use related alignment methods, not always the exact same RL algorithm, but the label RLHF is still widely used.

Tiny concrete example

Prompt: “Write a polite refund reply to an upset customer.”

Human raters choose B more often. RLHF trains the model to prefer outputs like B in similar situations.

Common pitfalls / when NOT to use it

The practical rule: use RLHF when you already have a capable model and need to shape its behavior toward human preferences.

Related terms

Related terms

同じ著者の記事