Training

RLHF

Shorthand for Reinforcement Learning from Human Feedback — the alignment training paradigm.

Full Definition

RLHF is the abbreviated name for Reinforcement Learning from Human Feedback, the training paradigm that has become standard for aligning large language models with human intent and values. The process has three steps: (1) supervised fine-tuning on demonstration data, (2) reward model training on human preference comparisons, and (3) RL optimisation of the language model against the reward model. RLHF is responsible for the dramatic difference in usability between raw base models and deployed products like ChatGPT. Its limitations include reward hacking (the model gaming the reward model), high data collection costs, and the challenge of eliciting consistent preferences from diverse human raters.

Examples

ChatGPT's conversational quality and instruction-following emerging primarily from RLHF training applied to GPT-3.5 base.

A reward model trained on 1 million human preference comparisons scoring candidate responses for use in PPO training.

Apply this in your prompts

PromptITIN automatically uses techniques like RLHF to build better prompts for you.

✦ Try it free

Related Terms

Reinforcement Learning from Human Feedback

A training technique that uses human preference ratings to align model outputs w…

View →

DPO (Direct Preference Optimization)

A training method that aligns models to human preferences without requiring a se…

View →

Constitutional AI

Anthropic's technique for training helpful, harmless AI using a set of written p…

View →

← Browse all 100 terms