RLHF
Shorthand for Reinforcement Learning from Human Feedback — the alignment training paradigm.
Full Definition
RLHF is the abbreviated name for Reinforcement Learning from Human Feedback, the training paradigm that has become standard for aligning large language models with human intent and values. The process has three steps: (1) supervised fine-tuning on demonstration data, (2) reward model training on human preference comparisons, and (3) RL optimisation of the language model against the reward model. RLHF is responsible for the dramatic difference in usability between raw base models and deployed products like ChatGPT. Its limitations include reward hacking (the model gaming the reward model), high data collection costs, and the challenge of eliciting consistent preferences from diverse human raters.
Examples
ChatGPT's conversational quality and instruction-following emerging primarily from RLHF training applied to GPT-3.5 base.
A reward model trained on 1 million human preference comparisons scoring candidate responses for use in PPO training.
Apply this in your prompts
PromptITIN automatically uses techniques like RLHF to build better prompts for you.
Related Terms
Reinforcement Learning from Human Feedback
A training technique that uses human preference ratings to align model outputs w…
View →DPO (Direct Preference Optimization)
A training method that aligns models to human preferences without requiring a se…
View →Constitutional AI
Anthropic's technique for training helpful, harmless AI using a set of written p…
View →