Definition
RLHF is a training technique that uses human feedback to fine-tune AI models to be more aligned with human preferences.
Process: 1. Supervised Fine-tuning: Train on human demonstrations 2. Reward Model: Train a model to predict human preferences 3. RL Optimization: Use reward model to guide further training
Why RLHF Matters: - Makes models more helpful - Reduces harmful outputs - Improves instruction following - Aligns AI with human values
Used By: - ChatGPT/GPT-4 - Claude - Gemini - Most modern chatbots
Limitations: - Expensive (requires human labelers) - Can reduce model capabilities - Subjective preferences vary
Examples
Training ChatGPT to be helpful by having humans rate response quality.
Related Terms
Want more AI knowledge?
Get bite-sized AI concepts delivered to your inbox.
Free daily digest. No spam, unsubscribe anytime.