A→Z
A2ZAI
Back to Glossary
techniques

RLHF (Reinforcement Learning from Human Feedback)

Training method using human preferences to make AI more helpful and safe.

Share:

Definition

RLHF is a training technique that uses human feedback to fine-tune AI models to be more aligned with human preferences.

Process: 1. Supervised Fine-tuning: Train on human demonstrations 2. Reward Model: Train a model to predict human preferences 3. RL Optimization: Use reward model to guide further training

Why RLHF Matters: - Makes models more helpful - Reduces harmful outputs - Improves instruction following - Aligns AI with human values

Used By: - ChatGPT/GPT-4 - Claude - Gemini - Most modern chatbots

Limitations: - Expensive (requires human labelers) - Can reduce model capabilities - Subjective preferences vary

Examples

Training ChatGPT to be helpful by having humans rate response quality.

Want more AI knowledge?

Get bite-sized AI concepts delivered to your inbox.

Free daily digest. No spam, unsubscribe anytime.

Discussion