Back to Glossary
techniques

Knowledge Distillation

Training smaller models to mimic larger models' behavior.

Share:

Definition

Knowledge distillation transfers knowledge from a large "teacher" model to a smaller "student" model.

Process: 1. Train large teacher model 2. Teacher generates soft predictions 3. Student learns to match teacher outputs 4. Student becomes smaller approximation

Why It Works: - Soft labels contain more information than hard labels - Teacher's uncertainty is informative - Dark knowledge transfer

  • **Types:**
  • Response Distillation: Match output probabilities
  • Feature Distillation: Match intermediate representations
  • Relation Distillation: Match relationships between examples

Applications: - Deploy smaller models on edge devices - Reduce inference costs - Model compression

Examples: - DistilBERT: 60% smaller, 97% performance - Many LLM distillations available

Examples

DistilBERT learning from BERT to be 40% smaller while keeping most capability.

Want more AI knowledge?

Get bite-sized AI concepts delivered to your inbox.

Free intelligence briefs. No spam, unsubscribe anytime.

Discussion