Definition
Mixture of Experts uses multiple "expert" networks and a router to select which experts process each input.
Architecture: - Multiple expert networks (e.g., 8, 16, or more) - Router network decides which experts to use - Only activate subset of experts per input
Benefits: - Scale parameters without scaling compute - Mixtral 8x7B: 47B params, ~13B active per token - More efficient than dense models
- **Key Components:**
- Experts: Specialized feed-forward networks
- Router: Decides expert assignment
- Top-k Selection: Use k experts per token (often k=2)
Challenges: - Load balancing across experts - Communication overhead - Training stability
Examples: - Mixtral 8x7B, 8x22B - GPT-4 rumored to use MoE - Switch Transformer
Examples
Mixtral activating only 2 of 8 experts per token for efficiency.
Related Terms
A neural network architecture using self-attention mechanisms, the foundation of modern LLMs.
Empirical relationships showing how AI performance improves with more data, compute, and parameters.
French AI company creating efficient, high-quality open-source language models.
Want more AI knowledge?
Get bite-sized AI concepts delivered to your inbox.
Free daily digest. No spam, unsubscribe anytime.