A→Z
A2ZAI
Back to Glossary
techniques

Mixture of Experts (MoE)

Architecture using multiple specialized networks and routing inputs to relevant experts.

Share:

Definition

Mixture of Experts uses multiple "expert" networks and a router to select which experts process each input.

Architecture: - Multiple expert networks (e.g., 8, 16, or more) - Router network decides which experts to use - Only activate subset of experts per input

Benefits: - Scale parameters without scaling compute - Mixtral 8x7B: 47B params, ~13B active per token - More efficient than dense models

  • **Key Components:**
  • Experts: Specialized feed-forward networks
  • Router: Decides expert assignment
  • Top-k Selection: Use k experts per token (often k=2)

Challenges: - Load balancing across experts - Communication overhead - Training stability

Examples: - Mixtral 8x7B, 8x22B - GPT-4 rumored to use MoE - Switch Transformer

Examples

Mixtral activating only 2 of 8 experts per token for efficiency.

Want more AI knowledge?

Get bite-sized AI concepts delivered to your inbox.

Free daily digest. No spam, unsubscribe anytime.

Discussion