A→Z
A2ZAI
Back to Glossary
techniques

Quantization

Reducing model precision to decrease size and increase speed while maintaining quality.

Share:

Definition

Quantization reduces the numerical precision of model weights to make models smaller and faster.

  • **Precision Levels:**
  • FP32: Full precision (32-bit floating point)
  • FP16: Half precision (16-bit)
  • INT8: 8-bit integers
  • INT4: 4-bit integers
  • **Benefits:**
  • Smaller Models: 4-bit is 8x smaller than 32-bit
  • Faster Inference: Less memory bandwidth needed
  • Lower Costs: Run larger models on smaller hardware
  • Edge Deployment: Run on phones/laptops

Methods: - Post-training quantization (PTQ) - Quantization-aware training (QAT) - GPTQ, AWQ (for LLMs)

Trade-offs: - Some quality loss (usually minimal) - May affect edge cases - Calibration data needed

Examples

Running a 70B model on a gaming GPU by quantizing it to 4-bit.

Want more AI knowledge?

Get bite-sized AI concepts delivered to your inbox.

Free daily digest. No spam, unsubscribe anytime.

Discussion