Definition
Quantization reduces the numerical precision of model weights to make models smaller and faster.
- **Precision Levels:**
- FP32: Full precision (32-bit floating point)
- FP16: Half precision (16-bit)
- INT8: 8-bit integers
- INT4: 4-bit integers
- **Benefits:**
- Smaller Models: 4-bit is 8x smaller than 32-bit
- Faster Inference: Less memory bandwidth needed
- Lower Costs: Run larger models on smaller hardware
- Edge Deployment: Run on phones/laptops
Methods: - Post-training quantization (PTQ) - Quantization-aware training (QAT) - GPTQ, AWQ (for LLMs)
Trade-offs: - Some quality loss (usually minimal) - May affect edge cases - Calibration data needed
Examples
Running a 70B model on a gaming GPU by quantizing it to 4-bit.
Want more AI knowledge?
Get bite-sized AI concepts delivered to your inbox.
Free daily digest. No spam, unsubscribe anytime.