Back to Glossary

techniques

Quantization

Reducing model precision to decrease size and increase speed while maintaining quality.

Share:

Definition

Quantization reduces the numerical precision of model weights to make models smaller and faster.

**Precision Levels:**
FP32: Full precision (32-bit floating point)
FP16: Half precision (16-bit)
INT8: 8-bit integers
INT4: 4-bit integers

**Benefits:**
Smaller Models: 4-bit is 8x smaller than 32-bit
Faster Inference: Less memory bandwidth needed
Lower Costs: Run larger models on smaller hardware
Edge Deployment: Run on phones/laptops

Methods: - Post-training quantization (PTQ) - Quantization-aware training (QAT) - GPTQ, AWQ (for LLMs)

Trade-offs: - Some quality loss (usually minimal) - May affect edge cases - Calibration data needed

Examples

Running a 70B model on a gaming GPU by quantizing it to 4-bit.

Related Terms

Using a trained AI model to make predictions on new, unseen data.

Want more AI knowledge?

Get bite-sized AI concepts delivered to your inbox.

Free daily digest. No spam, unsubscribe anytime.

Discussion

Browse all terms Take AI 101 Course