A→Z
A2ZAI
Back to Glossary
models

Vision Transformer (ViT)

Applying transformer architecture to image recognition by treating image patches as tokens.

Share:

Definition

Vision Transformers apply the transformer architecture to images by treating image patches as sequences.

How It Works: 1. Split image into patches (e.g., 16x16) 2. Flatten patches to vectors 3. Add position embeddings 4. Process through transformer encoder 5. Use [CLS] token for classification

Benefits: - Scales better than CNNs at large scale - Better global understanding - Transfer learning works well

  • **Variants:**
  • DeiT: Data-efficient training
  • Swin Transformer: Hierarchical windows
  • BEiT: BERT-style pre-training for vision

Impact: - Changed computer vision landscape - Foundation for multimodal models - CLIP combines ViT with text encoder

Examples

ViT-Large achieving state-of-the-art on ImageNet classification.

Want more AI knowledge?

Get bite-sized AI concepts delivered to your inbox.

Free daily digest. No spam, unsubscribe anytime.

Discussion