Definition
Vision Transformers apply the transformer architecture to images by treating image patches as sequences.
How It Works: 1. Split image into patches (e.g., 16x16) 2. Flatten patches to vectors 3. Add position embeddings 4. Process through transformer encoder 5. Use [CLS] token for classification
Benefits: - Scales better than CNNs at large scale - Better global understanding - Transfer learning works well
- **Variants:**
- DeiT: Data-efficient training
- Swin Transformer: Hierarchical windows
- BEiT: BERT-style pre-training for vision
Impact: - Changed computer vision landscape - Foundation for multimodal models - CLIP combines ViT with text encoder
Examples
ViT-Large achieving state-of-the-art on ImageNet classification.
Related Terms
Want more AI knowledge?
Get bite-sized AI concepts delivered to your inbox.
Free daily digest. No spam, unsubscribe anytime.