Definition
CLIP (Contrastive Language-Image Pre-training) learns to connect images and text through contrastive learning.
Training: - Learn from image-text pairs from internet - No manual labeling needed - 400M image-text pairs
How It Works: - Image encoder (ViT or ResNet) - Text encoder (Transformer) - Train to maximize similarity of matching pairs - Minimize similarity of non-matching pairs
Capabilities: - Zero-shot image classification - Image-text similarity - Powers DALL-E and Stable Diffusion - Image search with natural language
Impact: - Foundation for text-to-image models - Enabled zero-shot visual recognition - Key component in multimodal AI
Examples
Using CLIP to find images matching the description "a cat wearing a hat".
Related Terms
Want more AI knowledge?
Get bite-sized AI concepts delivered to your inbox.
Free daily digest. No spam, unsubscribe anytime.