CLIP

Definition

CLIP (Contrastive Language-Image Pre-training) learns to connect images and text through contrastive learning.

Training: - Learn from image-text pairs from internet - No manual labeling needed - 400M image-text pairs

How It Works: - Image encoder (ViT or ResNet) - Text encoder (Transformer) - Train to maximize similarity of matching pairs - Minimize similarity of non-matching pairs

Capabilities: - Zero-shot image classification - Image-text similarity - Powers DALL-E and Stable Diffusion - Image search with natural language

Impact: - Foundation for text-to-image models - Enabled zero-shot visual recognition - Key component in multimodal AI

Examples

Using CLIP to find images matching the description "a cat wearing a hat".

Related Terms

Embeddings

Numerical representations of data (text, images) that capture semantic meaning.

Vision Transformer (ViT)

Applying transformer architecture to image recognition by treating image patches as tokens.

Stable Diffusion

Popular open-source image generation model using diffusion techniques.

Definition

Examples

Related Terms

Want more AI knowledge?

Discussion