A→Z
A2ZAI
Back to Glossary
models

CLIP

OpenAI model connecting images and text in shared embedding space.

Share:

Definition

CLIP (Contrastive Language-Image Pre-training) learns to connect images and text through contrastive learning.

Training: - Learn from image-text pairs from internet - No manual labeling needed - 400M image-text pairs

How It Works: - Image encoder (ViT or ResNet) - Text encoder (Transformer) - Train to maximize similarity of matching pairs - Minimize similarity of non-matching pairs

Capabilities: - Zero-shot image classification - Image-text similarity - Powers DALL-E and Stable Diffusion - Image search with natural language

Impact: - Foundation for text-to-image models - Enabled zero-shot visual recognition - Key component in multimodal AI

Examples

Using CLIP to find images matching the description "a cat wearing a hat".

Want more AI knowledge?

Get bite-sized AI concepts delivered to your inbox.

Free daily digest. No spam, unsubscribe anytime.

Discussion