Definition
Multimodal AI models can process and generate content across different modalities - text, images, audio, video - in a unified system.
Modalities: - Text (language understanding/generation) - Images (vision, generation) - Audio (speech, music) - Video (understanding, generation) - Code (programming)
Capabilities: - Describe images - Generate images from text - Transcribe and generate speech - Analyze videos - Create videos from prompts
Examples: - GPT-4o (text + images + audio) - Gemini (text + images + video) - Claude 3 (text + images) - Sora (text to video)
Architecture: Models use encoders for each modality that map inputs to a shared representation space, enabling cross-modal understanding.
Examples
GPT-4o analyzing a chart image and answering questions about it, or Sora generating video from a text description.
Related Terms
Want more AI knowledge?
Get bite-sized AI concepts delivered to your inbox.
Free daily digest. No spam, unsubscribe anytime.