A→Z
A2ZAI
Back to AI 101
Lesson 13 of 15
intermediatemodels

Multimodal AI

AI that sees, hears, and works across formats

5 min read
Share:

Multimodal AI

Early AI models worked with text only. Modern AI handles images, audio, video, and more—often simultaneously. This is multimodal AI.

What Is Multimodal AI?

Multimodal AI can process and generate multiple types of data:

  • Text — Language understanding and generation
  • Images — Vision, image generation
  • Audio — Speech, music, sound
  • Video — Moving images with temporal understanding
  • Code — Programming languages

The most powerful modern models are multimodal by default.

Vision-Language Models

Understanding Images

Capabilities:

  • Describe what's in an image
  • Answer questions about images
  • Extract text (OCR)
  • Analyze charts and diagrams
  • Identify objects and people

Models: GPT-4V, Claude 3, Gemini Pro Vision

Use cases:

  • Accessibility (describe images)
  • Document processing
  • Quality control in manufacturing
  • Medical image analysis

Example Interactions

[Upload image of whiteboard with notes]
User: "Transcribe this whiteboard and organize as bullet points"
AI: • Meeting agenda for Q3 planning...
[Upload chart]
User: "What trend does this show?"
AI: "Revenue grew 23% year-over-year..."

Image Generation

Models that create images from text:

Diffusion Models:

  • Stable Diffusion (open source)
  • Midjourney
  • DALL-E 3

How they work:

  1. Start with noise
  2. Iteratively "denoise" guided by text
  3. Result: Image matching the description

Capabilities:

  • Photorealistic images
  • Artistic styles
  • Image editing
  • Inpainting (fill in parts)

Audio Models

Speech Recognition

Convert spoken words to text.

  • Whisper (OpenAI) — Open source, highly accurate
  • Google Speech-to-Text
  • Assembly AI

Speech Synthesis

Generate realistic speech from text.

  • ElevenLabs — Voice cloning
  • PlayHT — Diverse voices
  • OpenAI TTS — Simple and effective

Music Generation

  • Suno — Full songs from text
  • Udio — Music with vocals
  • Stable Audio — Instrumental tracks

Video Models

Video Understanding

Analyze video content, answer questions about it.

Capabilities:

  • Summarize videos
  • Answer temporal questions ("What happened after...?")
  • Extract key moments

Models: Gemini 1.5 Pro (can process hours of video)

Video Generation

Create video from text or images.

Models:

  • Sora (OpenAI) — Most impressive, limited access
  • Runway Gen-2 — Commercially available
  • Pika Labs — Easy-to-use generation

Current state: Impressive short clips, consistency issues in longer content.

True Multimodal: Mixing Modalities

The frontier: Models that fluidly combine modalities.

GPT-4o:

  • Processes text, images, audio natively
  • Can hear and speak, not just read and write
  • Real-time conversation with voice

Gemini:

  • Native multimodal architecture
  • All modalities trained together
  • Strong cross-modal understanding

Practical Applications

Document Processing

  • Upload PDFs with mixed content
  • Extract information from forms
  • Process receipts and invoices

Accessibility

  • Describe images for visually impaired
  • Transcribe audio for deaf users
  • Generate audio from text for the blind

Creative Work

  • Generate images for articles
  • Create video from scripts
  • Add music to content

Analysis

  • Understand complex diagrams
  • Extract data from charts
  • Analyze surveillance footage

Challenges

Consistency Across Modalities

  • Generated images may not match text exactly
  • Video consistency is hard to maintain
  • Cross-modal hallucinations

Computational Cost

  • Video processing is expensive
  • Real-time multimodal is resource-intensive
  • Trade-offs between quality and speed

Safety Concerns

  • Deepfakes and misinformation
  • Copyright questions for generated content
  • Privacy issues with face/voice

The Bottom Line

Multimodal AI is becoming the default:

  • Most frontier models handle multiple modalities
  • Single-modality models are increasingly niche
  • The future is AI that perceives the world as we do

For practitioners:

  • Leverage vision for document understanding
  • Consider audio for accessibility
  • Video capabilities are improving rapidly

Next up: Running AI Locally — AI without the cloud

Enjoying the course?

Get notified when we add new lessons and AI updates.

Free daily digest. No spam, unsubscribe anytime.

Discussion