Multimodal AI

Early AI models worked with text only. Modern AI handles images, audio, video, and more—often simultaneously. This is multimodal AI.

What Is Multimodal AI?

Multimodal AI can process and generate multiple types of data:

Text — Language understanding and generation
Images — Vision, image generation
Audio — Speech, music, sound
Video — Moving images with temporal understanding
Code — Programming languages

The most powerful modern models are multimodal by default.

Vision-Language Models

Understanding Images

Capabilities:

Describe what's in an image
Answer questions about images
Extract text (OCR)
Analyze charts and diagrams
Identify objects and people

Models: GPT-4V, Claude 3, Gemini Pro Vision

Use cases:

Accessibility (describe images)
Document processing
Quality control in manufacturing
Medical image analysis

Example Interactions

[Upload image of whiteboard with notes]
User: "Transcribe this whiteboard and organize as bullet points"
AI: • Meeting agenda for Q3 planning...

[Upload chart]
User: "What trend does this show?"
AI: "Revenue grew 23% year-over-year..."

Image Generation

Models that create images from text:

Diffusion Models:

Stable Diffusion (open source)
Midjourney
DALL-E 3

How they work:

Start with noise
Iteratively "denoise" guided by text
Result: Image matching the description

Capabilities:

Photorealistic images
Artistic styles
Image editing
Inpainting (fill in parts)

Audio Models

Speech Recognition

Convert spoken words to text.

Whisper (OpenAI) — Open source, highly accurate
Google Speech-to-Text
Assembly AI

Speech Synthesis

Generate realistic speech from text.

ElevenLabs — Voice cloning
PlayHT — Diverse voices
OpenAI TTS — Simple and effective

Music Generation

Suno — Full songs from text
Udio — Music with vocals
Stable Audio — Instrumental tracks

Video Models

Video Understanding

Analyze video content, answer questions about it.

Capabilities:

Summarize videos
Answer temporal questions ("What happened after...?")
Extract key moments

Models: Gemini 1.5 Pro (can process hours of video)

Video Generation

Create video from text or images.

Models:

Sora (OpenAI) — Most impressive, limited access
Runway Gen-2 — Commercially available
Pika Labs — Easy-to-use generation

Current state: Impressive short clips, consistency issues in longer content.

True Multimodal: Mixing Modalities

The frontier: Models that fluidly combine modalities.

GPT-4o:

Processes text, images, audio natively
Can hear and speak, not just read and write
Real-time conversation with voice

Gemini:

Native multimodal architecture
All modalities trained together
Strong cross-modal understanding

Practical Applications

Document Processing

Upload PDFs with mixed content
Extract information from forms
Process receipts and invoices

Accessibility

Describe images for visually impaired
Transcribe audio for deaf users
Generate audio from text for the blind

Creative Work

Generate images for articles
Create video from scripts
Add music to content

Analysis

Understand complex diagrams
Extract data from charts
Analyze surveillance footage

Challenges

Consistency Across Modalities

Generated images may not match text exactly
Video consistency is hard to maintain
Cross-modal hallucinations

Computational Cost

Video processing is expensive
Real-time multimodal is resource-intensive
Trade-offs between quality and speed

Safety Concerns

Deepfakes and misinformation
Copyright questions for generated content
Privacy issues with face/voice

The Bottom Line

Multimodal AI is becoming the default:

Most frontier models handle multiple modalities
Single-modality models are increasingly niche
The future is AI that perceives the world as we do

For practitioners:

Leverage vision for document understanding
Consider audio for accessibility
Video capabilities are improving rapidly

Next up: Running AI Locally — AI without the cloud

Multimodal AI

Multimodal AI

What Is Multimodal AI?

Vision-Language Models

Understanding Images

Example Interactions

Image Generation

Audio Models

Speech Recognition

Speech Synthesis

Music Generation

Video Models

Video Understanding

Video Generation

True Multimodal: Mixing Modalities

Practical Applications

Document Processing

Accessibility

Creative Work

Analysis

Challenges

Consistency Across Modalities

Computational Cost

Safety Concerns

The Bottom Line

Enjoying the course?

Discussion

Course Outline