RAG Explained

Retrieval-Augmented Generation (RAG) is one of the most practical techniques for making AI useful in real applications. It lets AI answer questions using your specific data.

The Problem RAG Solves

LLMs have two major limitations:

Knowledge cutoff — They don't know recent events
No access to your data — They haven't seen your documents

Without RAG:

"What's our company's refund policy?" "I don't have access to your company's information."

With RAG:

"What's our company's refund policy?" "According to your policy document, customers can request refunds within 30 days..."

How RAG Works

Step 1: Prepare Your Knowledge Base

Take your documents and:

Chunk them into smaller pieces (paragraphs, sections)
Embed each chunk (convert to numbers that capture meaning)
Store embeddings in a vector database

Step 2: At Query Time

When a user asks a question:

Embed the question (same process as documents)
Search the vector database for similar chunks
Retrieve the most relevant chunks
Augment the prompt with retrieved context
Generate an answer based on the context

User Question: "What's our refund policy?"
         ↓
    [Embed Query]
         ↓
    [Search Vector DB]
         ↓
    [Retrieve Top Chunks]
         ↓
    [Build Prompt: Question + Context]
         ↓
    [LLM Generates Answer]
         ↓
    "Customers can request refunds within 30 days..."

Key Components

1. Embedding Models

Convert text to vectors that capture semantic meaning.

Popular choices:

OpenAI text-embedding-3-small
Cohere embed-v3
Open source: BGE, E5, Nomic

2. Vector Databases

Store and search embeddings efficiently.

Options:

Pinecone — Managed, easy to use
Weaviate — Full-featured, open source
Chroma — Lightweight, great for prototypes
pgvector — PostgreSQL extension (use existing infra)

3. Chunking Strategy

How you split documents matters:

Too small: Lose context, fragments don't make sense Too large: Dilute relevance, hit token limits

Common approaches:

Fixed size (500-1000 tokens) with overlap
Semantic chunking (split at natural boundaries)
Document-specific (headers, paragraphs)

4. Retrieval Strategy

How many chunks? How to rank them?

Simple: Top K nearest neighbors (cosine similarity) Better: Hybrid search (combine keyword + semantic) Advanced: Re-ranking with cross-encoders

RAG vs Fine-Tuning

Aspect	RAG	Fine-Tuning
New data	Add anytime	Retrain required
Citation	Can link to sources	Can't show sources
Cost	Per-query retrieval	One-time training
Best for	Facts, documents	Style, behavior

Most teams use both:

RAG for factual Q&A over documents
Fine-tuning for consistent behavior/style

Building a RAG System

Basic Implementation

# Pseudo-code for a simple RAG system

# 1. Index documents
for doc in documents:
    chunks = split_into_chunks(doc)
    embeddings = embed(chunks)
    vector_db.add(embeddings, chunks)

# 2. Query
def answer(question):
    query_embedding = embed(question)
    relevant_chunks = vector_db.search(query_embedding, top_k=5)

    prompt = f"""
    Context: {relevant_chunks}

    Question: {question}

    Answer based on the context above:
    """

    return llm.generate(prompt)

Common Frameworks

LangChain — Most popular, lots of integrations
LlamaIndex — Purpose-built for RAG
Haystack — Production-focused
Vercel AI SDK — Good for web apps

Challenges and Solutions

Retrieval Quality

Problem: Wrong chunks retrieved Solutions:

Better chunking
Hybrid search
Query rewriting

Context Window Limits

Problem: Too much context Solutions:

Smarter chunk selection
Summarize before adding
Use models with larger context

Hallucinations

Problem: AI ignores context, makes things up Solutions:

Explicit instructions to only use context
Quote-based responses
Confidence scoring

Stale Data

Problem: Documents change Solutions:

Incremental indexing
Version tracking
Scheduled re-indexing

The Bottom Line

RAG is essential for:

Document Q&A
Customer support
Internal knowledge bases
Any application needing current/private data

It's not magic—good RAG requires:

Quality chunking
Appropriate embedding models
Thoughtful retrieval strategy
Good prompts

Start simple (basic vector search), then optimize based on what breaks.

Next up: AI Agents Explained — Autonomous AI that takes action

RAG Explained

RAG Explained

The Problem RAG Solves

How RAG Works

Step 1: Prepare Your Knowledge Base

Step 2: At Query Time

Key Components

1. Embedding Models

2. Vector Databases

3. Chunking Strategy

4. Retrieval Strategy

RAG vs Fine-Tuning

Building a RAG System

Basic Implementation

Common Frameworks

Challenges and Solutions

Retrieval Quality

Context Window Limits

Hallucinations

Stale Data

The Bottom Line

Enjoying the course?

Discussion

Course Outline