A→Z
A2ZAI
Back to AI 101
Lesson 11 of 15
intermediatetechniques

RAG Explained

Retrieval-Augmented Generation for smarter AI

6 min read
Share:

RAG Explained

Retrieval-Augmented Generation (RAG) is one of the most practical techniques for making AI useful in real applications. It lets AI answer questions using your specific data.

The Problem RAG Solves

LLMs have two major limitations:

  1. Knowledge cutoff — They don't know recent events
  2. No access to your data — They haven't seen your documents

Without RAG:

"What's our company's refund policy?" "I don't have access to your company's information."

With RAG:

"What's our company's refund policy?" "According to your policy document, customers can request refunds within 30 days..."

How RAG Works

Step 1: Prepare Your Knowledge Base

Take your documents and:

  1. Chunk them into smaller pieces (paragraphs, sections)
  2. Embed each chunk (convert to numbers that capture meaning)
  3. Store embeddings in a vector database

Step 2: At Query Time

When a user asks a question:

  1. Embed the question (same process as documents)
  2. Search the vector database for similar chunks
  3. Retrieve the most relevant chunks
  4. Augment the prompt with retrieved context
  5. Generate an answer based on the context
User Question: "What's our refund policy?"
         ↓
    [Embed Query]
         ↓
    [Search Vector DB]
         ↓
    [Retrieve Top Chunks]
         ↓
    [Build Prompt: Question + Context]
         ↓
    [LLM Generates Answer]
         ↓
    "Customers can request refunds within 30 days..."

Key Components

1. Embedding Models

Convert text to vectors that capture semantic meaning.

Popular choices:

  • OpenAI text-embedding-3-small
  • Cohere embed-v3
  • Open source: BGE, E5, Nomic

2. Vector Databases

Store and search embeddings efficiently.

Options:

  • Pinecone — Managed, easy to use
  • Weaviate — Full-featured, open source
  • Chroma — Lightweight, great for prototypes
  • pgvector — PostgreSQL extension (use existing infra)

3. Chunking Strategy

How you split documents matters:

Too small: Lose context, fragments don't make sense Too large: Dilute relevance, hit token limits

Common approaches:

  • Fixed size (500-1000 tokens) with overlap
  • Semantic chunking (split at natural boundaries)
  • Document-specific (headers, paragraphs)

4. Retrieval Strategy

How many chunks? How to rank them?

Simple: Top K nearest neighbors (cosine similarity) Better: Hybrid search (combine keyword + semantic) Advanced: Re-ranking with cross-encoders

RAG vs Fine-Tuning

AspectRAGFine-Tuning
New dataAdd anytimeRetrain required
CitationCan link to sourcesCan't show sources
CostPer-query retrievalOne-time training
Best forFacts, documentsStyle, behavior

Most teams use both:

  • RAG for factual Q&A over documents
  • Fine-tuning for consistent behavior/style

Building a RAG System

Basic Implementation

# Pseudo-code for a simple RAG system

# 1. Index documents
for doc in documents:
    chunks = split_into_chunks(doc)
    embeddings = embed(chunks)
    vector_db.add(embeddings, chunks)

# 2. Query
def answer(question):
    query_embedding = embed(question)
    relevant_chunks = vector_db.search(query_embedding, top_k=5)

    prompt = f"""
    Context: {relevant_chunks}

    Question: {question}

    Answer based on the context above:
    """

    return llm.generate(prompt)

Common Frameworks

  • LangChain — Most popular, lots of integrations
  • LlamaIndex — Purpose-built for RAG
  • Haystack — Production-focused
  • Vercel AI SDK — Good for web apps

Challenges and Solutions

Retrieval Quality

Problem: Wrong chunks retrieved Solutions:

  • Better chunking
  • Hybrid search
  • Query rewriting

Context Window Limits

Problem: Too much context Solutions:

  • Smarter chunk selection
  • Summarize before adding
  • Use models with larger context

Hallucinations

Problem: AI ignores context, makes things up Solutions:

  • Explicit instructions to only use context
  • Quote-based responses
  • Confidence scoring

Stale Data

Problem: Documents change Solutions:

  • Incremental indexing
  • Version tracking
  • Scheduled re-indexing

The Bottom Line

RAG is essential for:

  • Document Q&A
  • Customer support
  • Internal knowledge bases
  • Any application needing current/private data

It's not magic—good RAG requires:

  • Quality chunking
  • Appropriate embedding models
  • Thoughtful retrieval strategy
  • Good prompts

Start simple (basic vector search), then optimize based on what breaks.


Next up: AI Agents Explained — Autonomous AI that takes action

Enjoying the course?

Get notified when we add new lessons and AI updates.

Free daily digest. No spam, unsubscribe anytime.

Discussion