A→Z
A2ZAI
Back to AI 101
Lesson 3 of 15
intermediatemodels

How LLMs Work

Demystifying the technology behind ChatGPT and Claude

6 min read
Share:

How LLMs Work

Large Language Models like GPT-4 and Claude can write essays, code, poetry, and hold conversations. How do they actually work?

The Core Insight

LLMs are next-word prediction machines.

When you type "The cat sat on the ___", the model predicts the most likely next word based on patterns it learned from billions of text examples.

That's it. Everything else—conversations, reasoning, creativity—emerges from doing this really, really well.

Training: Reading the Internet

Phase 1: Pre-training

The model reads massive amounts of text:

  • Websites, books, articles, code
  • Wikipedia, Reddit, academic papers
  • Billions upon billions of words

For each chunk of text, it plays a game:

  1. Hide the next word
  2. Predict what it should be
  3. Check if it was right
  4. Adjust to do better next time

After seeing enough examples, the model learns:

  • Grammar and syntax
  • Facts and knowledge
  • Reasoning patterns
  • Writing styles

Phase 2: Fine-tuning

Raw pre-trained models are like unsocialized geniuses—smart but unhelpful.

Fine-tuning teaches them to:

  • Follow instructions
  • Be helpful and harmless
  • Format responses appropriately

This uses RLHF (Reinforcement Learning from Human Feedback):

  1. Humans rate model responses
  2. Model learns what humans prefer
  3. Repeat until it's actually useful

The Transformer Architecture

All modern LLMs use transformers (from the 2017 paper "Attention Is All You Need").

The Key Innovation: Attention

Previous models read text word-by-word, forgetting earlier words.

Transformers can look at all words simultaneously and decide which ones are important for each prediction.

Example: "The trophy didn't fit in the suitcase because it was too big."

What does "it" refer to? The trophy or suitcase?

Attention lets the model:

  1. Look at all words in the sentence
  2. Calculate relevance scores
  3. Determine "it" = trophy (because trophies are typically "too big")

Why This Matters

Attention enables:

  • Understanding context across long passages
  • Connecting related concepts
  • Handling complex, multi-step reasoning

Inside the Model: Parameters

LLMs have billions of parameters—numbers that determine behavior.

ModelParameters
GPT-21.5 billion
GPT-3175 billion
GPT-4~1.7 trillion
Llama 3 405B405 billion

More parameters = more capacity to learn patterns = generally better performance.

But also = more expensive to train and run.

Generation: How Responses Are Created

When you ask a question:

  1. Tokenization: Your text is split into tokens (~4 characters each)
  2. Encoding: Tokens become numbers the model understands
  3. Processing: Numbers flow through transformer layers
  4. Prediction: Model outputs probability for each possible next token
  5. Selection: Pick a token (usually using some randomness)
  6. Repeat: Use the new token to predict the next one
  7. Stop: Continue until reaching an end signal

This happens incredibly fast—hundreds of tokens per second.

The "Temperature" Setting

Temperature controls randomness in token selection:

  • Low temperature (0.0-0.3): More predictable, focused responses
  • Medium temperature (0.5-0.7): Balanced creativity and coherence
  • High temperature (0.8-1.0+): More creative, but potentially chaotic

What LLMs Don't Do

LLMs are impressive but not magic:

  • They don't "understand" like humans. They find statistical patterns.
  • They don't access the internet (unless given tools).
  • They can "hallucinate"—generate plausible-sounding nonsense.
  • They don't learn from conversations (each chat starts fresh).

The Bottom Line

LLMs are sophisticated pattern matching machines trained on human text. They predict what comes next based on what they've seen before.

The "intelligence" emerges from:

  1. Massive scale (billions of parameters)
  2. Massive data (trillions of words)
  3. Clever architecture (transformers + attention)

Understanding this helps you use them better—and recognize their limitations.


Next up: Understanding Model Parameters — What "7B" and "70B" actually mean

Enjoying the course?

Get notified when we add new lessons and AI updates.

Free daily digest. No spam, unsubscribe anytime.

Discussion