How LLMs Work
Large Language Models like GPT-4 and Claude can write essays, code, poetry, and hold conversations. How do they actually work?
The Core Insight
LLMs are next-word prediction machines.
When you type "The cat sat on the ___", the model predicts the most likely next word based on patterns it learned from billions of text examples.
That's it. Everything else—conversations, reasoning, creativity—emerges from doing this really, really well.
Training: Reading the Internet
Phase 1: Pre-training
The model reads massive amounts of text:
- Websites, books, articles, code
- Wikipedia, Reddit, academic papers
- Billions upon billions of words
For each chunk of text, it plays a game:
- Hide the next word
- Predict what it should be
- Check if it was right
- Adjust to do better next time
After seeing enough examples, the model learns:
- Grammar and syntax
- Facts and knowledge
- Reasoning patterns
- Writing styles
Phase 2: Fine-tuning
Raw pre-trained models are like unsocialized geniuses—smart but unhelpful.
Fine-tuning teaches them to:
- Follow instructions
- Be helpful and harmless
- Format responses appropriately
This uses RLHF (Reinforcement Learning from Human Feedback):
- Humans rate model responses
- Model learns what humans prefer
- Repeat until it's actually useful
The Transformer Architecture
All modern LLMs use transformers (from the 2017 paper "Attention Is All You Need").
The Key Innovation: Attention
Previous models read text word-by-word, forgetting earlier words.
Transformers can look at all words simultaneously and decide which ones are important for each prediction.
Example: "The trophy didn't fit in the suitcase because it was too big."
What does "it" refer to? The trophy or suitcase?
Attention lets the model:
- Look at all words in the sentence
- Calculate relevance scores
- Determine "it" = trophy (because trophies are typically "too big")
Why This Matters
Attention enables:
- Understanding context across long passages
- Connecting related concepts
- Handling complex, multi-step reasoning
Inside the Model: Parameters
LLMs have billions of parameters—numbers that determine behavior.
| Model | Parameters |
|---|---|
| GPT-2 | 1.5 billion |
| GPT-3 | 175 billion |
| GPT-4 | ~1.7 trillion |
| Llama 3 405B | 405 billion |
More parameters = more capacity to learn patterns = generally better performance.
But also = more expensive to train and run.
Generation: How Responses Are Created
When you ask a question:
- Tokenization: Your text is split into tokens (~4 characters each)
- Encoding: Tokens become numbers the model understands
- Processing: Numbers flow through transformer layers
- Prediction: Model outputs probability for each possible next token
- Selection: Pick a token (usually using some randomness)
- Repeat: Use the new token to predict the next one
- Stop: Continue until reaching an end signal
This happens incredibly fast—hundreds of tokens per second.
The "Temperature" Setting
Temperature controls randomness in token selection:
- Low temperature (0.0-0.3): More predictable, focused responses
- Medium temperature (0.5-0.7): Balanced creativity and coherence
- High temperature (0.8-1.0+): More creative, but potentially chaotic
What LLMs Don't Do
LLMs are impressive but not magic:
- They don't "understand" like humans. They find statistical patterns.
- They don't access the internet (unless given tools).
- They can "hallucinate"—generate plausible-sounding nonsense.
- They don't learn from conversations (each chat starts fresh).
The Bottom Line
LLMs are sophisticated pattern matching machines trained on human text. They predict what comes next based on what they've seen before.
The "intelligence" emerges from:
- Massive scale (billions of parameters)
- Massive data (trillions of words)
- Clever architecture (transformers + attention)
Understanding this helps you use them better—and recognize their limitations.
Next up: Understanding Model Parameters — What "7B" and "70B" actually mean