Definition
Tokenizers break text into smaller units (tokens) that language models can understand and process.
- **Tokenization Methods:**
- Word-level: Each word is a token
- Character-level: Each character is a token
- Subword: Balance between word and character (most common)
- **Popular Subword Methods:**
- BPE (Byte Pair Encoding): Used by GPT
- WordPiece: Used by BERT
- SentencePiece: Language-agnostic
- Unigram: Probabilistic approach
Vocabulary Size: - GPT-4: ~100K tokens - Llama: 32K tokens - Trade-off: Larger vocab = shorter sequences but more parameters
Special Tokens: - [CLS], [SEP]: BERT special tokens - <|endoftext|>: GPT end token - [PAD]: Padding token
Examples
BPE tokenizer splitting "unhappiness" into ["un", "happiness"].
Related Terms
Want more AI knowledge?
Get bite-sized AI concepts delivered to your inbox.
Free intelligence briefs. No spam, unsubscribe anytime.