Back to Glossary
concepts

Tokenizer

Component that converts text into tokens that language models can process.

Share:

Definition

Tokenizers break text into smaller units (tokens) that language models can understand and process.

  • **Tokenization Methods:**
  • Word-level: Each word is a token
  • Character-level: Each character is a token
  • Subword: Balance between word and character (most common)
  • **Popular Subword Methods:**
  • BPE (Byte Pair Encoding): Used by GPT
  • WordPiece: Used by BERT
  • SentencePiece: Language-agnostic
  • Unigram: Probabilistic approach

Vocabulary Size: - GPT-4: ~100K tokens - Llama: 32K tokens - Trade-off: Larger vocab = shorter sequences but more parameters

Special Tokens: - [CLS], [SEP]: BERT special tokens - <|endoftext|>: GPT end token - [PAD]: Padding token

Examples

BPE tokenizer splitting "unhappiness" into ["un", "happiness"].

Want more AI knowledge?

Get bite-sized AI concepts delivered to your inbox.

Free intelligence briefs. No spam, unsubscribe anytime.

Discussion