Tokenizer

Definition

Tokenizers break text into smaller units (tokens) that language models can understand and process.

Vocabulary Size: - GPT-4: ~100K tokens - Llama: 32K tokens - Trade-off: Larger vocab = shorter sequences but more parameters

Special Tokens: - [CLS], [SEP]: BERT special tokens - <|endoftext|>: GPT end token - [PAD]: Padding token

BPE tokenizer splitting "unhappiness" into ["un", "happiness"].

AI models trained on massive text datasets that can understand and generate human-like text.

The basic unit of text that language models process, roughly 3/4 of a word.