Foundation models to document tokenization logic

Foundation models (like GPT, BERT, etc.) rely on a key preprocessing step called tokenization to convert raw text into numerical inputs they can understand. Tokenization breaks down the text into smaller units called tokens, which can be words, subwords, characters, or other meaningful units depending on the tokenizer type. Below is a detailed explanation of tokenization logic used in foundation models.

1. What is Tokenization in Foundation Models?

Tokenization is the process of converting input text into a sequence of tokens. These tokens are then mapped to unique integers (IDs) using a vocabulary, which can be used by a neural network for training or inference.

Tokens are the atomic units on which models operate. The granularity of tokens depends on the model architecture and its tokenizer:

Word-level: Rare in modern LLMs due to issues with large vocab sizes and unknown words.
Subword-level: Common in models like BERT, GPT-2, GPT-3.
Character-level: Rare in foundation models due to efficiency and context limitations.

2. Common Tokenization Algorithms

Byte Pair Encoding (BPE)

Used in GPT-2, GPT-3, and others.

Process:

Start with characters as initial tokens.
Iteratively merge the most frequent pairs of tokens.
Build a vocabulary of subwords or merged tokens.

Advantages:

Handles out-of-vocabulary (OOV) words.
Balances vocabulary size and expressiveness.

WordPiece

Used in BERT.

Process:

Similar to BPE but uses a likelihood-based approach.
Chooses the most probable sequence of subwords.

Advantages:

Optimized for language modeling likelihood.
Allows fine control over vocabulary and granularity.

Unigram Language Model

Used in models like XLNet and SentencePiece.

Process:

Uses a probabilistic model to choose a vocabulary that maximizes likelihood.
Can drop tokens entirely from the vocab if they reduce overall likelihood.

Advantages:

Flexibility in vocabulary creation.
Handles multiple languages effectively.

3. Steps in Tokenization Pipeline

Preprocessing
- Lowercasing (optional)
- Removing special characters or normalizing whitespace
- Unicode normalization
Text Splitting
- Break text into raw tokens (words, subwords)
- Apply rules based on the tokenizer algorithm (e.g., WordPiece or BPE)
Subword Tokenization
- Match substrings using a greedy or probabilistic approach
- Fallback to unknown tokens (e.g., [UNK]) if no match
Mapping Tokens to IDs
- Each token corresponds to a unique index in a vocabulary
- These token IDs are fed to the embedding layer of the model
Padding and Truncation (if required)
- For batch processing, sequences are padded to a uniform length
- Longer sequences may be truncated based on a max token limit
Special Tokens
- [CLS], [SEP], <s>, </s>, <pad>, etc., depending on the model
- Added to indicate sentence boundaries or padding for batch consistency

4. Tokenization in Popular Foundation Models

GPT Series (GPT-2, GPT-3, GPT-4)

Uses a BPE tokenizer trained on a vast internet corpus.
Tokenizer is byte-level, meaning it can handle any UTF-8 string robustly.
Outputs variable-length token sequences with special tokens like “ in GPT-2 and <|endoftext|> in GPT-3.

BERT

Uses WordPiece tokenizer.
Adds [CLS] token at the beginning and [SEP] tokens between segments.
Truncates or pads sequences to a fixed length, usually 512 tokens.

T5

Uses SentencePiece with a unigram model.
Tokenizes inputs and outputs using the same vocabulary and format.
Special tokens include <pad>, <eos>, and task-specific prefixes (e.g., “translate English to German:”)

RoBERTa

Similar to BERT but with a different tokenizer.
Uses byte-level BPE and doesn’t rely on segment IDs.
Trains with larger data and longer sequences.

5. Tokenization Logic: Examples

Sentence:

"Tokenization is essential for language models."

BPE (GPT-2 style):

css
["Token", "ization", " is", " essential", " for", " language", " models", "."]

WordPiece (BERT style):

css
["[CLS]", "token", "##ization", "is", "essential", "for", "language", "models", ".", "[SEP]"]

Unigram (T5/SentencePiece style):

css
["▁Token", "ization", "▁is", "▁essential", "▁for", "▁language", "▁models", "."]

(Note: The “▁” denotes a whitespace boundary in SentencePiece.)

6. Handling Out-of-Vocabulary (OOV) Words

Modern subword tokenizers are designed to handle rare or unseen words by breaking them down into known subwords or characters:

"neuralink" might become ["neura", "##link"] in WordPiece.
In BPE, it could become ["ne", "ural", "ink"] if not in the vocabulary.
This strategy enables models to infer meaning from context without needing full-word training.

7. Efficiency Considerations

Compression vs. Expressiveness: Subword tokenization reduces vocab size while retaining semantic information.
Sequence Length: More tokens = more compute. Efficient tokenization can reduce total token count per input.
Multilingual Support: SentencePiece and Unigram models perform well across languages due to flexible vocabulary learning.

8. Tools and Libraries

Hugging Face Tokenizers: Fast Rust-based implementations of BPE, WordPiece, SentencePiece.
SentencePiece: Open-source tokenizer supporting BPE and Unigram.
spaCy, NLTK: General-purpose tokenizers (not subword-level).
OpenAI Tokenizer: Used in GPT models, includes special handling for byte-level tokens.

9. Conclusion

Tokenization logic in foundation models is crucial to model performance, data efficiency, and language generalization. By converting text into well-structured token sequences using algorithms like BPE, WordPiece, or SentencePiece, foundation models gain the ability to handle vast, multilingual corpora with precision. Understanding this preprocessing step is essential for optimizing NLP pipelines, customizing model behavior, and fine-tuning models on domain-specific tasks.

Share this Page your favorite way: Click any app below to share.

See all the ways to share this page

Foundation models to document tokenization logic

1. What is Tokenization in Foundation Models?

2. Common Tokenization Algorithms

Byte Pair Encoding (BPE)

WordPiece

Unigram Language Model

3. Steps in Tokenization Pipeline

4. Tokenization in Popular Foundation Models

GPT Series (GPT-2, GPT-3, GPT-4)

BERT

T5

RoBERTa

5. Tokenization Logic: Examples

Sentence:

6. Handling Out-of-Vocabulary (OOV) Words

7. Efficiency Considerations

8. Tools and Libraries

9. Conclusion

Check Out Our Newest Posts we wrote about

Why your ML system design must support partial retraining

Why your ML pipeline must detect missing or stale features

Why your ML feedback loop must consider label quality

Why your ML deployment plan must include fallback logic