Foundation models (like GPT, BERT, etc.) rely on a key preprocessing step called tokenization to convert raw text into numerical inputs they can understand. Tokenization breaks down the text into smaller units called tokens, which can be words, subwords, characters, or other meaningful units depending on the tokenizer type. Below is a detailed explanation of tokenization logic used in foundation models.
1. What is Tokenization in Foundation Models?
Tokenization is the process of converting input text into a sequence of tokens. These tokens are then mapped to unique integers (IDs) using a vocabulary, which can be used by a neural network for training or inference.
Tokens are the atomic units on which models operate. The granularity of tokens depends on the model architecture and its tokenizer:
-
Word-level: Rare in modern LLMs due to issues with large vocab sizes and unknown words.
-
Subword-level: Common in models like BERT, GPT-2, GPT-3.
-
Character-level: Rare in foundation models due to efficiency and context limitations.
2. Common Tokenization Algorithms
Byte Pair Encoding (BPE)
Used in GPT-2, GPT-3, and others.
Process:
-
Start with characters as initial tokens.
-
Iteratively merge the most frequent pairs of tokens.
-
Build a vocabulary of subwords or merged tokens.
Advantages:
-
Handles out-of-vocabulary (OOV) words.
-
Balances vocabulary size and expressiveness.
WordPiece
Used in BERT.
Process:
-
Similar to BPE but uses a likelihood-based approach.
-
Chooses the most probable sequence of subwords.
Advantages:
-
Optimized for language modeling likelihood.
-
Allows fine control over vocabulary and granularity.
Unigram Language Model
Used in models like XLNet and SentencePiece.
Process:
-
Uses a probabilistic model to choose a vocabulary that maximizes likelihood.
-
Can drop tokens entirely from the vocab if they reduce overall likelihood.
Advantages:
-
Flexibility in vocabulary creation.
-
Handles multiple languages effectively.
3. Steps in Tokenization Pipeline
-
Preprocessing
-
Lowercasing (optional)
-
Removing special characters or normalizing whitespace
-
Unicode normalization
-
-
Text Splitting
-
Break text into raw tokens (words, subwords)
-
Apply rules based on the tokenizer algorithm (e.g., WordPiece or BPE)
-
-
Subword Tokenization
-
Match substrings using a greedy or probabilistic approach
-
Fallback to unknown tokens (e.g.,
[UNK]) if no match
-
-
Mapping Tokens to IDs
-
Each token corresponds to a unique index in a vocabulary
-
These token IDs are fed to the embedding layer of the model
-
-
Padding and Truncation (if required)
-
For batch processing, sequences are padded to a uniform length
-
Longer sequences may be truncated based on a max token limit
-
-
Special Tokens
-
[CLS],[SEP],<s>,</s>,<pad>, etc., depending on the model -
Added to indicate sentence boundaries or padding for batch consistency
-
4. Tokenization in Popular Foundation Models
GPT Series (GPT-2, GPT-3, GPT-4)
-
Uses a BPE tokenizer trained on a vast internet corpus.
-
Tokenizer is byte-level, meaning it can handle any UTF-8 string robustly.
-
Outputs variable-length token sequences with special tokens like “ in GPT-2 and
<|endoftext|>in GPT-3.
BERT
-
Uses WordPiece tokenizer.
-
Adds
[CLS]token at the beginning and[SEP]tokens between segments. -
Truncates or pads sequences to a fixed length, usually 512 tokens.
T5
-
Uses SentencePiece with a unigram model.
-
Tokenizes inputs and outputs using the same vocabulary and format.
-
Special tokens include
<pad>,<eos>, and task-specific prefixes (e.g., “translate English to German:”)
RoBERTa
-
Similar to BERT but with a different tokenizer.
-
Uses byte-level BPE and doesn’t rely on segment IDs.
-
Trains with larger data and longer sequences.
5. Tokenization Logic: Examples
Sentence:
"Tokenization is essential for language models."
BPE (GPT-2 style):
WordPiece (BERT style):
Unigram (T5/SentencePiece style):
(Note: The “▁” denotes a whitespace boundary in SentencePiece.)
6. Handling Out-of-Vocabulary (OOV) Words
Modern subword tokenizers are designed to handle rare or unseen words by breaking them down into known subwords or characters:
-
"neuralink"might become["neura", "##link"]in WordPiece. -
In BPE, it could become
["ne", "ural", "ink"]if not in the vocabulary. -
This strategy enables models to infer meaning from context without needing full-word training.
7. Efficiency Considerations
-
Compression vs. Expressiveness: Subword tokenization reduces vocab size while retaining semantic information.
-
Sequence Length: More tokens = more compute. Efficient tokenization can reduce total token count per input.
-
Multilingual Support: SentencePiece and Unigram models perform well across languages due to flexible vocabulary learning.
8. Tools and Libraries
-
Hugging Face Tokenizers: Fast Rust-based implementations of BPE, WordPiece, SentencePiece.
-
SentencePiece: Open-source tokenizer supporting BPE and Unigram.
-
spaCy, NLTK: General-purpose tokenizers (not subword-level).
-
OpenAI Tokenizer: Used in GPT models, includes special handling for byte-level tokens.
9. Conclusion
Tokenization logic in foundation models is crucial to model performance, data efficiency, and language generalization. By converting text into well-structured token sequences using algorithms like BPE, WordPiece, or SentencePiece, foundation models gain the ability to handle vast, multilingual corpora with precision. Understanding this preprocessing step is essential for optimizing NLP pipelines, customizing model behavior, and fine-tuning models on domain-specific tasks.