Understanding positional encoding in transformers

Positional encoding in transformers is a crucial component that allows the model to understand the order of words in a sequence. Since transformers process all input tokens in parallel rather than sequentially, they lack an inherent understanding of token order, which is essential for language understanding. Positional encoding provides this missing information.

Here’s a breakdown of how it works:

1. The Need for Positional Encoding:

Transformers, such as the original BERT or GPT models, are based on self-attention mechanisms that allow them to process tokens in parallel rather than one by one, like RNNs or LSTMs.
However, self-attention alone doesn’t consider the position of tokens within the sequence, which is essential for tasks like language modeling, translation, etc.
For example, in the sentence “The cat sat on the mat,” knowing the position of “cat” helps differentiate it from “mat.”

2. How Positional Encoding Works:

To add the information about the position of tokens, transformers use positional encodings, which are vectors that are added to the embeddings of input tokens.
These encodings are designed to inject information about each token’s position in the sequence, ensuring that the model can still understand the order of the tokens.

3. Types of Positional Encoding:

Learned Positional Encoding: This involves learning the positional encodings as trainable parameters. The model learns these embeddings during training just like word embeddings.
Fixed (Sinusoidal) Positional Encoding: In the original transformer paper, a fixed sinusoidal function was used to generate positional encodings. This approach does not require learning these encodings as parameters, and they are defined by a deterministic function.

4. Sinusoidal Positional Encoding:

In the sinusoidal method, positional encodings are generated using sine and cosine functions with different frequencies. The key idea is that:

Each position $p$ in the sequence gets a unique positional encoding.
The encoding is made of two parts: one that corresponds to sine waves and one to cosine waves.
The formula for the positional encoding at a given position $p$ and dimension $i$ is:

PE(p, 2i) = sin(p / 10000^{2i/d})

PE(p, 2i+1) = cos(p / 10000^{2i/d})

Where:

$p$ is the position of the token in the sequence.
$i$ is the index of the dimension in the encoding.
$d$ is the total dimensionality of the encoding (usually the same as the embedding size).

The sine and cosine functions ensure that each position has a unique encoding, and the different frequencies allow the model to differentiate between positions.

5. Why Sinusoidal Encoding?

Non-repeating patterns: The sine and cosine waves ensure that each position gets a unique encoding, and different positions can be distinguished based on their wave patterns.
Efficient interpolation: Because the encoding is continuous and smoothly varying, the model can also generalize to positions it hasn’t seen during training, which is helpful for handling longer sequences.

6. Positional Encoding in the Model:

Once the positional encodings are computed, they are added to the word embeddings of the input tokens.
This combined input (word embedding + positional encoding) is then passed through the layers of the transformer.
The model learns to use this positional information alongside the word embeddings during self-attention to understand the relationships between words and their positions.

7. Key Insights:

No notion of sequence order in attention alone: Self-attention computes the relationships between tokens in parallel, but without positional encoding, it has no notion of order.
The encodings are fixed or learned: Learned encodings allow for flexibility during training, while sinusoidal encodings are deterministic and don’t require additional parameters.
Why add instead of multiply?: Adding the positional encoding ensures that both the word embedding and the positional information are blended together, whereas multiplying could overly emphasize one aspect of the input over the other.

In essence, positional encoding is how transformers incorporate sequence order into their otherwise parallel processing architecture. Whether learned or based on fixed functions like sinusoidal waves, it allows transformers to excel at tasks involving sequences.

Share this Page your favorite way: Click any app below to share.

See all the ways to share this page

Understanding positional encoding in transformers

1. The Need for Positional Encoding:

2. How Positional Encoding Works:

3. Types of Positional Encoding:

4. Sinusoidal Positional Encoding:

5. Why Sinusoidal Encoding?

6. Positional Encoding in the Model:

7. Key Insights:

Check Out Our Newest Posts we wrote about

Why your ML system design must support partial retraining

Why your ML pipeline must detect missing or stale features

Why your ML feedback loop must consider label quality

Why your ML deployment plan must include fallback logic