What is the positional encoding in the transformer model?

Question

I'm trying to read and understand the paper Attention is all you need and in it, there is a picture:

I don't know what positional encoding is. by listening to some youtube videos I've found out that it is an embedding having both meaning and position of a word in it and has something to do with $sin(x)$ or $cos(x)$

but I couldn't understand what exactly it is and how exactly it is doing that. so I'm here for some help. thanks in advance.

Esmailian · Accepted Answer

For example, for word $w$ at position $pos in [0, L-1]$ in the input sequence $boldsymbol{w}=(w_0,cdots, w_{L-1})$, with 4-dimensional embedding $e_{w}$, and $d_{model}=4$, the operation would be
$$begin{align*}e_{w}' &= e_{w} + left[sinleft(frac{pos}{10000^{0}}right), cosleft(frac{pos}{10000^{0}}right),sinleft(frac{pos}{10000^{2/4}}right),cosleft(frac{pos}{10000^{2/4}}right)right]
&=e_{w} + left[sinleft(posright), cosleft(posright),sinleft(frac{pos}{100}right),cosleft(frac{pos}{100}right)right]
end{align*}$$

where the formula for positional encoding is as follows
$$text{PE}(pos,2i)=sinleft(frac{pos}{10000^{2i/d_{model}}}right),$$
$$text{PE}(pos,2i+1)=cosleft(frac{pos}{10000^{2i/d_{model}}}right).$$
with $d_{model}=512$ (thus $i in [0, 255]$) in the original paper.

This technique is used because there is no notion of word order (1st word, 2nd word, ..) in the proposed architecture. All words of input sequence are fed to the network with no special order or position (unlike common RNN or ConvNet architectures), thus, model has no idea how the words are ordered. Consequently, a position-dependent signal is added to each word-embedding to help the model incorporate the order of words. Based on experiments, this addition not only avoids destroying the embedding information but also adds the vital position information. In the case of RNNs, we feed the words sequentially to RNN, i.e. $n$-th word is fed at step $n$, which helps the model incorporate the order of words.

This article by Jay Alammar explains the paper with excellent visualizations. Unfortunately, its example for positional encoding is incorrect at the moment (it uses $sin$ for the first half of embedding dimensions and $cos$ for the second half, instead of using $sin$ for even indices and $cos$ for odd indices).

Juan Esteban de la Calle · Answer

Positional encoding is a re-representation of the values of a word and its position in a sentence (given that is not the same to be at the beginning that at the end or middle).

But you have to take into account that sentences could be of any length, so saying '"X" word is the third in the sentence' does not make sense if there are different length sentences: 3rd in a 3-word-sentence is completely different to 3rd in a 20-word-sentence.

What a positional encoder does is to get help of the cyclic nature of $sin(x)$ and $cos(x)$ functions to return information of the position of a word in a sentence.

Eris · Answer

To add to other answers, OpenAI's ref implementation calculates it in natural log-space (to improve precision, I think. Not sure if they could have used log in base 2). They did not come up with the encoding. Here is the PE lookup table generation rewritten in C as a for-for loop:

int d_model = 512, max_len = 5000;
double pe[max_len][d_model];

for (int i = 0; i < max_len; i++) {
   for (int k = 0; k < d_model; k = k + 2) {
      double div_term = exp(k * -log(10000.0) / d_model);
      pe[i][k] = sin(i * div_term);
      pe[i][k + 1] = cos(i * div_term);
   }
}

What is the positional encoding in the transformer model?

3 Answers

Add your own answers!

Ask a Question