What are the keys and values of the attention model for the encoder and decoder in the "Attention Is All You Need" paper?

Question

I have recently encountered the paper on NLP. It is very new to me and I am still unable to see how that works. I have used all the resources over there from the original paper to Youtube videos and the very famous "Illustrated Transformer".

Suppose I have a training example of "I am a student" and I have the respective French as "Je suis etudient".

I want to know how these 3 words are converted to 4 words. What are the query, keys, values?

This is my understanding of the topic so far.

The encoder part is:

Query: a single word embedded in a vector form. such as "I" expressed as a vector of length 5 as $[.2, 0.1, 0.4, 0.9, 0.44]$.
Keys: the matrix of all the vectors or in simple words, a matrix that has all the words from a sentence in the form of embeddings.
Values = Keys

For decoder:

Query: the input word in the form of a vector (which is output given by the decoder from the previous pass).
Keys = values = outputs from the encoder's layers.

BUT there are 2 different attention layers and one of which do not use the encoder's output at all. So, what are the keys and values now? (I think they are just like encoder, but just the generated until that pass)?

Atticus · Answer

BUT there are 2 different attention layers and one of which do not use the encoder’s output at all. So, what are the keys and values
now?

The first attention layer in the decoder is the "Masked Multi-Head Attention" layer and is the self-attention layer, calculating how much each word is related to each word in the same sentence. However, our aim in the decoder is to generate the next French word and so for any given output French word we can use all the English words but only the French words previously seen in the sentence. We, therefore, "mask" the words that appear later in the French sentence by transforming these to 0 so the attention network cannot use them.

How these 3 words are converted to 4 words

The second attention block in the decoder is where the English to French word mapping happens. We have a query for every output position in the French sentence and a key/value for every English input word. We calculate relevance scores from the dot product of the query and key and then obtain output scores for each predicted word from multiplying the relevance and value. The following diagram is useful to visualise how, for each predicted word, we can have relevance scores that can predict that one English word can be translated to multiple, or no French word.

In summary, the encoder discovers interesting things about the English sentence whilst the decoder predicts the next French word in the translation. It should be noted they use "Multi-Head Attention", meaning that a number (8 in the original paper) of attention vectors are calculated to learn attention mechanisms to pay attention to different things, for example, grammar, vocabulary, tense, gender, and the output is a weighted average of these.

What are the keys and values of the attention model for the encoder and decoder in the "Attention Is All You Need" paper?

One Answer

Add your own answers!

Ask a Question