Is the self-attention matrix softmax output (layer 1) symmetric?

Question

Let's assume that we embedded a vector of length 49 into a matrix using 512-d embeddings. If we then multiply the matrix by its transposed version, we receive a matrix of 49 by 49, which is symmetric. Let's also assume we do not add the positional encoding and we only have only one attention head in the first layer of the transformer architecture.
What would the result of the softmax on this 49 by 49 matrix look like? Is it still symmetric, or is the softmax correctly applied for each line of the matrix, resulting in a non-symmetric matrix? My guess would be that the matrix should not be symmetric anymore. But I'm unsure about that.
I ask this to verify if my implementation is wrong or not, and what the output should look like. I have seen so many sophisticated and different implementations of the transformer architecture with different frameworks, that I can't answer this question for myself right now (confusion). I still try to understand the basic building blocks of the transformer architecture.

thepacker · Answer

I compared my results visually to a second implementation known to be working - "The annotated transformer".  I compared the pytorch calculation results of the attention-method to my implementation results.
The answe is - the softmax is applied row by row. Therefore the resulting matrix p-attn is not equal to its transposed version.

Is the self-attention matrix softmax output (layer 1) symmetric?

One Answer

Add your own answers!

Ask a Question