# Is the self-attention matrix softmax output (layer 1) symmetric?

Artificial Intelligence Asked by thepacker on January 5, 2022

Let’s assume that we embedded a vector of length 49 into a matrix using 512-d embeddings. If we then multiply the matrix by its transposed version, we receive a matrix of 49 by 49, which is symmetric. Let’s also assume we do not add the positional encoding and we only have only one attention head in the first layer of the transformer architecture.

What would the result of the softmax on this 49 by 49 matrix look like? Is it still symmetric, or is the softmax correctly applied for each line of the matrix, resulting in a non-symmetric matrix? My guess would be that the matrix should not be symmetric anymore. But I’m unsure about that.

I ask this to verify if my implementation is wrong or not, and what the output should look like. I have seen so many sophisticated and different implementations of the transformer architecture with different frameworks, that I can’t answer this question for myself right now (confusion). I still try to understand the basic building blocks of the transformer architecture.

I compared my results visually to a second implementation known to be working - "The annotated transformer". I compared the pytorch calculation results of the attention-method to my implementation results.

The answe is - the softmax is applied row by row. Therefore the resulting matrix p-attn is not equal to its transposed version.

Answered by thepacker on January 5, 2022

## Related Questions

### Why do we use $X_{I_t,t}$ and $v_{I_t}$ to denote the reward received and the at time step $t$ and the distribution of the chosen arm $I_t$?

2  Asked on August 24, 2021 by mab_n00b

### Can I apply AdaBoost on a random forest?

0  Asked on August 24, 2021 by swakshar-deb

### Why is the expected return in Reinforcement Learning (RL) computed as a sum of cumulative rewards?

1  Asked on August 24, 2021 by that_ai_guy

### How can a learning rate that is too large cause the output of the network (and the error) to go to infinity?

0  Asked on August 24, 2021 by user1477107

### Why does the number of channels in the PointNet increase as we go deeper?

0  Asked on August 24, 2021 by user3180

### How is AI helping humanity?

2  Asked on August 24, 2021

### Prioritised Remembering in Experience Replay (Q-Learning)

0  Asked on August 24, 2021 by conscious_process

### Advantages of training Neural Networks based on analytic success criteria

0  Asked on August 24, 2021 by emvee

### What kind of policy evaluation and policy improvement AlphaGo, AlphaGo Zero and AlphaZero are using

0  Asked on August 24, 2021 by daniel-wiczew

### Why would the learning rate curve go backwards?

1  Asked on August 24, 2021 by monkeydluffy

### When would bias regularisation and activation regularisation be necessary?

1  Asked on August 24, 2021

### Upper limit to the maximum cumulative reward in a deep reinforcement learning problem

3  Asked on August 24, 2021 by kamran-thomas-alimagham

### In continuous action spaces, how is the standard deviation, associated with Gaussian distribution from which actions are sampled, represented?

0  Asked on August 24, 2021 by m-s

### Extending FaceNet’s triplet loss to object recognition

1  Asked on August 24, 2021 by benedict-aaron-tjandra

### When is it time to switch to deep neural networks from simple networks in text classification problems?

1  Asked on August 24, 2021

### time-series prediction : loss going down, then stagnates with very high variance

1  Asked on August 24, 2021 by johncowk

### Why isn’t my implementation of DQN using TensorFlow on the FrozenWorld environment working?

1  Asked on August 24, 2021 by kosa

### What’s the intuition behind contrastive learning?

1  Asked on August 24, 2021 by cataluna84

### What are the most common feedforward neural networks?

1  Asked on August 24, 2021 by caesar-cruez