Artificial Intelligence Asked by thepacker on January 5, 2022
Let’s assume that we embedded a vector of length 49 into a matrix using 512-d embeddings. If we then multiply the matrix by its transposed version, we receive a matrix of 49 by 49, which is symmetric. Let’s also assume we do not add the positional encoding and we only have only one attention head in the first layer of the transformer architecture.
What would the result of the softmax on this 49 by 49 matrix look like? Is it still symmetric, or is the softmax correctly applied for each line of the matrix, resulting in a non-symmetric matrix? My guess would be that the matrix should not be symmetric anymore. But I’m unsure about that.
I ask this to verify if my implementation is wrong or not, and what the output should look like. I have seen so many sophisticated and different implementations of the transformer architecture with different frameworks, that I can’t answer this question for myself right now (confusion). I still try to understand the basic building blocks of the transformer architecture.
I compared my results visually to a second implementation known to be working - "The annotated transformer". I compared the pytorch calculation results of the attention-method to my implementation results.
The answe is - the softmax is applied row by row. Therefore the resulting matrix p-attn is not equal to its transposed version.
Answered by thepacker on January 5, 2022
2 Asked on August 24, 2021 by mab_n00b
0 Asked on August 24, 2021 by swakshar-deb
1 Asked on August 24, 2021 by that_ai_guy
q learning reinforcement learning return rewards value functions
0 Asked on August 24, 2021 by user1477107
0 Asked on August 24, 2021 by user3180
1d convolution convolution convolutional layers convolutional neural networks filters
0 Asked on August 24, 2021 by conscious_process
0 Asked on August 24, 2021 by emvee
0 Asked on August 24, 2021 by daniel-wiczew
1 Asked on August 24, 2021 by monkeydluffy
convolutional neural networks deep learning fast ai learning rate neural networks
1 Asked on August 24, 2021
3 Asked on August 24, 2021 by kamran-thomas-alimagham
0 Asked on August 24, 2021 by m-s
policy gradients probability distribution reinforcement learning
8 Asked on August 24, 2021 by souradeep-nanda
1 Asked on August 24, 2021 by benedict-aaron-tjandra
computer vision deep learning facial recognition object detection object recognition
1 Asked on August 24, 2021
classification natural language processing text classification
1 Asked on August 24, 2021 by johncowk
1 Asked on August 24, 2021 by kosa
1 Asked on August 24, 2021 by cataluna84
comparison deep learning machine learning semi supervised learning
1 Asked on August 24, 2021 by caesar-cruez
feedforward neural network neural networks reference request
Get help from others!
Recent Answers
Recent Questions
© 2023 AnswerBun.com. All rights reserved. Sites we Love: PCI Database, MenuIva, UKBizDB, Menu Kuliner, Sharing RPP, SolveDir