TransWikia.com

NLP Transformers - understanding the multi-headed attention visualization (Attention is all you need)

Data Science Asked by PhysicsPrincess on March 15, 2021

I am new to NLP and I just finished reading the paper "Attention is all you need".
I’m struggling to understand the interpretability of the multi-headed attention, and specifically how these visualizations were produced:
enter image description here

I understand that the output of the self-attention sub-layer (for a single head) is a vector of size d_v that is a weighted sum of all the value vectors. Than how do they use this vector to calculate the strengths of the relations between the positions?

Any help and insight would be appreciated, thanks a lot!

One Answer

So the question is concerned about understanding the self-attention mechanism in greater detail, in particular how this idea of multi-head self-attention is used to compute strength of relations between tokens.

I think it's best you look through this great tutorial on self-attention and see if this helps in your understanding of multi-head self-attention: http://www.peterbloem.nl/blog/transformers

Answered by shepan6 on March 15, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP