# How does attention mechanism learn?

Data Science Asked by user2790103 on December 16, 2020

I know how to build an attention in neural networks. But I don’t understand how attention layers learn the weights that pay attention to some specific embedding.

I have this question because I’m tackling a NLP task using attention layer. I believe it should be very easy to learn (the most important part is to learn alignments). However, my neural networks only achieve 50% test set accuracy. And the attention matrix is weird.
I don’t know how to improve my networks.

To give a example:
English: Who are you?
Chinese: 你是誰？

The alignments are
‘Who’ to ‘誰’
‘are’ to ‘是’
‘you’ to ‘你’

How does attention learn that?

Thank you!

Attention weights are learned through backpropagation, just like canonical layer weights.

The hard part about attention models is to learn how the math underlying alignment works. Different formulations of attention compute alignment scores in different ways. The main is Bahdanau attention, formulated here. The other is Luong's, provided in several variants in the original paper. Transformers have several self-attention layers instead (I just found a great exaplanation here).

However, backprop lies at the basis of all them. I know it's amazing how attention alignment scores can improve the performance of our models while using the canonical learning technique intact.

Correct answer by Leevo on December 16, 2020

To answer in the simplest way possible - let the model learn the attention weights by training itself. We do that by defining a Dense single layer MLP with 1 unit which 'transforms' each word in the input sentence in such a way that when a dot product of this transformation with the last decoder state is taken, the resulting value is high if the word in question needs to be considered when translating the next word.

So at the decoder end, before translating each word, we now know what all words in the input sequence need to be given importance - all we have to do is to take the last hidden state of the decoder and dot product it with all the 'transformed' words in the input sequence and softmax the result.

As to how the weights are learnt during training - it is learnt the same way that any layer weights in a NN are learnt - using the standard gradient descent, backprop concepts etc

Answered by Allohvk on December 16, 2020

From the Amazing Blog - FloydHub Blog- Attention Mechanisms

### Attention Mechanisms

Attention takes two sentences, turns them into a matrix where the words of one sentence form the columns, and the words of another sentence form the rows, and then it makes matches, identifying relevant context. This is very useful in machine translation.

When we think about the English word “Attention”, we know that it means directing your focus at something and taking greater notice. The Attention mechanism in Deep Learning is based off this concept of directing your focus, and it pays greater attention to certain factors when processing the data.

In broad terms, Attention is one component of a network’s architecture, and is in charge of managing and quantifying the interdependence:

1. Between the input and output elements (General Attention)
2. Within the input elements (Self-Attention)

Let me give you an example of how Attention works in a translation task. Say we have the sentence “How was your day”, which we would like to translate to the French version - “Comment se passe ta journée”. What the Attention component of the network will do for each word in the output sentence is map the important and relevant words from the input sentence and assign higher weights to these words, enhancing the accuracy of the output prediction.

Weights are assigned to input words at each step of the translation

Answered by Pluviophile on December 16, 2020

## Related Questions

### Why bias is not considering in Regularization?

2  Asked on December 22, 2021

### Overfitting in Huggingface’s TFBertForSequenceClassification

1  Asked on December 22, 2021

### Binary classification using images and an external dataset

1  Asked on December 22, 2021

### Keras stateful LSTM returns NaN for validation loss

1  Asked on December 20, 2021

### Reinforcement Learning on real time data over a web server

1  Asked on December 20, 2021

### Association Rules with Python (coded dataset)

1  Asked on December 20, 2021

### How to apply Kalman Filter for Cleaning Timeseries Data effectively without much optimization?

0  Asked on December 20, 2021

### Calculate number of parameters for ConvLSTM2D layer

1  Asked on December 19, 2021

### Pytorch dynamic forward pass

1  Asked on December 19, 2021

### gradient descent for non convex function like $-x^2$

1  Asked on December 19, 2021

### How to deal with mixed type features?

2  Asked on December 19, 2021

### How is this score function estimator derived?

1  Asked on December 19, 2021

### How to grid search feature selection and neural network hyperparameters in the same grid?

1  Asked on December 19, 2021

### How can I easily retrieve the latent space encodings in tensorflow?

1  Asked on December 19, 2021

### PySpark: How do I specify dropna axis in PySpark transformation?

2  Asked on December 16, 2021

### What is the best way to encode an arbitrary collection of strings into int categorical variables?

1  Asked on December 16, 2021

### Adding and multiplying higher values based on different columns of a dataframe

1  Asked on December 16, 2021

### Training a CNN on a large dataset

2  Asked on December 16, 2021

### When does Adam update its weights?

1  Asked on December 16, 2021

### Multi-class neural net always predicting 1 class after optimization

3  Asked on December 14, 2021