TransWikia.com

Transformer: where is the output of the last FF sub-layer of the encoder used?

Data Science Asked by Kishkashta on August 14, 2021

In the "Attention Is All You Need" paper, the decoder consists of two attention sub-layers in each layer followed by a FF sub-layer.
The first is a masked self attention which gets as an input the output of the decoder in the previous step (and the first input is a special start token).
The second, ‘encoder-decoder’, attention sub-layer gets as an input queries from the lower self-attention sub-layer and keys & values from the encoder.
I do not see the use of the output of the FF sub-layer in the encoder; can someone explain where is it used?
Thanks

One Answer

We can see this in the original Transformer diagram:

enter image description here

The output of the last encoder FF layer is added to the original input of the same layer, then layer normalization is applied and that is the output of the whole encoder, which is used as keys and values for the encoder-decoder attention blocks in the decoder.

Correct answer by noe on August 14, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP