TransWikia.com

Where does the evaluation speed advantage of Transformer-XL come from?

Data Science Asked by usabik on January 10, 2021

The Transformer-XL paper claims an advantage in evaluation speed 363x-1874x than that of a baseline Transformer model.

However, I do not understand where this massive difference comes from.

Although states from the previous segments can be cached, I do not see how the model obviates the need for autoregressively generating tokens one by one, which the paper seems to suggest. If the input is [1, 2, 3, 4, 0, 0, 0, 0], and we want to predict the values at the masked (zero) tokens, relying on the output logits at these positions would not be sufficient, as the model cannot attend to its own prediction for the first masked position when predicting the second masked position. Therefore, we need to generate the first masked position first – how do we then allow the model to attend to it without re-running the whole inference process?

I understood that the cached hidden states come from the previous segment, but that comes as a memory input and I don’t see how it speeds up the generation of the current segment.

Is there some way we can cache the current hidden states and then cleverly propagate just the changed fields for the token that was added? This seems hopeless to me because of the softmax layer in dot product attention – as soon as one of its input "scores" is changed, its whole output changes due to the normalization that it performs. That would then invalidate the full contents of all subsequent layers and cause them to be recomputed. What am I missing?

I could not find the code for sampling from the model in the Transformer-XL repo so could not study it in more detail. I feel like I’m deeply confused about some of this.

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP