TransWikia.com

Bert-Transformer : Why Bert transformer uses [CLS] token for classification instead of average over all tokens?

Data Science Asked on May 3, 2021

I am doing experiments on bert architecture and found out that most of the fine-tuning task takes the final hidden layer as text representation and later they pass it to other models for the further downstream task.

Bert’s last layer looks like this :

enter image description here

Where we take the [CLS] token of each sentence :

enter image description here

Image source

I went through many discussion on this huggingface issue, datascience forum question, github issue Most of the data scientist gives this explanation :

BERT is bidirectional, the [CLS] is encoded including all
representative information of all tokens through the multi-layer
encoding procedure. The representation of [CLS] is individual in
different sentences.

My question is, Why the author ignored the other information ( each token’s vector ) and taking the average, max_pool or other methods to make use of all information rather than using [CLS] token for classification?

How does this [CLS] token help compare to the average of all token vectors?

One Answer

It's because you need to fine-tune BERT for your specific task anyway. You can train it to classify based on either cls token, or mean of token outputs, or whatever.

In essence, CLS token of the last layer has connections with all of the other tokens on the previous layer. So, does it make sense to average manually?

Answered by roman on May 3, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP