TransWikia.com

Average of BLEU scores on two subsets of data is not the same as overall score

Stack Overflow Asked by user1323163 on February 28, 2021

For evaluating a sequence generation model, I’m using BLEU1:BLEU4. I separated the test set to two sets and calculated the scores on each set separately, as well as, on the whole test set. Surprisingly, the results I get from the whole test set is not the weighted average of the results I get from each set. For example, consider the BLEU4 scores I get on a set and two subsets of it:

set1, 866 elements: 0.0001529267908

set2, 1010 elements: 0.1625387989

<set1,set2>, 1876 elements: 0.3063472152

How should I aggregate the results on two subsets to get the overall result?

Note: I know that all the elements in set1 are shorter than 4 tokens that’s why BLEU4 is almost zero there.

One Answer

BLEU score is by definition non-linear. As you can see in the original paper by Papineni et al.:

enter image description here

It is a product of two terms: brevity penalty (BP) and a harmonic mean of n-gram precisions. Both the brevity penalty and the harmonic mean are not linear operations with respect to averaging.

Regarding what you should report: since the two tests set look fundamentally different, the best option is to report two separate numbers.

I don't know what your task is, but given that the desired outputs are very short, BLEU might not be the best choice for evaluation. You might consider something edit-based (e.g., TER) or even plain accuracy might do a good job.

Answered by Jindřich on February 28, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP