Label A records B times or label A*B records

Question

This question concerns pre-training data sourcing.

Suppose you have a human workforce of B individuals and a potentially unlimited source of data.

The task is labeling images with classes. These classes are somewhat subjective (emotions). This means one individual might label the same image with a different class than another individual.

For then using these labeled records as training data on a neural network that predicts classes on images, is it better to

1) have a number of records (A) labeled redundantly by all B individuals.

2) have every individual label A different records each, yielding A x B labeled records.

Intuition behind 1) is that the mean of subjective labeling would be somewhat objective. Thus training data would be mostly objective.
In addition, probabilities (50% happy, 50% surprised) could be used as input.

Intuition behind 2) is that subjectiveness in labeling of individuals is natural and the NN is trained on that, becoming somewhat "general"/"objective" in it's predictions. Also, more data is always better.

Please excuse the use of subjective and objective in combination with Machine Learning. I know this might not be correct at all.

Erwan · Answer

In my opinion there's no correct answer between the two choices, and you are right about the arguments.

I would argue for a balanced compromise between the two: have a proportion of A (say 20%) labelled by multiple annotators (say at least 3), but the rest of A can be labelled by a single annotator. You could refine the proportion/number of annotators progressively based on the results: for instance if after the first batch one observes that annotators disagreement is very high, then it might be useful to increase the two parameters. This way you can still evaluate the level of disagreement and it effect on performance, while also maximizing the amount of data annotated.

An alternative is to use some form of active learning. For example, first use method 2 (each instance labelled once), then use cross-validation to tran a model and apply it to the corresponding test set. At the end of the process, the misclassified instances are the "hard ones" which need to be re-annotated by multiple annotators. This kind of process can be used iteratively to identify the most "ambiguous" instances.

Label A records B times or label A*B records

One Answer

Add your own answers!

Ask a Question