TransWikia.com

Error metric to compare ratios derived from a binary prediction task

Cross Validated Asked on November 9, 2021

I’m working on a research problem where a binary classification task ultimately produces a ratio downstream. I would like to understand the best way to quantitatively compare the resultant ratio to the ground truth.

I often think of my research problem as a game of "whack-a-mole" so for simplicity I’ll use that as the basis for my question.

The arcade has 5 holes from which the moles emerge at random. Each game lasts 30 seconds. I have 5 cameras above the arcade, one for each hole. The role of the vision system is to log when a mole appears. Unfortunately, my deep learning model isn’t perfect so sometimes it misses when a mole appears or logs activity when no mole has emerged from the hole. This part is a relatively straightforward classification problem. I can use error metrics like precision, recall and F1 Score to evaluate how well my model performs in identifying a mole when it emerges from the hole.

In the next step, I’m interested in the number of holes in which a mole emerged more than twice during that 30 second period. Moles emerge at random so some holes may have zero appearance, one mole appearance or 2 or more appearances. That is used to generate a ratio of holes with two or more appearances during that 30 second period.

For example, the ground truth may look like this:

Ground Truth
Hole 1: 1 mole appearance
Hole 2: 3 mole appearances
Hole 3: 2 mole appearances
Hole 4: 7 mole appearances
Hole 5: 1 mole appearance

The ratio of holes with 2 or more appearances is 3/5. This is the ground truth.

But again the vision system is not perfect:

Predicted
Hole 1: 2 mole appearances
Hole 2: 4 mole appearances
Hole 3: 2 mole appearances
Hole 4: 5 mole appearances
Hole 5: 1 mole appearance

The vision system generates a ratio of 4/5.

In this project the errors in the initial classification task propagate to the final predicted ratio. I’m interested in understanding how to quantify the error between the predicted ratio and the ground truth ratio. From a practical standpoint, if the error at the final ratio step is "reasonable" there is less impetus to go back and retrain the classifier.

If in addition to your answer you can provide references it would be much appreciated. I have not seen many papers or textbooks explore these more complicated or nuanced model/pipeline evaluation techniques.

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP