How can I measure the reliability of the specificity of a model with very small train, test, and validation datasets?

Question

Stats newbie here. I have a small dataset of 646 samples that I've trained a reasonably performant model on (~99% test and val accuracy). To complicate things a little bit, the classes are somewhat unbalanced. It's a binary classification problem.

Here is my confusion matrix on training data

[[387   1]
 [  1  73]]

on testing data:

[[74  1]
 [ 0 10]]

on validation data:

[[85  1]
 [ 0 13]]

Training Specificity: .986
Testing Specificity: .909
Validation Specificity: .928

My thoughts are that testing and validation have a very low specificity while training has a comparatively high specificity. However, given that only one sample is missed in both the testing and validation datasets, what is my real-world specificity? Is there a better generalizability measure? Is there something akin to a p-value that relates the reliability of the specificity given the size of the negative sample class?

Thanks!

Chaitanya Bapat · Answer

Real world data is "test dataset", right?
Data has to be divided in such a way that train-validation see part of data more than once while test data will be seen only once. In that sense, if the model is robust enough, it will perform well even on the test dataset. The assumption is that test data is as close as possible to real-world data.

How can I measure the reliability of the specificity of a model with very small train, test, and validation datasets?

One Answer

Add your own answers!

Ask a Question