TransWikia.com

Binary Classification with almost no positives

Cross Validated Asked by EpsilonDelta on December 11, 2021

I have a dataset with 121 features and 7176 data points. Only 11 of these are positive, the rest is negative. If I want to train a SVM on this data set, what would be the best strategy to do this? Does this even make sense?

For now I decided to 5-fold cross validation with stratified sampling. Would this approach be reasonable?

One Answer

You have a positive class prior of $0.0015$, which is way to low to train a classifier from. You need at least an excess of $100$ positive cases, your most rare class. You will be able to start with a few (most) discriminative features.

Next, a support vector machine is not a suited classifier for such a skewed prior distribution. A recommended approach is to train for example a random forest classifier, a decision tree (C4.5 or CART) and discriminant-analysis/logistic-regression. Use a balanced training set and correct the posterior probabilities post hoc, see the approach for this correcting for a skewed prior after training.

After note

The set of $100$ required positive cases is chosen based on experience. With $121$ different feature values, this number is at an absolute minimum. A classifier like C4.5 does feature selection while building the decision tree (starting with each individual feature variable).

More theoretical work can begin with the paper by E.B. Baum, D. Haussler, What size net gives valid generalization? Neural Computation, Vol. 1, No. 1, March 1989. Having computed with the guidelines from their Corollary 9, to ensure an error rate of $0.015$, $121$ feature variables, you need $125$ training cases. I assumed one parameter/weight per feature variable (one hidden node - the simplest neural network classifier).

Beware that their results are based on the nonparametric VC-dimension. In practice, their bounds are below what is needed to fit well-generalizing decision boundaries. Using more model parameters than training cases, that is discouraged.

Answered by Match Maker EE on December 11, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP