Is too much or very few training sample of a specific feature hamper the neural network model?

Question

I am analysing a technique "Sherlock" - a semantic type of column detecting technique wherein training dataset too many samples of a specific type are limited up to 15K and too few occurring samples exist less than 1K per class also excluded. What is the reason behind this? What are the disadvantages having too much or very few samples of a specific type in the input of a neural network?

Valentin Calomme · Answer

Theoretically speaking, there aren't any disadvantages to having too much or too few data. It will only reflect in the overall performance of your model. Based on the Sherlock paper, it seems that it's a choice they made for their preprocessing. This is their explanation:

Certain types occur more frequently in the VizNet corpus than
others. For example, description and city are more common
than collection and continent. To address this heterogeneity,
we limited the number of columns to at most 15K per class and
excluded the 10% types containing less than 1K columns

They did this to reduce the overall imbalance of their dataset.

Is too much or very few training sample of a specific feature hamper the neural network model?

One Answer

Add your own answers!

Ask a Question