Is too much or very few training sample of a specific feature hamper the neural network model?

I am analysing a technique "Sherlock" – a semantic type of column detecting technique wherein training dataset too many samples of a specific type are limited up to 15K and too few occurring samples exist less than 1K per class also excluded. What is the reason behind this? What are the disadvantages having too much or very few samples of a specific type in the input of a neural network?

Data Science Asked on November 30, 2021

1 Answers

One Answer

Theoretically speaking, there aren't any disadvantages to having too much or too few data. It will only reflect in the overall performance of your model. Based on the Sherlock paper, it seems that it's a choice they made for their preprocessing. This is their explanation:

Certain types occur more frequently in the VizNet corpus than others. For example, description and city are more common than collection and continent. To address this heterogeneity, we limited the number of columns to at most 15K per class and excluded the 10% types containing less than 1K columns

They did this to reduce the overall imbalance of their dataset.

Answered by Valentin Calomme on November 30, 2021

Add your own answers!

Related Questions

Maximum Dimensionality of AWS Machine Learning

0  Asked on December 15, 2020 by 719016


Training neural network to generate realistic terrain for video games

0  Asked on December 15, 2020 by max-walczak


How to use Kaggle Api in Google Colab for directly using dataset?

1  Asked on December 15, 2020 by mozilla_firefox


what is label shift?

1  Asked on December 14, 2020 by marzi-heidari


Keras mnist.load_data() unshuffled?

1  Asked on December 14, 2020 by user4779


What are the alternatives to Python + Spark (pyspark)?

2  Asked on December 14, 2020 by stackoverflower


NER and context mapping

1  Asked on December 14, 2020 by skb


Ask a Question

Get help from others!

© 2022 All rights reserved.