TransWikia.com

Predictive power of a dataset

Data Science Asked by oklm on November 7, 2020

I am reading a book on machine learning for undergraduate. I am actually confused on linear regression flexibility as the say:

Occasionally, linear regression will fail to recover a good solution
for a data set. While this may be because our data doesn’t actually
have predictive power, it might also just indicate that our data is
provided in a format unsuitable for linear regression.

I read some questions here related to predictive power and I noticed that it’s all about the model produced. What do we mean when we talk about the predictive power of a dataset?

I think of it as there is no relationship between features (as linear regression is intented to learn the relationship between the inputs $X$ and the output $Y$). But I am not convinced of this answer yet.

2 Answers

Usually predictive power refers to the model, rather than the data. I've occasionally seen some people use it in the way that the author of your book uses it (see this for example).

In the context of your book, yes, predictive power refers to whether input can be mapped to target output $Xrightarrow Y$. We can infer a dataset's "predictive power" by trying to model it (e.g. linear regression). If the model performs poorly, then there are two possibilities as the book says: either the dataset is not predictive (i.e. it does not offer a clean mapping from input to target output) or the methods we are using are unsuitable to model the mapping.


Some examples of both situations:

  • If you generated random data for $X$ and $Y$, the resulting dataset would (probably) have no predictive power as no model could reasonably generalize the mapping $Xrightarrow Y$.

  • If you have a nonlinear mapping, then linear regression would not fit it well. For example, if our dataset was such that $y_1$ is mapped to by $||vec{x}||<alpha $ and all other inputs map to $y_2$, then our dataset is extremely predictive, but our linear regression model cannot fit it (since the mapping is nonlinear). In this toy example, it's easy to see the predictive power of the dataset, particularly if the input is in 2D/3D since we could just plot it. However, manually observing such trends in highly dimensional space using actual data can be very difficult, hence we use the tools that you are learning to help interpret the data. Also, when there's nonlinearity, it's difficult to statistically evaluate the dataset itself. Variables with linear relationships are simple to correlate (e.g. Pearson's correlation coefficient) but nonlinearities can make correlation difficult. I assume that this is why your book defers to vague terminology as it's probably for pedagogical, rather than pedantic, purposes. After all, it gets the point across without needing to discuss the ongoing research into quantifying nonlinear correlations.

Correct answer by Benji Albert on November 7, 2020

Rather than asking about the predictive power of a dataset, I think it's intuitive to ask about the predictive power of a model. My reasoning is as follows;

A dataset can be univariate, bivariate or multivariate types. The dataset can contain only numerical features or categorical features or both. Suppose there is a univariate dataset with a negative skewed distribution. In such a case the mean, median will be less than the mode. Now suppose this univariate dataset consist of continuous data type. Knowing that its distribution is negatively skewed has already given the analyst a clue about its symmetry or distribution. So basis of this brief introduction, as an analyst will I be interested to know the predictive power of a dataset or the model('s) that I build using this dataset, is a question worth discussing?

There have been several studies in literature that have discussed the model's predictive power like 1,2,3 (see references). In contrast, I have not come across any study that has discussed the predictive power of a dataset. Perhaps a future research direction.

However, I did find an article published on R-bloggers that discussed about a predictive power score, a concept somewhat similar to correlation coefficient.

And finally something about mapping. I think a better term could be "correlation" which at least quantifies the relationship between two variables X and Y.

Note

A similar question was asked on stats.stackexchange.com. The comments in it conform to my initial doubt, that there is no such thing as the predictive power of a dataset.

References

  1. Lee, P. H. (2014). Resampling methods improve the predictive power of modeling in class-imbalanced datasets. International journal of environmental research and public health, 11(9), 9776-9789.
  2. López‐López, J. A., Marín‐Martínez, F., Sánchez‐Meca, J., Van den Noortgate, W., & Viechtbauer, W. (2014). Estimation of the predictive power of the model in mixed‐effects meta‐regression: A simulation study. British Journal of Mathematical and Statistical Psychology, 67(1), 30-48.
  3. Newson, R. B. (2010). Comparing the predictive powers of survival models using Harrell's C or Somers’ D. The Stata Journal, 10(3), 339-358.

Answered by mnm on November 7, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP