# Selecting uncorrelated samples from a set of bulk data that contains correlated and dependent samples

Cross Validated Asked by Sarmes on January 5, 2022

i have a set of data that is generated by expensive computational model evaluations, on a total data set of 10000 samples in 40 dimensions. This sample data set is composed of different data sets, originating partly from random runs, latin hypercube DOE, radial design DOE, linear parameter studies, and a large part is based on the history data generated by several optimization runs using genetic algorithms.

My thought was that a large part of the function evaluations generated during the genetic algorithms runs, could be some how used to augment them to the set of random and latin hypercube samples, in order to have a larger sample set to perform a variance based sensitivity analysis.

I came up with 2 ideas, but i am an engineer, not a mathematician:

1) using the covariance matrix for the total samples matrix, trying to filter out samples until the of diagonal terms are smaller then some threshold, to avoid correlations.

2)The other idea was to make some sort of minimum distance filter to avoid areas with tightly clustered samples.

Would that be sufficient? are there any tests for randomness, that i could use?
The problem is that i don’t know the right terminology, so maybe there exist ready to run methods for such problems, but i don’t know how to find them, because i don’t know their names.

I am thankful for any helpful suggestions.

Have you thought about orthogonalizing the entire data matrix with PCA? You could replace the columns of $mathbf{X}$ with the un-correlated principal components (eigenvectors normalized to their $sqrt{lambda_m}$).

It sounds like you don't have grouping categorical variables among the 40 variables as well. In this, the only thing you are left with is measuring the association between variables. Indeed, if you are trying to linear and non-linear assessments on sensitivity analysis and variance explanation, then break up the data using a "divide and conquer" approach to solve a large problem by solving smaller problems. Mixtures of variables generated from DOW, LHS, and genetic algorithms sounds quite complex -- but as long as you generate questions singly, and then do the associated analysis to answer the problem, you can work through your analytic goals.

By the way, there doesn't exist variance explanation approaches that allow you to pull out non-linear and linear components using the same model, unless you code what you are doing using non-linear regression and linear regression. There are packages that allow you to fit data based on equations, so maybe look at those (IGOR, EGRET, AMFIT(Poisson), MATLAB, etc.)

Last, be careful of the "so what?" question, whereby after you have done all of your model checking, a reader could ask why you did all of this on simulated data.

Answered by user32398 on January 5, 2022

## Related Questions

### How to determine the expected chi^2 value?

1  Asked on November 2, 2021

### Is this problem related to statistical inference from two population parameters? If so, why does my approach not give the right answer?

1  Asked on November 2, 2021 by shyam-kumar-mangayil

### Do we need hypothesis testing when we have all the population?

7  Asked on November 2, 2021 by siddhi-kiran-bajracharya

### Combining class priors with discriminative methods

1  Asked on November 2, 2021

### Fatality Rate for SARS-CoV-2

2  Asked on November 2, 2021 by dsmalenb

### Multi-class classification with prior knowledge of class similarity?

1  Asked on November 2, 2021

### Relationship between overfitting and robustness to outliers

4  Asked on November 2, 2021

### Reconstruction Error: Principal component analysis vs Probabilistic prinicpal component analysis

2  Asked on November 2, 2021 by user290388

### Role of misspecification by biased data in the generalization error

1  Asked on November 2, 2021 by synack

### Question about fixed effects, and state-by -time fixed effects

1  Asked on November 2, 2021

### Are the differences between sampling clusters and sampling strata, conceptual, methodological, neither or both?

5  Asked on November 2, 2021

### confidence intervals for the Poisson process ($lambda$) sampled with uncertainty

1  Asked on November 2, 2021 by gideon-kogan

### Non-parametric (smoothed) estimate of current rate

1  Asked on November 2, 2021 by eithompson

### What’s the MSE of $hat{Y}$ in ordinary least squares using bias-variance decomposition?

1  Asked on November 2, 2021

### Conditional Inference Forest Variable Importance

0  Asked on November 2, 2021

### How is pairwise PERMANOVA/adonis a valid non-parametric approach for pairwise comparisons

1  Asked on November 2, 2021

### Using cross-entropy for regression problems

2  Asked on November 2, 2021

### What does it mean if magnitude of the variance of each measurement is allowed to be a function of its predicted value?

1  Asked on November 2, 2021 by kurtis-pykes