# Selecting uncorrelated samples from a set of bulk data that contains correlated and dependent samples

Cross Validated Asked by Sarmes on January 5, 2022

i have a set of data that is generated by expensive computational model evaluations, on a total data set of 10000 samples in 40 dimensions. This sample data set is composed of different data sets, originating partly from random runs, latin hypercube DOE, radial design DOE, linear parameter studies, and a large part is based on the history data generated by several optimization runs using genetic algorithms.

My thought was that a large part of the function evaluations generated during the genetic algorithms runs, could be some how used to augment them to the set of random and latin hypercube samples, in order to have a larger sample set to perform a variance based sensitivity analysis.

I came up with 2 ideas, but i am an engineer, not a mathematician:

1) using the covariance matrix for the total samples matrix, trying to filter out samples until the of diagonal terms are smaller then some threshold, to avoid correlations.

2)The other idea was to make some sort of minimum distance filter to avoid areas with tightly clustered samples.

Would that be sufficient? are there any tests for randomness, that i could use?
The problem is that i don’t know the right terminology, so maybe there exist ready to run methods for such problems, but i don’t know how to find them, because i don’t know their names.

I am thankful for any helpful suggestions.

Have you thought about orthogonalizing the entire data matrix with PCA? You could replace the columns of $mathbf{X}$ with the un-correlated principal components (eigenvectors normalized to their $sqrt{lambda_m}$).

It sounds like you don't have grouping categorical variables among the 40 variables as well. In this, the only thing you are left with is measuring the association between variables. Indeed, if you are trying to linear and non-linear assessments on sensitivity analysis and variance explanation, then break up the data using a "divide and conquer" approach to solve a large problem by solving smaller problems. Mixtures of variables generated from DOW, LHS, and genetic algorithms sounds quite complex -- but as long as you generate questions singly, and then do the associated analysis to answer the problem, you can work through your analytic goals.

By the way, there doesn't exist variance explanation approaches that allow you to pull out non-linear and linear components using the same model, unless you code what you are doing using non-linear regression and linear regression. There are packages that allow you to fit data based on equations, so maybe look at those (IGOR, EGRET, AMFIT(Poisson), MATLAB, etc.)

Last, be careful of the "so what?" question, whereby after you have done all of your model checking, a reader could ask why you did all of this on simulated data.

Answered by user32398 on January 5, 2022

## Related Questions

### ANOVA complete block design, more units per block than treatments

1  Asked on December 15, 2021

### How can proper scoring rules optimize the probabilistic prediction compared to improper scoring rules?

1  Asked on December 15, 2021

### Calculating the variance of dice rolls?

3  Asked on December 15, 2021

### Multiple comparisons for not normal and heterogeneous data

2  Asked on December 15, 2021 by ana-hernandez

### R-squared is equal to 81% means what?

2  Asked on December 15, 2021 by f-c-akhi

### Time series forecasting: from ARIMA to LSTM

3  Asked on December 13, 2021

### Is it bad to have error bars constructed with standard deviation that spans to the negative scale while the variable itself shouldn’t be negative?

1  Asked on December 13, 2021

### Can I exclude outliers when calculating mean or standard deviation (small-sample)?

1  Asked on December 13, 2021

### Estimating kappa of von Mises distribution

3  Asked on December 13, 2021 by swiss-army-man

### Proper way to combine conditional probability distributions of the same random variable conditioned on a discrete variable ? (based on assumptions)

3  Asked on December 13, 2021 by brainpermafrost

### Von Mises distribution to detect outliers

1  Asked on December 13, 2021 by velvetshelter

### What transformations preserve the von Mises distribution?

1  Asked on December 13, 2021

### what is the name of distribution similar to von mises distribution

0  Asked on December 13, 2021

### Count Panel Data Event Study

1  Asked on December 13, 2021 by econstat

### Expectations of cosine under von Mises distribution

1  Asked on December 13, 2021

### Significance Levels, Confidence Intervals and P-Values

1  Asked on December 13, 2021

### Backpropagation on a convolutional layer

1  Asked on December 13, 2021

### Mixed Models: How to derive Henderson’s mixed-model equations?

2  Asked on December 13, 2021 by domb

### Working with Time Series data: splitting the dataset and putting the model into production

2  Asked on December 13, 2021 by fernando-camargo

### Changing the reference level for contrasts on glmer (lme4) changes the output in anova

0  Asked on December 13, 2021 by evy