P-value correction for multiple testing using huge datasets

Question

First of all I apologize without the question is very basic, I am taking my first steps in data science, statistics and bioinformatics.
Data information
We are evaluating the correlation (using the Pearson, Kendall or Spearman method) between gene expression and miRNA expression using the corAndPvalue function of WCGNA.
The resulting structure would be a DataFrame containing all combinations between each gene with each miRNA, containing the following columns:
Gene     miRNA      Correlation P-value
Gen_1    miRNA_1    0,959       0.00311
Gen_1    miRNA_2    -0,039      0.1041
Gen_1    miRNA_3    -0,344      0.0021
Gen_2    miRNA_1    0,1333      0.00451
Gen_2    miRNA_2    0,877       0.07311
...

Question
Considering the huge number of correlation tests we are going to evaluate, we need to adjust the p-values to avoid correlations due to chance. Bonferroni does not seem to be the best solution, so we would use Benjamini-Hochberg method (BH). The question is:
The BH correction for the Gen_1 x miRNA_1 combination, should consider the p-values of all combinations that include Gen_1 (Option 1), or should consider all the p-values of all the genes x miRNA combinations (Option 2)?
For example, let's assume an expression dataset of 20,000 genes and another of 15,000 miRNAs
Option 1:
To adjust Gen_1 x miRNA_1 we would use 15,000 p-values (Gen_1 x miRNA_1, Gen_1 x miRNA_2, ..., Gen_1 x miRNA_15000).
Option 2:
To adjust Gen_1 x miRNA_1 we would use 300,000,000 p-values (Gen_1 x miRNA_1, Gen_1 x miRNA_2, ..., Gen_1 x miRNA_15000, Gen_2 x miRNA_1, Gen_2 x miRNA_2, ..., Gen_2 x miRNA_15000 and so on).
Clarifications
The question is oriented to the statistical aspect rather than to the domain of bioinformatics itself. However, some clarifications can be made that should be taken into account:
This is a generic tool to identify gene expression regulators.
Users can upload data from different sources that could have different forms of normalization or distribution.
It cannot be guaranteed that the data will have a normal bivariate distribution as it may be user-specific data. However, in the tool we offer the option to validate assumptions about results of interest.
Suplementary question
Documentation of the method fdrcorrection from Python Statsmodels library suggests that for negative correlations (that could be frequent in a mRNA x miRNA correlation analysis) Benjamini-Yekutieli would work better; is that right? Or Benjamini-Hochberg method would be appropiated for this case?
Any kind of help would be much appreciated, thanks in advance!

EdM · Answer

You need to correct for all of the comparisons you are doing. So if that's 300,000,000 comparisons you need to correct for that many multiple comparisons.
But consider what some standard corrections for false-discovery rates (FDR) and family-wise error rates (FWER) protect you from. Say you have data in which there are no true associations but you do a lot of comparisons. One or more might then be identified incorrectly as "significant" just by chance. The Bonferroni FWER and Benjamini-Hochberg (BH) FDR corrections you cite protect you from that.
That's not really your situation.
Among the thousands of protein-coding mRNAs there are frequent correlations in expression patterns. Although I don't know much about miRNAs, my understanding is that they too have highly inter-correlated expression patterns. So if any particular pair of an mRNA and a miRNA has a true correlation, the protein-coding mRNAs correlated with the original mRNA are likely also to be associated with the original miRNA, and vice-versa. So there's a chance that you might be over-correcting with standard procedures that were developed to protect you from finding true associations when there aren't any at all. Although the origin of the problem is in the nature of the biological phenomena, the resulting implications for analysis are statistical.
The Benjamini-Yekutieli method was designed to handle the situation with correlated test results better. It can provide an FDR that is less conservative than the BH value.
That's still a very general correction, however. There is extensive discussion about multiple-comparison correction in the specific context of genomic studies on this page. This page has further related discussion. Those pages originally date back almost a decade, indicating that even then there was already an extensive literature on the best ways to proceed with large amounts of expression data.
If all that you want to do is rank-order the set of correlations, almost anything will do. But the specific values you report for FWER or FDR may be unduly conservative. If you wish your tool to be truly useful, it would seem best to incorporate best practices rather than to fall back on generic types of correction that aren't really appropriate for this type of data.
More broadly, this type of problem calls out for ways beyond simple correlations that explicitly take into account the inherent correlations among your mRNA expression values and the separate correlations among your miRNA expression values, and then puts that information together. So-called partial least squares regression comes to mind as a method designed specifically for this type of data. I suspect that some type of cluster analysis could also be informative.

P-value correction for multiple testing using huge datasets

Data information

Question

Clarifications

Suplementary question

One Answer

Add your own answers!

Ask a Question