Compute the significance of the overlap between 2 or more gene sets

Question

I was able to compute the significance of the overlap between 2 gene sets using the cdf function of scipy hypergeometric distribution.
I wish to be able to perform the same calculation for more than 2 gene sets; should I use the multivariate hypergeometric distribution cdf function for that?
Are there any websites that provides the same calculations over gene sets so I can validate my results?

StupidWolf · Accepted Answer

This is a shot at it, first an example dataset:
import matplotlib.pyplot as plt  
import numpy as np
import functools
from matplotlib_venn import venn3
    
# define universe
uni = ["gene"+str(i) for i in range(1000)]
# some overlap
gs1 = uni[250:300] + uni[900:950]
gs2 = uni[:300]
gs3 = uni[250:500]

The ever amazing venn diagram:
venn3([set(gs1),set(gs2),set(gs3)],set_labels=["gs1","gs2","gs3"])

Then a function to draw a set with length equivalent of each set, randomly from the universe and find length of intersection (all 3):
def sim_intersect(uni,set_lengths):
    randomsets = [np.random.choice(uni,n) for n in set_lengths]
    return len(functools.reduce(np.intersect1d,randomsets))

We run this 1000 times:
permuted_values = [sim_intersect(uni,[len(gs1),len(gs2),len(gs3)]) for i in range(1000)]

plt.hist(permuted_values,bins=range(50))

The probability of observing the starting result, using (B+1)/(M+1) as estimator, see this post:
(sum(np.array(permuted_values)>obs_n)+1)/(1000+1)
0.000999000999000999

Compute the significance of the overlap between 2 or more gene sets

One Answer

Add your own answers!

Ask a Question