TransWikia.com

Compute the significance of the overlap between 2 or more gene sets

Bioinformatics Asked on May 31, 2021

I was able to compute the significance of the overlap between 2 gene sets using the cdf function of scipy hypergeometric distribution.

I wish to be able to perform the same calculation for more than 2 gene sets; should I use the multivariate hypergeometric distribution cdf function for that?

Are there any websites that provides the same calculations over gene sets so I can validate my results?

One Answer

This is a shot at it, first an example dataset:

import matplotlib.pyplot as plt  
import numpy as np
import functools
from matplotlib_venn import venn3
    
# define universe
uni = ["gene"+str(i) for i in range(1000)]
# some overlap
gs1 = uni[250:300] + uni[900:950]
gs2 = uni[:300]
gs3 = uni[250:500]

The ever amazing venn diagram:

venn3([set(gs1),set(gs2),set(gs3)],set_labels=["gs1","gs2","gs3"])

enter image description here

Then a function to draw a set with length equivalent of each set, randomly from the universe and find length of intersection (all 3):

def sim_intersect(uni,set_lengths):
    randomsets = [np.random.choice(uni,n) for n in set_lengths]
    return len(functools.reduce(np.intersect1d,randomsets))

We run this 1000 times:

permuted_values = [sim_intersect(uni,[len(gs1),len(gs2),len(gs3)]) for i in range(1000)]

plt.hist(permuted_values,bins=range(50))

enter image description here

The probability of observing the starting result, using (B+1)/(M+1) as estimator, see this post:

(sum(np.array(permuted_values)>obs_n)+1)/(1000+1)
0.000999000999000999

Correct answer by StupidWolf on May 31, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP