TransWikia.com

How to do `bedtools intersection` using pandas alone?

Bioinformatics Asked on February 15, 2021

I have two pandas Dataframes, using python3.x:

import pandas as pd

dict1 = {0:['chr1','chr1','chr1','chr1','chr2'], 
    1:[1, 100, 150, 900, 1], 2:[100, 200, 500, 950, 100], 
    3:['feature1', 'feature2', 'feature3', 'feature4', 'feature4'], 
    4:[0, 0, 0, 0, 0], 5:['+','+','-','+','+']}

df1 = pd.DataFrame(dict1)

print(df1)

##       0    1    2         3  4  5
## 0  chr1    1  100  feature1  0  +
## 1  chr1  100  200  feature2  0  +
## 2  chr1  150  500  feature3  0  -
## 3  chr1  900  950  feature4  0  +
## 4  chr2    1  100  feature4  0  +

dict2 = {0:['chr1','chr1'], 1:[155, 800], 2:[200, 901], 
    3:['feature5', 'feature6'], 4:[0, 0], 5:['-','+']}

df2 = pd.DataFrame(dict2)
print(df2)
##       0    1    2         3  4  5
## 0  chr1  155  200  feature5  0  -
## 1  chr1  800  901  feature6  0  +

The columns to focus on in these dataframes are the first three columns: location, start, and end. Each start:end value represents a distance on location (e.g. chr1, chr2, chr3).

I would like to output the intersection of df1 against df2. Here is the correct output:

chr1    155 200 feature2    0   +
chr1    155 200 feature3    0   -
chr1    900 901 feature4    0   +

I was hoping there was a pandas only solution…

4 Answers

Your setup:

import pandas as pd

dict1 = {0:['chr1','chr1','chr1','chr1','chr2'], 
    1:[1, 100, 150, 900, 1], 2:[100, 200, 500, 950, 100], 
    3:['feature1', 'feature2', 'feature3', 'feature4', 'feature4'], 
    4:[0, 0, 0, 0, 0], 5:['+','+','-','+','+']}

df1 = pd.DataFrame(dict1)

print(df1)

##       0    1    2         3  4  5
## 0  chr1    1  100  feature1  0  +
## 1  chr1  100  200  feature2  0  +
## 2  chr1  150  500  feature3  0  -
## 3  chr1  900  950  feature4  0  +
## 4  chr2    1  100  feature4  0  +

dict2 = {0:['chr1','chr1'], 1:[155, 800], 2:[200, 901], 
    3:['feature5', 'feature6'], 4:[0, 0], 5:['-','+']}

df2 = pd.DataFrame(dict2)

Now use pyranges (https://github.com/biocore-ntnu/pyranges):

# pip install pyranges
# conda install -c bioconda pyranges

import pyranges as pr
cols = "Chromosome Start End Name Score Strand".split()
# Add column names to the dfs so PyRanges knows which 
# represent Chromosomes, Starts, Ends and optionally Strand

df1.columns = cols
df2.columns = cols

# create PyRanges-objects from the dfs
gr1, gr2 = pr.PyRanges(df1), pr.PyRanges(df2)

# intersect the two
gr = gr1.intersect(gr2)

# get the result as a ?  df
df = gr.df

Results:

print(gr)
# +--------------+-----------+-----------+------------+-----------+--------------+
# | Chromosome   |     Start |       End | Name       |     Score | Strand       |
# | (category)   |   (int32) |   (int32) | (object)   |   (int64) | (category)   |
# |--------------+-----------+-----------+------------+-----------+--------------|
# | chr1         |       155 |       200 | feature2   |         0 | +            |
# | chr1         |       900 |       901 | feature4   |         0 | +            |
# | chr1         |       155 |       200 | feature3   |         0 | -            |
# +--------------+-----------+-----------+------------+-----------+--------------+
# Stranded PyRanges object has 3 rows and 6 columns from 1 chromosomes.
# For printing, the PyRanges was sorted on Chromosome and Strand.
print(df)
#   Chromosome  Start  End      Name  Score Strand
# 0       chr1    155  200  feature2      0      +
# 1       chr1    900  901  feature4      0      +
# 2       chr1    155  200  feature3      0      -

See the docs for more info: https://biocore-ntnu.github.io/pyranges/

Correct answer by The Unfun Cat on February 15, 2021

I implemented pandas "Intervals" and ... it should be a few lines, clearly there are limitations. For non-overlapping data it is very cool however. It will work for overlapping data, BUT if the data you are using as the interval data is overlapping, it falls over. It could work if an independent (non-overlapping) interval was constructed.

Anyway, the point of the excercise was to keep it all in pandas, "start_stop" should not be needed. This can't be done via Intervals, you've got to drop out, and that will slow it down and harder to parse.

The query contig df3.iloc[1,], is still a bug (hence not printed), should be [x,] where x is the value_counts hit, but this is just a pilot.

The script would be fine if it the df was wrangled using groupby rather than value_counts because value_counts trashes important data, which is what I'm trying to recover via df3.iloc[1,]. I like pandas elegant solutions, but you know another day another dollar.

Note: I changed 200 to 199 otherwise there's too much overlap.

Overlap Output (not labeled very well)

                 start  stop
index                      
(1.0, 100.0]        0     0
(100.0, 199.0]      1     0
(150.0, 500.0]      1     1
(900.0, 950.0]      0     1

import pandas as pd

def tableit (dict):
    df = pd.DataFrame(dict)
    table = pd.pivot_table(df, values=['start', 'stop'], columns=['chromosome'], index=['feature', 'misc', 'misc2'], fill_value = '-')
    return table

def start_stop(point, table2, bin):
    df3 = pd.cut(table2[point,'chr1'], bin[0]).value_counts()
    df3 = df3.to_frame(point).reset_index()
    for x in bin:
            if not (x == bin[0]):
                df3_tmp = pd.cut(table2[point,'chr1'], x).value_counts()
                df3_tmp = df3_tmp.to_frame(point).reset_index()
                df3 = pd.concat([df3, df3_tmp]                                  
            else:
                continue
    df3.set_index('index', inplace=True)
    return df3, df3.iloc[1,] # [1,] is a bug and the query locus needs correctly reading from df3_tmp

def main():
    dict1 = {'chromosome':['chr1','chr1','chr1','chr1','chr2'], 'start':[1, 100, 150, 900, 1], 'stop':[100, 199, 500, 950, 100], 'feature':['feature1', 'feature2', 'feature3', 'feature4', 'feature4'], 'misc':[0, 0, 0, 0, 0], 'misc2':['+','+','-','+','+']}
    dict2 = {'chromosome':['chr1','chr1'], 'start':[155, 800], 'stop':[200, 901], 'feature':['feature5', 'feature6'], 'misc':[0, 0], 'misc2':['-','+']}

    table1 = tableit(dict1)
    table2 = tableit(dict2)

    bin = [[i, j] for i, j in zip(table1['start', 'chr1'], table1['stop', 'chr1'])]

    df_start, locus = start_stop('start', table2, bin)
# locus.name gets to the data
    df_stop, _ = start_stop('stop', table2, bin)
    df_merge = pd.merge(df_start, df_stop, on="index", how='outer')
    print (df_merge)

if __name__ == "__main__":
    main()

The following would have been better,

table_join = pd.concat([table1, table2]).fillna('-')

and then sort them.. perhaps including a new column to specify the original dataframe and screen for the overlap with table_join.iloc[i, column] - table_join.iloc[1-i, other_column].

The key step in all approaches is to convert from "long format" to "wide format" and thats the most important thing I've done in all code and makes pandas work.

Output Table_join

                     start        stop     
chromosome            chr1 chr2   chr1 chr2
feature  misc misc2                        
feature1 0    +        1.0    -  100.0    -
feature2 0    +      100.0    -  199.0    -
feature3 0    -      150.0    -  500.0    -
feature4 0    +      900.0    1  950.0  100
feature5 0    -      155.0    -  200.0    -
feature6 0    +      800.0    -  901.0    -

Answered by M__ on February 15, 2021

I don't think Pandas has this implemented functionality out-of-the-box. Even if it did, solutions not designed specifically for bioinformatics probably rarely handle intervals on different chromosomes correctly unless you split the intervals by chromosome first.

Pandas does handle intervals (see docs for the Interval and IntervalIndex classes), but I've never used these. I have used the excellent intervaltree package. Whatever approach and tools you decide to use, there's a good chance you'll have to write some code. Computing interval intersections with intervaltree would look something like this.

>>> from collections import defaultdict
>>> from intervaltree import IntervalTree
>>> from pandas import DataFrame
>>> 
>>> dict1 = {
...     0:['chr1','chr1','chr1','chr1','chr2'],
...     1:[1, 100, 150, 900, 1],
...     2:[100, 200, 500, 950, 100],
...     3:['feature1', 'feature2', 'feature3', 'feature4', 'feature4'],
...     4:[0, 0, 0, 0, 0],
...     5:['+','+','-','+','+']
... }
>>> dict2 = {
...     0:['chr1','chr1'],
...     1:[155, 800],
...     2:[200, 901],
...     3:['feature5', 'feature6'],
...     4:[0, 0],
...     5:['-','+']
... }
>>> df1 = DataFrame(dict1)
>>> df2 = DataFrame(dict2)
>>> 
>>> 
>>> # Construct an interval index from one of the dataframes,
>>> #   one tree per chromosome
>>> index = defaultdict(IntervalTree)
>>> for n, row in df2.iterrows():
...     index[row[0]].addi(row[1], row[2], row)
... 
>>> 
>>> # Query the index from the other dataframe to build a 3rd
>>> #   dataframe representing the intersection
>>> dict3 = { i: list() for i in range(6) }
>>> for n, row in df1.iterrows():
...     overlapping = index[row[0]].overlap(row[1], row[2])
...     for itvl in overlapping:
...         itsct_start = max(row[1], itvl.begin)
...         itsct_end = min(row[2], itvl.end)
...         dict3[0].append(row[0])
...         dict3[1].append(itsct_start)
...         dict3[2].append(itsct_end)
...         dict3[3].append(row[3])
...         dict3[4].append(row[4])
...         dict3[5].append(row[5])
... 
>>> df3 = DataFrame(dict3)
>>> df3
      0    1    2         3  4  5
0  chr1  155  200  feature2  0  +
1  chr1  155  200  feature3  0  -
2  chr1  900  901  feature4  0  +
>>>

Answered by Daniel Standage on February 15, 2021

As mentioned by OP, another option is to use pybedtools, which in my opinion is pretty convenient for people already familiar with BedTools.

Let's even say df1's format is slightly different than df2:

import pandas as pd

dict1 = {0: ['chr1', 'chr1', 'chr1', 'chr1', 'chr2'], 1: [1, 100, 150, 900, 1], 
         2: [100, 200, 500, 950, 100], 
         3: ['feature1', 'feature2', 'feature3', 'feature4', 'feature4'], 
         4: [0, 0, 0, 0, 0], 5: ['+', '+', '-', '+', '+'], 
         6: ["remember", "the", "5th", "of", "november"]}

df1 = pd.DataFrame(dict1)

print(df1)

      0    1    2         3  4  5         6
0  chr1    1  100  feature1  0  +  remember
1  chr1  100  200  feature2  0  +       the
2  chr1  150  500  feature3  0  -       5th
3  chr1  900  950  feature4  0  +        of
4  chr2    1  100  feature4  0  +  November

dict2 = {0: ['chr1', 'chr1'], 1: [155, 800], 2: [200, 901], 
         3: ['feature5', 'feature6'], 4: [0, 0], 5: ['-', '+']}

df2 = pd.DataFrame(dict2)

print(df2)

      0    1    2         3  4  5
0  chr1  155  200  feature5  0  -
1  chr1  800  901  feature6  0  +

Using pybedtools we get:

from pybedtools import BedTool

bed1 = BedTool.from_dataframe(df1)
bed2 = BedTool.from_dataframe(df2)

intersected_bed = bed1.intersect(bed2)

print(intersected_bed)

chr1    155 200 feature2    0   +   the
chr1    155 200 feature3    0   -   5th
chr1    900 901 feature4    0   +   of

We can then keep using the intersected BedTool object or convert it back into a DataFrame:

# create a pandas.DataFrame, passing args and kwargs to pandas.read_table
intersected_df = intersected_bed.to_dataframe(header=None, 
                                              names=["chrom", "start", "end", "name", 
                                                     "score", "strand", "whatever"])

print(intersected_df)

  chrom  start  end      name  score strand whatever
0  chr1    155  200  feature2      0      +      the
1  chr1    155  200  feature3      0      -      5th
2  chr1    900  901  feature4      0      +       of

Answered by Kapara newbie on February 15, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP