TransWikia.com

>My counter is counting genotypic combination occurences more than once, how do I ensure it counts one combination and doesnt count it again?

Bioinformatics Asked on March 20, 2021

mutations = ['A222V', 'D614G', 'E484Q', 'E780Q', 'G476S', 'L18F', 'N439K',
             'S477', 'S477N', 'T478I', 'V483A']

combinations = []
for M in range(1, len(mutations)+1):
    for subset in itertools.combinations(mutations, M):
        combinations.append(subset)
        
        combinations = ['_'.join(sorted(x)) for x in combinations]
        combinations = [x.split('_') for x in list(set(combinations))]

root = "C:"

os.chdir(root)

lineages = os.listdir('Results')

combination_labels = []
combination_counts = []

for lineage in lineages:
    
    df = pd.read_csv('Results/' + lineage).dropna()
    
    for combination in combinations:
    
            combination_df = df[list(df)]
        
            for mutation in combination:
                combination_df = combination_df[combination_df[mutation] == 1]
    
        #print(combination_df.shape[0])
        
            combination_labels.append('_'.join(combination))
            combination_counts.append(combination_df.shape[0])
        
            out_df = pd.DataFrame({'combination':combination_labels,
                               'count':combination_counts})
            out_df['percentage'] = (out_df['count'] / df.shape[0]) * 100
            out_df = out_df.sort_values('percentage', ascending = False)   
        
            out_df.to_csv('Results_2/' + lineage.replace(".csv", "") + '_3.csv',
                     header = True, 
                     index = False)

The input CSV

lineage,A222V,D614G,E484Q,E780Q,G476S,L18F,N439K,S477,S477N,T478I,V483A
417941,0,1,0,0,0,0,0,0,0,0,0

Output CSV

combination,count,percentage
D614G,87355,90.7084929856806

My above code is used to count all occurences of combinations of spike protein mutations

  • to determine the genotypes.
    However, looking at the values processed into the .csv file, i can see ‘ D614G’ is being counted even when
    it is in combination with other combinations.

My question is how do i ensure once something has been counted (a row in the imported .csv file)
it is not counted for a second time?

OR possibly how can I edit this code to prevent counting of singular mutations even when presented within a combination?

Thanks in advance

One Answer

Alright, so there are a number of problematic patterns in your code - as far as I understand what you are trying to do. Next time, try to post a reproducible example that people can use and more people will be willing to help.

L18-21:

combination_labels = []
combination_counts = []

for lineage in lineages:

Declaring these two lists before the loop, then appending all your combination labels inside the loop means that these two lists will contain all labels and counts for all runs (all your lineages). If I am understanding your intentions correctly this is probably not something you want. The simple fix here is to just move them into the for-loop.

L29-30:

for mutation in combination:
    combination_df = combination_df[combination_df[mutation] == 1]

Here you are iteratively checking if the value for each mutation is set to '1', but you are also overwriting your combination_df variable at each iteration of the loop. After this loop, combination_df will be whatever mutation last was set to '1' in your current combination. I will get back to solving this in my solution below.

L37-44:

combination_labels.append('_'.join(combination))
combination_counts.append(combination_df.shape[0])

out_df = pd.DataFrame({'combination':combination_labels,
                   'count':combination_counts})
out_df['percentage'] = (out_df['count'] / df.shape[0]) * 100
out_df = out_df.sort_values('percentage', ascending = False)   

out_df.to_csv('Results_2/' + lineage.replace(".csv", "") + '_3.csv',
         header = True, 
         index = False)

Since this part is again scoped inside the for-loop, each iteration (each combination) will overwrite the same 'Results_2/' + lineage.replace(".csv", "") + '_3.csv' file. This you also want to move out of the for combination in ... loop.


A big concern is also performance. You have 2000-something combinations and 100,000 lines in your file. As you loop over combinations, you also process the DataFrame each time, when this is a proble you can solve in one iteration. It's not much data so it's still fine, but these habits are still good to develop early.

Now here's an idea for a rewrite that should work better. I don't have a csv from you to test with, so there might (proabably are) still things to fix, but hopefully it'll give you a starting point:

mutations = ['A222V', 'D614G', 'E484Q', 'E780Q', 'G476S', 'L18F', 'N439K',
             'S477', 'S477N', 'T478I', 'V483A']

combinations = []
for M in range(1, len(mutations)+1):
    for subset in itertools.combinations(mutations, M):
        combinations.append(subset)  
        
combinations = ['_'.join(sorted(x)) for x in combinations]

root = "C:"

os.chdir(root)

lineages = os.listdir('Results')

for lineage in lineages:

    out_df = pd.DataFrame({"Count": 0}, index=combinations)
    
    df = pd.read_csv('Results/' + lineage, index_col="lineage").dropna()
    
    for index, row in df.iterrows():
        combination = '_'.join(sorted(df.columns[row==1]))
        out_df.loc[combination, "Count"] += 1

    out_df['Percentage'] = (out_df['Count'] / df.shape[0]) * 100
    out_df = out_df.sort_values('Percentage', ascending = False)   

    out_df.to_csv('Results_2/' + lineage.replace(".csv", "") + '_3.csv', header = True, 
        index = True)

Correct answer by Bastian Schiffthaler on March 20, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP