TransWikia.com

How to find Genotype, from list of mutations:lineage list?

Bioinformatics Asked by Theo Jones on January 21, 2021

I’m very new to python and I’m getting along nicely (I’d like to believe); however, I must be missing something here. I’m looking to read each and every file and compare each line against one another.

The aim of this is to see if any mutations are shared in any of the strains – to determine the genotype.

The inner-workings of the files are single spaced headers, as those you would find on GISAID.

os.chdir("C:)

with open('A D614G.txt', 'r') as file2:
with open('A E780Q.txt', 'r') as file3:
with open('A G476S.txt', 'r') as file4:
with open('A L18F.txt', 'r') as file5:
with open('A N439K.txt', 'r') as file6:
with open('A S477N.txt', 'r') as file7:
with open('A T478I.txt', 'r') as file8:
with open('A V483A.txt', 'r') as file9:
                               
files = [file1, file2, file3,file4, file5, file6, file7, file8, file9]

it = itertools.permutations((files), len(files))

for x in it:
    print(x)

os.chdir("C:)
with open('tester.txt', 'wt') as file_out:
    for x in it: 
        file_out.write()```

1 I recieve the following error on a very long repeated stretch of text

```<_io.TextIOWrapper name='A G476S.txt' mode='r' encoding='cp1252'>```

One Answer

import re
import os
import pandas as pd
import logging
import sys
import csv
import itertools

mutations = ['A222V', 'D614G', 'E484Q', 'E780Q', 'G476S', 'L18F', 'N439K',
             'S477', 'S477N', 'T478I', 'V483A']

combinations = []
for M in range(1, len(mutations)+1):
    for subset in itertools.combinations(mutations, M):
        combinations.append(subset)
        
        combinations = ['_'.join(sorted(x)) for x in combinations]
        combinations = [x.split('_') for x in list(set(combinations))]

root = "C:"

os.chdir(root)

lineages = os.listdir('Results')

combination_labels = []
combination_counts = []

for lineage in lineages:
    
    df = pd.read_csv('Results/' + lineage).dropna()
    
    for combination in combinations:
    
            combination_df = df[list(df)]
        
            for mutation in combination:
                combination_df = combination_df[combination_df[mutation] == 1]
    
        #print(combination_df.shape[0])
        
            combination_labels.append('_'.join(combination))
            combination_counts.append(combination_df.shape[0])
        
            out_df = pd.DataFrame({'combination':combination_labels,
                               'count':combination_counts})
            out_df['percentage'] = (out_df['count'] / df.shape[0]) * 100
            out_df = out_df.sort_values('percentage', ascending = False)   
        
            out_df.to_csv('Results_2/' + lineage.replace(".csv", "") + '_3.csv',
                     header = True, 
                     index = False)

Correct answer by Theo Jones on January 21, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP