How to get % similarity between strains and mutation files

Question

I’m very new to python, and having some difficulty getting hang of some more complicated things
I have multiple files which look like so:

hCoV-19/Singapore/4/2020|EPI_ISL_410535|2020-02-03

hCoV-19/USA/WA13-UW9/2020|EPI_ISL_413601|2020-03-02

hCoV-19/USA/WA-UW142/2020|EPI_ISL_416680|2020-03-11

Please be aware that the lines above are meant to be one file
I want to extract the EPI_ISL_000000 for an easy comparison among files.
Could someone please advise on:

A programme to extract this data into new files (There’s many lines in each file - 1000+)

A programme to then give a % comparison between two or more files - comparing all lines in one file against all lines in a second+ file

Theo Jones · Accepted Answer

left_lineagelist = [x.split('_')[-1].split('|')[0] 
                          for x in left_lineagelist]
        right_lineagelist = set([x.split('_')[-1].split('|')[0] 
                          for x in right_lineagelist])

Allows for extraction of 6 digit EPI, provided the file has had sequences removed prior; as such:
for line in lines:
    if line[0] == '>':
        print(line[1:])

How to get % similarity between strains and mutation files

One Answer

Add your own answers!

Ask a Question