TransWikia.com

Biopython SeqIO check input file

Bioinformatics Asked on May 7, 2021

Hi I am trying to learnt python3 and Biopython,

I am trying to check imported fasta file before processing so far using:

sequences = SeqIO.parse("fake_fasta_file", "fasta")

if not any(sequa):
        print('n  SOMETHING IS WRONG WITH YOUR INPUT FILE n')
        exit()

works but as soon as I enter a ‘>’ in my fake_fasta_file , as an example

nihvivicòvo-l
,mnjobùJN
, Békibpç
, mnkn

nbigp
ù
mlkn

mlnù
>  ftfyfiuuogpigèi

any(sequences) returns True even if:

for seq_record in sequences:
       print(seq_record)

doesnt print anything

Any better way to check for real fasta file in Biopython ?

Or maybe it my fault using any() on the iterator instead of creating

a list of seq_record and checking on that ?

One Answer

any() loops over the iterator until the first truthy item is found or the end is reached. The iterator is not reset to the start after this. Using a list will change that behaviour since a list is not exhausted like a iterator. So, remove your any() call and you will see a sequence record printed. Your 'fake_fasta_file.fa' is actually recognised by Biopython as valid fasta:

>>> SeqIO.read('fake_fasta_file.fa', 'fasta')
SeqRecord(seq=Seq(''), id='ftfyfiuuogpigèi', name='ftfyfiuuogpigèi', description='  ftfyfiuuogpigèi', dbxrefs=[])

Biopython ignores lines before the first line beginning with '>' (as possible comment lines) and then starts collecting records, in this case an empty sequence record is still a record. This behaviour is deliberate from Biopython, although you could argue if it's best or not, it's unlikely to change for historical reasons.

What is real fasta? (philosophical pause) If you want to validate a particular definition of FASTA format then you can write your own parser depending on the exact specification and rules you want.

Edit, e.g.

from Bio import SeqIO

records = SeqIO.parse("myfile.fa", "fasta")
first = next(records, None)
if first is None:
    print("No fasta records")
elif first.seq == "":
    print("Empty sequence")
elif all(c not in "ATCGN" for c in first.seq.upper()):
    print("Non ATCGN characters")

etc.

Correct answer by Chris_Rands on May 7, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP