AnswerBun.com

Biopython SeqIO check input file

Bioinformatics Asked on May 7, 2021

Hi I am trying to learnt python3 and Biopython,

I am trying to check imported fasta file before processing so far using:

sequences = SeqIO.parse("fake_fasta_file", "fasta")

if not any(sequa):
        print('n  SOMETHING IS WRONG WITH YOUR INPUT FILE n')
        exit()

works but as soon as I enter a ‘>’ in my fake_fasta_file , as an example

nihvivicòvo-l
,mnjobùJN
, Békibpç
, mnkn

nbigp
ù
mlkn

mlnù
>  ftfyfiuuogpigèi

any(sequences) returns True even if:

for seq_record in sequences:
       print(seq_record)

doesnt print anything

Any better way to check for real fasta file in Biopython ?

Or maybe it my fault using any() on the iterator instead of creating

a list of seq_record and checking on that ?

One Answer

any() loops over the iterator until the first truthy item is found or the end is reached. The iterator is not reset to the start after this. Using a list will change that behaviour since a list is not exhausted like a iterator. So, remove your any() call and you will see a sequence record printed. Your 'fake_fasta_file.fa' is actually recognised by Biopython as valid fasta:

>>> SeqIO.read('fake_fasta_file.fa', 'fasta')
SeqRecord(seq=Seq(''), id='ftfyfiuuogpigèi', name='ftfyfiuuogpigèi', description='  ftfyfiuuogpigèi', dbxrefs=[])

Biopython ignores lines before the first line beginning with '>' (as possible comment lines) and then starts collecting records, in this case an empty sequence record is still a record. This behaviour is deliberate from Biopython, although you could argue if it's best or not, it's unlikely to change for historical reasons.

What is real fasta? (philosophical pause) If you want to validate a particular definition of FASTA format then you can write your own parser depending on the exact specification and rules you want.

Edit, e.g.

from Bio import SeqIO

records = SeqIO.parse("myfile.fa", "fasta")
first = next(records, None)
if first is None:
    print("No fasta records")
elif first.seq == "":
    print("Empty sequence")
elif all(c not in "ATCGN" for c in first.seq.upper()):
    print("Non ATCGN characters")

etc.

Correct answer by Chris_Rands on May 7, 2021

Add your own answers!

Related Questions

Blastp MSA to the same length

0  Asked on July 14, 2021

   

ATAC-seq macs2 peak splitting in sliding windows

1  Asked on July 12, 2021 by user5191

   

Subset FASTA file by species name

2  Asked on July 10, 2021 by tahunami

     

Querying metadata (GDC) using a filter

1  Asked on July 10, 2021 by lab

 

Get gene sequence based on the annotation

3  Asked on July 9, 2021 by igor-filippov

       

Hg38 annotation tracks retrieval

1  Asked on July 6, 2021 by trakesh

   

Extracting WDL map keys as a task

1  Asked on July 5, 2021 by xophmeister

 

A database for RefSeq protein accession IDs

1  Asked on July 3, 2021 by ehsan-salehabadi

     

two aligners combined results

1  Asked on July 2, 2021

 

Does NCBI’s blast API block my IP?

1  Asked on July 1, 2021

   

Ask a Question

Get help from others!

© 2023 AnswerBun.com. All rights reserved. Sites we Love: PCI Database, MenuIva, UKBizDB, Menu Kuliner, Sharing RPP, SolveDir