AnswerBun.com

Split fasta file based on groups in header information and output as separate files

Bioinformatics Asked on September 4, 2021

I have a fasta file containing the sequence of a gene across different species. In total there are around 900 samples and 12 species. (Each sequences is over multiple lines and longer than 100bp.)

My fasta file looks like:

>Species-1-samplenameA
CTATCCTTAAACGCATATCTCGCACAGTAACTCCCCAATATGTGAGCATCTGATGTTGCCCGGGCCGAGTTAGTCTTGTGCTCACGGAACTTATTGTATG
>Species-2-samplenameB
AGTAGTGATTTGAAAGAGTTGTCAGTTAGCTCGTTCAGGTAATGGTTCCTCACACTACGTCAAAATAAGAGAGCGGTCGTGACATTATCCGTGATTTTCT
>Species-3-samplenameC
CACTACTATCAGTACTCACGACTCGATTCTGCCGCAGCCACGTATCGCCAGAAAGCCAGTCAGCATTAAGGAGTGCTCTGGGCAGGACAACTCGCATAGT
>Species-3-samplenameD
GAGAGTTACATGTTCGTTGGGCTCTTCCGACACGAACCTCAGTTGGCCTACATCCTACCTGAGGTCTGTGCCCCGGTGGTGAGAAGTGCGCATTTCGTTC

I want to split this file in one fasta file per species.
I think it’s possible to use the awk function for this but I’m stuck. Does anyone have a script/code that might help me?

Thanks a lot.

One Answer

A simple Biopython solution- iterate over the sequence records, identify the species, open the file handle using append mode to ensure no data is overwritten, and write the record:

from Bio import SeqIO

for record in SeqIO.parse("myfile.fa", "fasta"):
    species = record.id.split('_')[0]
    with open(f"{species}.fa", "a") as f:
        SeqIO.write(record, f, "fasta")

Correct answer by Chris_Rands on September 4, 2021

Add your own answers!

Related Questions

NO_COOR reads not in a single block at the end 0 -1

1  Asked on May 19, 2021 by user9393

   

MUMmer plot error. Line 884. What to do?

0  Asked on May 18, 2021 by dansterboy

 

Is there public RESTful api for Gnomad?

5  Asked on May 14, 2021 by pasted

   

nextflow: Filter outputs of a process

1  Asked on May 10, 2021 by zillur-rahman

   

Filtering VEP annotation file

1  Asked on May 10, 2021 by jeni

   

Ask a Question

Get help from others!

© 2023 AnswerBun.com. All rights reserved. Sites we Love: PCI Database, MenuIva, UKBizDB, Menu Kuliner, Sharing RPP, SolveDir