How to sort multiple FASTA files based on their content?

Bioinformatics Asked on March 31, 2021

I have around 10,000 FASTA files of Influenza A virus.

These files contains sequences of each of the 8 segments of the viral genome and I want to separate these files into different locations based on the content of these FASTA files.

In each FASTA file for each segment,the first line has the segment number, for example.

KM368312.1 Influenza A virus (A/swine/Shandong/01/2009(H1N1)) segment 3 polymerase PA (PA) and PA-X protein (PA-X) genes, complete cds

To be clear I want

  • file1.txt has segment 1
  • file2.txt has segment 2
    etc …

I want to ensure all segment 1 sequences are grouped into one folder, and each file is grouped according to its geographic origin. The geographic groupings are mirrored for all 8 segements and each placed into their own directory.

2 Answers

The general idea is:

  1. iterate over all files
    1. read the first line of each file, extract the segment bit
    2. move the file based on that

In Python, this can be done in many ways. Here’s one way:

import os
import re

def find_segment(id):
    return'segment (d)', id).group(1)

for segment in range(8):
    os.mkdir(f'segment-{segment + 1}')

for entry in os.scandir():
    if not entry.is_file():

    with open(entry.path, 'r') as file:
        segment = find_segment(next(file))
    os.rename(entry.path, f'segment-{segment}/{entry.path}')

Note the absence of error handling — this code assumes that every operation succeeds; empty files, or files without “segment X” in their FASTA ID, will trigger an error, as will the usual IO failures (missing permissions, etc.).

But most people would probably use a shell scripting language instead of Python to solve this. For example, here’s something equivalent in Bash (untested):

mkdir segment-{1..8}

for file in *.fa *.fasta; do
    segment=$(sed 's/.*segment ([[:digit:]]).*/1/; q' "$file")
    if [[ -n $segment ]]; then
    mv "$file" "segment-$segment"

Correct answer by Konrad Rudolph on March 31, 2021

One "out there" approach you could take is to create a fasta file containing your regions of interest (as separate sequences), and then map your fasta files to that reference fasta file, generating BAM/SAM output. After that, position-sort the BAM file, and you've now got a content-sorted list of sequences.

The process to do this would be something like this:

minimap2 -a regions.fasta input1.fasta input2.fasta ... | samtools sort > mapped.bam
samtools fastq mapped.bam > sorted.fasta

Answered by gringer on March 31, 2021

Add your own answers!

Related Questions

Downloading SRA Files from AWS

1  Asked on December 6, 2020


NCT search in PubMed via Entrez (python)

0  Asked on December 5, 2020 by nutarelli-federico


using snakemake shadow rules to store temp files on local nodes

1  Asked on December 2, 2020 by kamil-s-jaron


Software for taxonomic assignment?

2  Asked on December 1, 2020


Detecting Allelic Imbalance

0  Asked on November 26, 2020 by krizbomb


Mapping statistics from bam file using bbtools and sambamba

2  Asked on November 25, 2020 by bioinfonext


Preprocessing Affymetrix SNP Array Matrix

0  Asked on November 17, 2020 by thanh-nguyen


Ask a Question

Get help from others!

© 2023 All rights reserved. Sites we Love: PCI Database, MenuIva, UKBizDB, Menu Kuliner, Sharing RPP, SolveDir