Subset FASTA file by species name

Question

I have a problem: I've managed to download a massive fasta file of 1500 sequences, but now I want to split them into separate fasta files based on the genus.
EDIT
The fasta file looks like this:
terminase_large.fasta
>YP_009300697.1 terminase large subunit [Arthrobacter phage Mudcat]
MGLSNTATPLYYGQF...

>YP_009208724.1 hypothetical protein ADP65_00072 [Achromobacter phage phiAxp-3]
MSNVLLKQELDEWL...

>YP_009148449.1 large terminase subunit [Delftia phage RG-2014]
MSEPRKLVKKTLD...
...

I would like to end up with something like this:
mycobacterium_terminase.fasta
>Mycobacterium phage JAMaL
MVRKKPPPELE...

>Mycobacterium phage Bruin
MEVCGYTLDDI...

>Mycobacterium phage Zaka
MSLDNHLPELA...

salmonella_terminase.fasta
>Salmonella virus SETP7
MNVDITATEPQ...

>Salmonella virus SE2
MEGGSDSLIAM...
and so on...

So that the label on the alignment would have the name of the phage and not the protein's ID.
I've managed to extract the genus names of my organism with this:
outfile = open('species.txt', 'w')
with open('terminase_large.fasta') as fd:
    for line in fd:
        if line.startswith('>'):
            if '[' in line:
                name=line.split('[')[-1]
                name=name.split(' ', 1)[0]
                outfile.write(name[:] + "n")
outfile.close()

And got to extract only the unique names with this:
lines_seen = set()
outfile = open('species2.txt', "w")
for line in open("species.txt", "r"):
    if line not in lines_seen:  # not a duplicate
        outfile.write(line)
        lines_seen.add(line)
outfile.close()

(Can I merge those two scripts together?)
Now, my genus names look like this:
Arthrobacter
Achromobacter
Delftia
....

I tried automating my script to get the Entrez data, but it gives me the 'Supplied id parameter is empty' message
My code looks like this:
from Bio import Entrez
Entrez.email = "example@example.org"

for line in open("species2.txt", "r"):
    searchterm = "(terminase large subunit AND viruses[Organism]) AND" +line+ "AND refseq[Filter]"
    searchResultHandle = Entrez.esearch(db="protein", term=searchterm, retmax=1000)
    searchResult = Entrez.read(searchResultHandle)
    ids = searchResult["IdList"]

handle = Entrez.efetch(db="protein", id=ids, rettype="fasta", retmode="text")
    record = handle.read()

out_handle = open('terminase_large_'+str(line[:-1])+'.fasta', 'w')
    out_handle.write(record.rstrip('n'))

Can someone help me with it?

terdon · Accepted Answer

Splitting into multiple files and changing the IDs can be easily done:

perl -pe 'if(/>/){/[(.*?)]s*$/; $_="> $1n"}' file.fa | 
    awk '(/^>/){name=$2} {print >> name".fa"}'

That assumes all your FASTA headers have [foo bar baz] as the last element of a line. It will create a file called foo.fa (the bacterium's name) with all sequences saved there.

Explanation

perl -pe : run the script given by -e on each line of the input file, and print the resulting line. 
if(/>/) : if this line starts with a >. 
/[(.*?)]s*$/ : match an opening bracket ([), then capture (that's what the parentheses do, they capture a pattern so we can refer to it as $1) everything until the first ] (.*?])
$_="> $1n" : the $_ special variable in Perl is (in this case) the current line. So, $_=foo means "make the current line read foo. Since the -p prints each input line, changing the value of $_ means the changed value will be printed. So here, we are printing >, whatever was in the square brackets ($1) and a newline character.

The output of the perl command alone on your example input file is:

$ perl -pe 'if(/>/){/[(.*)]s*$/; $_="> $1n"}' file.fa
> Arthrobacter phage Mudcat
MGLSNTATPLYYGQF...

> Achromobacter phage phiAxp-3
MSNVLLKQELDEWL...

> Delftia phage RG-2014
MSEPRKLVKKTLD...

So, we now pass it through awk which does:

(/^>/){a=$2} : if this line starts with an >, save the second field (the bacterial species) as the variable name. 
{print >> name".fa"} : print each line into a file whose name is the current value of the variable name with a .fa. extension.

If you prefer python scripts to the one-liner approach, you can do the same thing with:

#!/usr/bin/env python
import re
outFile = None
with open("file.fa", "r") as inFile, open(species + ".fa", 'a') as outFile:
    for line in inFile:
        line = line.rstrip()
        if line.startswith('>'):
            regex = re.compile('.*[((.+?)s+.*?)].*')
            matches = regex.search(line)
            species = matches[2]
            outFile.write('>%sn' % matches[1])
        else:
            outFile.write("%sn" % line)

As for your script, you've got the right idea, but have a small bug. You forgot to remove the n from your input file, so it looks for Arthrobactern instead of Arthrobacter. The golden rule of debugging is "print all the things". If you add print("Searchterm: ",searchterm)  you will see:

Searchterm:  (terminase large subunit AND viruses[Organism]) ANDArthrobacter  
AND refseq[Filter]
Searchterm:  (terminase large subunit AND viruses[Organism]) ANDAchromobacter  
AND refseq[Filter]
Searchterm:  (terminase large subunit AND viruses[Organism]) ANDDelftia  
AND refseq[Filter]

So, you need to remove the newline characters and add a space like so (I also made it a bit more "pythonic" and conforming to the Python syntax guidelines):

#!/usr/bin/env python
from Bio import Entrez
Entrez.email = "example@example.org"

with open("species2.txt", "r") as in_handle:
    for line in in_handle:
        line = line.rstrip()
        searchterm = ("(terminase large subunit AND viruses[Organism]) " +
                      "AND %s AND refseq[Filter]" % line)
        print("Searchterm: ", searchterm)
        searchResultHandle = Entrez.esearch(db="protein",
                                            term=searchterm, retmax=1000)
        searchResult = Entrez.read(searchResultHandle)
        ids = searchResult["IdList"]

handle = Entrez.efetch(db="protein", id=ids,
                               rettype="fasta", retmode="text")
        record = handle.read()

with open('terminase_large_' + str(line[:-1]) + '.fasta', 'w') as out_handle:
            out_handle.write(record.rstrip('n'))

bli · Answer

A (bio)awk-based solution

Generate the list of genus

bioawk -c fastx '{print $comment}' terminase_large.fasta 
    | sed -r 's/.*[(w+).*/1/' 
    | sort -u > genus_names.txt

Here is how it works:

bioawk is like awk but with extra parsing capabilities. -c fastx parses the fields in fasta or fastq records. {print $comment} will print the $comment field which contain the genus names you want to extract.
The sed command captures (with parentheses) the word (w+) that is just after the opening square bracket (which should be the genus name, if there is only one opening square bracket on the line), and substitutes the whole line (that is, everything before the bracket (.*), the bracket ([), the word, and everything after the word (.* again)) with just the captured word (1).
All this is sorted, and the -u option of sort ensures there is only one occurrence of each genus name in the output, which we put in the genus_names.txt file.

Extract the fasta records for each genus

for genus in $(cat genus_names.txt)
do
    bioawk -c fastx '{print}' terminase_large.fasta 
        | grep ${genus} 
        | awk -F "t" '{print ">"$4"n"$2}' 
        | sed -r 's/^>[^[]+[(.*)]/>1/' > ${genus}.fasta
done

Here is how it works:

We loop on the content of the previously established list of genus names.
For each value of ${genus}, we parse again the fasta file with bioawk. This time, for a given fasta record, we print all the elements that have been parsed. They appear on one line, in the following order (according to bioawk -c help): 1:name 2:seq 3:qual 4:comment (we actually only need seq and comment for the following steps, but there are less risks of making errors at the later awk step if we keep the four fields).
We use grep to select the resulting lines that contain the genus name we're currently dealing with.
The lines that have been successfully selected are then parsed by awk and reformatted into fasta format using only the comment part in the header. The -F "t" is to indicate that the field delimiters are tabulations (which is how bioawk has written the records when we did {print}).
We use sed to keep only the part inside the square brackets (again using a pattern capture between parentheses, that we use in the substitution part with 1), and the output is written in a file named using the current value of ${genus}.

More details on the last sed command:

We match the whole header lines. They start with ">" (^>), then there are some non-"[" characters ([^[]+), then a "[" ([), then something we capture ((.*)), then a "]" (]). We replace them with the captured part. The sequence lines are unaffected because they don't match the above regular expression.

Subset FASTA file by species name

2 Answers

Explanation

A (bio)awk-based solution

Generate the list of genus

Extract the fasta records for each genus

Add your own answers!

Ask a Question