TransWikia.com

Remove/delete sequences by ID from multifasta

Bioinformatics Asked by andresito on February 4, 2021

I have a fasta file like this:

>Id1
ATCCTT
>Id2
ATTTTCCC
>Id3
TTTCCCCAAAA
>Id4
CCCTTTAAA

I want to delete sequences that have the following IDs.

Id2
Id3

The IDs are in a .txt file, and the text file will be used to match and delete those sequences.

My output should be a fasta file like this

>Id1
ATCCTT
>Id4
CCCTTTAAA

But I want it with awk and/or sed and/or bash (no python or perl).
How can I do it?

6 Answers

Presume grep is OK if sed, awk are. The -A(n) and -B(n) flags give (n) lines post match. Presuming all your fasta sequence will be one line is a bit dangerous, but it works for your example. First get those IDs to be removed (in rmid.txt), then inverse grep against the initial fasta.

grep -A1 -f rmid.txt fasta.fa > rmfile.fa
grep -v -f rmfile.fa fasta.fa

The real answer is to use a script that defines a delimiter other than newline "n", then parse for IDs, so better to use one of the languages you don't want to use...

Answered by user1141 on February 4, 2021

Using the FastaToTbl and TblToFasta scripts I have posted before, you can do:

FastaToTbl file.fa | grep -vwf ids.txt | TblToFasta > file.2.fa

Answered by terdon on February 4, 2021

If you want to learn how to do things with command-line tools, you can linearize the FASTA with awk, pipe to grep to filter for items of interest named in patterns.txt, then pipe to tr to delinearize:

$ awk '{ if ((NR>1)&&($0~/^>/)) { printf("n%s", $0); } else if (NR==1) { printf("%s", $0); } else { printf("t%s", $0); } }' in.fa | grep -Ff patterns.txt - | tr "t" "n" > out.fa

This will work on multiline FASTA.

Answered by Alex Reynolds on February 4, 2021

Suppose you keep sequence names in ids.txt and sequences in seq.fa:

awk 'BEGIN{while((getline<"ids.txt")>0)l[">"$1]=1}/^>/{f=!l[$1]}f' seq.fa

Answered by user172818 on February 4, 2021

Just today I wrote a script to do exactly this using Biopython. I also add a warning If any the headers I wanted to filter was not present in the original fasta. So the python script filter_fasta_by_list_of_headers.py is :

#!/usr/bin/env python3

from Bio import SeqIO
import sys

ffile = SeqIO.parse(sys.argv[1], "fasta")
header_set = set(line.strip() for line in open(sys.argv[2]))

for seq_record in ffile:
    try:
        header_set.remove(seq_record.name)
    except KeyError:
        print(seq_record.format("fasta"))
        continue
if len(header_set) != 0:
    print(len(header_set),'of the headers from list were not identified in the input fasta file.', file=sys.stderr)

and usage :

filter_fasta_by_list_of_headers.py input.fasta list_of_scf_to_filter > filtered.fasta

P.S. it's quite easy to turn over the script to extract the sequences from the list (just the print line would have to move after line header_set.remove(seq_record.name)

Answered by Kamil S Jaron on February 4, 2021

This is what samtools faidx is intended for.

It needs to be called twice.

  1. First, to create a FASTA index (*.fai file):

    samtools faidx input.fasta
    
  2. Secondly, to filter the FASTA file with the help of the index:

    samtools faidx -o output.fasta input.fasta ids…
    

ids… are the IDs to retain, as individual arguments separated by space. If you’ve got a file blocklist.txt with IDs you want to discard (one per line), you first need to invert this, after having created the index (using Bash syntax):1

remove_ids=($(awk '{print $1}' input.fasta.fai | grep -v -f blocklist.txt))

… and then we can invoke the command as

samtools faidx -o output.fasta input.fasta "${remove_ids[@]}"

What’s nice about samtools faidx is that it also works with BGZF-compressed FASTA data — i.e. fa.gz files (but only if they were compressed using bgzip, which ships with Samtools, and the index must be generated on the compressed file). However, it will always write FASTA. If you want gzipped FASTA output, you need to manually stream the result into bgzip:


samtools faidx input.fasta.gz "${remove_ids[@]}" | bgzip -c >output.fasta.gz

1 Careful, this syntax is generally unsafe for reading a command output into a variable; we can use it here only because we assume that we want to generate a space-separated list of identifiers; the generally safe way is a lot more convoluted.

Answered by Konrad Rudolph on February 4, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP