AnswerBun.com

NCBI blast for exact match of a short sequence

Bioinformatics Asked by Mike Flynn on April 4, 2021

I’m trying to Blast for exact matches to the sequence: ‘ATTGNNNNGCAAACCA’ in the human transcriptome using NCBI Blast on its ‘refseq_rna’ database. However, when I do a basic query I get "No significant similarity found."

Note the N’s in the middle of the sequence, if I get rid of those N’s, yielding the sequence ‘AGCGGATTGCAAAGCAAACCA’, I get a match in the 3′ UTR of the human MeCP2 gene (which is correct). I don’t understand why adding the N’s here make it not work.

I looked at the help section and I believe this may be due to shorter sequences being less "statistically significant". Statistical significance of output is completely irrelevant to me, so I set ‘Expect Value’ to 100000000000000. However, I still get the same result.

I also tried to modify the HTML of the page to allow me to submit a word size of 4 to the form. This went through, but I still got "No significant similarity found".

Could someone help me out with this? I feel like searching for a short sequence shouldn’t be this hard.

One Answer

UPDATE: while procrastinating, I worked up the below throwaway code into a more generally applicable (but still bare-bones) script.


ORIGINAL:

I feel like you might want to use something other than BLAST to do this.

BLAST is not really designed to solve this problem, I think it's just easy to use. Also, your N-containing sequence (16nt) is shorter than the version without Ns (21 nt), which can't be correct if you care about "exact" matches.

As a side note, the idea of exact matches is a bit tricky with Ns involved. The whole point of ambiguity codes is that things won't be exact.

And all of this is on top of the short-string problem- most alignment programs won't even return results that don't meet some minimum alignment score, which is exacerbated a bit by the Ns. For how BLAST handles ambiguity codes, you can see section 7.2 here.

Why not use regular expressions instead to search your sequence against the transcriptome?

# python
import re
from Bio import SeqIO

fwd = r'ATTG[ACGT]{4}GCAAACCA'
rev = r'TGGTTTGC[ACGT]{4}CAAT'  # reverse complement

# reference file
seqfile = SeqIO.parse("your_reference_file.fa", "fasta")  # a transcriptome reference fasta

for seq in seqfile:
    # maybe use re.match() to get coords but easier to show it with findall
    a = re.findall(fwd, str(seq.seq))
    b = re.findall(rev, str(seq.seq))
    if len(a) > 0:
        print(seq.id, a)
    if len(b) > 0:
        print(seq.id, b)

When I put this in a file and run it against a test case with ATTGAAAAGCAAACCA, it works:

$ python exact_match.py 
the_contig ['ATTGAAAAGCAAACCA']

Alternately, you could just blast the 16 non-N versions of that sequence and combine them, as you've shown.

Correct answer by Maximilian Press on April 4, 2021

Add your own answers!

Related Questions

Extract sequences from partial Header

2  Asked on August 25, 2021 by bioinfo

     

Calculating Isoelectric point from a multifasta file

1  Asked on August 22, 2021 by farinelli

   

Kraken2 > OTU format > Phyloseq

0  Asked on August 22, 2021 by reebola95

         

nanopore QC measures on fastq file

0  Asked on August 22, 2021 by arun-venugopalan

     

Removing data from OrthoFinder output prefix

1  Asked on August 22, 2021 by michael-winter

   

Ask a Question

Get help from others!

© 2023 AnswerBun.com. All rights reserved. Sites we Love: PCI Database, MenuIva, UKBizDB, Menu Kuliner, Sharing RPP, SolveDir