NCBI blast for exact match of a short sequence

Question

I'm trying to Blast for exact matches to the sequence: 'ATTGNNNNGCAAACCA' in the human transcriptome using NCBI Blast on its 'refseq_rna' database. However, when I do a basic query I get "No significant similarity found."
Note the N's in the middle of the sequence, if I get rid of those N's, yielding the sequence 'AGCGGATTGCAAAGCAAACCA', I get a match in the 3' UTR of the human MeCP2 gene (which is correct). I don't understand why adding the N's here make it not work.
I looked at the help section and I believe this may be due to shorter sequences being less "statistically significant". Statistical significance of output is completely irrelevant to me, so I set 'Expect Value' to 100000000000000. However, I still get the same result.
I also tried to modify the HTML of the page to allow me to submit a word size of 4 to the form. This went through, but I still got "No significant similarity found".
Could someone help me out with this? I feel like searching for a short sequence shouldn't be this hard.

Maximilian Press · Accepted Answer

UPDATE: while procrastinating, I worked up the below throwaway code into a more generally applicable (but still bare-bones) script.

ORIGINAL:
I feel like you might want to use something other than BLAST to do this.
BLAST is not really designed to solve this problem, I think it's just easy to use. Also, your N-containing sequence (16nt) is shorter than the version without Ns (21 nt), which can't be correct if you care about "exact" matches.
As a side note, the idea of exact matches is a bit tricky with Ns involved. The whole point of ambiguity codes is that things won't be exact.
And all of this is on top of the short-string problem- most alignment programs won't even return results that don't meet some minimum alignment score, which is exacerbated a bit by the Ns. For how BLAST handles ambiguity codes, you can see section 7.2 here.
Why not use regular expressions instead to search your sequence against the transcriptome?
# python
import re
from Bio import SeqIO

fwd = r'ATTG[ACGT]{4}GCAAACCA'
rev = r'TGGTTTGC[ACGT]{4}CAAT'  # reverse complement

# reference file
seqfile = SeqIO.parse("your_reference_file.fa", "fasta")  # a transcriptome reference fasta

for seq in seqfile:
    # maybe use re.match() to get coords but easier to show it with findall
    a = re.findall(fwd, str(seq.seq))
    b = re.findall(rev, str(seq.seq))
    if len(a) > 0:
        print(seq.id, a)
    if len(b) > 0:
        print(seq.id, b)

When I put this in a file and run it against a test case with ATTGAAAAGCAAACCA, it works:
$ python exact_match.py 
the_contig ['ATTGAAAAGCAAACCA']

Alternately, you could just blast the 16 non-N versions of that sequence and combine them, as you've shown.

NCBI blast for exact match of a short sequence

One Answer

Add your own answers!

Ask a Question