AnswerBun.com

Finding motifs: fasta file with 10,000 sequences

Bioinformatics Asked on May 6, 2021

I am new to Python. I am trying to parse a fasta file containing 10,000 sequences to look for motifs (microsatellites in particular). I tried using Seq Utils to parse my sequences for a particular motif, that is, (TG)20. I suspect because I used the string function (str), it treated the entire fasta file as a one single fasta sequence. I then tried using BruteForce on the motif (TG)17 and found that an error appeared which said that "…the object of type ‘FastaIterator’ has no len()…". The script however worked when only a single sequence was in the fasta file.

In essence, I need some help figuring out how to:

  • Prepare an instances object(is this correct?) with microsatellite
    motifs (for starters, I intend to parse the sequences for dinucleotide repeats (TG)10,(AC)10,(AT)10
    and (AG)10;
  • Parse my 10,000 long sequence file using this object(?); and
  • Print an output specifying which of my sequences contain microsatellite motifs.

Appreciate any help I can get on this.

#Seq Utils to find Motifs

>>> pattern = Seq("TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG") 
>>> sequence = Seq(a)
>>> results = SeqUtils.nt_search(str(a),pattern)
>>> print(results)
['TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG', 38, 40, 42, 44, 46, 48, 50, 52, 54,
 56, 58, 68682, 68684, 68686, 370986, 370988, 370990, 370992, 370994, 370996, 43
8620, 438622, 438624, 438626, 438628, 438630, 438632, 438634, 438636, 438638, 43
8640, 556703, 556705, 556707, 784973, 784.... # it's too long so I am just pasting the first few lines

#BrutForce to find Motifs in a fasta file containing 10,000 sequences

>>> def BruteForce(s, t):
...     occurrences = []
...     for i in range(len(s)-len(t)+1):
...         match = True
...         for j in range(len(t)):
...             if s[i+j] != t[j]:
...                 match = False
...                 break
...         if match:
...             occurrences.append(i)
...             
...     print(occurrences)
...     
... 

>>> t = "TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG"
>>> len(t)
34
>>> s = SeqIO.parse("group23.fas", "fasta")

>>> BruteForce(s,t)
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "<console>", line 3, in BruteForce
TypeError: object of type 'FastaIterator' has no len()

#BrutForce to find Motifs in a fasta file containing a single sequence

>>> s = SeqIO.read("group23v.fas", "fasta")
>>> BruteForce(s,t)
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 79, 98, 117, 175, 193, 198, 217, 221, 277, 279, 282, 289, 294, 304, 311, 335, 381, 437, 464, 468, 510, 514, 522, 593, 602, 622, 634, 647, 658, 678, 704, 727, 750, 773, 804, 889, 895, 908, 911, 922, 937, 969, 975, 978, 1002]

3 Answers

Bio.SeqIO.parse() returns a SeqRecord iterator. To access the sequence of this sequence record, you need to use the .seq attribute so you should update your s with s.seq.

Answered by haci on May 6, 2021

To expand on haci's answer (and since you mentioned being new to Python) you can loop over the iterator object and use the seq attribute like this:

pattern = Seq("TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG")
for record in SeqIO.parse("group23.fas", "fasta"):
    results = SeqUtils.nt_search(str(record.seq), pattern)
    print(results)

(More info on iterators here.) That will work the same whatever the number of sequences in group23.fas.

I'd be cautious about looking for microsatellites with fixed strings. If the sequences you're looking for are short and have imperfections among the repeats, you might miss them, depending on how many perfect repeats your search pattern has. For example with these stub example sequences:

>seq1
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
>seq2
ACGATGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
>seq3
TGTGTGTGTGTGTGTGTGTGTCTGTGTGTGTGTGTGTGTG

The nt_search output gives you:

['TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG', 0]
['TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG', 4]
['TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG']

That is, one match at position 0 of the first sequence, one at position 4 of the second, and no matches for the third (there's a C I put in the middle). This might not be relevant for you if you're looking microsatellites much longer than your search patterns, but for shorter ones it might be a problem.

Answered by Jesse on May 6, 2021

Here's a solution that I came up with after help from both haci and Jesse. It involves using the following modules: os, glob, SeqIO, Seq, SeqUtils.

import os, glob

os.chdir("C:python38Libsite-packagesBioSeqIO")

from Bio import SeqIO

from Bio.Seq import Seq

from Bio import SeqUtils

patttern = Seq("ACACACACACACACACACAC")

folder_path = 'C:python38Libsite-packagesBioSeqIO' #specified the folder where my sequences were stored

fasta_paths = glob.glob(os.path.join(folder_path, '*.fasta')) #glob module finds all the pathnames matching a specified pattern i.e. '*.fasta'

with open("motif4.txt", "w") as f:
... for fasta_path in fasta_paths:
...     print(fasta_paths)
...     for seq_record in SeqIO.parse(fasta_path, "fasta"):
...         results = SeqUtils.nt_search(str(seq_record.seq),pattern)
...         f.write(str(results)) #replace with seq_record.seq to save the list of node names of all sequences that were parsed
...         f.write("n") #ensures results of the parse through all the sequences in the fasta_path is written to a new line
... f.close()

The nt_search output gave the following output (similar to what Jesse showed):

1 ['ACACACACACACACACACAC']
2 ['ACACACACACACACACACAC']
3 ['ACACACACACACACACACAC']
4 ['ACACACACACACACACACAC']
5 ['ACACACACACACACACACAC']
6 ['ACACACACACACACACACAC', 16636, 16638, 16640, 16642, 16644, 16646, 16648, 16650, 16652, 16654, 16656, 16658, 16660, 16662, 16664]
7 ['ACACACACACACACACACAC']
8 ['ACACACACACACACACACAC']
9 ['ACACACACACACACACACAC']
10 ['ACACACACACACACACACAC']
11 ['ACACACACACACACACACAC', 9174]

It would miss sequences that did not match (AC)10 or those with imperfect repeat motifs. However, in cases where the motif was > (AC)10, nt_search listed down all locations where the motif was found sequentially. For example, in line 6, all locations were listed and showed the motif was in fact 36 nucleotides long in total (pos #16,636 to #16,673).

You can then import the txt. files into a databasing app and analyse the results.

Answered by ShawnC on May 6, 2021

Add your own answers!

Related Questions

Determining significance of a variable in a glm model

2  Asked on September 30, 2021 by rmm

 

Ask a Question

Get help from others!

© 2023 AnswerBun.com. All rights reserved. Sites we Love: PCI Database, MenuIva, UKBizDB, Menu Kuliner, Sharing RPP, SolveDir