AnswerBun.com

What is the full script to extract sequences from fasta file by using Ids in text file in python 3 and pycharm?

Bioinformatics Asked by Samer Habash on March 27, 2021

I have a fasta file (gene.fasta)with sequences with names patterns:

>Hsc_gene_9627.t1
ATGGCACGCATTTTCCTCATTCTTTTATTGCTGCACAACATCTGCTGTGCAGCCGCTTCA
TTGCTCATTTTGAATGCCGTTACATTGGAGAAGGATGCTAATGATTATGCCGTTGGCGAT

and I have the Ids in text file (seq.txt) which are not the exact as in the fasta file:

HSC_gene_996
HSC_gene_9734
and some of the names came as 
HSC_gene_996|HSC_gene_9734 

How can extract the sequences?
I am not experienced in python so please use python for dummies language 🙂
Thank you in advance

One Answer

For example, the fasta are like these:

gene.fasta:

>Hsc_gene_9627.t1
ATGGCACGCATTTTCCTCATTCTTTTATTGCTGCACA
ACATCTGCTGTGCAGCCGCTTCA
>Hsc_gene_9627.t2
GGGGTTTTCCCC
>Hsc_gene_962.t1
AGTCGTCAGTCAGTAGTCGC

seq.txt:

HSC_gene_9627
HSC_gene_9734 
HSC_gene_996|HSC_gene_9734

We use a few modules, read in the fasta:

import re
from Bio import SeqIO                                                                 
records = list(SeqIO.parse("gene.fasta","fasta"))

Read in the list:

seq = [i.rstrip().lower() for i in open('seq.txt').readlines()]

Here I use two functions, one to write fasta files, and the other to format the header for comparison:

def fasta_out(rec):
    return(">"+str(rec.description)+"n"+str(rec.seq))

def format(rec):
    return re.sub("[.][^ ]*$","",rec.description).lower()

Define the records to keep:

keep = [fasta_out(i) for i in records if format(i) in seq]

Write them out:

f=open('matched.fasta','w')
f.writelines("n".join(keep))
f.close()

We can check:

open('matched.fasta').readlines()                                                          
Out[50]: 
['>Hsc_gene_9627.t1n',
 'ATGGCACGCATTTTCCTCATTCTTTTATTGCTGCACAACATCTGCTGTGCAGCCGCTTCAn',
 '>Hsc_gene_9627.t2n',
 'GGGGTTTTCCCC']

Most likely there are some simple while, for loops that does not require modules..

Answered by StupidWolf on March 27, 2021

Add your own answers!

Related Questions

Genewise: Failed to call blastwise.pl

0  Asked on January 28, 2021 by yp-chen

 

ATAC seq : Plotting insert sizes

1  Asked on January 26, 2021

 

Find CRISPR PAM-sites with python

1  Asked on January 21, 2021 by heibai

     

Microarray Meta-Analysis and Cross-Platform

1  Asked on January 20, 2021 by morteza-hadizadeh

 

Ask a Question

Get help from others!

© 2023 AnswerBun.com. All rights reserved. Sites we Love: PCI Database, MenuIva, UKBizDB, Menu Kuliner, Sharing RPP, SolveDir