TransWikia.com

Get gene sequence based on the annotation

Bioinformatics Asked by Igor Filippov on July 9, 2021

I’ve got the reference genome with Python like so:

sequences_by_chr = {}

with gzip.open("data/Zea_mays.B73_RefGen_v4.dna.toplevel.fa.gz", "rt") as f:
    for seq in SeqIO.parse(f, "fasta"):
        sequences_by_chr[seq.id] = seq.seq

I’ve also parsed a GFF3 annotation and would like to get the gene sequence based on said annotation.
Suppose the following:

chromosome = 1
strand = "-"
start = 45286331
end = 45290076

Since it’s on the "minus" strand, should I do the following:

sequences_by_chr[chromosome].complement()[start:end]

or should I use the reverse_complement()?

I’m completely confused by this, would appreciate any hint!

3 Answers

You can parse the GFF3 to generate a bed-like file with gene ranges and use that with bedtools getfasta to obtain a multi-FASTA file with gene sequences. An example case is shown below:

## get fasta and gff3 files
wget ftp://ftp.ensembl.org//pub/current_gff3/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.100.gff3.gz
wget ftp://ftp.ensembl.org:/pub/current_fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz

## unzip fasta
gunzip Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz

## generate genes.bed 
zgrep -v '^#' Saccharomyces_cerevisiae.R64-1-1.100.gff3.gz | awk 'BEGIN{FS="t";OFS="t"}($3=="gene"){print $1,$4-1,$5,"name",1000,$7}' > genes.bed

## create gene fasta
bedtools getfasta -fi Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa -bed genes.bed -s -fo genes.fa
```

Answered by vkkodali on July 9, 2021

DNA sequence is always provided on the plus strand, if the gff tells you it is on the minus strand, you need to reverse complement it, because the start of the gene in your example will be 45290076 and the end will now be 45286331.

I can give you an example below, this is a potential coding sequence from yeast, no introns and on the minus strand:

gunzip -c test.fa.gz  
>1
TCATCCCCTTATATAAGATGGGAAACGAAGGTAAAAGATTAACGATAGCAT

If we read it in like you did, and if you apply the complement, this is not a coding sequence.

chromosome="1"
start = 1
end =51
sequences_by_chr[chromosome].complement()[(start-1):end]

Seq('AGTAGGGGAATATATTCTACCCTTTGCTTCCATTTTCTAATTGCTATCGTA', SingleLetterAlphabet())

Getting the sequence followed by reverse complement gives you a coding sequence, starting with ATG ending with TGA:

sequences_by_chr["1"][(start-1):end].reverse_complement()

Seq('ATGCTATCGTTAATCTTTTACCTTCGTTTCCCATCTTATATAAGGGGATGA', SingleLetterAlphabet())

Answered by StupidWolf on July 9, 2021

You should do:

if sequences_by_chrom[strand] == "-":    
    rev_comp( sequences_by_chr[chromosome][start:end] )
else:
    sequences_by_chr[chromosome][start:end]

where

def rev_comp(sequence):
    d = {'A':'T', 'T':'A','C':'G','G':'C'}
    return ''.join([d[i] for i in sequence][::-1])

Answered by aerijman on July 9, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP