Get gene sequence based on the annotation

Question

I've got the reference genome with Python like so:
sequences_by_chr = {}

with gzip.open("data/Zea_mays.B73_RefGen_v4.dna.toplevel.fa.gz", "rt") as f:
    for seq in SeqIO.parse(f, "fasta"):
        sequences_by_chr[seq.id] = seq.seq

I've also parsed a GFF3 annotation and would like to get the gene sequence based on said annotation.
Suppose the following:
chromosome = 1
strand = "-"
start = 45286331
end = 45290076

Since it's on the "minus" strand, should I do the following:
sequences_by_chr[chromosome].complement()[start:end]
or should I use the reverse_complement()?
I'm completely confused by this, would appreciate any hint!

vkkodali · Answer

You can parse the GFF3 to generate a bed-like file with gene ranges and use that with bedtools getfasta to obtain a multi-FASTA file with gene sequences. An example case is shown below:
## get fasta and gff3 files
wget ftp://ftp.ensembl.org//pub/current_gff3/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.100.gff3.gz
wget ftp://ftp.ensembl.org:/pub/current_fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz

## unzip fasta
gunzip Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz

## generate genes.bed 
zgrep -v '^#' Saccharomyces_cerevisiae.R64-1-1.100.gff3.gz | awk 'BEGIN{FS="t";OFS="t"}($3=="gene"){print $1,$4-1,$5,"name",1000,$7}' > genes.bed

## create gene fasta
bedtools getfasta -fi Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa -bed genes.bed -s -fo genes.fa
```

StupidWolf · Answer

DNA sequence is always provided on the plus strand, if the gff tells you it is on the minus strand, you need to reverse complement it, because the start of the gene in your example will be 45290076 and the end will now be 45286331.
I can give you an example below, this is a potential coding sequence from yeast, no introns and on the minus strand:
gunzip -c test.fa.gz  
>1
TCATCCCCTTATATAAGATGGGAAACGAAGGTAAAAGATTAACGATAGCAT

If we read it in like you did, and if you apply the complement, this is not a coding sequence.
chromosome="1"
start = 1
end =51
sequences_by_chr[chromosome].complement()[(start-1):end]

Seq('AGTAGGGGAATATATTCTACCCTTTGCTTCCATTTTCTAATTGCTATCGTA', SingleLetterAlphabet())

Getting the sequence followed by reverse complement gives you a coding sequence, starting with ATG ending with TGA:
sequences_by_chr["1"][(start-1):end].reverse_complement()

Seq('ATGCTATCGTTAATCTTTTACCTTCGTTTCCCATCTTATATAAGGGGATGA', SingleLetterAlphabet())

aerijman · Answer

You should do:
if sequences_by_chrom[strand] == "-":    
    rev_comp( sequences_by_chr[chromosome][start:end] )
else:
    sequences_by_chr[chromosome][start:end]

where
def rev_comp(sequence):
    d = {'A':'T', 'T':'A','C':'G','G':'C'}
    return ''.join([d[i] for i in sequence][::-1])

Get gene sequence based on the annotation

3 Answers

Add your own answers!

Ask a Question