Comparing aligned amino acids to codon

Bioinformatics Asked on August 27, 2021

I have a set of 4 amino acids which i have aligned,but i want to compare then with respective nucleotide sequence in terms of change codon level.

I have the alignment file where i want to take only for the C-terminal domain region region only the CTD starts with amino acids “NITNLC” till it ends with “HAPATV” and rest of the sequence i dont want to take.

In short i want to take the part sequence which starts from "NITNLC" to "HAPATV" then i would like to compare what are the changes in the nucleotide sequence at codon level.

  1. To start with I have used the seqinr library to generate nucleotide alignment from aligned protein sequence.

    reverse.align {seqinr}

  2. Then the next step that i want to compare each of my amino acid sequence to my reference sequence which in my case is P0DTC2to rest of the sequence for example I want to compare P0DTC2 to K9N5Q8to the above mentioned CTD region which starts from NITNLC and ends with HAPATVand find what are the codon level changes if there is any I would like to report both in amino acid level and codon level.

First part I guess my approach is correct. Second part I’m not sure how to proceed i guess it’s much more than simple parsing I guess!!.

Any help or suggestion would be really appreciated ,and if there exist a R based solution that would be really welcome.

Amino acid file nucleotide file Aligned amino acid file Reverse alignment nucleotide file using seqinr

One Answer

To set the ball rolling, I tried to implement it manually with your files. Overall, there are three steps:

  1. Locate NITNLC (or HAPATV) in protein P0DTC2.
  2. Locate mismatched amino acids between proteins P0DTC2 and K9N5Q8 within the range from step 1.
  3. Print the amino acid and the DNA codon of both proteins from step 2.

It works, but only for the first 60 amino acids. I wonder if it's due to the amino sequences in AMINOOO_seq_removed.fasta repeat itself every 60 acids. But why?

#For example, the first three lines of protein P0DTC2

Step 0: Read the files.


#align file containing protein sequences
count_added <- read.alignment('count_added_.clustal_num', format='clustal')
names(count_added$seq) <- count_added$nam

#DNA sequences
rev3.aln <- read.alignment('rev3.aln', format='fasta')
names(rev3.aln$seq) <- rev3.aln$nam

Step 1: Locate NITNLC (or HAPATV) in protein P0DTC2. Again, there are two repeats of NITNLC 60 acids apart (at 807 and 870).

locate <- function(seq, find)
  {address <- gregexpr(paste(strsplit(find, '')[[1]], collapse='[^a-z]*'), seq)
  #substr(seq, address[[1]][1], address[[1]][1]+attr(address[[1]], 'match.length')[1]-1)

             end=as.numeric(address[[1]] + attr(address[[1]], 'match.length') - 1)))

locate(seq=count_added$seq[['P0DTC2']], find='nitnlc')
#807 870
#812 875
locate(seq=count_added$seq[['P0DTC2']], find='hapatv')
#1212 1279
#1217 1285

Step 2: Locate mismatched amino acids between proteins P0DTC2 and K9N5Q8 within the range from step 1. Instead of using the range 812 - 1279, I chose range 1 - 20 for demonstration purposes.

compare <- function(seq1, seq2, after=0, before=100000)
  {seq1_ = strsplit(seq1, '')[[1]]
  seq2_ = strsplit(seq2, '')[[1]]
  ind = which(seq1_ != seq2_ & grepl('[a-z]',seq1_) & grepl('[a-z]',seq2_))
  ind = ind[ind>after & ind<before]

compare(seq1=count_added$seq[['P0DTC2']], seq2=count_added$seq[['K9N5Q8']], after=1, before=20)
# [1] 5  7  8  9 10 13 14

#protein comparison
#P0DTC2      ----MFVFLVLLPL------
#<mismatch>      5 7890  34

Step 3: Print the amino acid and the DNA codon of both proteins from step 2. Note that ind is based on the align file.

print_amino_codon <- function(ind, seq, seq_gene)
  {locate_amino <- gregexpr('[a-z]', seq)[[1]]
  if (!ind %in% locate_amino) return(NA)
  ind2 = match(ind, locate_amino)

  return(c(amino=substr(seq, ind, ind), codon=substr(seq_gene, ind2*3-2, ind2*3)))
codon(ind=5, seq=count_added$seq[['K9N5Q8']], seq_gene=rev3.aln$seq[['K9N5Q8']])
#amino codon 
#  "v" "atg" 
codon(ind=6, vseq=count_added$seq[['K9N5Q8']], seq_gene=rev3.aln$seq[['K9N5Q8']])
#amino codon 
#  "f" "ttc" 
codon(ind=7, seq=count_added$seq[['K9N5Q8']], seq_gene=rev3.aln$seq[['K9N5Q8']])
#amino codon 
#  "l" "ttg"

codon(ind=5, seq=count_added$seq[['P0DTC2']], seq_gene=rev3.aln$seq[['P0DTC2']])
#amino codon 
#  "m" "atg" 
codon(ind=6, seq=count_added$seq[['P0DTC2']], seq_gene=rev3.aln$seq[['P0DTC2']])
#amino codon 
#  "f" "ttt" 
codon(ind=7, seq=count_added$seq[['P0DTC2']], seq_gene=rev3.aln$seq[['P0DTC2']])
#amino codon 
#  "v" "gtt" 

#protein comparison
#P0DTC2      ----MFVFLVLLPL------
#<mismatch>      5 7890  34
#<print>         ^^^

#K9N5Q8 gene codon
#gtg ttt cta ctg atg ttc ttg tta aca
#                ^^^ ^^^ ^^^
#P0DTC2 gene codon
#atg ttt gtt ttt ctt
#^^^ ^^^ ^^^ 

Correct answer by Ryan SY Kwan on August 27, 2021

Add your own answers!

Related Questions

NCBI blast for exact match of a short sequence

1  Asked on April 4, 2021 by mike-flynn


Hybrid approach in metabolomics (NMR and MS)

0  Asked on April 2, 2021 by garrus990


no result from heat map WGCNA

2  Asked on April 2, 2021


STAR aligner multiple fastq files

0  Asked on March 30, 2021 by mobasher-barsi


extract chimeric and multimap reads from bam file

1  Asked on March 26, 2021 by sbdk8219


Ask a Question

Get help from others!

© 2023 All rights reserved. Sites we Love: PCI Database, MenuIva, UKBizDB, Menu Kuliner, Sharing RPP, SolveDir