# Why does the SARS-Cov2 coronavirus genome end in aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa (33 a's)?

Bioinformatics Asked on August 27, 2021

The SARS-Cov2 coronavirus’s genome was released, and is now available on Genbank. Looking at it…

    1 attaaaggtt tataccttcc caggtaacaa accaaccaac tttcgatctc ttgtagatct
61 gttctctaaa cgaactttaa aatctgtgtg gctgtcactc ggctgcatgc ttagtgcact
121 cacgcagtat aattaataac taattactgt cgttgacagg acacgagtaa ctcgtctatc
...
29761 acagtgaaca atgctaggga gagctgccta tatggaagag ccctaatgtg taaaattaat
29821 tttagtagtg ctatccccat gtgattttaa tagcttctta ggagaatgac aaaaaaaaaa
29881 aaaaaaaaaa aaaaaaaaaa aaa


Wuhan seafood market pneumonia virus isolate Wuhan-Hu-1, complete genome, Genbank

Geeze, that’s a lot of a nucleotides—I don’t think that’s just random. I would guess that it’s either an artifact of the sequencing process, or there is some underlying biological reason.

Question: Why does the SARS-Cov2 coronavirus genome end in 33 a’s?

Good observation! The 3' poly(A) tail is actually a very common feature of positive-strand RNA viruses, including coronaviruses and picornaviruses.

For coronaviruses in particular, we know that the poly(A) tail is required for replication, functioning in conjunction with the 3' untranslated region (UTR) as a cis-acting signal for negative strand synthesis and attachment to the ribosome during translation. Mutants lacking the poly(A) tail are severely compromised in replication. Jeannie Spagnolo and Brenda Hogue report:

The 3′ poly (A) tail plays an important, but as yet undefined role in Coronavirus genome replication. To further examine the requirement for the Coronavirus poly(A) tail, we created truncated poly(A) mutant defective interfering (DI) RNAs and observed the effects on replication. Bovine Coronavirus (BCV) and mouse hepatitis Coronavirus A59 (MHV-A59) DI RNAs with tails of 5 or 10 A residues were replicated, albeit at delayed kinetics as compared to DI RNAs with wild type tail lengths (>50 A residues). A BCV DI RNA lacking a poly(A) tail was unable to replicate; however, a MHV DI lacking a tail did replicate following multiple virus passages. Poly(A) tail extension/repair was concurrent with robust replication of the tail mutants. Binding of the host factor poly(A)- binding protein (PABP) appeared to correlate with the ability of DI RNAs to be replicated. Poly(A) tail mutants that were compromised for replication, or that were unable to replicate at all exhibited less in vitro PABP interaction. The data support the importance of the poly(A) tail in Coronavirus replication and further delineate the minimal requirements for viral genome propagation.

Spagnolo J.F., Hogue B.G. (2001) Requirement of the Poly(A) Tail in Coronavirus Genome Replication. In: Lavi E., Weiss S.R., Hingley S.T. (eds) The Nidoviruses. Advances in Experimental Medicine and Biology, vol 494. Springer, Boston, MA

Yu-Hui Peng et al. also report that the length of the poly(A) tail is regulated during infection:

Similar to eukaryotic mRNA, the positive-strand coronavirus genome of ~30 kilobases is 5’-capped and 3’-polyadenylated. It has been demonstrated that the length of the coronaviral poly(A) tail is not static but regulated during infection; however, little is known regarding the factors involved in coronaviral polyadenylation and its regulation. Here, we show that during infection, the level of coronavirus poly(A) tail lengthening depends on the initial length upon infection and that the minimum length to initiate lengthening may lie between 5 and 9 nucleotides. By mutagenesis analysis, it was found that (i) the hexamer AGUAAA and poly(A) tail are two important elements responsible for synthesis of the coronavirus poly(A) tail and may function in concert to accomplish polyadenylation and (ii) the function of the hexamer AGUAAA in coronaviral polyadenylation is position dependent. Based on these findings, we propose a process for how the coronaviral poly(A) tail is synthesized and undergoes variation. Our results provide the first genetic evidence to gain insight into coronaviral polyadenylation.

Peng Y-H, Lin C-H, Lin C-N, Lo C-Y, Tsai T-L, Wu H-Y (2016) Characterization of the Role of Hexamer AGUAAA and Poly(A) Tail in Coronavirus Polyadenylation. PLoS ONE 11(10): e0165077

This builds upon prior work by Hung-Yi Wu et al, which showed that the coronaviral 3' poly(A) tail is approximately 65 nucleotides in length in both genomic and sgmRNAs at peak viral RNA synthesis, and also observed that the precise length varied throughout infection. Most interestingly, they report:

Functional analyses of poly(A) tail length on specific viral RNA species, furthermore, revealed that translation, in vivo, of RNAs with the longer poly(A) tail was enhanced over those with the shorter poly(A). Although the mechanisms by which the tail lengths vary is unknown, experimental results together suggest that the length of the poly(A) and poly(U) tails is regulated. One potential function of regulated poly(A) tail length might be that for the coronavirus genome a longer poly(A) favors translation. The regulation of coronavirus translation by poly(A) tail length resembles that during embryonal development suggesting there may be mechanistic parallels.

Wu HY, Ke TY, Liao WY, Chang NY. Regulation of coronaviral poly(A) tail length during infection. PLoS One. 2013;8(7):e70548. Published 2013 Jul 29. doi:10.1371/journal.pone.0070548

It's also worth pointing out that poly(A) tails at the 3' end of RNA are not an unusual feature of viruses. Eukaryotic mRNA almost always contains poly(A) tails, which are added post-transcriptionally in a process known as polyadenylation. It should not therefore be surprising that positive-strand RNA viruses would have poly(A) tails as well. In eukaryotic mRNA, the central sequence motif for identifying a polyadenylation region is AAUAAA, identified way back in the 1970s, with more recent research confirming its ubiquity. Proudfoot 2011 is a nice review article on poly(A) signals in eukaryotic mRNA.

Correct answer by Cody Gray on August 27, 2021

Not an expert, but some searching on eukaryotic positive-strand RNA viruses seems to show that polyadenylation is not uncommon. For example, Steil, et al., 2010.

Answered by merv on August 27, 2021

This question is quite general, so I'm going to attempt to tie it back to bioinformatics.

Background The tree for the current coronavirus is here, showing it is closely related to bat-coronavirus and in particular SARS.

Question The bioinformatics question for the current coronavirus is why this virus appears to be able to infect humans and transmit to human.

Genome size Firstly, you said that 30kb was large ... this is a standard size for a coronavirus genome, albeit it is unusual in that the family Coronaviridae are the largest genomes for a single stranded RNA virus, for example flaviviruses are 10kb. Thus, all coronaviruses are all approximately 30Kb. Some coronaviruses don't infect humans (zero symptoms), some cause very mild symptoms, others are MERS and SARS with 40-60% and 10% mortality rates, respectively. So, genome size is of little bioinformatics interest in my opinion.

Polyadenylation Polyadenylation and capping (5' methylation) enable the RNA to be trafficked and transcribed by ribosomes and the mechanism is widely used by viruses. Methylation would also prevent the innate immune response from the shredding the vRNA. Koonin and Moss (2010), interpreted a given capping mechanism as being common to the Mononegavirales - a viral Order including measles, mumps, Ebolavirus. Its a big statement, but regardless poly-A and capping are simply mimicking the host mRNA which a lot of viruses use. Poly-A and capping per se are not really interesting.

Evolution and SARS A more detailed examination the evolution of 2019-nCov and its epidemiology in relation to SARS can be found here

Conclusion The bioinformatics question is the genome size wierd - no, its standard for a coronavirus, is the poly-A weird - no its generic amongst lots of viruses as is capping. Is the length of the poly-A excessive (33 As), it looks odd but a human genecist/bioinformaticist needs to answer that ... so is it (potentially) linked with its epidemiology/clinical symptoms?

I don't think 33 poly-As are linked with anything bioinformatically interesting. This is because it will likely vary dramatically between genomes (not simply epidemic vs. non-epidemic strains). I don't know the mechanism for poly-adenylation, but I think slippage is a likely mutation resulting in large variations between individual genomes, particularly for poly-A - which notorious for slippage.

So ultimately could poly-As be linked with the ability of the new coronavirus to infect/transmit and could we therefore explore that bioinformatically? I personally think slippage mutations would prevent a clonal lineage emerging, i.e. that the size of the poly-As is not stable between genomes, but that would assume a given given mechanism of polyadenylation. Thus as a bioinformatics question I wouldn't pursue it, because I don't think there is sufficient biological rationale. I agree weird stuff should be questioned and that bit of the genome jumps out ... but I doubt it would go anywhere.

Slippage The definition of a slippage mutation is here, but basically it means this genome has 33 poly-As, however another isolate from the same epidemic could say have 30 poly-As (just an example), another might have 25 poly-As and so on.

Just my 2 cents

Answered by M__ on August 27, 2021

Some of the other answers here seem quite good; at the same time I think the core answer to the OP's question is maybe a bit hard to tease out of them, so I'd like to try to state it more plainly. It's worth noting that a truly complete answer to this question seems to be beyond current research, but any kind of "Why?" is inevitably a hard or even impossible sort of question to answer fully in biology. We have some ideas about it though.

mRNA is used as a template for protein synthesis within a cell. A single mRNA is used repeatedly, but is eventually "used up" and taken apart. In eukaryotes, poly(A) tails are almost always found on mRNAs produced in the nucleus. The poly(A) tail is ultimately shortened during the transcription process, and this shortening contributes to the mRNA being degraded. (See here for more.)

Coronaviruses also have a poly(A) tail, similarly to eukaryote mRNA. The precise mechanical functions of this poly(A) tail and the means of its synthesis are objects of ongoing research, but research has shown that its presence greatly increases the degree to which Coronavirus RNA is replicated by the host cell. Research has also shown that longer tails increase replication compared to shorter tails. It's quite likely that the presence of the tail assists in recruiting the cell's protein synthesis machinery and allows the RNA to last longer within the host cell, just as it does in the cell's own mRNA.

Interestingly, the pattern in which Coronavirus poly(A) tail length is regulated during infection, in which it starts out shorter, gets longer, then gets much shorter, resembles poly(A) tail length regulation of mRNA during eukaryote embryogenesis, suggesting parallels (see the paper in the "longer tails" link for more on this as well). Longer poly(A) tail length is closely tied to greater translational efficiency in that context.

There has been some speculation in the comments as to whether or not the Coronavirus poly(A) tail resembles a NOP sled in computer programming. I think the resemblance is mostly coincidental. NOP sleds are used in exploits because a processor, encountering a NOP, moves to the next instruction without taking any other actions. A long chain of NOPs, if entered by the processor at any point within it, will lead it to the instructions at the "bottom," after the NOPs. This is advantageous to use if you can't get the processor to go exactly where you want but you know it will end up somewhere close by, because it increases your chances of having your payload executed.

It's unusual to see a lengthy NOP sled in legitimate code, to the point that people writing them usually have to disguise their function in order to avoid automatic detection. (see pg. 183 here) In contrast, a poly(A) tail is almost universally found on nuclear eukaryote mRNA (and on some mRNAs of almost all organisms to some capacity, even mitochondria). Furthermore, the functions of the poly(A) tail are complex enough that it's still an object of ongoing research decades after its initial discovery, whereas a NOP sled does one very mechanical thing. Since the environment inside a cell is so different from the environment of a processor interacting with memory, I think it's hard to make comparisons that are so granular as to deal with a specific set of machine instructions, at least in this kind of context—a processor is a very straightforward kind of machine compared to a cell.

Answered by Zoë Sparks on August 27, 2021

## Related Questions

### install bowtie2 from sources cannot find -ltbb

1  Asked on June 25, 2021 by suvar

### Normalize RNA seq data from multiple runs for expression analysis

5  Asked on June 25, 2021

### Tool for rna/lna melting temperature prediction

1  Asked on June 24, 2021

### Getting Unique Identifier List for GEO Datasets NCBI

2  Asked on June 24, 2021 by pawan-verma

### MSA (protein) with biopython or something else?

1  Asked on June 24, 2021 by curioustree

### Problem with merge data while trying to convert gene names in R

2  Asked on June 23, 2021 by equinox

### How to deal with the mismatches between gene names obtained from different sources?

0  Asked on June 22, 2021

### Is there any value in scaffolding the output contigs of MEGAHIT assembler given a metagenomic dataset?

2  Asked on June 21, 2021

### How to tell if our ligand-protein docking is good from AutoDock Vina’s result

1  Asked on June 20, 2021 by scamander

### Survival analysis using CoxPH – Effect of covariates

1  Asked on June 17, 2021 by beerzy

### Parsing .vcf file for this information

2  Asked on June 17, 2021

### Is there an efficient way to extract CIGAR strings for read pairs from bam files with python?

1  Asked on June 16, 2021 by mereven

### Filtering genes from cuffdiff results

1  Asked on June 13, 2021 by sujaypatil

### Low Fraction of usable antibody reads in CiteSeq

1  Asked on June 12, 2021 by gypti

### RNA_Seq Analysis in R, propmapped function issue

1  Asked on June 11, 2021 by pa_lvl

### Selecting part of an extracted ligand

1  Asked on June 11, 2021 by user8338

### Proper use of BWA MEM on multiplexed GBS sample

2  Asked on June 10, 2021 by plantgeek519

### co-occurrence analysis and visualization for amplicon microbial data

0  Asked on June 9, 2021

### How do I create a VCF file of all known pathogenic mutations in a gene of interest?

1  Asked on June 8, 2021 by nereus

### How do I write tests for a snakemake pipeline?

2  Asked on June 8, 2021