Understanding the fasta File Format

Question

I'm a computer scientist teaching algorithms development in the Fall. One of the algorithms we teach is called Edit Distance, and our folklore is that it is used to compare RNA sequences (is this actually true in practice?).
I would like to have students implement the edit distance algorithm and run it on actual SARS-COV-2 sequences, so I'm trying to understand exactly what I get from the GenBank database. I downloaded this one: https://www.ncbi.nlm.nih.gov/nuccore/1798174254
I am looking at the genomic.fna file. So this is apparently the FASTA file format, and the lines that begin with >MN988669.1 ... are comments. I see comments like:
>MN988669.1 Severe acute respiratory syndrome coronavirus 2 isolate 2019-nCoV WHU02, complete genome

Followed by an RNA string. Is this the start of a new sequence for a different coronavirus specimen? So I could have students extract each of these and run edit distance and then produce a dendrogram or something? How do I find more information on where the samples came from? Is this the right file to be using, or should I use the gbff file? And are the PDB files interesting to me at all (I actually do know what PDB files are)?
Also, are there any recommended data sets where we could do something like track mutations in the virus (and see, e.g., that the NYC outbreak originated from Europe and not China)? Are there other useful algorithms / data that might be interesting for students to study in this vein? Specifically interesting to me would be graph-search algorithms, minimum spanning trees, and network flow. Also any NP-complete algorithms that we could run backtracking on. Obviously taking the theoretical study of algorithms to something as currently topical as the coronavirus has pedagogical value.
Thanks
EDIT:
Based on comments below here is what is taking shape.

Have students implement vanilla EditDistance (which there seems to be some disagreement about which algorithms are named what, so let's say insertions and deletions only, which I would call Longest Common Subsequence LCS). Then a variant that also does alignment (i.e. full Levenshtein distance computation, which I would call EditDistance, but Wikipedia calls the Needleman-Wunsch algorithm with a gap penalty of 1), then maybe Needleman-Wunsch with different gap penalties (if someone tells me what would make sense biologically).
Have students implement basic heirarchical clustering / phylogenetic tree generation algorithms a la https://www.ncss.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Hierarchical_Clustering-Dendrograms.pdf.
Have students run their sequence alignment variants and different clustering algorithms on SARS-COV-2 sequences and report on how the choices of parameters in 1 and 2 change the result and therefore potentially the analysis.
Ask some open ended written response questions on what this might mean for society, whether this introduces ethical considerations for an algorithms designer or are they doing just the math, etc.

My learning objectives (as they are now shaping up) are:

Students will understand that just because their algorithm comes with a proof of correctness, it doesn't mean it's the correct algorithm for the job.
Students will understand that different models / parameters to a model result in different outcomes, and thus even computational problems are not purely computational.
Doing theoretical computer science / mathematics is not devoid of ethical considerations.

I would very much appreciate thoughts on the above.

Chris_Rands · Answer

Your understanding of FASTA format is about right. The type of basic problem you're eluding to we term "sequence alignment"- edit distance might be okay for teaching but in practise we use other algorithms, e.g. you might be interested in the Needleman–Wunsch or Smith–Waterman algorithms. Richard Durbin et al. wrote a great book that covers these any much more https://www.amazon.com/Biological-Sequence-Analysis-Probabilistic-Proteins/dp/0521629713
Tracking mutations etc. requires more than just alignment though, see "phylogenetics" (i.e. building genetic trees) and "variant calling". Also checkout what the nextstrain team is doing https://nextstrain.org/ncov/global
In general, looking for practical applications for your algorithms is great, but be very careful before drawing any real world conclusions about the coronavirus outbreak from such analyses

M__ · Answer

The correct method is laborious and it is better to give the students a ready made tree at GISAID to investigate the dispersion of COVID-19 into Europe.
However, a quick approach to getting to the alignment and drawing a tree is easy and will readily complement your established teaching method. What this will give you is a very different phylogeny from edit distances and you would explain the matrix differences between the approaches. I suspect NCBI use Jukes Cantor distances.

Go to blast.ncbi.nlm.nih.gov
Select nucleotides
Paste "MN988669" and hit return
This will produce 100 hits
On the menu page select the "draw tree option"
This will produce a reasonable nucleotide tree for 100 COVID-19 sequences
Select "neighbor-joining" rather than "minimum evolution" (this is hierarchical clustering geared to towards mutation rate heterogeneity between taxa)
There are various point and click options including "examine alignment"
You can then select "minimum evolution" and see the changes in the tree (and there are changes) - TEACHING OBJECTIVE 1 & 2
If you want to examine COVID-19 from the European perspective then select a European isolate, e.g. French isolates and input this into the blast. However, European isolates are basically Wuhan original.

I have provided an example below, which I "rerooted" and represents the closest 100 to you sequence using collapsed clade format (your students can undo this, to examine the contents of a "collapsed clade"). The tree shows the dispersion of strains from the Wuhan seafood market.
There's lots of flexibility and the students can do everything inside ½ hour easily and this would complement your approach. The advantage of my approach to teaching phylogeny and regardless of what  you do and how you do it Blast is central to rapidly obtaining alignment data for students and investigators alike. We use different blast options but blasting through is prerequisite to understand the diversity and clean some information on the population structure.

Also, are there any recommended data sets where we could do something like track mutations in the virus (and see, e.g., that the NYC outbreak originated from Europe and not China)?

Yes there are, today this would be Mesquite available here. In my opinion this is a bit advanced. You can track amino acid mutations, which in my view is easier.

Understanding the fasta File Format

2 Answers

Add your own answers!

Ask a Question