TransWikia.com

Genome assembly of SRR12196449 with SPAdes

Bioinformatics Asked on February 19, 2021

I am trying to assemble the run SRR12196449 with SPAdes. The description of their project is:

This project expected to standardize a method for amplification and
sequencing of FIV genome in a simple way, allowing a broader analysis
to increase the knowledge on the biology and evolution of the virus
and virus-host interaction.

This is from School of Veterinary Medicine; University of Sao Paulo.

I have tried in three ways:

  1. Using a fasta file downloaded from NCBI (using Filtered Download method)

SPAdes-3.14.0/spades.py –only-assembler -s sra_data.fa -o raw-fiv1

  1. Using a fastq file also downloaded from NCBI using Filtered Download method.

SPAdes-3.14.0/spades.py –12 sra_data.fastq.gz -o raw-fiv1-b

  1. Downloading the original file and dumping to a fastq using sratools.

SPAdes-3.14.0/spades.py –12 SRR12196449.fastq -o raw-fiv1-c

In the third case I get a very bad assembly, likely because it needs trimming. In the two former cases, I get an assembly with a max contig length of around 4k-5k bps. But when I use quast to evaluate the assembly I obtain a largest alignment of ~500bps. I use this as reference.
Furthermore, if make a blast search it finds that the 4k-5k contig matches other FIV sequences at ~90% . In between these other sequences, there is at least one (MF370550.1) submitted by the University of Sao Paulo (likely another run from another sample in the same project).

On the other hand, if I assemble the run using another software I obtain a contig that is an almost perfect match to my reference sequence (~99%).

I have a few questions.

  • Am I using SPAdes correctly? Is there a better way to use it?
  • Is it possible that the sequences that blast finds matching with the SPAdes contigs are indeed sequences that someone got using SPAdes and that’s why these match (not because these are correct)?
  • What other option could explain these circumstances that I could be missing?

EDIT:
Some clarifications regarding some comments from @MaximilianPress. I can confirm that I used the same reference genome in both cases. I used that command to get the quast results:

quast-5.0.2/quast.py -r raw-fiv1/sequence.fasta SPAdes-3.14.0/raw-fiv1/contigs.fasta

The other assembler is an overlap-layout-consensus algorithm that I implemented. The methodology is similar: I use exactly the same input fasta file.

I have also tried using the SPAdes –plasmid flag and that doesn’t improve much the result. Max contig length is now 3642 bps and largest alignment 737 bps, covering only 26% of the reference.
BTW, I made also tests with other virus sequences. I am just interested to know if I am using SPAdes correctly so that I can compare the performance of my algorithm Here there is a link to additional cases with other genomes. There you can find all the data and results used (raw files, assembly results and quast results)

Another clarification: I have also got perfect assemblies using SPAdes for other runs corresponding to other viruses… I am following exactly the same procedure in this case.

2 Answers

Update 2:

I looked into this a little more, with the various data sources.

This is related in part to the answer submitted by OP juanjo75es, in addition to discussion on chat. I don't entirely understand the logic, but the general thrust seems to be that SPAdes makes weird assemblies somehow.

Some notes that I made:

REFERENCE ASSEMBLIES

  • FIV sequence U11820.1 was deposited in 1996, before SPAdes existed. Unclear assembly method.
  • FIV sequence MN630242 was deposited in 2020, but was assembled with CLC workbench.
  • These two sequences align together ok with minimap2. Low exact identity, it's true. This is the PAF output:
MN630242.1      8977    194     8964    +       U11820.1        9462    575     9326    1130    8778    60      tp:A:P  cm:i:116        s1:i:1125
       s2:i:55 dv:f:0.1761
  • These two assemblies are syntenic, and both have similar gene calls using prokka. GFFs:
MN630242.1      Prodigal:2.6    CDS     253     1605    .       +       0       ID=AMGANPBD_00001;inference=ab initio prediction:Prodigal:2.6;locus_tag=AMGANPBD_00001;product=hypothetical protein
MN630242.1      Prodigal:2.6    CDS     1656    4868    .       +       0       ID=AMGANPBD_00002;eC_number=3.6.1.23;Name=dut;gene=dut;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:Q2YRG4;locus_tag=AMGANPBD_00002;product=Deoxyuridine 5'-triphosphate nucleotidohydrolase
MN630242.1      Prodigal:2.6    CDS     4861    5616    .       +       0       ID=AMGANPBD_00003;inference=ab initio prediction:Prodigal:2.6;locus_tag=AMGANPBD_00003;product=hypothetical protein
MN630242.1      Prodigal:2.6    CDS     5891    8461    .       +       0       ID=AMGANPBD_00004;inference=ab initio prediction:Prodigal:2.6;locus_tag=AMGANPBD_00004;product=hypothetical protein
MN630242.1      Prodigal:2.6    CDS     8626    8790    .       +       0       ID=AMGANPBD_00005;inference=ab initio prediction:Prodigal:2.6;locus_tag=AMGANPBD_00005;product=hypothetical protein


U11820.1        Prodigal:2.6    CDS     634     1983    .       +       0       ID=EANIPDKN_00001;inference=ab initio prediction:Prodigal:2.6;locus_tag=EANIPDKN_00001;product=hypothetical protein
U11820.1        Prodigal:2.6    CDS     1995    5246    .       +       0       ID=EANIPDKN_00002;eC_number=3.6.1.23;Name=dut;gene=dut;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:Q2YRG4;locus_tag=EANIPDKN_00002;product=Deoxyuridine 5'-triphosphate nucleotidohydrolase
U11820.1        Prodigal:2.6    CDS     5239    5994    .       +       0       ID=EANIPDKN_00003;inference=ab initio prediction:Prodigal:2.6;locus_tag=EANIPDKN_00003;product=hypothetical protein
U11820.1        Prodigal:2.6    CDS     6269    8830    .       +       0       ID=EANIPDKN_00004;inference=ab initio prediction:Prodigal:2.6;locus_tag=EANIPDKN_00004;product=hypothetical protein
U11820.1        Prodigal:2.6    CDS     8904    9152    .       +       0       ID=EANIPDKN_00005;inference=ab initio prediction:Prodigal:2.6;locus_tag=EANIPDKN_00005;product=hypothetical protein

I can share FAA files of the proteins if needed.

READ ASSEMBLY

I also assembled the indicated reads using SPAdes. For reference it is a ~9Kbp virus, but this is a 4.3Mbp assembly. There is a lot of non-virus sequence in there. The second-largest contig is a shuttle vector. the third-largest contig is cat (host). Many more are cat, so I think it's pretty cat-oriented. The original authors used CLC workbench, so I guess that just worked a lot better in this instance, even in the presence of all the contamination. Unclear why, it appears that CLC works similarly to OP's assembly tool.

I aligned these to MN630242.1 with minimap2. If I understand OP, they are unhappy about the mapping of these contigs to this genome reference. The identities are mostly high (though there is indeed some oddness with the largest contig, which only finds very low coverage- at the same time that there are overlapping contigs with very high ID?), and they cover the entire reference genome:

MN630242.1  8977    2155    3199    -   NODE_14_length_1054_cov_3786.620280 1054    2   1046    1021    1044    60  tp:A:P  cm:i:189    s1:i:1021   s2:i:378    dv:f:0.0031
MN630242.1  8977    7893    8626    -   NODE_25_length_743_cov_3814.258117  743 8   739 713 733 60  tp:A:P  cm:i:129    s1:i:713    s2:i:318    dv:f:0.0045
MN630242.1  8977    72  4904    -   NODE_1_length_4942_cov_25.814123    4942    34  4863    628 4832    41  tp:A:P  cm:i:7s1:i:628  s2:i:512    dv:f:0.1703
MN630242.1  8977    3376    3897    +   NODE_39_length_526_cov_5254.155388  526 3   524 512 521 0   tp:A:S  cm:i:8s1:i:512  dv:f:0.0023
MN630242.1  8977    6270    6673    +   NODE_82_length_409_cov_1.744681 409 3   406 357 403 0   tp:A:P  cm:i:53 s1:i:357    s2:i:351    dv:f:0.0185
MN630242.1  8977    6852    7223    -   NODE_87_length_381_cov_23.334646    381 6   376 355 371 2   tp:A:P  cm:i:5s1:i:355  s2:i:351    dv:f:0.0079
MN630242.1  8977    6187    6558    -   NODE_88_length_380_cov_2338.675889  380 7   378 351 371 0   tp:A:S  cm:i:5s1:i:351  dv:f:0.0085
MN630242.1  8977    6852    7308    +   NODE_56_length_471_cov_0.933140 471 15  471 351 456 0   tp:A:S  cm:i:43 s1:i:351    dv:f:0.0397
MN630242.1  8977    7332    7707    +   NODE_86_length_389_cov_3506.202290  389 5   380 334 375 17  tp:A:P  cm:i:5s1:i:334  s2:i:309    dv:f:0.0111
MN630242.1  8977    5181    5568    -   NODE_60_length_464_cov_1.005935 464 50  437 314 387 17  tp:A:P  cm:i:41 s1:i:314    s2:i:282    dv:f:0.0357
MN630242.1  8977    7277    7707    +   NODE_76_length_438_cov_0.919614 438 5   435 309 430 0   tp:A:S  cm:i:40 s1:i:309    dv:f:0.0437
MN630242.1  8977    6953    7285    +   NODE_94_length_345_cov_2.903670 345 8   340 301 332 0   tp:A:S  cm:i:50 s1:i:301    dv:f:0.0110
MN630242.1  8977    5744    6056    -   NODE_113_length_316_cov_2.169312    316 4   316 289 312 35  tp:A:P  cm:i:4s1:i:289  s2:i:244    dv:f:0.0096
MN630242.1  8977    5615    5927    +   NODE_108_length_322_cov_1103.400000 322 6   318 283 312 58  tp:A:P  cm:i:4s1:i:283  s2:i:209    dv:f:0.0143
MN630242.1  8977    6543    6813    -   NODE_147_length_279_cov_2592.519737 279 8   278 261 270 21  tp:A:P  cm:i:4s1:i:261  s2:i:236    dv:f:0.0032
MN630242.1  8977    8544    8872    +   NODE_100_length_333_cov_2259.189320 333 1   329 252 328 0   tp:A:P  cm:i:3s1:i:252  s2:i:245    dv:f:0.0318
MN630242.1  8977    7601    7853    -   NODE_200_length_255_cov_25.125000   255 0   252 250 252 12  tp:A:P  cm:i:3s1:i:250  s2:i:236    dv:f:0.0017
MN630242.1  8977    6648    6968    -   NODE_106_length_324_cov_2.538071    324 4   324 248 320 0   tp:A:P  cm:i:3s1:i:248  s2:i:248    dv:f:0.0303
MN630242.1  8977    5940    6193    -   NODE_162_length_266_cov_1809.287770 266 6   259 246 253 5   tp:A:P  cm:i:4s1:i:246  s2:i:240    dv:f:0.0033
MN630242.1  8977    5416    5664    -   NODE_264_length_255_cov_9.531250    255 5   253 243 248 0   tp:A:P  cm:i:4s1:i:243  s2:i:242    dv:f:0.0033
MN630242.1  8977    7221    7466    -   NODE_247_length_255_cov_12.539062   255 1   246 241 245 0   tp:A:P  cm:i:4s1:i:241  s2:i:241    dv:f:0.0032
MN630242.1  8977    4830    5071    -   NODE_1121_length_245_cov_11.711864  245 4   245 239 241 0   tp:A:P  cm:i:4s1:i:239  s2:i:235    dv:f:0.0042
MN630242.1  8977    7769    8013    +   NODE_677_length_251_cov_3.951613    251 6   249 238 244 7   tp:A:P  cm:i:4s1:i:238  s2:i:230    dv:f:0.0075
MN630242.1  8977    4901    5148    +   NODE_270_length_255_cov_8.726562    255 8   255 238 247 0   tp:A:P  cm:i:4s1:i:238  s2:i:234    dv:f:0.0073
MN630242.1  8977    8705    8948    -   NODE_382_length_254_cov_5.763780    254 7   250 234 243 15  tp:A:P  cm:i:4s1:i:234  s2:i:217    dv:f:0.0062
MN630242.1  8977    6065    6307    +   NODE_410_length_254_cov_3.488189    254 1   243 227 242 10  tp:A:P  cm:i:3s1:i:227  s2:i:210    dv:f:0.0122
MN630242.1  8977    5062    5239    -   NODE_7003_length_180_cov_9.207547   180 3   180 172 177 17  tp:A:P  cm:i:2s1:i:172  s2:i:157    dv:f:0.0066
MN630242.1  8977    11  136 +   NODE_21321_length_128_cov_1195.000000   128 2   127 125 125 3   tp:A:P  cm:i:2s1:i:125  s2:i:123    dv:f:0

Only a (relatively) few contigs align well (figure), as might be expected with heavy contamination, and of those some are overlapping:

enter image description here

To look into the odd behavior of the biggest contig, I ran prokka on it too and found the expected genes for its position, especially the largest gene in the virus, dut (Deoxyuridine 5'-triphosphate nucleotidohydrolase). I then took the protein sequences of the 3 assemblies and aligned them with clustalo. They're all quite similar:

CLUSTAL O(1.2.4) multiple sequence alignment


NODE1_KJHFFCBH_00001         ------MEKRPEIQIFVNGHPIKFLLDTGADITILNRKEFIIGNSIENGKQNMIGVGGGK
MN630242_AMGANPBD_00002      -------------MIFVNGYPIKFLLDTGADITILNRRDFQVKNSIENGRQNMIGVGGGK
U11820.1_EANIPDKN_00002      MGTTTTLERRLEIQIFVNGHPIKFLLDTGADITILNRKDFQIGNSIENGKQNMIGVGGGK
                                           *****:*****************::* : ******:**********

NODE1_KJHFFCBH_00001         RGTNYINVHLEIRDENYKTQCIFGNVCVLEDNSLIQPLLGRDNMIKFNIRLVMAQISEKI
MN630242_AMGANPBD_00002      RGTNYTNVHLEIRDENYKTQCIFGNVCVLEDNSLIQPLLGRDNMIKFNIRLVMAQISDKI
U11820.1_EANIPDKN_00002      RGTNYINVHLEIRDENYRMQCIFGNVCVLEDNSLIQPLLGRDNMIKFNIRLVMAQISEKI
                             ***** ***********: **************************************:**

NODE1_KJHFFCBH_00001         PIVKVRMKDPTQGPQVKQWPLSNEKIEALTEIVERLEQEGKVKRADPNNPWNTPVFAIKK
MN630242_AMGANPBD_00002      PIVKVKMKDPNKGPQIKQWPLSNEKIEALTEIVERLEKEGKVKRADPNNPWNTPVFAIKK
U11820.1_EANIPDKN_00002      PIVKVRMRDPIQGPQVKQWPLSNEKIEALTDIVERLESEGKVKRADPNNPWNTPVFAIKK
                             *****:*:** :***:**************:******.**********************

NODE1_KJHFFCBH_00001         KSGKWRMLIDFRVLNKLTDKGAEVQLGLPHPAGLQWKKQVTVLDIGDAYFTIPLDPDYAP
MN630242_AMGANPBD_00002      KSGKWRMLIDFRELNKLTEKGAEVQLGLPHPAGLQMKKQVTVLDIGDAYFTIPLDPDYAP
U11820.1_EANIPDKN_00002      KSGKWRMLIDFRVLNKLTDKGAEVQLGLPHPAGLQMKKQVTVLDIGDAYFTIPLDPDYAP
                             ************ *****:**************** ************************

NODE1_KJHFFCBH_00001         YTAFTLPRKNNAGPGKRYVWCSLPQGWVLSPLIYQSTLDNILQPYIKQNPELDIYQYMDD
MN630242_AMGANPBD_00002      YTAFTLPRKNNAGPGRRYIWCSLPQGWILSPLIYQSTLDNIIQPFIRQNPQLDIYQYMDD
U11820.1_EANIPDKN_00002      YTAFTLPRKNNAGPGRRYVWCSLPQGWVLSPLIYQSTLDNILQPFIKQNSELDIYQYMDD
                             ***************:**:********:*************:**:*:** :*********

NODE1_KJHFFCBH_00001         IYIGSNLSRQEHKQKVEELRKLLLWWGFETPEDKLQEEPPYKWMGYELHPLTWSIQQKQL
MN630242_AMGANPBD_00002      IYIGSNLNKKEHKEKVGELRKLLLWWGFETPEDKLQEEPPYKWMGYELHPLTWTIQQKQL
U11820.1_EANIPDKN_00002      IYIGSNLSKKEHKQKVEELRKLLLWWGFETPEDKLQEEPPYKWMGYELHPLTWSIQQKQL
                             *******.::***:** ************************************:******

NODE1_KJHFFCBH_00001         EIPERPTLNDLQKLAGKINWASQTIPKLSIKALTHMMRGDQKLDSIREWTEEAKKEVQKA
MN630242_AMGANPBD_00002      DIPEQPTLNELQKLAGKINWASQAIPDLSIKALTNMMRGNQNLNSIREWTKEARLEVQKA
U11820.1_EANIPDKN_00002      EIPERPTLNELQKLAGKINWASQTIPDLSIKELTNMMRGDQKLDSIREWTVEAKREVQKA
                             :***:****:*************:**.**** **:****:*:*:****** **: *****

NODE1_KJHFFCBH_00001         KEAIEKQAQLYYYDPNRELYAKISLVGPHQLCYQVYHKNPEQILWYGKMNRQKKRAENTC
MN630242_AMGANPBD_00002      KKAIEGQVQLGYYDPSKELYAKLSLVGPHQISYQVYQRNPERILWYGKMSRQKKKAENTC
U11820.1_EANIPDKN_00002      KEAIEKQAQLNYYDPNRGLYAKLSLVGPHQICYQVYQKNPEHILWYGKMNRQKKKAENTC
                             *:*** *.** ****.: ****:*******:.****::***:*******.****:*****

NODE1_KJHFFCBH_00001         DIALRACYKIREESIVRIGKEPVYEIPASREAWESNIIRSPYLKASPPEVEFIHAALNIK
MN630242_AMGANPBD_00002      DIALRACYKIREESIIRIGKEPKYEIPTSREAWESNLINSPYLKAPPPEVEYIHAALNIK
U11820.1_EANIPDKN_00002      DIALRACYKIREESIIRIGKEPMYEIPASREAWESNLIRSPYLKAPPPEVEFIHAALSIK
                             ***************:****** ****:********:*.****** *****:*****.**

NODE1_KJHFFCBH_00001         RALSMVQEVPILGAETWYIDGGRRQGKAARAAYWTNTGRWQVMEIEGSNQKAEVQALLMA
MN630242_AMGANPBD_00002      RALSMIKDAPILGAETWYIDGGRKLGKAAKAAYWTDTGKWQVMELEGSNQKAEIQALLLA
U11820.1_EANIPDKN_00002      RALSMIQDAPITGAETWYIDGSRKQGKAARAAYWTDTGKWQIMEIEGSNQKAEVQALLLA
                             *****:::.** *********.*: ****:*****:**:**:**:********:****:*

NODE1_KJHFFCBH_00001         LKEGPEEMNIITDSQYILNIMNQQPDLMEGIWQEVLEEMEKKIAIFIDWVPGHKGIPGNE
MN630242_AMGANPBD_00002      LKAGPEEMNIITDSQYVINIILQQPDMMEGIWQEVLEELEKKTAIFIDWVPGHKGIPGNE
U11820.1_EANIPDKN_00002      LKAGSEEMNIITDSQYILNIINQQPDLMEGLWQEVLEEMEKKIAIFIDWVPGHKGIPGNE
                             ** * ***********::**: ****:***:*******:*** *****************

NODE1_KJHFFCBH_00001         EVDKLCQTMMIIEGDGILNKRPEDAGYDLLAAQEIHILPGEVRIVPTRTRIMLPKGYWGL
MN630242_AMGANPBD_00002      EVDKLCQTMMIIEGDGILDKRSEDAGYDLLAAKEMHLLPGEVKVIPTGVKIMLPKGYWGL
U11820.1_EANIPDKN_00002      EVDKLCQTMMIIEGEGILDKRSEDAGYDLLAAQETHFLPGEVRIVPTKTRIMLPKGHWGL
                             **************:***:** **********:* *:*****:::** .:******:***

NODE1_KJHFFCBH_00001         IMGKSSIGNKGMDVLGGVIDEGYRGEIGVIMINLSKKSTTILEKQKVAQLIILPCKHESL
MN630242_AMGANPBD_00002      IIGKSSIGSKGLDVLGGVIDEGYRGEIGVIMINLSRKSITLLEQQKIAQLIILPCKHEVL
U11820.1_EANIPDKN_00002      IMGKSSIGSKGMDVLGGVIDEGYRGELGVIMINLTKKSITILEKQKVAQLIILPCRHESL
                             *:******.**:**************:*******::** *:**:**:********:** *

NODE1_KJHFFCBH_00001         EQGEIIMNSERGEKGFGSTGVFSSWVDRIEEAELNHEKFHSDPQYLRTEFNIPRIVAEEI
MN630242_AMGANPBD_00002      EQGKVIMDSERGDKGYGSTGVFSSWVDRIEEAEINHEKFHSDPQYLRTEFNLPKMVAEEI
U11820.1_EANIPDKN_00002      QQGEIQMDSERGEKGFGSTGVFSSWVDRIEEAELNHEKFHSDPQYLRTEFNLPRIVAEEI
                             :**:: *:****:**:*****************:*****************:*::*****

NODE1_KJHFFCBH_00001         KRKCPLCRIRGEQVEGKLKIGPGIWQMDCTHFNGKIIIVAIHVESGLLWAQIIPQETADC
MN630242_AMGANPBD_00002      RRKCPVCRIRGEQVGGQLKIGPGIWQMDCTHFDGKIILVAIHVESGYIWAQIISQETADC
U11820.1_EANIPDKN_00002      KRKCPLCRIRGEQVGGQLKIGPGIWQMDCTHFNGKIIIVAVHVESGFLWAQIIPQETAEC
                             :****:******** *:***************:****:**:***** :***** ****:*

NODE1_KJHFFCBH_00001         TLKAIMQLVSTHNVTEIQTDNGPNFKNQKIEGLLSYMGIKHKLGIPGNPQSQALVENANS
MN630242_AMGANPBD_00002      TVKAVLQLLSAHNVTELQTDNGPNFKNQKMEGALNYMGVKHKFGIPGNPQSQALVENVNQ
U11820.1_EANIPDKN_00002      TVKALLQLICAHNVTELQTDNGPNFKNQKMEGLLNYMGIKHKLGIPGNPQSQALVENANN
                             *:**::**:.:*****:************:** *.***:***:**************.*.

NODE1_KJHFFCBH_00001         TLKVWIQKFLPETTSLDNALALALHCLNFKQRGRLGKMAPYELYTQQESLRIQDYFSQLP
MN630242_AMGANPBD_00002      TLKAWIQKFLPETTSLENALALAVHCLNFKQRGRIGGMAPYELLAQQESLRIQEFFSKIP
U11820.1_EANIPDKN_00002      TLKAWIQKFLPETTSLDNALALALHCLNFKQRGRLGKMAPYELYIQQESLRIQDYFSQIP
                             ***.************:******:**********:* ******  ********::**::*

NODE1_KJHFFCBH_00001         QKLMMQWVYYKDQKDKKWKGPMRVEYWGQGSVLLKDEEKGYFLVPRRHIRRVPEPCTLPE
MN630242_AMGANPBD_00002      QKLQAQWIYYKDQKDKKWKGPMRVEYWGQGSVLLKDEEKGYFLVPRRHIKRVPEPCALPE
U11820.1_EANIPDKN_00002      QKLMMQWVYYKDQKDKKWKGPMRVEYWGQGSVLLKDEEKGYFLVPRRHIRRVPEPCTLPE
                             ***  **:*****************************************:******:***

NODE1_KJHFFCBH_00001         GDE
MN630242_AMGANPBD_00002      GDE
U11820.1_EANIPDKN_00002      GDE
                             ***

Overall, the level of homology is quite high at the protein level. There's maybe a little evidence that this contig's protein is closer to U11820.1 than to MN630242 (though there are also several positions where U11820.1 is the outgroup). So I'm not sure why pieces of this contig are so hard to align at the DNA level without further investigation, but honestly overall of these are looking like very similar viral sequences, as might be expected.

I also aligned the big contig to both U11820.1 and MN630242 at the DNA level in 3-way clustalo alignment it is ok. Not sure what the difference is there. There is no obvious reason to prefer one reference over the other, according to my eye. When I have clustalo output a clustering solution the big contig is essentially equidistant between the two references, so I don't buy that U11820.1 is "better" as a reference:

(
MN630242.1:0.116151
,
(
NODE1_rc:0.0504856
,
U11820.1:0.0504856
):0.0656651
)
;

As I noted in the comments, it's unfortunate that QUAST and SPAdes aren't working very well. I don't know what the deal is there. But it seems like overall the assembly is quite close to both references. I don't have any intuition for why SPAdes is doing what it's doing, or why alignment at the DNA level is acting weird (lots of artificial gaps?). But I don't think it's a pathology of SPAdes particularly, it seems like it's just something weird that happened with this dataset interacting with DNA aligners. Possibly something went strangely with read deposition?

Update:

See OP's self-answer as well (and also discussion in the comments).


Based on the (extremely informative!) follow-up updates that you have provided, I think that we can tentatively answer your questions:

The tl;dr is that you can check your reads to make sure they aren't weird and you can check your output sequence to see if it's weird.

  • Am I using SPAdes correctly? Is there a better way to use it?

    1. I think that you are using SPAdes correctly, based on your experience getting ~finished viral genomes in the past with the same workflow. (THis is also my experience using SPAdes on viral genomes with Illumina PE data, is that it "just works".)
    2. It is possible that you are using some other tools in your workflow in a suboptimal way. For example, maybe your trimming is not quite correct (it does seem based on your answer that you are already trimming). The question in my mind is then whether there is some non-intuitive thing in the reads that should still be trimmed.
    3. I would suggest running FASTQC on your reads for this run and also your reads for the other runs that have worked, to see if something weird is going on. FASTQC directly checks for adapter sequence, weird GC composition, quality scores etc. That will give you a lot of information.
  • Is it possible that the sequences that blast finds matching with the SPAdes contigs are indeed sequences that someone got using SPAdes and that's why these match (not because these are correct)?

    1. This is formally possible, of course. Our main way of knowing that you're getting the right genome is... does it look like other genomes that have been generated in the same way?
    2. If you are really worried about this, I would suggest directly inspecting the sequence to see what the matches are. If you like, you can then blast the matches to see if they
    3. Another approach would be to annotate the genome and make sure that it "looks like" a phage genome. E.g. it has the expected genes from FIV. Prokka is very easy to run and works ok on phage.
    4. A slightly self-serving suggestion is to use PhageTerm to check your reads against your final genomes to ensure that they behave like phage- e.g. they have termini in the correct places, etc. (it does this with test coverage.) I am a contributor to PhageTerm, so I happen to know it a bit. Likely there are other options.
  • What other option could explain these circumstances that I could be missing?

    1. As I've suggested, I think that that leaves the reads. How sure are you that they are comparable to your other runs? They look like they should be sufficient from my glance at SRA, but maybe someone bumped the MiSeq while it was running. I've given suggestions above for checking with FASTQC.

Correct answer by Maximilian Press on February 19, 2021

After many considerations, I am going to accept the response from Maximilian Press. I see now that some viruses have high variability (HIV even 50% of the sequence). Therefore MN630242.1. and U11820.1 are apparently two strains. There are things I still don't understand but these are beyond the initial goal of my question. Concretely:

  • Why SPAdes returns one strain and rnaSPAdes the other one.
  • Why one strain matches 99,9% with MN630242.1 (and at least another assembly) while the other one has the (now) expected variability of this kind of virus.

I also want to point that apparently, Quast is not effective for viruses with such high variation.

I'm not going to delete this answer given that it responds to part of my question that is not responded in Maximilian's answer.

Therefore, directly answering my questions:

  • Am I using SPAdes correctly? Is there a better way to use it? I am using SPAdes mostly in the correct way, similarly to how an average user would. But rnaSPAdes seems to be more appropriate for RNA viruses and it indeed works far better in this case.

  • Is it possible that the sequences that Blast finds matching with the SPAdes contigs are indeed sequences that someone got using SPAdes and that's why these match? That could be possible in some cases but that doesn't mean these sequences are wrong as I initially considered. For whatever reason SPAdes seems to miss one strain (if that's what really is happening)

  • What other option could explain these circumstances that I could be missing? See Maximilian Press answer.

Answered by juanjo75es on February 19, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP