TransWikia.com

Is there any value in scaffolding the output contigs of MEGAHIT assembler given a metagenomic dataset?

Bioinformatics Asked on June 21, 2021

As far as I understood, for most assembly programs, the scaffolding step takes into consideration paired-end information in order to get from contigs (contiguous sequences) to scaffolds (longer sequences that might have some N-filled gaps).

My assembly software of choice, MEGAHIT, uses paired-end information to build the contigs, but it does not output a standard scaffold. So I am wondering the following, related things:

-is it meaningful to run a scaffold program on the output of MEGAHIT? I imagine there might be some instances in which paired-end information could span a gap.

-Which software would you recommend for it? (I’ve tried soapdenovo2 and SSPACE, but they appear not to be actively maintained so I have the issue of ‘it doesn’t work and I can’t do anything about it’)

-Could relevant information (eg. two contigs being connected by pair-end information) be recovered by alternative and perhaps more user-friendly means, such as exploring the assembly graph with Bandage?

Thank you for your time!

2 Answers

Update 2: It looks like your approach has actually been suggested here as one way to use the PE information. I guess megahit may not be really using the PE information anyways. I still believe that it's kind of weird but other people do suggest it, so maybe it's worth trying what Torsten suggests.

I think that scaffolding with PE information from the reads used as input to Megahit is somewhat sketchy. You could still try it, but I'd be worried about artifacts, just because de novo assemblers are so heuristic.

However, I think that it is perfectly ok to scaffold using orthogonal data. Here are some examples of orthogonal data:

  1. A reference assembly (weird for metagenomics but still possibly ok).
  2. long reads, e.g. ONT or PacBio (low coverage probably fine).
  3. Proximity ligation data, e.g. Hi-C reads.
  4. Mate-pair reads (also lead to artifacts at a somewhat high rate, but technically valid).
  5. Linked-reads (e.g. 10X).
  6. Probably other methods as well.

The tools used in each case would be somewhat different. For a little more information about how you might use this, here is a recent review of tech.

Full disclosure: I work for a company that sells Hi-C kits for such applications.

Update: Realized that I missed one part of the Q. I think that visually exploring the assembly using e.g. bandage is always a good idea. Quite possibly you can make some scaffolding decisions that way, but to me it sounds somewhat painful to do, especially in a metagenome where there are going to be a lot of multiple-branching collapsed regions.

Answered by Maximilian Press on June 21, 2021

You can use SOAPdenovo-Fusion to scaffold contigs produced by MEGAHIT as suggested by one of the developers: https://github.com/aquaskyline/SOAPdenovo2

Answered by Robvh on June 21, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP