Is there any value in scaffolding the output contigs of MEGAHIT assembler given a metagenomic dataset?

Bioinformatics Asked on June 21, 2021

As far as I understood, for most assembly programs, the scaffolding step takes into consideration paired-end information in order to get from contigs (contiguous sequences) to scaffolds (longer sequences that might have some N-filled gaps).

My assembly software of choice, MEGAHIT, uses paired-end information to build the contigs, but it does not output a standard scaffold. So I am wondering the following, related things:

-is it meaningful to run a scaffold program on the output of MEGAHIT? I imagine there might be some instances in which paired-end information could span a gap.

-Which software would you recommend for it? (I’ve tried soapdenovo2 and SSPACE, but they appear not to be actively maintained so I have the issue of ‘it doesn’t work and I can’t do anything about it’)

-Could relevant information (eg. two contigs being connected by pair-end information) be recovered by alternative and perhaps more user-friendly means, such as exploring the assembly graph with Bandage?

Thank you for your time!

2 Answers

Update 2: It looks like your approach has actually been suggested here as one way to use the PE information. I guess megahit may not be really using the PE information anyways. I still believe that it's kind of weird but other people do suggest it, so maybe it's worth trying what Torsten suggests.

I think that scaffolding with PE information from the reads used as input to Megahit is somewhat sketchy. You could still try it, but I'd be worried about artifacts, just because de novo assemblers are so heuristic.

However, I think that it is perfectly ok to scaffold using orthogonal data. Here are some examples of orthogonal data:

  1. A reference assembly (weird for metagenomics but still possibly ok).
  2. long reads, e.g. ONT or PacBio (low coverage probably fine).
  3. Proximity ligation data, e.g. Hi-C reads.
  4. Mate-pair reads (also lead to artifacts at a somewhat high rate, but technically valid).
  5. Linked-reads (e.g. 10X).
  6. Probably other methods as well.

The tools used in each case would be somewhat different. For a little more information about how you might use this, here is a recent review of tech.

Full disclosure: I work for a company that sells Hi-C kits for such applications.

Update: Realized that I missed one part of the Q. I think that visually exploring the assembly using e.g. bandage is always a good idea. Quite possibly you can make some scaffolding decisions that way, but to me it sounds somewhat painful to do, especially in a metagenome where there are going to be a lot of multiple-branching collapsed regions.

Answered by Maximilian Press on June 21, 2021

You can use SOAPdenovo-Fusion to scaffold contigs produced by MEGAHIT as suggested by one of the developers:

Answered by Robvh on June 21, 2021

Add your own answers!

Related Questions

Generating 3D coordinates error

1  Asked on January 15, 2021 by shahbaaz


BAM file filteing to remain best isoform

0  Asked on January 10, 2021 by user977828


Somatic mutations for normal WES samples

0  Asked on January 6, 2021 by lot_to_learn


Get list of urls of GSM data set of a GSE set

1  Asked on January 6, 2021 by user432797


Biohackers Netflix – DNA to binary and video

1  Asked on January 3, 2021 by xamax


DNA sequence error annotation

0  Asked on December 30, 2020 by matthew-jones


samtools / bamUtil | Meaning of as Reference Name

1  Asked on December 25, 2020 by paul-endymion


How to remove batch effect from TCGA and GTEx data

2  Asked on December 22, 2020 by kai-he


Ask a Question

Get help from others!

© 2023 All rights reserved. Sites we Love: PCI Database, MenuIva, UKBizDB, Menu Kuliner, Sharing RPP, SolveDir