How to map short sequences to long reads, recovering all multiply-mapped high-quality matches

Question

The dilemma:
I have around one million short sequences (21 bp to several 100s of basepairs) for which I need to identify all occurrences of in 20-30x coverage noisy long reads (both pacbio and ONT). All of the short sequences and long reads are derived from the same individual, and there are enough short sequences given the organism's genome size that each long read should have at least one or multiple short sequences.
An example:
Given three short sequences, AA & BBBB & CCCCCC , how to identify all occurrences of these sequences in long, uncorrected reads.
Sequences to identify in long reads:
  - AA
  - BBBB
  - CCCCCC

Long Reads:
  - 1: ------AA-----------------------CCCCCC------
  - 2: ------AA--------BBBB-----------------------
  - 3: ---BBBB-----------CCCC--
  - 4: ------------------AA-----

In this example, the output sam file will contain high MAPQ hits for short sequence AA to reads (1, 2, 3), for short sequence BBBB to reads (2, 3). For short sequence CCCCCC there will be a high MAPQ hit to read 1 and a lower MAPQ hit to read 3
Limitations:

Due to the requirements of this project, I cannot perform a de novo assembly then map the short sequences to the assembly output. It is necessary to directly map the short sequences to the long reads.
I cannot correct the long reads before I map/align the short sequences to them.
The alignment/mapping should sensitively detect the short sequences in the long reads.
The short sequences may be anywhere from 1Mbp to 30Mbp in total bases.
The long reads may be anywhere from 3Gbp to 90Gbp in total bases. So the mapper/alignment technique is hopefully fast. Sensitivity is more important though.

Research so far:

bwa mem does not output multiply-mapped in a predictable manner that fits this use case https://www.biostars.org/p/304614/
blat seems to output many hits for a single short sequence, but there are no out-of-the-box parameters for mapping short reads to long reads.
Maybe multiple sequence alignment is the most sensitive, but wouldn't that entail running (no. of short seqs) * (no. of long reads) alignments?

gringer · Accepted Answer

LAST has given the best results for me when I've tried to do this, although I agree with @user172818 that it's not a good idea to map really short reads. This is due to a combination of natural sequence duplication in long DNA sequences (e.g. see here), as well as abundant base calling differences present in single-molecule sequencing.

Minimising error is not necessarily going to be the best option, and concentrating only on unique hits will miss a lot of signal. There are frequently multiple identically plausible positions, even at zero error, and the long reads will have their own associated errors.

I also find the limitations a little odd. Canu can do read-level correction on long reads based on overlapping reads, and if assembly is possible, then nanopolish can be used to correct nanopore reads for systematic base-calling error introduced by the methylation of unamplified template.

How to map short sequences to long reads, recovering all multiply-mapped high-quality matches

The dilemma:

An example:

Limitations:

Research so far:

One Answer

Add your own answers!

Ask a Question