TransWikia.com

Sequence alignment using BWT

Bioinformatics Asked on December 5, 2021

My Problem:

Skipping some specific background, what I want to do is judging whether some soft-clipping sequences are the same, which may result by the same SV event. Colored bases in Fig.1 is an example of soft-clipping sequences.

Fig.1 an example of soft-clipping sequences

I use BWA -MEM as the aligner. And I am trying to develop an SV caller actually, what I want to do here is refining PE(Paired-Ends) signal. If it is SR(Split-Reads), I don’t need to do this, because of SR and its supplemental alignments coming from the same read. But PE signal generated from some cluster procedures need some filters. And I think K-mer is not a good way to do this which adapted by some SV callers. For me, as a bioinformatic fool, imitation and implementation may be a good learning process. It seems still a long way to go.

My solution:

I suppose this is a sequence alignment problem. Then I use FM-indexing based on BWT to do this. My inexact/approximate matching implementation is referred from BWA.

My workflow:

  1. Extract these soft-clipping sequences
  2. Choose the longest one as the reference
  3. Iterate others to match with the reference, if the number of unmatched is large than 1, it means these sequences are not the same.

  • Question 1:
    Is that sounds reasonable to you? Any suggestions?

  • Question 2:
    If it is OK, how should I implement the penalty scoring scheme?
    Here, because these sequences should come from the same SV event, the main differences(Mismatch, InDel) should arise from sequencing errors, not small genomic variants. So I want to set a higher gap open penalty and even higher gap extend penalty, like

    mismatch_penalty = 4
    gapopen_penalty = 11
    gapextend_penalty = 20
    alignment penalty cutoff = 30

The later means that one alignment with two more gaps will be considered as spurious alignment.

How do I optimize these parameters?

One Answer

You just need to check the flag for suplementary mapping. If there is suplementary mapping then use the coordinates; if they are all the same the seq is the same and you just have one event, if different reads have different suplementary mapping locations then there's multiple events. The problem is more complex than this though so just use an existing tool.

Answered by Liam McIntyre on December 5, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP