TransWikia.com

Is there an efficient way to extract CIGAR strings for read pairs from bam files with python?

Bioinformatics Asked by Mereven on June 16, 2021

I am working with bam files and I have to check if reads of a specific position or their mates are soft clipped. So, I am looking for a fast way to extract the read pairs from a bam file in python. So far, I use pysam and fetch reads of a given position. From those reads, I have access to the CIGAR string and can check them for soft clipping. My problem is that I cannot access the CIGAR of the read mates. Is there a way to access information of read mates with pysam or another tool, especially CIGAR strings, without iterating over the whole .bam file?

Unlike in this question Extracting the CIGAR string from a BAM via Python? I am looking for the CIGAR of both reads of a pair.

One Answer

I think that the easiest way is to work with query-name sorted groups of reads. In that case, mates will be adjacent in the sort and you can use that to extract the paired CIGARs.

If you are depending on position sorting to extract the alignments of interest efficiently, you can maybe do the following:

  1. extract alignments in the region of interest plus a lot of padding to ensure that you are also getting the mates (this will obviously only work with standard short-insert shotgun libraries).

  2. Query-name sort the list iterator using built-in sort() or sorted() with a key function that returns the query name.

  3. Iterate over the pairs using adjacency in the iterator to extract paired CIGARs. Discard unpaired reads (first reads whose query name does not match supposed mates).

Code would look something like this (I have not run or tested this in any way!):

import pysam

def get_query_key(read):
  """read is pysam.AlignedSegment"""
  return read.query_name

def get_paired_cigars(bam, chr, start, end, padding=1000):
  """bam is pysam.AlignmentFile"""
  reads = list(bam.fetch(bam, chr, start-padding, end+padding))  # may fail at chr ends
  reads.sort(key=get_query_key)
  first_cigar = None
  mate_cigar = None
  cigars = list()
  qn = ""
  for read in reads:
    if first_cigar is None:
      first_cigar = read.cigarstring
      qn = read.query_name
    elif mate_cigar is None:
      if read.query_name != qn:
        first_cigar = read.cigarstring
        qn = read.query_name
      else:
        mate_cigar = read.cigarstring
    if first_cigar is not None and mate_cigar is not None:
      cigars.append((first_cigar, mate_cigar))
      first_cigar = None
      mate_cigar = None

I agree that I would like this to be simpler! Unfortunately BAM files are not terribly well-designed for using pairing information at that level of detail IMO, I have always had to use something like this. I would be interested to see a straightforward alternative to this approach from someone with more experience.

Answered by Maximilian Press on June 16, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP