TransWikia.com

summarising read group information from a .bam file

Bioinformatics Asked by user438383 on April 25, 2021

I have merged together 2 different .bam files in order to simulate sample contamination. So the reads can come from one of two samples, as shown by the read group info:

@RG ID:0    PL:ILLUMINA SM:LP4100018-DNA_C11_Proband    PU:HGY3WDSXX:1:none
@RG ID:1    PL:ILLUMINA SM:LP4100018-DNA_C11_Proband    PU:HGY3WDSXX:2:none
@RG ID:2    PL:ILLUMINA SM:LP4100018-DNA_C11_Proband    PU:HGY3WDSXX:3:none
@RG ID:3    PL:ILLUMINA SM:LP4100018-DNA_C11_Proband    PU:HGY3WDSXX:4:none
@RG ID:0-11EFC00B   PL:ILLUMINA SM:LP4100018-DNA_E11_Proband    PU:HGY3WDSXX:1:none
@RG ID:1-B8A1099    PL:ILLUMINA SM:LP4100018-DNA_E11_Proband    PU:HGY3WDSXX:2:none
@RG ID:2-330086F    PL:ILLUMINA SM:LP4100018-DNA_E11_Proband    PU:HGY3WDSXX:3:none
@RG ID:3-7681F092   PL:ILLUMINA SM:LP4100018-DNA_E11_Proband    PU:HGY3WDSXX:4:none

I’d like to check that the correct proportion of read groups originate from each sample.

Currently I am using:

samtools view example.bam | rev | cut -f 1 | rev > output.txt

, but this is not very elegant and only works because the RG field is last in the .bam.

Is there a quick way to tabulate the number of reads groups with different IDs? E.g. produce an output like:

ID:0 1000
ID:1 2000
ID:2 3000
...

A solution in samtools would be ideal, along the lines of the output produced in samtools stats.

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP