summarising read group information from a .bam file

Question

I have merged together 2 different .bam files in order to simulate sample contamination. So the reads can come from one of two samples, as shown by the read group info:
@RG ID:0    PL:ILLUMINA SM:LP4100018-DNA_C11_Proband    PU:HGY3WDSXX:1:none
@RG ID:1    PL:ILLUMINA SM:LP4100018-DNA_C11_Proband    PU:HGY3WDSXX:2:none
@RG ID:2    PL:ILLUMINA SM:LP4100018-DNA_C11_Proband    PU:HGY3WDSXX:3:none
@RG ID:3    PL:ILLUMINA SM:LP4100018-DNA_C11_Proband    PU:HGY3WDSXX:4:none
@RG ID:0-11EFC00B   PL:ILLUMINA SM:LP4100018-DNA_E11_Proband    PU:HGY3WDSXX:1:none
@RG ID:1-B8A1099    PL:ILLUMINA SM:LP4100018-DNA_E11_Proband    PU:HGY3WDSXX:2:none
@RG ID:2-330086F    PL:ILLUMINA SM:LP4100018-DNA_E11_Proband    PU:HGY3WDSXX:3:none
@RG ID:3-7681F092   PL:ILLUMINA SM:LP4100018-DNA_E11_Proband    PU:HGY3WDSXX:4:none

I'd like to check that the correct proportion of read groups originate from each sample.
Currently I am using:
samtools view example.bam | rev | cut -f 1 | rev > output.txt

, but this is not very elegant and only works because the RG field is last in the .bam.
Is there a quick way to tabulate the number of reads groups with different IDs? E.g. produce an output like:
ID:0 1000
ID:1 2000
ID:2 3000
...

A solution in samtools would be ideal, along the lines of the output produced in samtools stats.

summarising read group information from a .bam file

Add your own answers!

Ask a Question