TransWikia.com

How can I subset WGS data to the level of WES variants?

Bioinformatics Asked by jared_mamrot on February 12, 2021

I would like to compare mutational signatures1 in patients from different studies, however some studies are based on exome seq (i.e. ~20,000 coding variants) and some are from whole genome seq (i.e. ~22,000 coding variants) – is there a way to ‘downsample’ WGS data to better reflect WES data and effectively ‘ignore’ the coordinates of those ~2000 coding variants in the VCF files?

1Alexandrov, L.B., Kim, J., Haradhvala, N.J., Huang, M.N.,
Ng, A.W.T., Wu, Y., Boot, A., Covington, K.R., Gordenin, D.A.,
Bergstrom, E.N. and Islam, S.A., 2020. The repertoire of mutational
signatures in human cancer. Nature, 578(7793), pp.94-101.

2 Answers

bcftools would be my choice, I'm sure bedtools could do the trick, too. Something along this

bcftools view --regions-file

or --targets, --regions-file might require a tabix index.

Answered by Carambakaracho on February 12, 2021

I'll expand slightly on the previous answer. First print off the positions for the exome file and then use bcftools view to filter the variants from the whole genome file. You could also index the whole_genome.vcf file to make the filtering faster.

bcftools query 
    -f'%CHROMt%POSn' 
    exome.vcf > exome_variants.txt

bcftools view 
    -T exome_variants.txt 
    whole_genome.vcf > whole_genome.exome_positions.vcf

Answered by user438383 on February 12, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP