TransWikia.com

Parsing .vcf file for this information

Bioinformatics Asked on June 17, 2021

I have a .vcf file

https://www.dropbox.com/s/8v73nppwg3a1tnd/LP2000109-DNA_A01_vs_LP2000103-DNA_A01.SVannotated.txt?dl=0

with this header

##startTime=Fri Mar 29 16:46:32 2019
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NORMAL  TUMOR
1   54586   .   T   C   .   PASS    DP=39;MQ=50.55;MQ0=0;NT=ref;QSS=48;QSS_NT=48;ReadPosRankSum=1.92;SGT=TT->CT;SNVSB=0.00;SOMATIC;SomaticEVS=10.83;TQSS=1;TQSS_NT=1    AU:CU:DP:FDP:GU:SDP:SUBDP:TU    0,0:0,0:20:0:0,0:0:0:20,20  0,0:6,6:18:0:0,0:0:0:12,13
1   103241  .   C   T   .   PASS    DP=120;MQ=24.94;MQ0=35;NT=ref;QSS=47;QSS_NT=47;ReadPosRankSum=2.09;SGT=CC->CT;SNVSB=0.00;SOMATIC;SomaticEVS=9.44;TQSS=2;TQSS_NT=2   AU:CU:DP:FDP:GU:SDP:SUBDP:TU    0,1:32,47:33:1:0,0:0:0:0,5  0,

The "DP" field in the vcf shows the depth of the individual samples; So in this file, the first locus has the following format fields:

AU:CU:DP:FDP:GU:SDP:SUBDP:TU    0,0:0,0:20:0:0,0:0:0:20,20  0,0:6,6:18:0:0,0:0:0:12,13

So according to this (DP field of normal and tumor samples), normal sample has a depth of 20 and tumor sample has a depth of 18.

So how I could extract the read depth for all loci as described for the first position? The desired output would be like this [note that the VCF is taken from my own data but the table is my desired format that I don’t know how to get that from my own data. chr have been added manually because my reference genome is hg19]:

  Sample   Type   CHROM       POS    REF  ALT       Tumor_Depth Normal_Depth
  CHC2432T SNV    chr1  102961055    G    A                  64           62      
  CHC2432T SNV    chr1  105492588    A    T                  66           73     
  CHC2432T SNV    chr1  108628724    C    T                  45           54    
  CHC2432T SNV    chr1  109692113    G    T                  53           29     
  CHC2432T SNV    chr1  109692114    G    T                  53           31     
  CHC2432T SNV    chr1  120676701    T    C                  48           87   

2 Answers

To extract the DP fields from a VCF file, you could use a tool like bcftools query:

Extracts fields from VCF or BCF files and outputs them in user-defined format.

You could start from something like this:

bcftools query -Hf 'CHC2432Tt%TYPEt%CHROMt%POSt%REFt%ALT[t%DP]n' file.vcf

Answered by Jukka Matilainen on June 17, 2021

You can do the extraction part with the GATK tool VariantsToTable, as described here:

https://gatk.broadinstitute.org/hc/en-us/articles/360041414592-VariantsToTable

The usage example from that doc:

gatk VariantsToTable 
    -V input.vcf 
    -F CHROM -F POS -F TYPE -GF AD 
    -O output.table

would produce a file that looks like:

 CHROM  POS        TYPE   HSCX1010N.AD  HSCX1010T.AD
 1      31782997   SNP    77,0          53,4
 1      40125052   SNP    97,0          92,7
 1      65068538   SNP    49,0          35,4
 1      111146235  SNP    69,1          77,4

So you might still need to reorder columns etc but that should allow you to at least get the values out in a tabular format that will be easier to work with.

Answered by Geraldine_VdAuwera on June 17, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP