TransWikia.com

How can I use my Myheritage DNA results file for further analysis?

Bioinformatics Asked by user3390486 on March 19, 2021

I had my DNA tested by Myheritage and they sent me a csv file with RSID, Chromosome, position and result (which base) with about 700,000 rows.
I understand most analyses of DNA use VCF files but is there anything i can do with this csv file i.e. check for genetic health-related genes?
I am not a bioinformaticist but I am a scientist and I can use R and python. Ive heard of the Gnomad database but not sure if i can match things to my csv file.

One Answer

A csv file with RSID, chromosome, position and result is enough for what you want to do and these are the core columns of a VCF (which is just a TSV with some headers describing how it was made).

Given you have ~700k rows I suspect that yes there will be genetic health-related SNVs (single nucleotide variant). Disclaimer; these data could include info pertaining to your health and your family's health, I strongly advise that you speak to a genetic counsellor to understand these things.

Gnomad is a population database so generally speaking (plenty of exceptions of course) if there is a SNV in there more than a few times its probably not causing disease. https://www.ncbi.nlm.nih.gov/clinvar/ and https://omim.org/ are example of disease databases.

Here's a pandas example with clinvar (https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/ you'll need to match your version).

import pandas as pd
df = pd.read_csv(
    'clinvar.vcf.gz',
    sep='t',
    comment = '#',
    header=None, 
    names = ['CHROM','POS','ID','REF','ALT','QUAL','FILTER','INFO']
)
df.head()

CHROM   POS ID  REF ALT QUAL    FILTER  INFO
0   1   930188  846933  G   A   .   .   ALLELEID=824438;CLNDISDB=MedGen:CN517202;CLNDN...
1   1   930203  972363  C   T   .   .   ALLELEID=959431;CLNDISDB=MedGen:CN517202;CLNDN...
2   1   930248  789256  G   A   .   .   AF_ESP=0.00347;AF_EXAC=0.00622;AF_TGP=0.00280;...
3   1   930275  969662  T   G   .   .   ALLELEID=959432;CLNDISDB=MedGen:CN517202;CLNDN...
4   1   930336  843786  G   A   .   .   ALLELEID=824439;CLNDISDB=MedGen:CN517202;CLNDN...

Then lets say one of your 700k SNVs is at chrom 1, position 930336 and you have an A

print(*df.query('(CHROM == 1) & (POS == 930336) & (ALT == "A")')['INFO'].str.split(';'))

Gives a list from which you could pull out CLNSIG which here is Uncertain_significance

['ALLELEID=824439', 'CLNDISDB=MedGen:CN517202', 'CLNDN=not_provided', 'CLNHGVS=NC_000001.11:g.930336G>A', 'CLNREVSTAT=criteria_provided,_single_submitter', 'CLNSIG=Uncertain_significance', 'CLNVC=single_nucleotide_variant', 'CLNVCSO=SO:0001483', 'GENEINFO=SAMD11:148398', 'MC=SO:0001583|missense_variant', 'ORIGIN=1']

So you could easily parse your 700k SNVs and get their respective clinical info this way. Please think about this carefully before you do it though, there are many implications including life insurance and mental health! I'm of the opinion that this sort of data is useful but MANY people disagree. Further, this advice is general in nature only, I accept no liabilty for what you or anyone else chooses to do with this publicly available data.

Answered by Liam McIntyre on March 19, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP