TransWikia.com

How identifiable are human omics data and how to mitigate their identifying features?

Bioinformatics Asked on November 4, 2021

Say a database were to store human omics datasets. The human subjects are known and the sample size is rather small in size initially (n=500). The database contains genomics, transcriptomics, proteomics, gut metabolomics, and epigenomics. Since the sample size is rather small initially, it would be important that certain identifying features are mitigated so that individual subjects cannot be identified. This leads me to two questions:

  1. What identifiable features could be present in these different types of raw data? (Such as race, sex, age, hair color, eye color, height, where someone has lived, and anything else I may not even be considering)? Which of these omics types would be the most and least dangerous for identifying individuals? How can the identifiable nature of these data be mitigated?

  2. If environmental metagenomics is also collected by these human subjects, is it possible to identify the human subjects by contamination? (i.e. some of the reads from the metagenomics data inadvertently contain human reads?) How can the identifiable nature of these data be mitigated?

I think this subject may be a bit futuristic, but I am very uninformed. If there are any references that provide additional thinking about these topics, please kindly share. Thank you for sharing your thoughts.

2 Answers

PPK provides a great answer, but for question 2 I can provide a different perspective. For shotgun metagenome sequencing (without any enrichment/depletion protocols) it is common for >90% of reads to map to the human genome. There is variation across body sites, for example the gut microbiome is rich and so the % of human reads from a stool sample will be less compared to say a blood sample.

Checkout this study, where they mined human microbiome datasets for host (human) reads and were able to re-construct draft full human genomes with sometimes 20X coverage. See also part of their discussion:

We show here that it is possible to reconstruct complete host genomes using metagenomic sequence data, which is potentially identifiable. However, this was possible due to the unique study design of the HMP, whereby multiple body sites from each individual were sequenced at a high depth, allowing us to pool data across body sites and reach a 10x mean coverage per host genome. Common metagenomic shotgun sequencing studies, which usually include an order of magnitude less sequence data, are unlikely to enable such an analysis. Moreover, the majority of studies sequence stool samples, which include many fewer host-derived reads. Nevertheless, we anticipate that future shotgun metagenomics sequencing studies would consider these potential privacy concerns.

Answered by Chris_Rands on November 4, 2021

This is one of the major problems with genomic information in todays research. This was highlighted some years ago with the police using publicly available genomic data bases to idenitfy unkown murder suspects.

The full extent of this issue was exemplified in a paper by Yaniv Erlich from 2018 that should be a good starting point for you.
They claim that giving the current (2018) amount genetic information more than half the searches for a person of european descent will result in a thrid cousin or better match. And these odds will only inprove with time.

Therefore, the genetic information not only can give you a hint at inherited variables like the approximate height of a person but can even yield their name and address.

About your second question, I would think that the amount of host (i.e human) material collected alongside the microbiome sample is rather small. This probably is not sufficient to obtain enough coverage for a profile. But I could be mistaken.

Answered by PPK on November 4, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP