How identifiable are human omics data and how to mitigate their identifying features?

Question

Say a database were to store human omics datasets. The human subjects are known and the sample size is rather small in size initially (n=500). The database contains genomics, transcriptomics, proteomics, gut metabolomics, and epigenomics. Since the sample size is rather small initially, it would be important that certain identifying features are mitigated so that individual subjects cannot be identified. This leads me to two questions:

What identifiable features could be present in these different types of raw data? (Such as race, sex, age, hair color, eye color, height, where someone has lived, and anything else I may not even be considering)? Which of these omics types would be the most and least dangerous for identifying individuals? How can the identifiable nature of these data be mitigated?

If environmental metagenomics is also collected by these human subjects, is it possible to identify the human subjects by contamination? (i.e. some of the reads from the metagenomics data inadvertently contain human reads?) How can the identifiable nature of these data be mitigated?

I think this subject may be a bit futuristic, but I am very uninformed. If there are any references that provide additional thinking about these topics, please kindly share. Thank you for sharing your thoughts.

Chris_Rands · Answer

PPK provides a great answer, but for question 2 I can provide a different perspective. For shotgun metagenome sequencing (without any enrichment/depletion protocols) it is common for >90% of reads to map to the human genome. There is variation across body sites, for example the gut microbiome is rich and so the % of human reads from a stool sample will be less compared to say a blood sample.
Checkout this study, where they mined human microbiome datasets for host (human) reads and were able to re-construct draft full human genomes with sometimes 20X coverage. See also part of their discussion:

We show here that it is possible to reconstruct complete host genomes
using metagenomic sequence data, which is potentially identifiable.
However, this was possible due to the unique study design of the HMP,
whereby multiple body sites from each individual were sequenced at a
high depth, allowing us to pool data across body sites and reach a 10x
mean coverage per host genome. Common metagenomic shotgun sequencing
studies, which usually include an order of magnitude less sequence
data, are unlikely to enable such an analysis. Moreover, the majority
of studies sequence stool samples, which include many fewer
host-derived reads. Nevertheless, we anticipate that future shotgun
metagenomics sequencing studies would consider these potential privacy
concerns.

PPK · Answer

This is one of the major problems with genomic information in todays research. This was highlighted some years ago with the police using publicly available genomic data bases to idenitfy unkown murder suspects.
The full extent of this issue was exemplified in a
paper by Yaniv Erlich from 2018 that should be a good starting point for you.
They claim that giving the current (2018) amount genetic information more than half the searches for a person of european descent will result in a thrid cousin or better match. And these odds will only inprove with time.
Therefore, the genetic information not only can give you a hint at inherited variables like the approximate height of a person but can even yield their name and address.
About your second question, I would think that the amount of host (i.e human) material collected alongside the microbiome sample is rather small. This probably is not sufficient to obtain enough coverage for a profile. But I could be mistaken.

How identifiable are human omics data and how to mitigate their identifying features?

2 Answers

Add your own answers!

Ask a Question