How to compact variant data to their genes without bias?

Bioinformatics Asked on December 30, 2021

I have a dataset of genes I am trying to collect data on from public databases, to use as features in machine learning. I am trying to take some features from UCSC genome browser (e.g. number of CpG islands per gene, number of DNase clusters per gene, regulatory enrichment scores etc.) however I am not sure how to control for bias where a gene that is larger in length – and so will then have more CpG islands or higher regulatory enrichment scores simply due to gene length.

Is there a way to correct for gene length when taking/condensing variant data to individual genes?


For reference, my machine learning model aims to predict whether a gene is the most likely to be causal for a disease (out of all the genes given to the model). The model will score the genes as a regression classification between 0 to 1 (0 being least likely to cause disease and 1 being most likely to cause disease). I plan to later further investigate the genes with the highest scores.

The model uses a variety of multi-omic features (e.g. GTEx gene expression the genes have for many tissues, GWAScatalog data, gene intolerance scores, protein-protein interaction data, drug interaction data, phenotypic scores etc.). However, I am missing epigenetic data to describe my genes so I’ve been looking to collect based on UCSC’s variant data (CpG islands, histone modifications, DNase clusters) – however this leads to my gene length problem when I am trying to reliably take data from the variant level.

I’ve been plotting my features and gene length, and seen that the UCSC epigenetic data does correlate with having a larger gene length if there is a higher count of regulatory sites (0.8 r2 for some), and so this is what I’m looking to correct.

One Answer

Its very easy, just let the ML sort this out for you and that is its advantage, You're thinking of GLM style calculation where you pre-screen the data with bivariate plots, where there needs to be nice Q-Q plots and low residual.

For ML simply include the gene length as one of your parameters along with CpG etc ... and the ML regression analysis SVC, lasso, ridge, random forest will figure the relationship out between gene length and CpG. You do zero, the ML does everything, hence from a statistical point of view purists object because you don't know the relationshiop the ML has deduced between the variables, but you will get regression weights for non-DNN stuff, which will give you some idea of the impact of length.

There is the issue of transformations and that can be complicated, but I'd try untransformed data first. The only disadvantage of this approach is the user will have to input the gene size when they want to check out your training algorithm.

Answered by M__ on December 30, 2021

Add your own answers!

Related Questions

Generating 3D coordinates error

1  Asked on January 15, 2021 by shahbaaz


BAM file filteing to remain best isoform

0  Asked on January 10, 2021 by user977828


Somatic mutations for normal WES samples

0  Asked on January 6, 2021 by lot_to_learn


Get list of urls of GSM data set of a GSE set

1  Asked on January 6, 2021 by user432797


Biohackers Netflix – DNA to binary and video

1  Asked on January 3, 2021 by xamax


DNA sequence error annotation

0  Asked on December 30, 2020 by matthew-jones


samtools / bamUtil | Meaning of as Reference Name

1  Asked on December 25, 2020 by paul-endymion


How to remove batch effect from TCGA and GTEx data

2  Asked on December 22, 2020 by kai-he


Ask a Question

Get help from others!

© 2023 All rights reserved. Sites we Love: PCI Database, MenuIva, UKBizDB, Menu Kuliner, Sharing RPP, SolveDir