TransWikia.com

How to deal with the mismatches between gene names obtained from different sources?

Bioinformatics Asked on June 22, 2021

For most of the time, I rely on gene ids to combine different datasets. However, in some instances, I have to combine datasets based on gene names. Then, if I don’t know the source of gene names in the dataset, I get to this issue of choosing a source of gene names, be it Ensembl, HGNC etc for human genes. I wonder if this is a common issue and if there is an reliable method out there to deal with this issue.

To demonstrate the mismatch between different sources, I compared gene names for all human genes. I obtained them from 4 different sources as listed below, using BioMart (pybiomart) :

+-----------------+-----------------------+--------------------------------------------+
|     source      |     attribute_name    |                display_name                |
+-----------------+-----------------------+--------------------------------------------+
| HGNC            |  hgnc_symbol          |  HGNC symbol                               |
| NCBI            |  entrezgene_accession |  NCBI gene (formerly Entrezgene) accession |
| Uniprot         |  uniprot_gn_symbol    |  UniProtKB Gene Name symbol                |
| Ensembl (maybe) |  external_gene_name   |  Gene name                                 |
+-----------------+-----------------------+--------------------------------------------+

Upon this comparison, I found several things that are clearly apparent.

1. Genes names of protein coding genes are best matched across different sources.

I saw that protein coding genes have the best matching (left, measured in terms of Jaccard index) across different sources, with majority of genes having a single unique names (shown on right).
enter image description here
However, there isn’t a good enough matching in the case of not protein coding genes. Here, HGNC and Ensembl have the best match. (I don’t expect Uniprot gene names to match because they are of course only for protein coding genes.) Remarkably most of the genes have 2 unique ids (shown on right).
enter image description here

2. Gene names from some databases match with each other.

Comparison of all genes shows that some pairs of the sources do not have a good match e.g. Ensembl and Uniprot, with many genes having 2 unique gene names(!).
enter image description here

I saw similar pattern for genes on chromosomes (autosomes,X,Y) and on the scaffolds.
enter image description here
enter image description here

3. Mitochondrial gene names do not match at all(!).

Mitochondrial genes clearly have different names in different databases. None of the genes have a single unique gene names (!).
enter image description here

How to deal with such a mismatch between different sources?
Should I prefer one particular source or is there a way to make use of the synonymous gene names from different sources?

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP