How to deal with the mismatches between gene names obtained from different sources?

Question

For most of the time, I rely on gene ids to combine different datasets. However, in some instances, I have to combine datasets based on gene names. Then, if I don't know the source of gene names in the dataset, I get to this issue of choosing a source of gene names, be it Ensembl, HGNC etc for human genes. I wonder if this is a common issue and if there is an reliable method out there to deal with this issue.
To demonstrate the mismatch between different sources, I compared gene names for all human genes. I obtained them from 4 different sources as listed below, using BioMart (pybiomart) :
+-----------------+-----------------------+--------------------------------------------+
|     source      |     attribute_name    |                display_name                |
+-----------------+-----------------------+--------------------------------------------+
| HGNC            |  hgnc_symbol          |  HGNC symbol                               |
| NCBI            |  entrezgene_accession |  NCBI gene (formerly Entrezgene) accession |
| Uniprot         |  uniprot_gn_symbol    |  UniProtKB Gene Name symbol                |
| Ensembl (maybe) |  external_gene_name   |  Gene name                                 |
+-----------------+-----------------------+--------------------------------------------+

Upon this comparison, I found several things that are clearly apparent.
1. Genes names of protein coding genes are best matched across different sources.
I saw that protein coding genes have the best matching (left, measured in terms of Jaccard index) across different sources, with majority of genes having a single unique names (shown on right).

However, there isn't a good enough matching in the case of not protein coding genes. Here, HGNC and Ensembl have the best match. (I don't expect Uniprot gene names to match because they are of course only for protein coding genes.) Remarkably most of the genes have 2 unique ids (shown on right).

2. Gene names from some databases match with each other.
Comparison of all genes shows that some pairs of the sources do not have a good match e.g. Ensembl and Uniprot, with many genes having 2 unique gene names(!).

I saw similar pattern for genes on chromosomes (autosomes,X,Y) and on the scaffolds.

3. Mitochondrial gene names do not match at all(!).
Mitochondrial genes clearly have different names in different databases. None of the genes have a single unique gene names (!).

How to deal with such a mismatch between different sources?
Should I prefer one particular source or is there a way to make use of the synonymous gene names from different sources?

How to deal with the mismatches between gene names obtained from different sources?

1. Genes names of protein coding genes are best matched across different sources.

2. Gene names from some databases match with each other.

3. Mitochondrial gene names do not match at all(!).

Add your own answers!

Ask a Question