AnswerBun.com

How to deal with the mismatches between gene names obtained from different sources?

Bioinformatics Asked on June 22, 2021

For most of the time, I rely on gene ids to combine different datasets. However, in some instances, I have to combine datasets based on gene names. Then, if I don’t know the source of gene names in the dataset, I get to this issue of choosing a source of gene names, be it Ensembl, HGNC etc for human genes. I wonder if this is a common issue and if there is an reliable method out there to deal with this issue.

To demonstrate the mismatch between different sources, I compared gene names for all human genes. I obtained them from 4 different sources as listed below, using BioMart (pybiomart) :

+-----------------+-----------------------+--------------------------------------------+
|     source      |     attribute_name    |                display_name                |
+-----------------+-----------------------+--------------------------------------------+
| HGNC            |  hgnc_symbol          |  HGNC symbol                               |
| NCBI            |  entrezgene_accession |  NCBI gene (formerly Entrezgene) accession |
| Uniprot         |  uniprot_gn_symbol    |  UniProtKB Gene Name symbol                |
| Ensembl (maybe) |  external_gene_name   |  Gene name                                 |
+-----------------+-----------------------+--------------------------------------------+

Upon this comparison, I found several things that are clearly apparent.

1. Genes names of protein coding genes are best matched across different sources.

I saw that protein coding genes have the best matching (left, measured in terms of Jaccard index) across different sources, with majority of genes having a single unique names (shown on right).
enter image description here
However, there isn’t a good enough matching in the case of not protein coding genes. Here, HGNC and Ensembl have the best match. (I don’t expect Uniprot gene names to match because they are of course only for protein coding genes.) Remarkably most of the genes have 2 unique ids (shown on right).
enter image description here

2. Gene names from some databases match with each other.

Comparison of all genes shows that some pairs of the sources do not have a good match e.g. Ensembl and Uniprot, with many genes having 2 unique gene names(!).
enter image description here

I saw similar pattern for genes on chromosomes (autosomes,X,Y) and on the scaffolds.
enter image description here
enter image description here

3. Mitochondrial gene names do not match at all(!).

Mitochondrial genes clearly have different names in different databases. None of the genes have a single unique gene names (!).
enter image description here

How to deal with such a mismatch between different sources?
Should I prefer one particular source or is there a way to make use of the synonymous gene names from different sources?

Add your own answers!

Related Questions

10X cellranger error during count

0  Asked on May 9, 2021 by raiora

 

Does SBOL support timing and threshold value parameters?

1  Asked on May 9, 2021 by hasan-baig

 

Biopython SeqIO check input file

1  Asked on May 7, 2021

     

blast p-error with making directory

1  Asked on May 3, 2021 by dominic-chang

 

How to make CIRCOS plot of VCF file?

0  Asked on May 1, 2021

     

lower mapping rates in salmon v0.13 compared to previous versions

1  Asked on April 29, 2021 by courtney-stairs

   

How to assign LD proxy

0  Asked on April 28, 2021

 

How is BLAST’s nr database created?

2  Asked on April 25, 2021 by juniper

   

summarising read group information from a .bam file

0  Asked on April 25, 2021 by user438383

   

Using dssp after chain extraction

0  Asked on April 24, 2021 by saiden

     

Ask a Question

Get help from others!

© 2023 AnswerBun.com. All rights reserved. Sites we Love: PCI Database, MenuIva, UKBizDB, Menu Kuliner, Sharing RPP, SolveDir