Problem with merge data while trying to convert gene names in R

Bioinformatics Asked by Equinox on June 23, 2021

This question has also been asked on Biostars and StackOverflow

I’ve been trying to code (in R) a way to convert gene accession numbers to gene names (from RNAseq data). I’ve looked at all the related questions and tried to modify my code such, but for some reason it’s still not working. Here is my code, where charg is a character vector of the gene accession ID’s of the data set resdata:

charg <- resdata$genes


ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")

theBM <- getBM(attributes='ensembl_gene_id','hgnc_symbol', 
      filters = 'external_gene_name', 
      values = charg, 
      mart = ensembl)

resdata <-, theBM, by.x="genes",by.y="ensembl_gene_id")

Here’s some output (where I’m struggling):

> head(charg)
[1] "ENSG00000261150.2"  "ENSG00000164877.18" "ENSG00000120334.15"
[4] "ENSG00000100906.10" "ENSG00000182759.3"  "ENSG00000124145.6" 

> dim(theBM)
[1] 0 1

> head(theBM)
[1] ensembl_gene_id
<0 rows> (or 0-length row.names)

> dim(resdata)
[1] 20381    11
> resdata <-, theBM, by.x="genes",by.y="ensembl_gene_id")
> dim(resdata) #after merge
[1]  0 11 #isn't correct -- just row names! where'd my genes go?

Thank you.

2 Answers

This is the code to get a look-up table to convert between Ensembl ID and HGNC:

ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
theBM <- getBM(attributes=c('ensembl_gene_id','hgnc_symbol'), 
               filters = c('ensembl_gene_id'),
               values = gsub("..*", "", charg),
               mart = ensembl)

What Devon was posting is correct but misses a c() around the attributes values. For further help please provide the content of resdata which you should always do when posting a question, since we cannot read minds. Does not work by the way is not a proper error description.

Once you have the output do:

resdata$genes <- gsub("..*", "", resdata$genes)

merge(x = theBM,
      by.x = "ensembl_gene_id",
      y = resdata,
      by.y = "genes")

Note that I had to go to that SE crosspost to get the content of resdata, this is not how this goes. Please post all relevant data up front in the future otherwise your questions might get downvoted and closed. Please also avoid cross-posting. if you provide proper information you usually get a good answer in time.

Edit: Just realized you also cross-posted this to Biostars even twice. Please stop this. I closed the Biostars posts and gave my two cents on this behaviour over there.

Correct answer by ATpoint on June 23, 2021

Those aren't external_gene_name's, they're ensembl_gene_id_versions:

theBM <- getBM(attributes='ensembl_gene_id','hgnc_symbol', 
               filters = 'ensembl_gene_id_version', 
               values = charg2, 
               mart = ensembl)

Note that you'll get more hits if you strip the gene ID versions off:

charg2 = sapply(strsplit(charg, '.', fixed=T), function(x) x[1])
theBM = getBM(attributes='ensembl_gene_id','hgnc_symbol', 
              filters = 'ensembl_gene_id', 
              values = charg2, 
              mart = ensembl)

Answered by Devon Ryan on June 23, 2021

Add your own answers!

Related Questions

Optitype for Singularity

0  Asked on February 17, 2021 by sophistrs


What are phantom peaks in ChIP-seq?

1  Asked on February 13, 2021 by eric_kernfeld


Remove/delete sequences by ID from multifasta

6  Asked on February 4, 2021 by andresito


Why use “robust” FPKMs?

1  Asked on February 3, 2021


Ask a Question

Get help from others!

© 2023 All rights reserved. Sites we Love: PCI Database, MenuIva, UKBizDB, Menu Kuliner, Sharing RPP, SolveDir