TransWikia.com

How to get a list of genes corresponding to the list of SNPs (rs ids)?

Bioinformatics Asked on August 22, 2021

Is there a way for me to get a list of genes given a list of SNP rs ids? I found several questions asked with a similar goal years ago, and the answers are always about using multiple online tools, and/or R programming.

I wonder if there is a simpler solution recently? (and hopefully, I can solve it within the scope of python. )

The relevant question I found (that are asked and answered more than five years ago.)

6 Answers

I wonder if there is a simpler solution recently? (and hopefully, I can solve it within the scope of python. )

A simpler solution, I don't know... but this is at least one Python solution using Biopython's ELink method via NCBI's Entrez E-utils.

The Biopython library is flexible enough (they have an in-depth tutorial worth reading) to modify the code below to to fit project-specific needs. For example, the mapping between snp id and gene id is lost with this bulk request method.

from Bio import Entrez
Entrez.email = "[email protected]"

snp_ids = ["rs8179065", 
           "rs12899845", 
           "rs8179066", 
           "rs80357332", 
           "rs80356920"]

record = Entrez.read(Entrez.elink(dbfrom="snp", 
                                  id=",".join(snp_ids).replace('rs', ''), 
                                  db="gene"))

This returns a record object

[{'LinkSetDb': [{'DbTo': 'gene', 'LinkName': 'snp_gene', 'Link': [{'Id': '374654'}, {'Id': '90381'}, {'Id': '672'}]}], 'ERROR': [], 'LinkSetDbHistory': [], 'DbFrom': 'snp', 'IdList': ['8179065', '8179066', '12899845', '80356920', '80357332']}]

which contains several iterable fields -- 'Link' has a list of the Gene UID's that can be converted to names using ESummary.

for gene_id in record[0]['LinkSetDb'][0]['Link']:
    handle = Entrez.esummary(db="gene", id=gene_id['Id'])
    uid_record = Entrez.read(handle)
    handle.close()

    uid_summary = uid_record["DocumentSummarySet"]['DocumentSummary'][0]
    print (gene_id['Id'], uid_summary['Name'])
    # ...or something more interesting than print


374654 KIF7
90381 TICRR
672 BRCA1

An option to preserve discrete mapping, is to loop over each snp id and extract the gene(s):

although I don't endorse the method below, especially for long lists of snps, you may flood the server with too many requests, maybe someone more familiar can comment on request limits or best practices for Entrez

for snp_id in snp_ids:
    record = Entrez.read(Entrez.elink(dbfrom="snp", 
                                  id=snp_id.replace('rs',''), 
                                  db="gene"))
    results = record[0]['LinkSetDb'][0]['Link']
    for result in results:
        uid = result['Id']
        handle = Entrez.esummary(db="gene", id=uid)
        uid_record = Entrez.read(handle)
        handle.close()
        uid_summary = uid_record["DocumentSummarySet"]['DocumentSummary'][0]
        gene_name = uid_summary['Name']

        print(snp_id, uid, gene_name)

rs8179065 374654 KIF7
rs12899845 374654 KIF7
rs12899845 90381 TICRR
rs8179066 374654 KIF7
rs80357332 672 BRCA1
rs80356920 672 BRCA1

Correct answer by Kevin on August 22, 2021

BiocManager::install('biomaRt')
require(biomaRt)
ensembl <- useMart("ENSEMBL_MART_SNP", dataset = "hsapiens_snp")
getBM(attributes=c("refsnp_id", "chr_name", "chrom_start", "chrom_end",
                   "allele", "mapweight", "validated", "allele_1", "minor_allele",
                   "minor_allele_freq", "minor_allele_count", "clinical_significance",
                   "synonym_name", "ensembl_gene_stable_id"),
      filters="snp_filter", values=c("rs6025","rs431905509"),
      mart=ensembl, uniqueRows=TRUE)

Answered by Shicheng Guo on August 22, 2021

Have you tried using SnpEff and SnpShift (http://snpeff.sourceforge.net) for annotation of the SNPs?

Even if your desired organism available in their annotation database you can create them easily with the gene bank files. Other than annotation the tools features a bunch of other amazing functionalities.

Answered by arup on August 22, 2021

using ncbi-efetch and XSLT with the following stylesheet:

<?xml version='1.0' encoding="UTF-8"?>
<xsl:stylesheet
    xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
    xmlns:r="https://www.ncbi.nlm.nih.gov/SNP/docsum"
    version='1.1'
    >
<xsl:output method="text"  encoding="UTF-8"/>

<xsl:template match="/">
<xsl:apply-templates select="r:ExchangeSet"/>
</xsl:template>
<xsl:template match="r:ExchangeSet">
<xsl:apply-templates select="r:Rs"/>
</xsl:template>
<xsl:template match="r:Rs">
<xsl:apply-templates select="r:Assembly/r:Component/r:MapLoc/r:FxnSet"/>
</xsl:template>


<xsl:template match="r:FxnSet">
<xsl:text>rs</xsl:text>
<xsl:value-of select="../../../../@rsId"/>
<xsl:text>  ncbi:</xsl:text>
<xsl:value-of select="@geneId"/>
<xsl:text>  </xsl:text>
<xsl:value-of select="@symbol"/>
<xsl:text>
</xsl:text>
</xsl:template>

</xsl:stylesheet>

usage (group the rs# by 3):

$ echo -e "rs8179065nrs12899845nrs8179066nrs80357332nrs80356920n" | xargs  -L 3 echo  | tr " " "," | while read R; do curl -sL "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=${R}&retmode=xml" | xsltproc --novalid  transform.xsl -; done | uniq
rs8179065   ncbi:374654 KIF7
rs12899845  ncbi:90381  TICRR
rs8179066   ncbi:374654 KIF7
rs80357332  ncbi:672    BRCA1
rs80356920  ncbi:672    BRCA1

Answered by Pierre on August 22, 2021

If you want a fast (binary) search with a locally-stored dataset, where you know exactly what SNPs and gene annotations are being queried, here's a way to set up files to do your own querying, which should scale up nicely if you have a lot of SNPs.

1) Get annotations of interest and write them into a sorted BED file.

For example, here's a way to get Gencode v26 gene annotations:

$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_26/gencode.v26.basic.annotation.gff3.gz 
    | gunzip -c 
    | convert2bed --input=gff - 
    > gencode.v26.bed

2) Get SNPs and write them into a text file sorted by SNP ID.

For hg38, for instance:

$ LC_ALL=C
$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/snp147.txt.gz 
   | gunzip -c 
   | awk -v OFS="t" '{ print $5,$2,$3,($3+1) }' 
   | sort -k1,1 
   > hg38.snp147.sortedByName.txt

3) Install pts-line-bisect:

$ git clone https://github.com/pts/pts-line-bisect.git
$ cd pts-line-bisect && make && cd ..

4) Run a query:

$ rs_of_interest=rs2814778
$ ./pts-line-bisect/pts_lbsearch -p hg38.snp147.sortedByName.txt ${rs_of_interest} 
   | head -1 
   | cut -f2- 
   | bedmap --echo --echo-map-id-uniq --range 1000 - gencode.v26.bed 
   > answer.bed

The file answer.bed contains the genomic interval of the SNP of interest, along with the IDs of any Gencode v26 gene annotations within 1000 nt upstream and downstream of the SNP.

To repeat step 4 within a Python script, you can use subprocess.check_output():

...
pts_lbsearch = os.path.join(project_dir, "pts-line-bisect", "pts_lbsearch")
if not os.path.exists(pts_lbsearch):
  print "Content-type:application/jsonrnrn"
  print json.dumps({'error' : 'could not locate pts_lbsearch [%s]' % (pts_lbsearch)})
  sys.exit(errno.ENOENT)

if term.startswith('rs'):
  query_fn = os.path.join(annotations_dir, "hg38.snp147.sortedByName.txt")
  cmd = "%s -p %s %s" % (pts_lbsearch, query_fn, term)
  try:
    res = subprocess.check_output(cmd, shell=True)
    if res and len(res) > 0:
      (chr, start, stop) = res.strip().split('n')[0].split('t')
    else:
      pass
  except subprocess.CalledProcessError as err:
    sys.stderr.write("Query failed!n")
    sys.exit(errno.EIO)

#
# map 'chr', 'start', and 'stop' variables to annotations with 
# another subprocess call (not shown -- use example above)
#
...

Answered by Alex Reynolds on August 22, 2021

I can show you a simple way in R, using biomaRt. Let's say you have two snp ids you want to exmine.

SNPids <- c("rs431905509", "rs431905511")

library("biomaRt")

# To show which marts are available
listMarts()

# You need the SNP mart
mart <- useMart("ENSEMBL_MART_SNP")

# Find homo sapiens
listDatasets(mart)

# This will be the dataset we want to use
dataset <- useDataset("hsapiens_snp", mart=mart)

# Show available filters
listFilters(dataset)

# Now list all available attributes
listAttributes(dataset)

# To get the ensembl gene id belonging to the SNPs
getBM(attributes=c('refsnp_id', 'ensembl_gene_stable_id'), 
      filters = 'snp_filter', 
      values = SNPids, 
      mart = dataset)

    refsnp_id ensembl_gene_stable_id
1 rs431905509        ENSG00000070371
2 rs431905509        ENSG00000100075
3 rs431905509        ENSG00000260924
4 rs431905511        ENSG00000145335

Answered by benn on August 22, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP