How to get a list of genes corresponding to the list of SNPs (rs ids)?

Question

Is there a way for me to get a list of genes given a list of SNP rs ids? I found several questions asked with a similar goal years ago, and the answers are always about using multiple online tools, and/or R programming.

I wonder if there is a simpler solution recently? (and hopefully, I can solve it within the scope of python. )

The relevant question I found (that are asked and answered more than five years ago.)

https://support.bioconductor.org/p/39278/
https://www.biostars.org/p/10437/ 
https://www.biostars.org/p/413/

Kevin · Accepted Answer

I wonder if there is a simpler solution recently? (and hopefully, I can solve it within the scope of python. )

A simpler solution, I don't know... but this is at least one Python solution using Biopython's ELink method via NCBI's Entrez E-utils.

The Biopython library is flexible enough (they have an in-depth tutorial worth reading) to modify the code below to to fit project-specific needs.  For example, the mapping between snp id and gene id is lost with this bulk request method.

from Bio import Entrez
Entrez.email = "youremail@gmail.com"

snp_ids = ["rs8179065", 
           "rs12899845", 
           "rs8179066", 
           "rs80357332", 
           "rs80356920"]

record = Entrez.read(Entrez.elink(dbfrom="snp", 
                                  id=",".join(snp_ids).replace('rs', ''), 
                                  db="gene"))

This returns a record object

[{'LinkSetDb': [{'DbTo': 'gene', 'LinkName': 'snp_gene', 'Link': [{'Id': '374654'}, {'Id': '90381'}, {'Id': '672'}]}], 'ERROR': [], 'LinkSetDbHistory': [], 'DbFrom': 'snp', 'IdList': ['8179065', '8179066', '12899845', '80356920', '80357332']}]

which contains several iterable fields -- 'Link' has a list of the Gene UID's that can be converted to names using ESummary.

for gene_id in record[0]['LinkSetDb'][0]['Link']:
    handle = Entrez.esummary(db="gene", id=gene_id['Id'])
    uid_record = Entrez.read(handle)
    handle.close()

uid_summary = uid_record["DocumentSummarySet"]['DocumentSummary'][0]
    print (gene_id['Id'], uid_summary['Name'])
    # ...or something more interesting than print

374654 KIF7
90381 TICRR
672 BRCA1

An option to preserve discrete mapping, is to loop over each snp id and extract the gene(s):

although I don't endorse the method below, especially for long lists of snps, you may flood the server with too many requests, maybe someone more familiar can comment on request limits or best practices for Entrez

for snp_id in snp_ids:
    record = Entrez.read(Entrez.elink(dbfrom="snp", 
                                  id=snp_id.replace('rs',''), 
                                  db="gene"))
    results = record[0]['LinkSetDb'][0]['Link']
    for result in results:
        uid = result['Id']
        handle = Entrez.esummary(db="gene", id=uid)
        uid_record = Entrez.read(handle)
        handle.close()
        uid_summary = uid_record["DocumentSummarySet"]['DocumentSummary'][0]
        gene_name = uid_summary['Name']

print(snp_id, uid, gene_name)

rs8179065 374654 KIF7
rs12899845 374654 KIF7
rs12899845 90381 TICRR
rs8179066 374654 KIF7
rs80357332 672 BRCA1
rs80356920 672 BRCA1

Shicheng Guo · Answer

BiocManager::install('biomaRt')
require(biomaRt)
ensembl <- useMart("ENSEMBL_MART_SNP", dataset = "hsapiens_snp")
getBM(attributes=c("refsnp_id", "chr_name", "chrom_start", "chrom_end",
                   "allele", "mapweight", "validated", "allele_1", "minor_allele",
                   "minor_allele_freq", "minor_allele_count", "clinical_significance",
                   "synonym_name", "ensembl_gene_stable_id"),
      filters="snp_filter", values=c("rs6025","rs431905509"),
      mart=ensembl, uniqueRows=TRUE)

arup · Answer

Have you tried using SnpEff and SnpShift (http://snpeff.sourceforge.net) for annotation of the SNPs?

Even if your desired organism available in their annotation database you can create them easily with the gene bank files. Other than annotation the tools features a bunch of other amazing functionalities.

Pierre · Answer

using ncbi-efetch and XSLT with the following stylesheet:

rs

ncbi:

usage (group the rs# by 3): $ echo -e "rs8179065nrs12899845nrs8179066nrs80357332nrs80356920n" | xargs -L 3 echo | tr " " "," | while read R; do curl -sL "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=${R}&retmode=xml" | xsltproc --novalid transform.xsl -; done | uniq rs8179065 ncbi:374654 KIF7 rs12899845 ncbi:90381 TICRR rs8179066 ncbi:374654 KIF7 rs80357332 ncbi:672 BRCA1 rs80356920 ncbi:672 BRCA1

Alex Reynolds · Answer

If you want a fast (binary) search with a locally-stored dataset, where you know exactly what SNPs and gene annotations are being queried, here's a way to set up files to do your own querying, which should scale up nicely if you have a lot of SNPs.

1) Get annotations of interest and write them into a sorted BED file.

For example, here's a way to get Gencode v26 gene annotations:

$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_26/gencode.v26.basic.annotation.gff3.gz 
    | gunzip -c 
    | convert2bed --input=gff - 
    > gencode.v26.bed

2) Get SNPs and write them into a text file sorted by SNP ID.

For hg38, for instance:

$ LC_ALL=C
$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/snp147.txt.gz 
   | gunzip -c 
   | awk -v OFS="t" '{ print $5,$2,$3,($3+1) }' 
   | sort -k1,1 
   > hg38.snp147.sortedByName.txt

3) Install pts-line-bisect:

$ git clone https://github.com/pts/pts-line-bisect.git
$ cd pts-line-bisect && make && cd ..

4) Run a query:

$ rs_of_interest=rs2814778
$ ./pts-line-bisect/pts_lbsearch -p hg38.snp147.sortedByName.txt ${rs_of_interest} 
   | head -1 
   | cut -f2- 
   | bedmap --echo --echo-map-id-uniq --range 1000 - gencode.v26.bed 
   > answer.bed

The file answer.bed contains the genomic interval of the SNP of interest, along with the IDs of any Gencode v26 gene annotations within 1000 nt upstream and downstream of the SNP.

To repeat step 4 within a Python script, you can use subprocess.check_output():

...
pts_lbsearch = os.path.join(project_dir, "pts-line-bisect", "pts_lbsearch")
if not os.path.exists(pts_lbsearch):
  print "Content-type:application/jsonrnrn"
  print json.dumps({'error' : 'could not locate pts_lbsearch [%s]' % (pts_lbsearch)})
  sys.exit(errno.ENOENT)

if term.startswith('rs'):
  query_fn = os.path.join(annotations_dir, "hg38.snp147.sortedByName.txt")
  cmd = "%s -p %s %s" % (pts_lbsearch, query_fn, term)
  try:
    res = subprocess.check_output(cmd, shell=True)
    if res and len(res) > 0:
      (chr, start, stop) = res.strip().split('n')[0].split('t')
    else:
      pass
  except subprocess.CalledProcessError as err:
    sys.stderr.write("Query failed!n")
    sys.exit(errno.EIO)

#
# map 'chr', 'start', and 'stop' variables to annotations with 
# another subprocess call (not shown -- use example above)
#
...

benn · Answer

I can show you a simple way in R, using biomaRt. Let's say you have two snp ids you want to exmine.

SNPids <- c("rs431905509", "rs431905511")

library("biomaRt")

# To show which marts are available
listMarts()

# You need the SNP mart
mart <- useMart("ENSEMBL_MART_SNP")

# Find homo sapiens
listDatasets(mart)

# This will be the dataset we want to use
dataset <- useDataset("hsapiens_snp", mart=mart)

# Show available filters
listFilters(dataset)

# Now list all available attributes
listAttributes(dataset)

# To get the ensembl gene id belonging to the SNPs
getBM(attributes=c('refsnp_id', 'ensembl_gene_stable_id'), 
      filters = 'snp_filter', 
      values = SNPids, 
      mart = dataset)

refsnp_id ensembl_gene_stable_id
1 rs431905509        ENSG00000070371
2 rs431905509        ENSG00000100075
3 rs431905509        ENSG00000260924
4 rs431905511        ENSG00000145335

How to get a list of genes corresponding to the list of SNPs (rs ids)?

6 Answers

Add your own answers!

Ask a Question