Getting Unique Identifier List for GEO Datasets NCBI

Question

AIM: Download "Unique Identifier List" for the following query from GEO DataSets.
Query: ("Expression profiling by high throughput sequencing"[DataSet Type] AND ("Homo sapiens"[Organism] OR "Mus musculus"[Organism] OR "rattus norvegicus"[Organism])) AND ("2020/01/01"[PDAT] : "3000"[PDAT])
which means, all RNASeq studies deposited on GEO in the year 2020 for humans, mice or rat.
Problem: I need the GSE ID list for ~9k datasets, but while trying to download the list of ids, it loads to a blank page and nothing happens. Also, clicking on "Next Page" gives error.
I have been trying for the last 3-4 days but it doesn't work.
Steps to generate file:
"Send To" -> "File" -> "Format" (Unique Identifier List) -> "Sort By" (Default Order) -> "Create File"

vkkodali · Accepted Answer

You can use Entrez Direct for this. The following returns Unique Identifiers which are just bare integers.
$ geo_query='"Expression profiling by high throughput sequencing"[DataSet Type] AND ("Homo sapiens"[Organism] OR "Mus musculus"[Organism] OR "rattus norvegicus"[Organism]) AND ("2020/01/01"[PDAT] : "3000"[PDAT])'
$ esearch -db gds -query "$geo_query" | efetch -format uid > gds_results.txt 
$ wc -l gds_results.txt 
9981 gds_results.txt
$ head -n2 gds_results.txt 
200134092
200120931

Instead, if you are looking for a way to get the GSE accessions, you can use the built-in xtract command to parse the XML returned by esummary as follows:
$ esearch -db gds -query "$geo_query" | esummary | xtract -pattern DocumentSummary -first Accession > gse_accs.txt > gse_accs.txt
$ wc -l gse_accs.txt 
9981 gse_accs.txt
$ head -n2 gse_accs.txt 
GSE165829
GSE165824

k1sauce · Answer

I think I would just try to do this with GEOquery.
https://bioconductor.org/packages/release/bioc/html/GEOquery.html

Getting Unique Identifier List for GEO Datasets NCBI

2 Answers

Add your own answers!

Ask a Question