What is the best way to get a large number of RNA seq data from SRA in Python without being denied access

Question

I have the following code to download data from the SRA using multithreading in Python. After running this a few times now (for testing purposes), I keep getting denied access to the data. Not sure how to fix this. In particular, cating the output files gives:

<?xml version="1.0" encoding="UTF-8"?><Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>cb0f0e98-cafb-1dd7-9b7b-d8c49756ec52</RequestId><HostId>QeDGVwBXYp61J0B4_OUTn7UsEsiQEec0n18DAeR0kaE</HostId></Error>

Here is my code:

import threading
import requests
import time 

start = time.perf_counter()

class MyThread(threading.Thread):
    def __init__(self, url):
        threading.Thread.__init__(self)
        self.url = url
        self.result = None
        self.filename = url.split('/')[-1]
    def run(self):
        res = requests.get(self.url)
        with open(self.filename, 'wb') as f:
            f.write(res.content)
urls = [
        'https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-1/SRR000001/SRR000001.1',
        'https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-1/SRR000001/SRR000001.2',
        'https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-1/SRR000002/SRR000002.1',
        'https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-1/SRR000002/SRR000002.2']

threads = [MyThread(url, ) for url in urls]
for thread in threads:
    thread.start()
for thread in threads:
    thread.join()

finish = time.perf_counter()

print(f'Finished in {round(finish-start, 2)} second(s)')

python

Phoenix Mu · Accepted Answer

I once needed to download more than 1000 files from SRA, and I found a blog post that is very useful (https://reneshbedre.github.io/blog/fqutil.html). The idea is to use aspera to fetch SRA files. This is fast becasue of aspera. Then use fasterq-dump to convert SRA file to fastq. fasterq-dump is faster than fastq-dump partly because it does not compress the output file. I think compressing is the rate-limiting step here. After this, you can use pigz, which is a multi-threading version of gzip to compress the fastq files. Note that if you want to download files to a computing cluster, you can submit jobs for fasterq-dump and pigz. This makes things much faster. You cannot submit jobs for fetching SRA, because you need to connect to internet.

Correct answer by Phoenix Mu on August 22, 2021

What is the best way to get a large number of RNA seq data from SRA in Python without being denied access

One Answer

Add your own answers!

Ask a Question