TransWikia.com

Timeout when downloading the ncbi nr blast database

Bioinformatics Asked by C. Zeil on July 19, 2021

I am experiencing timeout problems when downloading the NCBI nr preformatted blast database using the update_blastdb script (version 504861).

I run the script with the following paramters

update_blastdb --decompress --passive --verbose nr

and I get the following error message (in verbose mode)

Downloading nr (45 volumes) ...
Downloading nr.00.tar.gz...Net::FTP=GLOB(0x5610fb59b8f8)>>> PASV
Net::FTP=GLOB(0x5610fb59b8f8)<<< 227 Entering Passive Mode (165,112,9,229,195,144).
Net::FTP=GLOB(0x5610fb59b8f8)>>> RETR nr.00.tar.gz
Net::FTP=GLOB(0x5610fb59b8f8)<<< 150 Opening BINARY mode data connection for nr.00.tar.gz (18745730730 bytes)
Net::FTP: Net::Cmd::getline(): timeout at /usr/share/perl/5.26/Net/FTP/dataconn.pm line 82.
Unable to close datastream at /usr/bin/update_blastdb line 202.
Net::FTP=GLOB(0x5610fb59b8f8)>>> PASV
Net::FTP: Net::Cmd::getline(): unexpected EOF on command channel: Connection reset by peer at /usr/bin/update_blastdb line 203.
Failed to download nr.00.tar.gz.md5!
Net::FTP: Net::Cmd::_is_closed(): unexpected EOF on command channel: Connection reset by peer at /usr/bin/update_blastdb line 101.
Net::FTP: Net::Cmd::_is_closed(): unexpected EOF on command channel: Connection reset by peer at /usr/bin/update_blastdb line 101.

The timeout happens after ~35 minutes and a file that is approximately 18GB big is being downloaded, which matches the expected filesize. The checksum file (nr.00.tar.gz.md5) is not downloaded. So I’m not sure which of the two files is actually the problem.

I tested downloading the nt database and everything seems to work fine, so I don’t think the script is the problem. For comparison, this is the nt download output for the first file

Downloading nt.00.tar.gz...Net::FTP=GLOB(0x562575a73168)>>> PASV
Net::FTP=GLOB(0x562575a73168)<<< 227 Entering Passive Mode (165,112,9,229,196,51).
Net::FTP=GLOB(0x562575a73168)>>> RETR nt.00.tar.gz
Net::FTP=GLOB(0x562575a73168)<<< 150 Opening BINARY mode data connection for nt.00.tar.gz (4065912989 bytes)
Net::FTP=GLOB(0x562575a73168)<<< 226 Transfer complete
Net::FTP=GLOB(0x562575a73168)>>> PASV
Net::FTP=GLOB(0x562575a73168)<<< 227 Entering Passive Mode (165,112,9,229,195,107).
Net::FTP=GLOB(0x562575a73168)>>> RETR nt.00.tar.gz.md5
Net::FTP=GLOB(0x562575a73168)<<< 150 Opening BINARY mode data connection for nt.00.tar.gz.md5 (47 bytes)
Net::FTP=GLOB(0x562575a73168)<<< 226 Transfer complete

Any help would be appreciated.

2 Answers

The answer of the support stated that it is likely that the size of the first download is the problem and that I should try using another tool like rsync. Because I'm unfamiliar with rsync I decided to write a python script that did the job.

import urllib, tarfile, json
    
base_url = "https://ftp.ncbi.nlm.nih.gov/blast/db"

# download manifest file to get the filenames
manifest_file = 'blastdb-manifest.json'
urllib.request.urlretrieve(f"{base_url}/{manifest_file}", manifest_file)

with open(manifest_file) as f:
    manifest_data = json.load(f)

# download everything
for file in manifest_data['nr']['files']:
    # download checksum file
    checksum_file = f"{file}.md5"
    urllib.request.urlretrieve(f"{base_url}/{checksum_file}", checksum_file)
    
    # download the archive file
    urllib.request.urlretrieve(f"{base_url}/{file}", file)
    
    # check that the checksums match
    calculated_checksum = get_md5_for_file(file)
    
    with open(checksum_file) as f:
        for line in f:  
            line = line.strip()
            if line.endswith(file):
                checksum = re.split("s+", line)[0]
                if checksum == calculated_checksum:
                    checksum_matches = True
                else:
                    raise Exception(f"Checksum doesn't match expected for file {file}")
    
    # unpack the archive
    tar = tarfile.open(archive_file)
    tar.extractall(file)
    tar.close()

The function to calculate the md5 checksum is the following (borrowed from this question)

def get_md5_for_file(file):
    hash_md5 = hashlib.md5()
    with open(file, 'rb') as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    
    return hash_md5.hexdigest()

Note that this is not the exact script I use, so I can't guarantee it works exactly like this, but I tried to include all the important stuff.

In my version I include checks to whether my existing blast database needs to be updated (I keep the manifest files and compare the manifest['nr']['last_updated'] values) and I also try to skip already unpacked archives while iterating (check if the nr.xx.phd file exists) to save time in case the script fails due to problems in the download.

Correct answer by C. Zeil on July 19, 2021

Same here. I used axel to download it a week ago, but fail to get finished. Also, the downloaded taxa zipped file (60Mb) doesn't match md5sum. Apparently, they update the FTP files every 30 minutes or 1 hour.

Answered by Life_Searching_Steps on July 19, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP