How is BLAST's nr database created?

Question

Is there a paper or web page describing the procedure for creating the nr database used by NCBI's BLAST implementation?

I presume it's some type of clustering, but I'm curious about how exactly sequences are condensed into non-redundant representatives.

juniper- · Accepted Answer

Did a little more searching and found the answer in the README on BLAST's ftp site: ftp://ftp.ncbi.nlm.nih.gov/blast/db/README
6. Non-redundant defline syntax

The non-redundant databases are nr, nt and pataa. Identical sequences are 
merged into one entry in these databases. To be merged two sequences must
have identical lengths and every residue at every position must be the 
same.  The FASTA deflines for the different entries that belong to one 
record are separated by control-A characters invisible to most 
programs. In the example below both entries Q57293.1 and AAB05030.1
have the same sequence, in every respect:

>Q57293.1 RecName: Full=Fe(3+) ions import ATP-binding protein FbpC ^AAAB05030.1 afuC 
[Actinobacillus pleuropneumoniae] ^AAAB17216.1 afuC [Actinobacillus pleuropneumoniae]
MNNDFLVLKNITKSFGKATVIDNLDLVIKRGTMVTLLGPSGCGKTTVLRLVAGLENPTSGQIFIDGEDVTKSSIQNRDIC
IVFQSYALFPHMSIGDNVGYGLRMQGVSNEERKQRVKEALELVDLAGFADRFVDQISGGQQQRVALARALVLKPKVLILD
EPLSNLDANLRRSMREKIRELQQRLGITSLYVTHDQTEAFAVSDEVIVMNKGTIMQKARQKIFIYDRILYSLRNFMGEST
ICDGNLNQGTVSIGDYRFPLHNAADFSVADGACLVGVRPEAIRLTATGETSQRCQIKSAVYMGNHWEIVANWNGKDVLIN
ANPDQFDPDATKAFIHFTEQGIFLLNKE

Individual sequences are now identifed simply by their accession.version.

For databases whose entries are not from official NCBI sequence databases, 
such as Trace database, the gnl| convention is used. For custom databases, 
this convention should be followed and the id for each sequence must be 
unique, if one would like to take the advantage of indexed database, 
which enables specific sequence retrieval using blastdbcmd program included 
in the blast executable package.  One should refer to documents 
distributed in the standalone BLAST package for more details.

Landed on that README from this question on biostars.org: https://www.biostars.org/p/217456/
Edit
In that same README file is some information on the origin of the sequences in the non-redundant sets:
+-----------------------+-----------------------------------------------------+
|File Name              | Content Description                                 |
+-----------------------+-----------------------------------------------------+
nr.gz*                  | non-redundant protein sequence database with entries
                           from GenPept, Swissprot, PIR, PDF, PDB, and RefSeq
nt.gz*                  | nucleotide sequence database, with entries from all
                          traditional divisions of GenBank, EMBL, and DDBJ;
                          excluding bulk divisions (gss, sts, pat, est, htg)
                          and wgs entries. Partially non-redundant.

Chris_Rands · Answer

The Refseq team and also the NCBI resource coordinators team publish a new paper every few years, so check out the many papers (e.g. here or here), but to answer your 2nd question, non-redundancy here is (I think) defined very strictly as proteins that are identical in terms of sequence and length, so the clustering is trivial, without the need for a sophisticated clustering algorithm as required to detect more remote homologs.

Answered by Chris_Rands on April 25, 2021

How is BLAST's nr database created?

2 Answers

Add your own answers!

Ask a Question