How is BLAST's nr database created?

Bioinformatics Asked by juniper- on April 25, 2021

Is there a paper or web page describing the procedure for creating the nr database used by NCBI’s BLAST implementation?

I presume it’s some type of clustering, but I’m curious about how exactly sequences are condensed into non-redundant representatives.

2 Answers

Did a little more searching and found the answer in the README on BLAST's ftp site:

6. Non-redundant defline syntax

The non-redundant databases are nr, nt and pataa. Identical sequences are 
merged into one entry in these databases. To be merged two sequences must
have identical lengths and every residue at every position must be the 
same.  The FASTA deflines for the different entries that belong to one 
record are separated by control-A characters invisible to most 
programs. In the example below both entries Q57293.1 and AAB05030.1
have the same sequence, in every respect:

>Q57293.1 RecName: Full=Fe(3+) ions import ATP-binding protein FbpC ^AAAB05030.1 afuC 
[Actinobacillus pleuropneumoniae] ^AAAB17216.1 afuC [Actinobacillus pleuropneumoniae]

Individual sequences are now identifed simply by their accession.version.  

For databases whose entries are not from official NCBI sequence databases, 
such as Trace database, the gnl| convention is used. For custom databases, 
this convention should be followed and the id for each sequence must be 
unique, if one would like to take the advantage of indexed database, 
which enables specific sequence retrieval using blastdbcmd program included 
in the blast executable package.  One should refer to documents 
distributed in the standalone BLAST package for more details.

Landed on that README from this question on


In that same README file is some information on the origin of the sequences in the non-redundant sets:

|File Name              | Content Description                                 |
nr.gz*                  | non-redundant protein sequence database with entries
                           from GenPept, Swissprot, PIR, PDF, PDB, and RefSeq
nt.gz*                  | nucleotide sequence database, with entries from all
                          traditional divisions of GenBank, EMBL, and DDBJ;
                          excluding bulk divisions (gss, sts, pat, est, htg)
                          and wgs entries. Partially non-redundant.

Correct answer by juniper- on April 25, 2021

The Refseq team and also the NCBI resource coordinators team publish a new paper every few years, so check out the many papers (e.g. here or here), but to answer your 2nd question, non-redundancy here is (I think) defined very strictly as proteins that are identical in terms of sequence and length, so the clustering is trivial, without the need for a sophisticated clustering algorithm as required to detect more remote homologs.

Answered by Chris_Rands on April 25, 2021

Add your own answers!

Related Questions

Viral Metagenomics

1  Asked on November 1, 2020 by l-r-joshi


Issues with AutoDock Vina

0  Asked on October 18, 2020 by ibio_rep1


Swapping to effect increasing allele in case/control studies

0  Asked on October 9, 2020 by dale-handley


Error When Using biocLite as an installer in rpy2 python library

1  Asked on September 27, 2020 by abiologist


Seurat DE t.test

1  Asked on August 11, 2020 by vdu12345


Number of reactions per metabolic pathway

0  Asked on August 11, 2020 by mmphysics


Find all the bases for given reference position

0  Asked on August 8, 2020 by diesel__100


Convert VCF to genotype table

1  Asked on July 30, 2020 by snowflake


Ask a Question

Get help from others!

© 2023 All rights reserved. Sites we Love: PCI Database, MenuIva, UKBizDB, Menu Kuliner, Sharing RPP, SolveDir