# How is BLAST's nr database created?

Bioinformatics Asked by juniper- on April 25, 2021

Is there a paper or web page describing the procedure for creating the nr database used by NCBI’s BLAST implementation?

I presume it’s some type of clustering, but I’m curious about how exactly sequences are condensed into non-redundant representatives.

Did a little more searching and found the answer in the README on BLAST's ftp site: ftp://ftp.ncbi.nlm.nih.gov/blast/db/README

6. Non-redundant defline syntax

The non-redundant databases are nr, nt and pataa. Identical sequences are
merged into one entry in these databases. To be merged two sequences must
have identical lengths and every residue at every position must be the
same.  The FASTA deflines for the different entries that belong to one
record are separated by control-A characters invisible to most
programs. In the example below both entries Q57293.1 and AAB05030.1
have the same sequence, in every respect:

>Q57293.1 RecName: Full=Fe(3+) ions import ATP-binding protein FbpC ^AAAB05030.1 afuC
[Actinobacillus pleuropneumoniae] ^AAAB17216.1 afuC [Actinobacillus pleuropneumoniae]
MNNDFLVLKNITKSFGKATVIDNLDLVIKRGTMVTLLGPSGCGKTTVLRLVAGLENPTSGQIFIDGEDVTKSSIQNRDIC
EPLSNLDANLRRSMREKIRELQQRLGITSLYVTHDQTEAFAVSDEVIVMNKGTIMQKARQKIFIYDRILYSLRNFMGEST
ANPDQFDPDATKAFIHFTEQGIFLLNKE

Individual sequences are now identifed simply by their accession.version.

For databases whose entries are not from official NCBI sequence databases,
such as Trace database, the gnl| convention is used. For custom databases,
this convention should be followed and the id for each sequence must be
unique, if one would like to take the advantage of indexed database,
which enables specific sequence retrieval using blastdbcmd program included
in the blast executable package.  One should refer to documents
distributed in the standalone BLAST package for more details.


Landed on that README from this question on biostars.org: https://www.biostars.org/p/217456/

Edit

In that same README file is some information on the origin of the sequences in the non-redundant sets:

+-----------------------+-----------------------------------------------------+
|File Name              | Content Description                                 |
+-----------------------+-----------------------------------------------------+
nr.gz*                  | non-redundant protein sequence database with entries
from GenPept, Swissprot, PIR, PDF, PDB, and RefSeq
nt.gz*                  | nucleotide sequence database, with entries from all
traditional divisions of GenBank, EMBL, and DDBJ;
excluding bulk divisions (gss, sts, pat, est, htg)
and wgs entries. Partially non-redundant.


Correct answer by juniper- on April 25, 2021

The Refseq team and also the NCBI resource coordinators team publish a new paper every few years, so check out the many papers (e.g. here or here), but to answer your 2nd question, non-redundancy here is (I think) defined very strictly as proteins that are identical in terms of sequence and length, so the clustering is trivial, without the need for a sophisticated clustering algorithm as required to detect more remote homologs.

Answered by Chris_Rands on April 25, 2021

## Related Questions

### How to identify to each scaffold a read belongs to, inside a .sam file?

1  Asked on November 5, 2020 by fullmooninu

### are GSEA and other geneset enrichment analysis supposed to yield extremely different results between them?

1  Asked on November 5, 2020 by daro-rocha

### Meta-analysis and data curation tools in R

0  Asked on November 4, 2020

### Viral Metagenomics

1  Asked on November 1, 2020 by l-r-joshi

### Issues with AutoDock Vina

0  Asked on October 18, 2020 by ibio_rep1

### Samtools Index: Chromosome Blocks not Continuous

1  Asked on October 17, 2020

### Plotting distance tree from blastn output

1  Asked on October 11, 2020

### Swapping to effect increasing allele in case/control studies

0  Asked on October 9, 2020 by dale-handley

### Error When Using biocLite as an installer in rpy2 python library

1  Asked on September 27, 2020 by abiologist

### Why doesn’t an (Entrez eutils) einfo request for “gene” return the link gene_nucleotide or gene_nucleotide_pos links?

1  Asked on September 10, 2020 by hepcat72

### Convert rs ID of one hg build to rs IDs of another build

4  Asked on September 1, 2020 by rob-john

### Does the kinship and inbreeding coefficients depend on population frequency of an allele?

2  Asked on August 31, 2020

### How do I include repeat purity, default slippage, default stutter, and minimum flanking (left and right) in Tandem Repeat Finder’s output?

0  Asked on August 30, 2020 by annabelperry

### calculating nucleotide frequency per column

7  Asked on August 19, 2020 by user3138373

### Seurat DE t.test

1  Asked on August 11, 2020 by vdu12345

### Number of reactions per metabolic pathway

0  Asked on August 11, 2020 by mmphysics

### Find all the bases for given reference position

0  Asked on August 8, 2020 by diesel__100

### Calculate the percentage of each unique phylogenetic tree in a BEAST output

2  Asked on August 8, 2020 by justine-vandendorpe

### parsimony and maximum likelihood tree comparison in R

2  Asked on August 5, 2020

### Convert VCF to genotype table

1  Asked on July 30, 2020 by snowflake