TransWikia.com

How to retrieve the DNA sequence from a particular species and region through command line?

Bioinformatics Asked by fp1234 on April 13, 2021

I am new to working with the NCBI database so I am not sure how trivial this question is. I wanted to get DNA sequence of a specific version of the Xenopus tropicalis genome of a specific region (ie 1-100bp) without doing it manually each time (going on genome browser). I know there is a faster way and efficient way to do this through the command line but I found the NCBI tutorial hard to follow.

2 Answers

Samtools

Samtools is a toolkit that includes a binary called samtoools, which can be used to query a FASTA file — to extract a subsequence based on some coordinate range you specify.

If you do not have samtools installed, you need to add it to your computer. For OS X, you could use Homebrew, for example:

$ brew install samtools

For Ubuntu, you could use apt-get:

$ sudo apt-get install samtools

Reference assembly

The next step is to download a reference assembly for your genome and genomic build-of-interest.

There is a research group dedicated to making Xenopus datasets available. If you were interested in v10 of the Xentr genome, for instance, you could do the following or similar to download it:

$ curl --output "XENTR_10.0_genome.fasta.gz" "http://ftp.xenbase.org/pub/Genomics/JGI/Xentr10.0/XENTR_10.0_genome.fasta.gz"

You could also dig through the NCBI site to get a similar link. It's the same idea — just change the web address, accordingly.

Indexing

Indexing makes it possible to extract an arbitrary subsequence of interest from this FASTA file. You need to index, but you only need to do it once:

$ samtools faidx XENTR_10.0_genome.fasta.gz

This creates two new files: XENTR_10.0_genome.fasta.gz.fai and XENTR_10.0_genome.fasta.gz.gzi.

Note: You need to keep these two files in the same directory as the original, compressed FASTA file XENTR_10.0_genome.fasta.gz.

Querying

Now you are ready to query, or extract a subsequence of interest. If you wanted the first hundred bases of the first chromosome, for instance:

$ samtools faidx XENTR_10.0_genome.fasta.gz Chr1:1-100
>Chr1:1-100
cctcagcgtcttcttctttctgctcccactgcattagcatagggagagagggccacagag
gctaggtagcattgccggctagcaatcttcagcgtcctct

The search term used here is Chr1:1-100. Replace Chr1 with the chromosome of interest and 1 and 100 with the start and stop coordinates, respectively, of interest. Coordinates used with samtools queries use a one-based indexing scheme.

Note: This uses the chromosome naming scheme in the XENTR_10.0_genome.fasta.gz file (Chr1, etc.). Other research groups (NCBI) might use different chromosome names. If you get your FASTA from somewhere else, investigate the FASTA file to determine the naming scheme being used.

If you want to redirect output to a file, use the output redirect operator (>):

$ samtools faidx XENTR_10.0_genome.fasta.gz Chr1:1-100 > mySubsequence.fa

Correct answer by Alex Reynolds on April 13, 2021

I'd download the fasta, and use samtools, or maybe bedtools.

Answered by swbarnes2 on April 13, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP