TransWikia.com

Metagenomics: Identifying most common sequences

Bioinformatics Asked by DumbledoreTheGrey on March 1, 2021

I am working on a project and used the following command:

vsearch --derep_fulllength filtered_merged.fa -sizeout -relabel Uniq -output dereplicated_filtered_merged.fa

and got the following output:

87373926 nt in 203453 seqs, min 310, max 480, avg 352
Sorting 100%
10981 unique sequences, avg cluster 2.0, median 1, max 1287
Writing output file 100% 

The output had provided me with the data that 10981 unique sequences have been identified. But I cant seem to identify how many reads of the most common sequence were present in the input data.

Any suggestions will be kindly appreciated!

One Answer

According to the VSEARCH docs, since you have specified --sizeout your abundances have been written into the FASTA headers:

--sizeout

Take into account the abundance annotations present in the input fasta file (search for the pattern ’[>;]size=integer[;]’ in sequence headers). That option is active by default when rereplicating.

Add abundance annotations to the output fasta file (add the pattern ’;size=integer;’ to sequence headers). If --sizein is specified, each unique sequence receives a new abun- dance value corresponding to its total abundance (sum of the abundances of its occur- rences). If --sizein is not specified, input abundances are set to 1, and each unique sequence receives a new abundance value corresponding to its number of occurrences in the input file.

Correct answer by Maximilian Press on March 1, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP