TransWikia.com

How do I include repeat purity, default slippage, default stutter, and minimum flanking (left and right) in Tandem Repeat Finder's output?

Bioinformatics Asked by annabelperry on August 30, 2020

I am attempting to create a markerInfoFile for use in a program called popSTR (GitHub Documentation: https://github.com/DecodeGenetics/popSTR). The marker info file should contain information about microsatellites in a reference genome. Microsatellites are stretches of DNA where a short motif (i.e. "AGG") is repeated multiple times (i.e. three repeats of "AGG" is "AGGAGGAGG"). Here is an example line from the popSTR author’s marker info file:

chr10 10589 10601 GCCC 3.2 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTCACCCTTCTAACTGGACTCTGACCCTGATTGTTGAGGGCTGCAAAGAGGAAGAATTTTATTTACCGTCGCTGTGGCCCCGAGTTGTCCCAAAGCGAGGTAATGCCCGCAAGGTCTGTGCTGATCAGGACGCAGCTCTGCCTTCGGGGTGCCCCTGGACT GGTCTGTGCTGAGGAGAACGCTGCTCCGCCTCCGCGGTACTCCGGACATATGTGCAGAGAAGAACGCAGCTGCGCCCTCGCCATGCTCTGCGAGTCTCTGCTGATGAGAACACAGCTTCACTTTCGCAAAGGCGCAGCGCCGGCGCAGGCGCGGAGGGGCGCGCAGCGCCGGCGCAGGCGCGGAGGGGCGCGCCCGAACCCGAACCCTAATGCCGTCATAAGAGCCCTAGGGAGACCTTAGGGAACAAGCATTAAACTGACACTCGATTCTGTAGCCGGCTCTGCCAAGAGACATGGCGTTGCGGTGATATGAGGGCAGGGGTCATGGAAGAAAGCCTTCTGGTTTTAGACCCACAGGAAGATCTGTGACGCGCTCTTGGGTAGAGCACACGTTGCTGGGCGTGCGCTTGAAAAGAGCCTAAGAAGAGGGGGCGTCTGGAAGGAACCGCAACGCCAAGGGAGGGTGTCCAGCCTTCCCGCTTCAACACCTGGACACATTCTGGAAAGTTTCCTAAGAAAGCCAGAAAAATAATTTAAAAAAAAATCCAGAGGCCAGACGGGCTAATGGGGCTTTACTGCGACTATCTGGCTTAATCCTCCAAACAACCTTGCCATACCAGCCCATCAGTCCTCTGAGACAGGTGAAGAACCTGAGGTCGCAGGAGGACACCCAGAAGGTCCAGAGAGAGCCTCCTAGGCCCCCCACCTCCCCCCGTGGCAGCTCCAACCCCAGCTTTTTCACTAGTAAGGCAGTCGGGCCCCTGGGCCACGCCCACTCCCCCAAGCGGGGAAGGAGCTTCGCGCTGCCGCTTGGCTGGGGACTGGGCACCGCCCTCCCGCGGCTCCTGAGCCGGCTGCCACCAGGGGGCGCGCCAGCGGTGTCCGGGAGCCTAGCGGCGCGTGTGCAGCGGCCAGTGCACCTGCTCTGGCCCTCGCCGCGGTCTCTGCCAGGACCCCGACGCCCAGCCTGACCCTGCCATTCAGCGGGGCTGCGGCTCCA GCCCGCCCGCCCG 4 4 1.00 0.0149855 0.949846 0 0.75 0.25 0

Though it may be difficult to tell from the above example, the desired output is a 17 column space-delimited file formatted like so:

  chrom startCoordinate endCoordinate repeatMotif numOfRepeatsInRef 1000refBasesBeforeStart 1000refBasesAfterEnd repeatSeqFromRef minFlankLeft minFlankRight repeatPurity defaultSlippage defaultStutter fractionAinMotif fractionCinMotif fractionGinMotif fractionTinMotif

Since I am working with non-human genomes, I cannot simply use the author’s sample marker info files. I could not find explicit instructions for the creation of this markerInfoFile in popSTR’s full journal article (https://academic.oup.com/bioinformatics/article/33/24/4041/2525679), so I emailed the authors. they instructed me to use Tandem Repeat Finder (in the Linux command prompt) with the following options:

./trf409.linux64 reference_genome.fasta 2 7 7 80 10 22 1000 -d -h -ngs > reference_genome_TRFmap

Here is the documentation for TRF:
https://blaxter-lab-documentation.readthedocs.io/en/latest/trf.html

I applied TRF to my reference genome, but the output was in the following format:

Chrom
    startCoordinate endCoordinate PeriodSize    numOfRepeatsInRef   SizeofConsensusPattern  PercentofMatchesBetweenAdjacentCopies   PercentofIndelsBetweenAdjacentCopies OverallAlignmentScore  fractionAinMotif fractionCinMotif fractionGinMotif fractionTinMotif Entropy repeatMotif repeatSeqFromRef    1000refBasesBefore  1000refBasesAfter

Here is an example of the first two lines of my current output file:

@ScyDAA6_1;HRSCAF=23
108 120 5 2.6 5 100 0 26 0 0 15 84 0.62 TTTGT TTTGTTTTGTTTT CAGTAAAGTCTTTCTTTCCTCTAACATAGAAAGTACTACTAGATTAGTGC TCTGTGTATGCTCTCTATTCTCAACCTCCAGATGCCCGTTCACACTGAGC
414 424 5 2.2 5 100 0 22 63 0 18 18 1.31 ATGAA ATGAAATGAAA TGAACTGGAATGTACATGATTGAATTTAAATTACTTCTTTAAAAAATTCC AGTATTGTGAATTGGTGCTAAATAAATAAACTGAATGAAAAATAACTCAC

It looks like my current TRF output is sorted with the chromosome names at the top, and the microsatellites found at each chromosome following underneath the name. I think I can include the appropriate chromosome names into each row with no problem, and I can easily rearrange the data into the appropriate columns. My main issue is with the information itself. As you can see, my current output includes data I don’t need: period size, size of consensus pattern, percent of matches between adjacent copies, percent of indels between adjacent copies, overall alignment score, and entropy are not required for my desired output (unless some of these terms are synonymous with information included in the desired output and I am simply not aware). Furthermore, my output needs to include the minimum flanking on the left and right, repeat purity, default slippage, and default stutter. Which TRF options do I need to use for this information to be included in my output?

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP