TransWikia.com

How to fragment genomes into non-overlapping sequences of differing sizes?

Bioinformatics Asked by Stanley Ho on August 22, 2021

I have bacterial chromosomes/plasmids that I have downloaded off refseq in a multi-fasta that I would like to fragment into non-overlapping fragments of between 1 kb and 15 kb.

I have tried googling for related tools but to no avail. Does anyone know how I would achieve this? Something very similar is done in this paper:
https://link.springer.com/article/10.1007/s40484-019-0187-4

Thanks in advance!

2 Answers

You could use numpy.random.randint to sample integers from a uniform distribution, from 1000 to 15000 nt, extracting subsequences of that length from a linearized multi-FASTA file.

Here's some code:

#!/usr/bin/env python

import sys
import numpy as np

low = int(sys.argv[1])
high = int(sys.argv[2])

def process_record(header, sequence, low, high):
    next = 0
    fragment = 1
    while next < len(sequence):
        new_next = np.random.randint(low, high) + next  
        if new_next > len(sequence) - high:
            new_next = len(sequence)
        sys.stdout.write('{}:{}n{}n'.format(header, fragment, sequence[next:new_next]))
        next = new_next
        fragment += 1

header = ""
sequence = ""
for line in sys.stdin:
    if line.startswith('>'):
        if len(sequence) > 0:
            process_record(header, sequence, low, high)
            sequence = ""
        header = line.rstrip()
    else:
        sequence += line.rstrip()

if len(sequence) > 0:
    process_record(header, sequence, low, high)

To use it:

$ ./13879.py 1000 15000 < in.fa > out.fa

Fragments will be disjoint (non-overlapping, contiguous) subsequences.

I add a check to determine if the last fragment of a FASTA record would be shorter than low (1kb) in length, and, if so, append that fragment to the final subsequence. It is possible for that last fragment to be up to 16 kb - 1 nt in length, given your parameters:

if new_next > len(sequence) - high:
    new_next = len(sequence)

You could change this logic to discard the last fragment (i.e. break), or simply allow fragments less than 1kb in size, depending on what you need.

EDIT

Here's a slight modification that should allow to you preserve fragment sizes within your bounds.

A cartoon may help illustrate the problem. Let's set up a FASTA record, which has been broken up into n fragments:

   1       2           n-2      n-1      n
|-----|---------|...|-------|---------|-----|

We may get unlucky and sample a random integer for fragment n-1 (frag(n-1)) which makes the length of frag(n) less than low (1 kb in your example).

In my first suggested solution, frag(n-1) and frag(n) are appended to one another. This is why you get some fragments longer than high (15 kb).

Another option is to discard frag(n), which is of length less than low (1 kb). Here's some code:

#!/usr/bin/env python

import sys
import numpy as np
import timeit

low = int(sys.argv[1])
high = int(sys.argv[2])
in_fn = sys.argv[3]

"""
Globals
"""
fragment_lengths = []

"""
Process a record
"""
def process_record(header, sequence, low, high):
    next = 0
    last = 0
    fragment = 1
    fragment_sequence = ""
    while last < len(sequence) - high:
        next = np.random.randint(low, high + 1) + last
        fragment_sequence += sequence[last:next]
        # Comment the sys.stdout.write and uncomment the fragment_lengths line, if measuring fragment length distributions 
        sys.stdout.write('{}:{}-{}:{}n{}n'.format(header, last, next, fragment, sequence[last:next]))
        #fragment_lengths.append(next - last)                                                                                                                                                                                                                                                                                                                                                                                          
        last = next
        fragment += 1
    """
    Uncomment to compare input sequence with fragments
    """
    #sys.stderr.write('In > {}nOut > {}n'.format(sequence, fragment_sequence))
    
"""
Parse records
"""
def parse_records(in_fn, low, high):
    header = ""
    sequence = ""
    with open(in_fn, 'r') as ifh:
        for line in ifh:
            if line.startswith('>'):
                if len(sequence) > 0:
                    process_record(header, sequence, low, high)
                    sequence = ""
                header = line.rstrip()
            else:
                sequence += line.rstrip()
        if len(sequence) > 0:
            process_record(header, sequence, low, high)

def count_elements(seq):
    hist = {}
    for i in seq:
        hist[i] = hist.get(i, 0) + 1
    return hist

def wrapper(func, *args, **kwargs):
    def wrapped():
        return func(*args, **kwargs)
    return wrapped

def main(in_fn, low, high):
    """
    Comment below if testing the distribution of fragments
    """
    parse_records(in_fn, low, high)
    """
    Uncomment to test the distribution of fragments from 
    running fns on same input 100k times
    """
    #wrapped = wrapper(parse_records, in_fn, low, high)
    #timeit.timeit(wrapped, number=100000)
    #print(count_elements(fragment_lengths))

if __name__ == "__main__":
    main(in_fn, low, high)

If frag(n) is needed, and you need to preserve the uniform distribution (or whatever fragment length distribution you want to model), then rejection sampling might be an option.

With rejection sampling, you keep fragmenting your input sequence, until one of the distributions of lengths gets you a set of fragments that entirely cover the input sequence.

If you need that, follow up and I'll take another look.

Correct answer by Alex Reynolds on August 22, 2021

I've previously written my own script to fragment DNA sequences (input as either fastq or fasta files) into identical-length fragments, with a given overlap. I used this because there was a particular assembler that demanded equal-length sequences, and I was working with nanopore reads, but it's since been useful for other purposes as well.

Because it is most concerned about making lengths identical, there may be some forced overlap when setting the overlap to zero. The overlap will only be exactly zero when the sequence length is an exact multiple of the fragment size. Here's an example use:

zero overlap

normalise_seqlengths.pl -fraglength 25 -overlap 0 oneRead.fq
@A00187:180:HLGNGDSXX:3:1101:8486:1031#0 [150 bp, 6 fragments, 0 overlap] 1:N:0:CGCAAGAA+ACTCCTAA
ATTGCGCAATGACAGCCAGGGGAGT
+
FFFFFFFFFFFFFFFFF:,FFFFFF
@A00187:180:HLGNGDSXX:3:1101:8486:1031#1:25-50  1:N:0:CGCAAGAA+ACTCCTAA
GGGAGCTGTCTGGGGTAACCTGGCT
+
:FFFF:F:FFFFF::FFFFF,::FF
@A00187:180:HLGNGDSXX:3:1101:8486:1031#2:50-75  1:N:0:CGCAAGAA+ACTCCTAA
CTAGGACCAGGTGGAAAAAGCACCT
+
F:FFFFFFFFFFFFFFFFFFFFFFF
@A00187:180:HLGNGDSXX:3:1101:8486:1031#3:75-100  1:N:0:CGCAAGAA+ACTCCTAA
TGGCTGGCCCAGCCCTGTTCTTCTC
+
:FF,FFFFF,:,FFF::F:FFF,:F
@A00187:180:HLGNGDSXX:3:1101:8486:1031#4:100-125  1:N:0:CGCAAGAA+ACTCCTAA
CTGGGGAGGGAAGGGCAGGTCCCCT
+
F:FFFFFFFFF,FFFFF:FFF,F,,
@A00187:180:HLGNGDSXX:3:1101:8486:1031#5:125-150  1:N:0:CGCAAGAA+ACTCCTAA
AATGCCTGAACCACATGGTGGGCCA
+
FFF,F:FFFFFFF:F,,,FFF,F,,

partial overlap

normalise_seqlengths.pl -fraglength 63 -overlap 0 oneRead.fq
@A00187:180:HLGNGDSXX:3:1101:8486:1031#0 [150 bp, 3 fragments, 20 overlap] 1:N:0:CGCAAGAA+ACTCCTAA
ATTGCGCAATGACAGCCAGGGGAGTGGGAGCTGTCTGGGGTAACCTGGCTCTAGGACCAGGTG
+
FFFFFFFFFFFFFFFFF:,FFFFFF:FFFF:F:FFFFF::FFFFF,::FFF:FFFFFFFFFFF
@A00187:180:HLGNGDSXX:3:1101:8486:1031#1:43-106  1:N:0:CGCAAGAA+ACTCCTAA
CCTGGCTCTAGGACCAGGTGGAAAAAGCACCTTGGCTGGCCCAGCCCTGTTCTTCTCCTGGGG
+
FF,::FFF:FFFFFFFFFFFFFFFFFFFFFFF:FF,FFFFF,:,FFF::F:FFF,:FF:FFFF
@A00187:180:HLGNGDSXX:3:1101:8486:1031#2:87-150  1:N:0:CGCAAGAA+ACTCCTAA
CCCTGTTCTTCTCCTGGGGAGGGAAGGGCAGGTCCCCTAATGCCTGAACCACATGGTGGGCCA
+
FFF::F:FFF,:FF:FFFFFFFFF,FFFFF:FFF,F,,FFF,F:FFFFFFF:F,,,FFF,F,,

Answered by gringer on August 22, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP