How to fragment genomes into non-overlapping sequences of differing sizes?

Question

I have bacterial chromosomes/plasmids that I have downloaded off refseq in a multi-fasta that I would like to fragment into non-overlapping fragments of between 1 kb and 15 kb.
I have tried googling for related tools but to no avail. Does anyone know how I would achieve this? Something very similar is done in this paper:
https://link.springer.com/article/10.1007/s40484-019-0187-4
Thanks in advance!

Alex Reynolds · Accepted Answer

You could use numpy.random.randint to sample integers from a uniform distribution, from 1000 to 15000 nt, extracting subsequences of that length from a linearized multi-FASTA file. Here's some code: #!/usr/bin/env python import sys import numpy as np low = int(sys.argv[1]) high = int(sys.argv[2]) def process_record(header, sequence, low, high): next = 0 fragment = 1 while next < len(sequence): new_next = np.random.randint(low, high) + next if new_next > len(sequence) - high: new_next = len(sequence) sys.stdout.write('{}:{}n{}n'.format(header, fragment, sequence[next:new_next])) next = new_next fragment += 1 header = "" sequence = "" for line in sys.stdin: if line.startswith('>'): if len(sequence) > 0: process_record(header, sequence, low, high) sequence = "" header = line.rstrip() else: sequence += line.rstrip() if len(sequence) > 0: process_record(header, sequence, low, high) To use it: $ ./13879.py 1000 15000 < in.fa > out.fa Fragments will be disjoint (non-overlapping, contiguous) subsequences. I add a check to determine if the last fragment of a FASTA record would be shorter than low (1kb) in length, and, if so, append that fragment to the final subsequence. It is possible for that last fragment to be up to 16 kb - 1 nt in length, given your parameters: if new_next > len(sequence) - high: new_next = len(sequence) You could change this logic to discard the last fragment (i.e. break), or simply allow fragments less than 1kb in size, depending on what you need. EDIT Here's a slight modification that should allow to you preserve fragment sizes within your bounds. A cartoon may help illustrate the problem. Let's set up a FASTA record, which has been broken up into n fragments: 1 2 n-2 n-1 n |-----|---------|...|-------|---------|-----| We may get unlucky and sample a random integer for fragment n-1 (frag(n-1)) which makes the length of frag(n) less than low (1 kb in your example). In my first suggested solution, frag(n-1) and frag(n) are appended to one another. This is why you get some fragments longer than high (15 kb). Another option is to discard frag(n), which is of length less than low (1 kb). Here's some code: #!/usr/bin/env python import sys import numpy as np import timeit low = int(sys.argv[1]) high = int(sys.argv[2]) in_fn = sys.argv[3] """ Globals """ fragment_lengths = [] """ Process a record """ def process_record(header, sequence, low, high): next = 0 last = 0 fragment = 1 fragment_sequence = "" while last < len(sequence) - high: next = np.random.randint(low, high + 1) + last fragment_sequence += sequence[last:next] # Comment the sys.stdout.write and uncomment the fragment_lengths line, if measuring fragment length distributions sys.stdout.write('{}:{}-{}:{}n{}n'.format(header, last, next, fragment, sequence[last:next])) #fragment_lengths.append(next - last) last = next fragment += 1 """ Uncomment to compare input sequence with fragments """ #sys.stderr.write('In > {}nOut > {}n'.format(sequence, fragment_sequence)) """ Parse records """ def parse_records(in_fn, low, high): header = "" sequence = "" with open(in_fn, 'r') as ifh: for line in ifh: if line.startswith('>'): if len(sequence) > 0: process_record(header, sequence, low, high) sequence = "" header = line.rstrip() else: sequence += line.rstrip() if len(sequence) > 0: process_record(header, sequence, low, high) def count_elements(seq): hist = {} for i in seq: hist[i] = hist.get(i, 0) + 1 return hist def wrapper(func, *args, **kwargs): def wrapped(): return func(*args, **kwargs) return wrapped def main(in_fn, low, high): """ Comment below if testing the distribution of fragments """ parse_records(in_fn, low, high) """ Uncomment to test the distribution of fragments from running fns on same input 100k times """ #wrapped = wrapper(parse_records, in_fn, low, high) #timeit.timeit(wrapped, number=100000) #print(count_elements(fragment_lengths)) if __name__ == "__main__": main(in_fn, low, high) If frag(n) is needed, and you need to preserve the uniform distribution (or whatever fragment length distribution you want to model), then rejection sampling might be an option. With rejection sampling, you keep fragmenting your input sequence, until one of the distributions of lengths gets you a set of fragments that entirely cover the input sequence. If you need that, follow up and I'll take another look.

gringer · Answer

I've previously written my own script to fragment DNA sequences (input as either fastq or fasta files) into identical-length fragments, with a given overlap. I used this because there was a particular assembler that demanded equal-length sequences, and I was working with nanopore reads, but it's since been useful for other purposes as well.
Because it is most concerned about making lengths identical, there may be some forced overlap when setting the overlap to zero. The overlap will only be exactly zero when the sequence length is an exact multiple of the fragment size. Here's an example use:
zero overlap
normalise_seqlengths.pl -fraglength 25 -overlap 0 oneRead.fq
@A00187:180:HLGNGDSXX:3:1101:8486:1031#0 [150 bp, 6 fragments, 0 overlap] 1:N:0:CGCAAGAA+ACTCCTAA
ATTGCGCAATGACAGCCAGGGGAGT
+
FFFFFFFFFFFFFFFFF:,FFFFFF
@A00187:180:HLGNGDSXX:3:1101:8486:1031#1:25-50  1:N:0:CGCAAGAA+ACTCCTAA
GGGAGCTGTCTGGGGTAACCTGGCT
+
:FFFF:F:FFFFF::FFFFF,::FF
@A00187:180:HLGNGDSXX:3:1101:8486:1031#2:50-75  1:N:0:CGCAAGAA+ACTCCTAA
CTAGGACCAGGTGGAAAAAGCACCT
+
F:FFFFFFFFFFFFFFFFFFFFFFF
@A00187:180:HLGNGDSXX:3:1101:8486:1031#3:75-100  1:N:0:CGCAAGAA+ACTCCTAA
TGGCTGGCCCAGCCCTGTTCTTCTC
+
:FF,FFFFF,:,FFF::F:FFF,:F
@A00187:180:HLGNGDSXX:3:1101:8486:1031#4:100-125  1:N:0:CGCAAGAA+ACTCCTAA
CTGGGGAGGGAAGGGCAGGTCCCCT
+
F:FFFFFFFFF,FFFFF:FFF,F,,
@A00187:180:HLGNGDSXX:3:1101:8486:1031#5:125-150  1:N:0:CGCAAGAA+ACTCCTAA
AATGCCTGAACCACATGGTGGGCCA
+
FFF,F:FFFFFFF:F,,,FFF,F,,

partial overlap
normalise_seqlengths.pl -fraglength 63 -overlap 0 oneRead.fq
@A00187:180:HLGNGDSXX:3:1101:8486:1031#0 [150 bp, 3 fragments, 20 overlap] 1:N:0:CGCAAGAA+ACTCCTAA
ATTGCGCAATGACAGCCAGGGGAGTGGGAGCTGTCTGGGGTAACCTGGCTCTAGGACCAGGTG
+
FFFFFFFFFFFFFFFFF:,FFFFFF:FFFF:F:FFFFF::FFFFF,::FFF:FFFFFFFFFFF
@A00187:180:HLGNGDSXX:3:1101:8486:1031#1:43-106  1:N:0:CGCAAGAA+ACTCCTAA
CCTGGCTCTAGGACCAGGTGGAAAAAGCACCTTGGCTGGCCCAGCCCTGTTCTTCTCCTGGGG
+
FF,::FFF:FFFFFFFFFFFFFFFFFFFFFFF:FF,FFFFF,:,FFF::F:FFF,:FF:FFFF
@A00187:180:HLGNGDSXX:3:1101:8486:1031#2:87-150  1:N:0:CGCAAGAA+ACTCCTAA
CCCTGTTCTTCTCCTGGGGAGGGAAGGGCAGGTCCCCTAATGCCTGAACCACATGGTGGGCCA
+
FFF::F:FFF,:FF:FFFFFFFFF,FFFFF:FFF,F,,FFF,F:FFFFFFF:F,,,FFF,F,,

How to fragment genomes into non-overlapping sequences of differing sizes?

2 Answers

zero overlap

partial overlap

Add your own answers!

Ask a Question