TransWikia.com

What is the fastest way to get the reverse complement of a DNA sequence in python?

Bioinformatics Asked by conchoecia on June 26, 2021

I am writing a python script that requires a reverse complement function to be called on DNA strings of length 1 through around length 30. Line profiling programs indicate that my functions spend a lot of time getting the reverse complements, so I am looking to optimize.

What is the fastest way to get the reverse complement of a sequence in python? I am posting my skeleton program to test different implementations below with DNA string size 17 as an example.

#!/usr/bin/env python
import random
import timeit

global complement
complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}

DNAlength = 17

#randomly generate 100k bases
int_to_basemap = {1: 'A', 2: 'C', 3: 'G', 4: 'T'}
num_strings = 500000
random.seed(90210)
DNAstrings = ["".join([int_to_basemap[random.randint(1,4)] for i in range(DNAlength)])
              for j in range(num_strings)]
#get an idea of what the DNAstrings look like
print(DNAstrings[0:5])

def reverse_complement_naive(seq):
    this_complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
    return "".join(this_complement.get(base, base) for base in reversed(seq))

def reverse_complement(seq):
    return "".join(complement.get(base, base) for base in reversed(seq))


tic=timeit.default_timer()
rcs = [reverse_complement_naive(seq) for seq in DNAstrings]
toc=timeit.default_timer()
baseline = toc - tic
namefunc = {"naive implementation": reverse_complement_naive,
            "global dict implementation": reverse_complement}

for function_name in namefunc:
    func = namefunc[function_name]
    tic=timeit.default_timer()
    rcs = [func(seq) for seq in DNAstrings]
    toc=timeit.default_timer()
    walltime = toc-tic
    print("""{}
       {:.5f}s total,
       {:.1f} strings per second
       {:.1f}% increase over baseline""".format(
           function_name,
           walltime,
           num_strings/walltime,
           100- ((walltime/baseline)*100)   ))

By the way, I get output like this. It varies by the call, of course!

naive implementation
       1.83880s total,
       271916.7 strings per second
       -0.7% increase over baseline
global dict implementation
       1.74645s total,
       286294.3 strings per second
       4.3% increase over baseline

Edit: Great answers, everyone! When I get a chance in a day or two I will add all of these to a test file for the final run. When I asked the question, I had not considered whether I would allow for cython or c extensions when selecting the final answer. What do you all think?

Edit 2: Here are the results of the final simulation with everyone’s implementations. I am going to accept the highest scoring pure python code with no Cython/C. For my own sake I ended up using user172818’s c implementation. If you feel like contributing to this in the future, check out the github page I made for this question.

the runtime of reverse complement implementations.
10000 strings and 250 repetitions
╔══════════════════════════════════════════════════════╗
║ name                       %inc   s total  str per s ║
╠══════════════════════════════════════════════════════╣
║ user172818 seqpy.c        93.7%  0.002344  4266961.4 ║
║ alexreynolds Cython (v2)  93.4%  0.002468  4051583.1 ║
║ alexreynolds Cython (v1)  90.4%  0.003596  2780512.1 ║
║ devonryan string          86.1%  0.005204  1921515.6 ║
║ jackaidley bytes          84.7%  0.005716  1749622.2 ║
║ jackaidley bytesstring    83.0%  0.006352  1574240.6 ║
║ global dict                5.4%  0.035330   283046.7 ║
║ revcomp_translateSO       45.9%  0.020202   494999.4 ║
║ string_replace            37.5%  0.023345   428364.9 ║
║ revcom from SO            28.0%  0.026904   371694.5 ║
║ naive (baseline)           1.5%  0.036804   271711.5 ║
║ lambda from SO           -39.9%  0.052246   191401.3 ║
║ biopython seq then rc    -32.0%  0.049293   202869.7 ║
╚══════════════════════════════════════════════════════╝

8 Answers

I don't know if it's the fastest, but the following provides an approximately 10x speed up over your functions:

import string
tab = string.maketrans("ACTG", "TGAC")

def reverse_complement_table(seq):
    return seq.translate(tab)[::-1]

The thing with hashing is that it adds a good bit of overhead for a replacement set this small.

For what it's worth, I added that to your code as "with a translation table" and here is what I got on my workstation:

global dict implementation
       1.37599s total,
       363374.8 strings per second
       3.3% increase over baseline
naive implementation
       1.44126s total,
       346919.4 strings per second
       -1.3% increase over baseline
with a translation table
       0.16780s total,
       2979755.6 strings per second
       88.2% increase over baseline

If you need python 3 rather than python 2, then substitute tab = str.maketrans("ACTG", "TGAC") for tab = string.maketrans("ACTG", "TGAC"), since maketrans is now a static method on the str type.

For those wondering, using biopython is slower for this (~50% slower than the naive implementation), presumably due to the overhead of converting the strings to Seq objects. If one were already reading sequences in using biopython, though, I wouldn't be surprised if the performance was much different.

Correct answer by Devon Ryan on June 26, 2021

The most reliable and simplest way is probably using Biopython:

from Bio.Seq import Seq

def my_reverse_complement(seq):
    return Seq(seq).reverse_complement()

print(my_reverse_complement('ATGCGTA'))
# TACGCAT

As Devon has already said here using Biopython isn't as fast as the naive Python solution, and I also tested that shown here with ipython. However, this is because Biopython's implementation, although similar to the naive approach, includes other features; it can reverse complement RNA as well as DNA and it will tell you if you're mixing DNA and RNA

Note that if you really want a fast way you could look at Cython or another python extension. As I edit this now, there are several nice answers taking this approach from user172818 and Alex Reynolds.

Answered by Chris_Rands on June 26, 2021

Here's a Cython approach that might suggest a generic approach to speeding up Python work.

If you're manipulating (ASCII) character strings and performance is a design consideration, then C or Perl are probably preferred options to Python.

In any case, this Cython test uses Python 3.6.3:

$ python --version
Python 3.6.3 :: Anaconda custom (64-bit)

To install Cython, e.g.:

$ conda install -c anaconda cython

The Cython code below seems to offer about the same speed bump as the translation table — perhaps similar code is run under the hood of that.

Two files are needed, starting with setup.py:

from distutils.core import setup
from Cython.Build import cythonize

setup(
  name = 'Reverse complement C test',
  ext_modules = cythonize("revcomp_c.pyx"),
)

And then a second file called revcomp_c.pyx:

from libc.stdlib cimport malloc

cdef int seq_len = 17
cdef char *seq_dest = <char *>malloc(seq_len + 1)
seq_dest[seq_len] = ''

def reverse_complement_c_v1(str seq):
    cdef bytes py_bytes = seq.encode('ascii')
    cdef char *seq_src = py_bytes
    cdef int i = 0
    cdef int d = 0
    for i in range(seq_len):
        d = seq_len - i - 1
        if seq_src[i] == 'A':
            seq_dest[d] = 'T'
        elif seq_src[i] == 'G':
            seq_dest[d] = 'C'
        elif seq_src[i] == 'C':
            seq_dest[d] = 'G'
        elif seq_src[i] == 'T':
            seq_dest[d] = 'A'
    return seq_dest[:seq_len].decode('ascii')

This can be compiled into a Python module like so:

$ python setup.py build_ext --inplace

Then we can modify the test bench to include this Cython module and the relevant test method:

#!/usr/bin/env python
import random
import timeit
from revcomp_c import reverse_complement_c_v1

global complement
complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}

DNAlength = 17

#randomly generate 100k bases
int_to_basemap = {1: 'A', 2: 'C', 3: 'G', 4: 'T'}
num_strings = 500000
random.seed(90210)
DNAstrings = ["".join([int_to_basemap[random.randint(1,4)] for i in range(DNAlength)])
              for j in range(num_strings)]
#get an idea of what the DNAstrings look like
print(DNAstrings[0:5])

def reverse_complement_naive(seq):
    this_complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
    return "".join(this_complement.get(base, base) for base in reversed(seq))

def reverse_complement(seq):
    return "".join(complement.get(base, base) for base in reversed(seq))

tic=timeit.default_timer()
rcs = [reverse_complement_naive(seq) for seq in DNAstrings]
toc=timeit.default_timer()
baseline = toc - tic
namefunc = {"naive implementation": reverse_complement_naive,
            "global dict implementation": reverse_complement,
            "Cython implementation (v1)": reverse_complement_c_v1}

for function_name in namefunc:
    func = namefunc[function_name]
    tic=timeit.default_timer()
    rcs = [func(seq) for seq in DNAstrings]
    toc=timeit.default_timer()
    walltime = toc-tic
    print("""{}
       {:.5f}s total,
       {:.1f} strings per second
       {:.1f}% increase over baseline""".format(
           function_name,
           walltime,
           num_strings/walltime,
           100- ((walltime/baseline)*100)   ))

A sample run:

$ python ./revcomp_test.py
['ACTGCAATAGACCACGC', 'CAGCTTGAGCCATTAAC', 'GGCCCAAGAGTTCGAAC', 'CGACTGTCTCGAATTGT', 'ATCCGCTATGCGCCGTC']
naive implementation
       1.73295s total,
       288525.0 strings per second
       -2.5% increase over baseline
global dict implementation
       1.70896s total,
       292575.5 strings per second
       -1.0% increase over baseline
Cython implementation (v1)
       0.21458s total,
       2330158.7 strings per second
       87.3% increase over baseline

One easy way to speed this up is to use a static const unsigned char array as an ASCII lookup table, which maps a residue directly to its complement. This would replace the nest of if statements and probably give a nice little boost (and it appears it does, making it among the best performers so far!).

Here is my fast implementation of a reverse complement function in C:

https://gist.github.com/alexpreynolds/4f75cab4350e9d937f4a

You might be able to use this directly in Python via the subprocess library. Outsourcing the reverse complement step to a utility written in C will almost always beat the best that Python can do, and you can do nice and important things like bounds checking etc. without losing much speed.

It's unclear how "pure" the answer needs to be, but making a system call from Python seems fair if you're processing strings and your goal is performance.

Another direction to take may be to look at multithreading, if you don't need ordered output. If you have many thousands of sequences stored in memory, you could split an array of sequences up into smaller arrays by use of offsets or array indices. Each thread would work on "rc"-ing sequences in its own piece of the array.

Answered by Alex Reynolds on June 26, 2021

Use a bytearray instead of a string and then employ maketrans to translate

You do not need the more advanced string encoding capabilities of string to store a string of bases, but you're still paying for it in performance. Devon Ryan's suggestion of maketrans is the huge improvement, 10x faster than your naive implementation. Using the same approach, but swapping everything out for bytes allows a further 40% speed improvement, however:

# DNA sequence generation also needs to be swapped to bytes
tab_b = bytes.maketrans(b"ACTG", b"TGAC")

def reverse_complement_bytes(seq):
    return seq.translate(tab_b)[::-1]

In my comparison:

Table implementation
       0.24339s total,
       2054315.7 strings per second
       82.5% increase over baseline
Bytes table implementation
       0.14985s total,
       3336755.9 strings per second
       89.2% increase over table

Answered by Jack Aidley on June 26, 2021

Another python extension but without cython. The source code is available at the bottom of this answer or from this gist. On Mac with Python3:

string
   0.38166s total,
   1310067.4 strings per second
   84.4% increase over baseline
seqpy
   0.20524s total,
   2436182.7 strings per second
   91.6% increase over baseline

On Linux with Python2 (seqpy is the first):

seqpy  
   0.10960s total,
   4561842.5 strings per second
   93.3% increase over baseline
string 
   0.17811s total,
   2807330.9 strings per second
   89.1% increase over baseline

This is file setup.py:

from setuptools import setup, Extension

seqpy_module = Extension('seqpy', sources = ["seqpy.c"])

setup(
    name = "seqpy",
    description = "seqpy",
    ext_modules = [seqpy_module])

This is file seqpy.c:

#include <stdlib.h>
#include <stdint.h>
#include <Python.h>

static unsigned char seq_comp_table[256] = {
      0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,  14,  15,
     16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,  29,  30,  31,
     32,  33,  34,  35,  36,  37,  38,  39,  40,  41,  42,  43,  44,  45,  46,  47,
     48,  49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,
     64, 'T', 'V', 'G', 'H', 'E', 'F', 'C', 'D', 'I', 'J', 'M', 'L', 'K', 'N', 'O',
    'P', 'Q', 'Y', 'S', 'A', 'A', 'B', 'W', 'X', 'R', 'Z',  91,  92,  93,  94,  95,
     64, 't', 'v', 'g', 'h', 'e', 'f', 'c', 'd', 'i', 'j', 'm', 'l', 'k', 'n', 'o',
    'p', 'q', 'y', 's', 'a', 'a', 'b', 'w', 'x', 'r', 'z', 123, 124, 125, 126, 127,
    128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143,
    144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159,
    160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175,
    176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191,
    192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207,
    208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223,
    224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239,
    240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255
};

static PyObject *seqpy_revcomp(PyObject *self, PyObject *args)
{
    PyObject *r;
    char *seq, *rev;
    int i, len;
    PyArg_ParseTuple(args, "s#", &seq, &len);
    rev = (char*)malloc(len);
    for (i = 0; i < len; ++i)
        rev[len - i - 1] = seq_comp_table[(uint8_t)seq[i]];
    r = Py_BuildValue("s#", rev, len);
    free(rev);
    return r;
}

static PyMethodDef seqpy_methods[] = {
    {"revcomp", seqpy_revcomp, METH_VARARGS, "Reverse complement a DNA sequence"},
    {NULL, NULL, 0, NULL}
};

#if PY_MAJOR_VERSION >= 3
static struct PyModuleDef seqpy_module = { PyModuleDef_HEAD_INIT, "seqpy", NULL, -1, seqpy_methods };
PyMODINIT_FUNC PyInit_seqpy(void) { return PyModule_Create(&seqpy_module); }
#else
PyMODINIT_FUNC initseqpy(void) { Py_InitModule3("seqpy", seqpy_methods, NULL); }
#endif

To compile and test:

python setup.py build_ext -i
python -c 'import seqpy; print(seqpy.revcomp("GGGTT"))'

Answered by user172818 on June 26, 2021

Here is a revision of my original Cython answer which incorporates my suggestion to use a char lookup array:

from libc.stdlib cimport malloc

cdef int seq_len = 17
cdef char *seq_dest = <char *>malloc(seq_len + 1)
seq_dest[seq_len] = ''

cdef char *basemap = [ '', '', '', '', '', '', '', '', '', '',   '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',  'T', '',  'G', '', '', '',  'C', '', '', '', '', '', '',  'N', '', '', '', '', '',  'A',  'A', '', '', '', '', '', '', '', '', '', '', '',  't', '',  'g', '', '', '',  'c', '', '', '', '', '', '', '', '', '', '', '', '',  'a',  'a' ]

def reverse_complement_c_v2(str seq):
    cdef bytes py_bytes = seq.encode('UTF-8')
    cdef char *seq_src = py_bytes
    cdef int i = 0
    for i in range(seq_len):
        seq_dest[seq_len - i - 1] = basemap[<int>seq_src[i]]
    return seq_dest[:seq_len].decode('UTF-8')

In testing under OS X, Python 3.6.3:

$ python ./revcomp.py
['ACTGCAATAGACCACGC', 'CAGCTTGAGCCATTAAC', 'GGCCCAAGAGTTCGAAC', 'CGACTGTCTCGAATTGT', 'ATCCGCTATGCGCCGTC']
...
Cython implementation (v1)
       0.20664s total,
       2419626.4 strings per second
       88.3% increase over baseline
Cython implementation (v2)
       0.14170s total,
       3528668.5 strings per second
       92.0% increase over baseline

Using my lookup array approach ("v2") adds a very decent performance bump over using if blocks ("v1"), and you can keep everything as a Python string.

Answered by Alex Reynolds on June 26, 2021

Since at least version 1.71 of biopython you can use Bio.Seq.reverse_complement, which also works on plain strings natively (no conversion to Seq objects). On my mac I get 800k strings converted with that implementation ("biopython just rc") when using the benchmark.

10000 strings and 250 repetitions
                         percent increase over baseline  seconds total strings per second
name                                                                                     
alexreynolds Cython (v2)                          94.0%       0.002491          4015188.7
user172818 seqpy.c                                93.2%       0.002833          3529944.4
alexreynolds Cython (v1)                          90.7%       0.003885          2573863.7
devonryan string                                  86.0%       0.005804          1722983.5
jackaidley bytes                                  85.7%       0.005959          1678085.3
jackaidley bytesstring                            82.1%       0.007459          1340693.2
biopython just rc                                 70.5%       0.012275           814638.6
revcomp_translateSO                               49.9%       0.020822           480254.6
revcom from SO                                    41.7%       0.024253           412320.2
string_replace                                    35.2%       0.026946           371107.3
lambda from SO                                    10.7%       0.037111           269464.0
global dict                                        0.0%       0.041556           240641.7
biopython seq then rc                            -30.9%       0.054430           183721.9
naive (baseline)                                  -0.9%       0.041962           238313.6

Answered by winni2k on June 26, 2021

def revcomplement(s):
    s = s.upper()
    c = { "A":"T","C":"G","T":"A","G":"C"}
    co = str()
    for i in s:
        co = c[i] + co
    return co

Answered by Amardeep Lokhande on June 26, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP