Can FFT convolution be faster than direct convolution for signals of large sizes?

Question

Let's say I have a 1D signal of size $N$ and am trying to filter it with a 10-tap FIR filter mask. When $N$ is large, the number of multiply-accumulates would approximately equal
$$2 times 10 times N = 20Nquadtext{floating point operations}$$
This convolution algorithm is thus $mathcal O(N)$ since the computation scales linearly with the input data size.
However, when would it make sense to use an FFT convolution to find this convolution? If an FFT is $mathcal O(Nlog N)$, wouldn't this mean FFT convolution does not scale as well as direct convolution? If this is the case, how can FFT convolution be faster than direct convolution?

Hilmar · Accepted Answer

There is probably a bit of a misconception here. In many application the signal is  runnning all the time or is VERY long: modem, audio stream, video etc. In this case you can't really define the "length" of the signal. The relevant metric is here "number of operations per input sample" not the "total  number of operations". If you watch a streaming movie, it's obvious that a two hour movie takes twice the operations of a one hour movie. But that's okay since you have twice the time to process it. The key metric is if your processor can keep up with the stream , so that's why "ops per sample" or "ops per second" are used.
In this case, you don't do an FFT over the entire input signal but you apply "overlap add", i.e. you chop the input signals into blocks that are about the same size as the filter length and then you process one block at a time.
So the effectiveness of the algorithm is only a function of the filter length. And it increases linearly for direct convoltuion but only logarithmically for Overlap Add (which is FFT based). The "break even" point is typically at "many tens of samples".
We can run some numbers assuming that the filter length is $N$ and lets also assume its a power of 2.
An FFT of length $M$ has  $log_2(M) cdot M/2$ butterflies, each with one complex multiplies and two complex adds. That's a total of 10 real operations. So the price for an FFT is roughly $5 cdot log_2(M) cdot M/2$
Overlap add requires zero padding to 2N and that's our FFT size. Hence overlap add costs

Forward FFT: $5 cdot log_2(2N) cdot 2N/2$
Spectral Multiply: $3*N$  (N/2 complex multiplies)
Inverse FFT: $5 cdot log_2(2N) cdot 2N/2$
Overlap adding: $N$
Total: $Ncdot (10 cdot log_2(N) + 10 + 3 + 1)$

So overlap add cost per sample is roughly $ 10 cdot log_2(N) + 14$ as compared to $2N$ for direct convolution. The break even point is a filter length of 32.
DISCLAIMER: I made a lot of simplying assumptions and in a practical implementation the break even point can vary wildly based on processor architecture and smartness of the algorithm. The main cost is in the FFT and for a single channel real system you would deploy a core complex FFT of N/2. I also skipped over bit reverse ordering and other overheads.

TimWescott · Answer

I think you need to refine your definitions.
The computational complexity for an $n$-element convolution on $m$ points of data is $mathcal{O}(ncdot m)$.  Your $mathcal{O}(n)$ only applies for each output sample.
The computational complexity for an $n$-element FFT is, indeed $mathcal{O}(n log n)$.  But it coughs up $n$ points.
The savings in using the FFT to do convolution is that with your $n$-element filter you can do the FFT in chunks of $2n$ points each.  Then you can use overlap-add (do a web search for the term) to reassembly the filter output which, outside of numerical scrud, will be the same thing you'd have gotten using convolution.
So the difference on the average computations for each output sample.
The convolution uses your $mathcal{O}(n)$ per output sample.
But because the FFT over $2n$ points coughs up $2n$ points, and $n$ of those points are 'new', you only do the FFT $1/n$ as many times as you'd do the convolution.  So while the computational complexity for the FFT alone is $mathcal{O}(n log n)$, the computational complexity per output sample for the FFT with overlap and add is $mathcal{O}(frac{n log n}{n}) = mathcal{O}(log n)$.

OverLordGoldDragon · Answer

FFT convolution is certainly scalable, but what you really ask is if it's faster when one of inputs is small (<1000) or input lengths differ greatly. Then indeed FFT convolution can be slower, as it must first pad both inputs to at least sum of their lengths minus 1.
Note: answer only addresses 'direct' FFT convolution and linear convolution; implementations use overlap-add and other algorithms to cut compute cost.
Compute time on my CPU (top sub-row=FFT, bottom=direct):
              # len(x1) = 2e1     # len(x1) = 2e2     # len(x1) = 2e3    # len(x1) = 2e4
len(x2)=1e1   82.6 µs ± 2.64 µs   92.2 µs ± 3.59 µs   138 µs  ± 3.5 µs   415  µs ± 15.1 µs
              5.88 µs ± 163  ns   6.96 µs ± 583  ns   13.6 µs ± 429 ns   44.9 µs ± 2.58 µs
                                                                                   
len(x2)=1e2   88.1 µs ± 2.36 µs   90.6 µs ± 3.3 µs    142 µs ± 2.81 µs   419 µs ± 52.2 µs 
              9.33 µs ± 206  ns   18.4 µs ± 521 ns    101 µs ± 3.23 µs   426 µs ± 8.66 µs 
                                                                                   
len(x2)=1e3   109 µs  ± 2.8  µs   116 µs ± 2.29  µs   171 µs ± 2.56 µs   439 µs  ± 25.1 µs
              39.7 µs ± 1.11 µs   69.4 µs ± 2.18 µs   382 µs ± 15   µs   1.68 ms ± 60.8 µs
                                                                                   
len(x2)=1e4   414 µs ± 11.9 µs    373 µs ± 16.4 µs    449 µs ± 26.4 µs   712  µs ± 24.7 µs
              327 µs ± 14.6 µs    580 µs ± 21.8 µs    3.5 ms ± 25.1 µs   23.2 ms ± 374 µs 
                                                                                   
len(x2)=1e5   7.01 ms ± 331 µs    6.77 ms ± 198 µs    6.33 ms ± 261 µs   7.77 ms ± 392 µs 
              3.44 ms ± 306 µs    5.48 ms ± 101 µs    30.9 ms ± 1.76 ms  188 ms  ± 2.49 ms
                                                                                   
len(x2)=1e6   96.9 ms ± 4.33 ms   90 ms   ± 2.51 ms   89.7 ms ± 1.77 ms  95.8 ms ± 3.06 ms
              34.8 ms ± 2.35 ms   56.9 ms ± 2.18 ms   368  ms ± 17.7 ms  1.91 s  ± 34.4 ms
                                                                                   
len(x2)=1e7   1.15 s ± 22   ms    1.12 s ± 28.4 ms    1.13 s ± 27.8  ms  1.17 s ± 54.6 ms 
              339 ms ± 13.4 ms    571 ms ± 29   ms    3.04 s ± 61.5  ms  18.5 s ± 96.5 ms

Times reported as mean ± stdev of at least 7 re-runs. Code used:
import numpy as np
from scipy.signal import convolve

for i in range(1, 5):
    print('_' * 80)
    x2 = np.random.randn(int(10**i))
    for j in range(1, 8):
        x1 = np.random.randn(int(10**j))
        %timeit convolve(x, f, method='fft')
        %timeit convolve(x, f, method='direct')
        print()

Admittedly I was a bit surprised by these numbers. Fortunately
scipy.signal.convolve by default uses method='auto' which guesses the faster
method.

Can FFT convolution be faster than direct convolution for signals of large sizes?

3 Answers

Add your own answers!

Ask a Question