Are there any embarrassingly parallel tasks that require a CPU rather than GPU?

Question

I am looking for tasks that

are unsuitable for GPUs
gain significant speedup as more CPU nodes are added
don't require large data transfer or inter-thread communication between nodes.

Do any problems of this type still exist?

Mauro Vanzetto · Answer

GPUs work with the model SIMD (single instruction multiple data) i.e. they execute  an instruction over multiple data. To give an idea: under CUDA technology when you have got an if-then-else condition the two branches are executed in sequence over the respective data.
In your question, the condition to favor a CPU suggests a MISD or MIMD model, i.e. different instruction over the same data or different data.
So to find a class of example we must take a sequential task. For example, a solution of an ODE that is intrinsically sequential with the same initial condition, in this case we have got the SD. And run in different conditions, with different methods for a comparison study, here we have got the MI.
With this simple example the communication between the nodes are few because they are independent.

Anton Menshov · Answer

The simple example from electromagnetics (EM) would be performing a parallel frequency sweep for a frequency-domain simulation, say, full-wave extraction of network parameters (S, Y, Z, etc) for a device. Since the simulation for each frequency point is highly independent from another, the simulation can be embarrassingly parallelized across different cores, including trivial distributed memory parallelization.
Such simulations involve minimal data transfer (if any), except workload distribution and final results sharing. However, for the straightforward implementation (there is, of course, a huge field of parallelization/coding of certain EM simulation on GPUs), each simulation is not very suitable for GPUs as it involves a lot of branching, complicated patterns of data processing, storing large amounts of auxiliary data, etc. Thus, it is not desirable to perform embarrassing parallelization of the frequency sweep on the GPUs.
NB: do not read this as GPUs are not suitable for any EM simulation. They are more preferable for certain numerical methods (say, finite-difference time-domain) and different parallelization patterns. However, frequency sweep parallelization is also an important mode that has to be implemented.

Thomas · Answer

A problem where each unit of work requires access to more registers than are available on a single GPU core, or requires access to more data than can fit into the cache or shared memory would not be able to fully utilize all of the cores on a GPU. A CPU's greater number of registers and larger cache size could let the CPU outperform the GPU on these types of problems.
Finding the exact probabilities in Texas Hold'em poker somewhat fits into this category because each thread has to either compute a lot of rules when comparing hands, or use a large lookup table that can't fit into the small cache of a GPU. I only say "somewhat" because a GPU still outperforms a CPU in this case, but not as much as one would expect.

Peter Cordes · Answer

High-quality video encoding is something like this.
The search space is so huge that it requires branching to prune it rapidly, but GPUs are terrible at that.  Modern CPU short-vector SIMD works well for this, working on contiguous chunks of 16 to 64 bytes of data.  And while still being tightly coupled to the CPU core which can branch efficiently on SIMD results without any significant transfer overhead.
Modern encoders like x265 can scale to at least a hundred CPU cores, for high enough resolution.  (Or you can chop up a long video into multiple segments to make it truly embarrassingly parallel.)
Modern GPUs have fixed-function video-encode hardware separate from the main GPU execution units, but the max quality they can achieve is I think limited.  (Unless they can operate in a mode where the CPU makes decisions but offloads the heavy data-parallel work to the GPU, like motion search.)  I haven't kept up with recent developments in hardware encoding, but AFAIK it's still not possible to get the same quality as x265 -preset veryslow on a CPU.
See also https://video.stackexchange.com/questions/14656/why-processor-is-better-for-encoding-than-gpu/14657#14657

Philipp · Answer

When it comes to playing chess and other complex turn-based games using the MiniMax algorithm, then GPU acceleration is either not viable or only viable for a couple minor sub-problems.
Chess engines need to evaluate a very large number of moves to find out which one results in a position which is best for the AI. How does the AI know that one position is better than another? By using a rating function which applies all the common knowledge about what's good and what's bad in chess and turns that into a number. All those position evaluations can be parallelized. Simply counting material advantage (having a queen is better than not having one) is a start for a simple chess engine, but stronger chess engines also take strategic considerations into account, like if pieces are threatened or pinned, control of the board, pawn structure, piece development and so on. So these rating functions can become very complex. This usually makes them unsuitable for running on a GPU.

Marcin Raczkowski · Answer

Are you asking in context of crypto currencies or proof-of-work?
In that case there are examples of algorithms that are designed specifically to use features specific to CPUs - mainly branching and faster acess to the memory, L1 & L2 cache. For example scrypt that claims to resist GPU and ASIC implementations.
In general GPUs are designed to work on paralel applying single instruction on large amount of data. So almost every algorithm that is not specifically dsigned to not work on GPU, could be rewritten in a way that it'd be faster to perform on GPU.

user253751 · Answer

How about compiling a large program?
Compilation is unsuitable for GPUs.
Each file can be compiled separately.
The only data transfer is to transfer the source code to each node (including header files), and transfer the object file back.
There is, however, a sequential phase at the end, where the object files are combined (linked) into an executable file.

Audrius Meskauskas · Answer

Sequence similarity search in bioinformatics.
While naive search can be easily parallelized, following a nontrivial algorithm most often includes lots of branching. The numerous GPU cores can do the branching but they are not good at this, they strongly prefer to compute "all as one".
Because of that, even if the similarity search has been tried early on GPU and some results are promising, the improvement of performance for the similarly priced GPU vs CPU is often definitely not in hundreds. "running times, as a standalone tool, are comparable to the running times of BLAST", as said in this publication.

Dmitry Grigoryev · Answer

If you're interested in parallelization in general, there's are two considerations which affect the (un)suitability of an algorithm for parallel computing:
(a) Dependencies between steps, which force sequential execution in a single run
(b) High memory requirements, which prevent running several instances of the algorithm in parallel
Perhaps a prime example would be cryptographic functions such as KDFs, which are specifically designed to benefit as little as possible from parallelization that GPUs and special cracking hardware offers. For instance, consider the following algorithm:

Generate a large pseudo-random array $H$ with a password as a seed. This occupies a large portion of RAM and covers the case (b)
Starting with a given index $I_k$, compute $I_{k+1}$ based on $H[I_k]$ and the password. This introduces a data dependency between every two steps to cover the case (a).
Repeat step 2 a large number of times: $k=1..N$
Use the $H[I_N]$ to construct an encryption key or a password hash

The idea is that an attacker trying to guess a password from a known key/hash will have to spend a significant time checking one password after the other no matter how many GPU cores they have at their disposal, while a legitimate user will compute a key/hash relatively quickly using a single core because they have to do it only once.
If we're talking about GPUs specifically, those are optimized for rendering tasks with a set of features that are substantially different from what general-purpose CPUs have. Any algorithm relying on a different set of features, e.g. integer arithmetic and overflows, will be hard to implement with the GPU instruction set optimized for floating-point arithmetic with saturation. Idem for the case of system programming which requires the processor to support interrupts (with nesting and priorities) and virtual memory (with paging, swapping, and copy-on-write semantics).
Also check out this question on Computer Science SE: "What are GPUs bad at?"

Are there any embarrassingly parallel tasks that require a CPU rather than GPU?

9 Answers

Add your own answers!

Ask a Question