TransWikia.com

Parallel nonlinear solvers

Operations Research Asked by Josh Allen on August 19, 2021

I’ve noticed that parallel (CPU or GPU) nonlinear programming solvers are few and far between. It seems that if any parallelization is involved at all, it generally applies to solving the underlying linear system or using multi-start methods.

Is there a particular reason for this – i.e. inherent difficulties in finding a good scalable parallel framework for the interior point and SQP algorithms?

3 Answers

As a developer of parallel non-linear software, I want to share my experience working in this space and the challenges we face. If I were to break down why we don't have more parallel non-linear software I would pick the following reasons:

  1. Expectations vs reality
  2. Inherent difficulties in designing and implementing the algorithms.
  3. Lack of good tools to implement, test, and deploy distributed software.

1. Expectations vs reality

Unfortunately, NLP or MINLP are very broad terms so people tend to expect a piece of software that solves e.g., NLP to solve every type of NLP, which simply cannot happen. That NLP may be sparse, dense, mostly linear, convex, pseudoconvex, large, small, parametric, etc. In every case, there are different optimal ways of exploiting parallel hardware, but building a piece of software that implements all of these ways is incredibly difficult, which is why we have several specialised packages.

In other words, parallelism in non-linear optimisation is tricky from a design point of view because how we choose to allocate parallel resources depends on the real-life problems our solver is designed to solve.

There are three main applications of parallelism for MINLPs:

  1. Calculating derivatives in parallel (important in very dense problems)
  2. Doing the factorisation step in parallel (important in very large problems)
  3. Solving the problem many times in parallel, e.g., parallel branch-and-bound or multistart (important in MINLP problems)

A solver can only really do one of these in parallel, and that design choice determines what problems that solver can solve well.

For instance, with some tweaking IPOPT can run MUMPS in MPI mode, so if factorisation is the bottleneck, that's a solid choice. This makes sense for IPOPT as it is designed to do a single local optimisation as quickly as possible.

On the other hand, our own solver, Octeract Engine, is designed to prove global optimality and to locate feasible solutions for difficult MINLP problems. Therefore, for us it made sense to use parallel branch-and-bound because we need to solve millions of local optimisation problems, so when we run in MPI mode that's how we allocate work to the CPUs.

Other examples with various degrees of parallelism are SCIP, KNITRO, and NLPAROPT.

Now, if one was to use IPOPT with MPI-MUMPS for a large-scale local optimisation problem, that would be much faster than a parallel branch-and-bound solver, if IPOPT manages to solve the problem. If it doesn't, it often turns out that no amount of manual tweaking can yield a feasible solution in which case parallel branch-and-bound would have saved us time all along. Similarly, certain problems are just too big to feed into a solver that doesn't factorise a matrix in parallel.

These differences in expectations vs what the scientific reality is, mean that solver companies usually have to engage a lot in consultancy work (vs just building more and better stuff), and that academic projects rarely come to a state where they are usable by the community (IPOPT is a bright exception).

2. Algorithmic/Implementation difficulties

From an implementation point of view, it is hellishly difficult to come up with and implement a parallel framework correctly. As an example, it took us about a year to implement the parallel framework for Octeract Engine (and this is a team of people who had done this before). For the problems that our engine solves, we implemented so-called "true parallelism", in the sense that investing more CPUs results in linear speedups.

However, it took me 5 years to create and test the algorithm that achieves this (my PhD and post-doc). The reason this takes so long is that a high performance algorithm is never something we can implement by reading/creating a flowchart. In other words, it never "just works". It takes hundreds of tricks (the term is literally "advanced black magic") to make these kinds of algorithms work well in practice. Importantly, it takes much more effort than LP algorithms, because the structure we can expect in NLP can literally be anything.

On top of that, parallel computing applications tend to target large problems. The size factor comes with a plethora of issues around data structures, distributing memory, and actually thinking about the complexity of every tiny part of the solver to avoid bottlenecks, which are non-issues for smaller problems. All of this blends into one big interactive system that is very hard to design using good software practices without compromising performance.

3. Lack of tools for development, deployment and testing

Algorithmic difficulties aside, a big bottleneck is that our best choice for high throughput applications is MPI, which is simply just not a good enough tool anymore because it requires too much boilerplate and modern software is just too complex for that.

This is compounded by the fact that even when we do make it work on our rigs it comes with a host of portability, deployment, and maintenance issues.

For instance, after solver development was done, it took us around 4 months of full-time testing to ensure that it would run on all OS/hardware, and I swear that we ran into every single obscure issue in the history of Linux in the process (can't deny that it was a lot of fun though). We saw compiler bugs, kernel bugs, dependencies that would only build if we tried twice, linking problems that only happened some of the time even on the same machine, but my favorite is an issue where MPI would fail to work 1 in 10 times, and only if we used more than 60 cores (59 or fewer was fine).

Even then, we still have a few tickets of obscure issues that only occur when people run massive problems on clusters, and it can take a lot of time to reproduce and fix each one because we have to build our own tools most of the time (there's nothing we can buy out of the box).

From our point of view, we would love to have better tools because it would allow us to spend less manpower on maintenance and testing, and more on R&D and implementing new parallel algorithms.

Correct answer by Nikos Kazazakis on August 19, 2021

I think that there exist multiple solvers based on ADMM. If the variables can be partitioned in two sets in a way that the problem decomposes for one set fixed, then every other iteration can be executed in parallel.

I had a specific solver implementation in mind, but could not find it now. Instead I found POGS which seems to be more recent and targeting the GPU. The documentation has its own summary of ADMM.

EDIT: The paper Block Splitting for Distributed Optimization might have been my original exposure to this idea.

Answered by Robert Schwarz on August 19, 2021

I've noticed that parallel (CPU or GPU) nonlinear programming solvers are few and far between.

The General Nonlinear Problem

The $n times n$ nonlinear problem is: $$begin{array} mathcal{f}_1(x_1, x_2, cdots , x_n)=0 \ mathcal{f}_2(x_1, x_2, cdots , x_n)=0 \ vdots tag 1 \ mathcal{f}_n(x_1, x_2, cdots , x_n)=0 end{array}$$

which can be summarized as

$$f (x) = 0 tag 2$$

where we use bold face to denote vectors:

$$ mathbf{f} = left [ begin{array} mathcal{f}_1 \ mathcal{f}_2 \ vdots \ mathcal{f}_n end{array} right ] !!,,, mathbf{x} = left [ begin{array} mathcal{x}_1 \ mathcal{x}_2 \ vdots \ mathcal{x}_n end{array} right ] !!,,, mathbf{0} = left [ begin{array} 00 \ 0 \ 0 \ 0 end{array} right ] $$

Two points:

  • Nonlinear equations are generally solved numerically by the iterative solution of linear equations
  • John Rice states (approximately) that the solution of nonlinear equations is the most difficult problem in scientific computation

Parallel programming is difficult to implement, but there are more than a few solvers.

Mathworks solves non-linear equations in parallel.

A number of solvers incorporate Sundials and use the KINSOL solver for nonlinear algebraic systems. It can use MPI, CUDA, or operate serially. It is written and maintained by the team at LLNL.

JuliaDiffEq allows non-linear methods, and has an interface to Sundials.

PETSc supports threads, MPI, and GPUs through CUDA or OpenCL, as well as hybrid MPI-GPU parallelism. It has over a dozen non-linear methods. It is written and maintained by the team at ANL.

Trilinos supports CUDA, OpenMP, Pthread and serial modes to solve non-linear problems. Visit their GitHub for sources to the NOX, LOCA and GlobiPack non-linear solver packages.

See also:

CS.SE - "Solving large nonlinear systems in parallel" (parallel)

Math.SE - "Using jacobian to solve a nonlinear system of equations?" (not parallel)

"Parallelization Research of SapTis-Software of Multi-Field Simulation and Nonlinear Analysis of Complex Structures" (the difficulty of parallelization)

Answered by Rob on August 19, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP