Why doesn't the optimizer just look for stationary points of the loss function?

Question

I want to have a better understanding of the weight-optimization process.
I understand the optimizer(e.g., gradient descent) looks for the direction in which to move the parameters to minimize the loss function. Than it makes a move in that direction, the size of this move is determined by the learning rate.
I have an naive question. Why don't we just look for the stationary points instead of an iterative process?

Sycorax · Accepted Answer

Root-finding can be framed as an optimization problem because we seek to find $x$ such that $f(x)=0$; if we consider that for some polynomial function $f$ we are seeking a stationary point $f^prime(x)=0$, then this is just root-finding for $f^prime$.
Let's restrict consideration to finding the roots of polynomials in one variable. Polynomials are easy, right? The differentiation is nice and simple, and we know how many roots a polynomial has just by looking at its degree. And it's only an optimization in one variable, instead of many variables, so that's also very simple. So we might suppose that this this optimization should be straightforward.
In one dimension, the quadratic equation gives us the roots of a parabola, so we don't need any iterative methods there. There are also (more complex) root-finding formulae for cubic and quartic functions.
However, for quintic or higher-order polynomials, there is no expression using a finite number of algebraic operations (addition, subtraction, multiplication, division and root-extraction) which solves for the roots. This is the Abel-Ruffini theorem. (Also, note that a finite number of steps is even more relaxed than OP's requirement of a non-iterative method.)
So now let's return back to our starting point, which was finding stationary points of some general class of functions. All polynomials are more expansive than polynomials of degree less than 5, and in turn the union of polynomial and non-polynomial functions is more expansive than all polynomials. If we can't even find the roots of degree 5 polynomials using an algebraic expression, neither can we solve the more general problem of root finding for non-polynomial expressions.

Why doesn't the optimizer just look for stationary points of the loss function?

One Answer

Add your own answers!

Ask a Question