TransWikia.com

Saddle-free Newton method for SGD - while Newton attracts saddles, is it worth to actively replel them?

Cross Validated Asked by Jarek Duda on January 3, 2021

While 2nd order methods have many advantages, e.g. natural gradient (e.g. in L-BFGS) attracts to close zero gradient point, which is usually saddle. Other try to pretend that our very non-convex function is locally convex (e.g. Gauss-Newton, Levenberg-Marquardt, Fisher information matrix e.g. in K-FAC, gradient covariance matrix in TONGA – overview) – again attracting rather not only to local minima (how bad it is?).

There is a belief that the number of saddles is ~exp(dim) larger than of minima. Actively repelling them (instead of attracting) requires control of sign of curvatures (as Hessian eigenvalues) – e.g. negating step sign in these directions.

It is e.g. done in saddle-free Newton method (SFN) ( https://arxiv.org/pdf/1406.2572 ) – 2014, 600+ citations, recent github. They claim to get a few times(!) lower error e.g. on MNIST this way, other methods got stuck on some plateaus with strong negative eigenvalues:

enter image description here

Here is another very interesting paper: https://arxiv.org/pdf/1902.02366 investigating evolution of eigenvalues of Hessian for 3.3M parameters (~20 terabytes!), for example showing that rare negative curvature directions allow for relatively huge improvements:

enter image description here

So it looks great – it seems that we all should use SFN or other methods actively repelling saddles … but it didn’t happen – why is it so? What are the weaknesses?

What are other promising 2nd order approaches handling saddles?

How can we improve SFN-like methods? For example what I mostly don’t like is directly estimating Hessian from noisy data, what is very problematic numerically. Instead, we are really interested in linear behavior of 1st derivative – we can optimally estimate it with (online) linear regression of gradients: with weakening weights of old gradients. Another issue is focusing on Krylov subspace due to numerical method (Lanczos) – it should be rather based on gradient statistics like their PCA, what again can be made online to get local statistically relevant directions.

One Answer

my joint paper

https://arxiv.org/abs/2006.01512

Here is a github link for python codes:

https://github.com/hphuongdhsp/Q-Newton-method

Gives theoretical proof of the heuristic argument in the second paper you linked to in your question. We also provide a simple way of how to proceed in the case the Hessian is not invertible.

Two issues I think now prevents the use in large scale:

Cost of implementation. I read that there are some methods to reduce the cost but need to look in details.

No guarantee of convergence. Maybe for loss functions in popular DNN we can hope to prove convergence.

On the other hand, a very well theoretical justified first order methods, working well in large scale is Backtracking GD. You can check the codes here

https://github.com/hank-nguyen/MBT-optimizer

P.S. I don’t consider having an account on this site, so if you want more discussion it is better through emails.

Answered by user292463 on January 3, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP