TransWikia.com

Gradient Checking: MeanSquareError. Why huge epsilon improves discrepancy?

Data Science Asked on January 12, 2021

I am using custom C++ code, and coded a simple “Mean Squared Error” layer.
Temporarily using it for the ‘classification task’, not a simple regression. …maybe this causes the issues?

I don’t have anything else before this layer – not even a simple Dense layer. It’s just MSE on its own.
Its input is a collection of rows of input features. For example, here are 8 rows of input features that will be passed to MSE all at once:

{ a0, a1, a2, a3, a4, a5, a6, a7 }
{ b0, b1, b2, b3, b4, b5, b6, b7 }
{ c0, c1, c2, c3, c4, c5, c6, c7 }
{ d0, d1, d2, d3, d4, d5, d6, d7 }
{ e0, e1, e2, e3, e4, e5, e6, e7 }
{ f0, f1, f2, f3, f4, f5, f6, f7 }
{ g0, g1, g2, g3, g4, g5, g6, g7 }
{ h0, h1, h2, g3, h4, h5, h6, h7 }    //8x8 matrix (contains 64 different values)

Every row of this matrix gets passed into my “Mean Square Error” layer, returning a single scalar for such a row: “Cost”.

I then compute a “final error” scalar, which is the average of such Costs.

When doing Gradient Checking, I am looking at how this “final error” quantity changes as I perturb each of the 64 input values, seen above. The idea is that the changes in finalError must correspond to the gradient computed by formula, with respect to my 64 input values. If they match, then I’ve coded backprop correctly.

Here is the forward prop:

$$finalError = frac{1}{r}sum^r{ left( frac{1}{2n}sum^n{(input_i-target)^2} right) } $$

where $n$ is the number of features per row, and $r$ is the number of rows.

Here is the gradient wrt one of input values, that my backprop is using:

$$frac{partial finalError}{partial input_i} = frac{1}{rn}(input_i – target)$$

Question:

I peturb each input value ‘up’, then ‘down’, running forward prop 64*2 = 128 times. This gives me numerical estimate of gradient for my 64 input values.

However this numerical estimate and the actual analytical gradient become less similar when smaller epsilon is used. This is counterintuitive to me. On the contrary, my vectors match almost exactly, when I use a ridiculously large value for epsilon, such as $1$

Is this expected, or do I have an error in C++ code?


Here is the pseudocode

for every input value i:
   i -= EPSILON
   finalCost_down =  fwdprop( inputMatrix )//very simple - just computes final cost via MSE layer.  finalCost_down is a scalar.
   i += EPSILON
   finalCost_up   =  fwdprop( inputMatrix ) 
   gradientEstimate[i] = (finalCost_up - finalCost_down) / (2*EPSILON)

//after the loop, some time later, just one invocation of backprop:
trueGradientVec = backprop( vec )

//some time later:

discrepancyScalar =  (gradientEstimate - trueGradientVec).magnitude / gradientEstimate.magnitude + trueGradientVec.magnitude)

//somehow discrepancyScalar decreases the larger the EPSILON was used:
// discrepancy is 0.00275, if EPSILON is 0.0001
// discrepancy is 0.00025, if EPSILON is 0.001
// discrepancy is 2.198e-05, if EPSILON is 0.01
// discrepancy is 3.149e-06, if EPSILON is 0.1
// discrepancy is 2.751e-07, if EPSILON is 1

I would expect discrepancy to decrease when epsilon is decreased, because finer perturbations should give more precise slope estimate…

Andrew NG explanation of GradientChecking

One Answer

This was caused by numerical precision of floating numbers. It becomes really apparent as we change one of input values, by a tiny $epsilon$ which (in my example above) emedially affects the cost function.

The tricky thing was - because the cost function is Mean Square Error (MSE), and there are no other layers in my network, we indeed can use any epsilon to estimate the slope. Even ridiculous epsilon will work, and will be numerically more stable, explaining why discrepancy seemed to get better. That's just how $y=x^2$ works.

In practice, when doing gradient-checking, I switched to another cost function, which is linear (instead of MSE). This improved the numerical stability, allowing me to use epsilon that is 10 times smaller:

$$C=sum^rsum^n(obtained-expected)$$

the gradient is then simply:

$$frac{partial C}{partial (obtained)_{rn}} = 1$$

I don't use this cost in production, just for purpose of gradient-checking.

So yeah, this explains why large epsilon was improving the discrepancy.


Edit:

Notice that this linear-cost function must not be used if you have a softmax right before it. That's because the sum of softmax is always 1.0

In that case you must use the MSE

Correct answer by Kari on January 12, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP