How do local minima occur in the equation of loss function?

Question

In gradient descent, I know that local minima occur when the derivative of a function is zero, but when the loss function is used, the derivative is equal to zero only when the output and the predicted output are the same (according to the equation below).
So, when the predicted output equals the output, that means the global minima is reached! So, my question is: How can a local minima occur, if zero gradient occurs only for the "perfect" fit?
$$theta_j := theta_j - {alpha over m} sum_{i=1}^M (hat y^i-y^i)x_j^i$$

Itamar Mushkin · Answer

The equation you used for gradient descent isn't general; it's specific for linear regression.
In linear regression, there is indeed only a single global minimum and no local minima; but for more complex models, the loss function is more complex, and local minima are possible.

Dave · Answer

The premise of “no minimum without a perfect fit” is incorrect.
Let's look at a simple example with square loss.
$$L(hat{y}, y) = sum_i (y_i-hat{y}_i)^2$$
$$ (x_1, y_1) = (0,1)$$
$$ (x_2, y_2) = (1,2)$$
$$ (x_3, y_3) = (3,3)$$
We decide to model this with a line: $hat{y}_i = beta_0 + beta_1 x_i$.
Let's optimize the parameters according to the loss function.
$$L(hat{y}, y) = (1-(beta_0 + beta_1(0)))^2 + (2-(beta_0 + beta_1(1)))^2 + (3-(beta_0 + beta_1(3)))^2$$
Now we take the partial derivatives of $L$ with respect to $beta_0$ and $beta_1$ and do the usual calculus of minimization.
So we minimize the loss function, but we certainly do not have a perfect fit with a line.

How do local minima occur in the equation of loss function?

2 Answers

Add your own answers!

Ask a Question