# Do batch GD and stochastic GD give the same results?

Data Science Asked by AI_new2 on October 21, 2020

If a neural network is trained on a dataset of M samples for N epochs, do batch GD and SGD give the same result? Is SGD is faster because utilize the hardware better?

I am asking because I figured out that both (batch GD & SGD) give mathematically the same result, but I read SGD avoid local minima, how can this can be true if SGD & batch GD give the same result!?

They are not the same; the batch gradient descent averages the gradients of samples 1 through M. This is seen in the first equation, denoted by the summation over M elements and divided by M total elements (hence the average).

Stochastic gradient descent, as illustrated in the second equation, updates via a single, randomly selected instance.

The confusion I believe stems from thinking about loops over 1 through M. The batch gradient descent summation (loop from 1 to M) is performed per iteration, whereas the stochastic gradient descent loop is performed over all epochs.

Thus the intermediate weight updates will be different, and these results compound. So even if both methods process every instance over the entire training course, the act of updating only after a single instance will inherently cause a different trajectory than if one were to average all the gradients per iteration.

SGD can be faster simply because it only processes a single instance per weight update. In fact, purely from a hardware efficiency perspective, SGD is suboptimal due to its intrinsic serialized nature over the course of training. Batch gradient descent can easily be parallelized and take advantage of GPUs by processing multiple instances simultaneously.

Because SGD is, well, stochastic, the perturbations this causes in the surrogate model development can help alleviate bias in the training set. This is why SGD can sometimes help avoid local minima, albeit using a rather crude optimization strategy.

Picture an example where a model has 2 parameters that define the xy plane and the corresponding error is plotted as the z component. This defines a function mapping R² to R. This function is only an approximation of the real-world function because you are not testing on every possible training instance. A biased training set poorly approximates the real-world function. Thus its local minima are different from the real-world local minima. When using batch GD, you will likely target the nearest local minima, without really any change of target over the course of training. This is good if your training set is relatively unbiased, but otherwise not ideal as you would be over fitting to a poor approximation. SGD can improve generalization by forcing a redirect of target in each iteration. Every iteration, if you sample a different function out of the M training samples, then you will likely change target minima multiple times. If your training set is relatively biased, then this can help not overfit to a poor approximation. If your training was a perfect approximation, then SGD would be pointless. Some may argue, given a perfect approximation, that the random sampling helps to avoid getting trapped in the starting basin, but at that point it's just luck.

Answered by Benji Albert on October 21, 2020

## Related Questions

### is it wrong to use average=’weighted’ when having only 2 classes?

1  Asked on August 28, 2021

### Which algorithm to use for transactional data

3  Asked on August 28, 2021 by liam-louw

### How to handle sparsely coded features in a dataframe

1  Asked on August 28, 2021

### Measure correlation for categorical vs continous variable

2  Asked on August 28, 2021

### How interpret keras training loss without compare with validation loss?

2  Asked on August 28, 2021

### Mathematics: Can the result of a derivative for the Gradient Descent consist of only one value?

1  Asked on August 28, 2021

### Build text complexity model based on complex examples

1  Asked on August 27, 2021 by vitalii-mishchenko

### Is there any research on zonal OCR / field level ORC / template OCR?

0  Asked on August 27, 2021

### What is the difference between ICR and OCR?

0  Asked on August 27, 2021

### How to restrict the columns to be passed to final classifier in PMML Pipeline

1  Asked on August 27, 2021 by akshay-tilekar

### How to utilize dictionary data set for text classification?

1  Asked on August 27, 2021

### Can features negatively correlated with the target be used?

2  Asked on August 27, 2021

### Determining a correct ML approach

1  Asked on August 27, 2021 by colt-exe

### Guidelines for selecting an optimizer for training neural networks

4  Asked on August 27, 2021 by mplappert

### How to make use of POS tags as useful features for a NaiveBayesClassifier for sentiment analysis?

1  Asked on August 27, 2021

### LightGBM – Why Exclusive Feature Bundling (EFB)?

3  Asked on August 27, 2021

### how come accuracy_score recognizes the positive label and precision_score does not?

1  Asked on August 27, 2021

### Elbow method for cosine distance

1  Asked on August 26, 2021 by ruuza

### Implementation of custom loss function invariant to batch size

0  Asked on August 26, 2021

### Is too much or very few training sample of a specific feature hamper the neural network model?

1  Asked on August 26, 2021 by zannatul-ferdaus

Get help from others!