# Why L2 loss is more commonly used in Neural Networks than other loss functions?

Artificial Intelligence Asked by Ali KHalili on September 27, 2020

Why L2 loss is more commonly used in Neural Networks than other loss functions?
What is the reason to L2 being a default choice in Neural Networks?

I'll cover both L2 regularized loss, as well as Mean-Squared Error (MSE):

MSE:

1. L2 loss is continuously-differentiable across any domain, unlike L1 loss. This makes training more stable and allows for gradient-based optimization, as opposed to combinatorial optimization.
2. Using L2 loss (without any regularization) corresponds to the Ordinary Least Squares Estimator, which, if you're able to invoke Gauss-Markov assumptions, can lead to some beneficial theoretical guarantees about your estimator/model (e.g. that it is the "Best Linear Unbiased Estimator"). Source: https://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem.

L2 Regularization:

1. Using L2 regularization is equivalent to invoking a Gaussian prior (see https://stats.stackexchange.com/questions/163388/why-is-the-l2-regularization-equivalent-to-gaussian-prior) on your model/estimator. If modeling your problem as a Maximum A Posteriori Inference (MAP) problem, if your likelihood model (p(y|x)) is Gaussian, then your posterior distribution over parameters (p(x|y)) will also be Gaussian. From Wikipedia: "If the likelihood function is Gaussian, choosing a Gaussian prior over the mean will ensure that the posterior distribution is also Gaussian" (source: https://en.wikipedia.org/wiki/Conjugate_prior).

2. As in the case above, L2 loss is continuously-differentiable across any domain, unlike L1 loss.

Correct answer by Ryan Sander on September 27, 2020

## Related Questions

### Is the self-attention matrix softmax output (layer 1) symmetric?

1  Asked on January 5, 2022 by thepacker

### Is there a good website where I can learn about Deep Deterministic Policy Gradient?

1  Asked on January 5, 2022 by huzaifah-shamim

### Why can we perform graph convolution using the standard 2d convolution with $1 times Gamma$ kernels?

0  Asked on January 1, 2022

### Anomaly Detection in distributed system using generated log file

1  Asked on December 30, 2021

### How do big companies, like Facebook, model individuals and their interaction?

1  Asked on December 30, 2021

### How to evaluate the performance of an autoencoder trained on image data?

1  Asked on December 30, 2021 by nim-py

### Is there an optimal way to split the text into small parts when working with co-reference resolution?

0  Asked on December 30, 2021

### Extending patch based image classification into image classification

0  Asked on December 30, 2021

### How to properly optimize shared network between actor and critic?

1  Asked on December 27, 2021 by bestr

### Which is a better form of regularization: lasso (L1) or ridge (L2)?

1  Asked on December 27, 2021 by jaeger6

### What is meant by “arranging the final features of CNN in a grid” and how to do it?

0  Asked on December 27, 2021

### How are training hyperparameters determined for large models?

1  Asked on December 27, 2021 by kao

### How can I have the same input and output shape in an auto-encoder?

2  Asked on December 25, 2021 by vesko-vujovic

### Which neural network should I use to distinguish between different types of defects?

0  Asked on December 25, 2021 by beinando

### Can I think of the graph convolution operation as a regular 2D convolution for images?

0  Asked on December 25, 2021

### How could I use machine learning to detect text and non-text regions in scanned documents?

2  Asked on December 22, 2021

### Using convnet to classify language of text contained in images

1  Asked on December 20, 2021

### Why does my “entropy generation” RNN do so badly?

1  Asked on December 18, 2021

### Continuous state and continuous action Markov decision process time complexity estimate: backward induction VS policy gradient method (RL)

1  Asked on December 16, 2021 by leodongxu

### What is meant by gene, chromosome, population in genetic algorithm in terms of feature selection?

2  Asked on December 16, 2021