# What is the Cost Function for Neural Network with Dropout Regularisation?

Cross Validated Asked on January 3, 2022

For some context, I shall outline my current understanding:

Considering a Neural Network, for a Binary Classification problem, the Cross-entropy cost function, J, is defined as:

$$J = frac{-1}{m} sum_{i=1}^m y^i*log(a^i) + (1-y^i)*log(1-a^i)$$

1. m = number of training examples
2. y = class label (0 or 1)
3. a = output prediction (value between 0 and 1)

Dropout regularisation works as follows: For a given training example, we randomly shut down some nodes in a layer according to some probability. This has the effect of keeping the weights low during training and hence regularises the network and prevents overfitting.

I have learnt that if we do apply dropout regularisation, the cross entropy cost function is no longer easy to define due to all the intermediate probabilities. Why is this the case? Why doesn’t the old definition still hold? As long as the network learns better parameters, won’t the cross entropy cost decrease on every iteration of Gradient Descent? Thanks in advance.

Dropout does not change the cost function, and you do not need to make changes to the cost function when using dropout.

The reasoning is that dropout is a way to average over an ensemble of each of the exponentially-many "thinned" networks resulting from dropping units randomly. In this light, each time you apply dropout and compute the loss, you're computing the loss that corresponds to a randomly-selected thinned network; collecting together many of these losses reflects a distribution of losses over these networks. Of course, the loss surface is noisier as a result, so model training takes longer. The goal of training the network in this way is to obtain a model that is averaged over all of these different "thinned" networks.

For more information, see How to explain dropout regularization in simple terms? or the original paper: Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov, "Dropout: A Simple Way to Prevent Neural Networks from Overfitting", Journal of Machine Learning Research, 2014.

Answered by Sycorax on January 3, 2022

## Related Questions

### Difference of two non-central chi squared random variables

1  Asked on February 14, 2021 by pol

### Why do the mean and proportion measurements take the spotlight in estimation?

1  Asked on February 13, 2021

### Feature engineering: Measuring proportional changes across two item given two conditions

0  Asked on February 13, 2021 by comte

### Validation of my implementation for trace norm and L1 regularization

0  Asked on February 13, 2021 by rando

### Multivariate Normal Quadratic MGF: Eigendecomposition to Matrix form

0  Asked on February 12, 2021 by itsallpurple

### Sample Clustering by Accounting for Gene Fold Change and p-value

0  Asked on February 12, 2021 by user294496

### How do we construct features to use as input to machine learning algorithms for the purpose of movie recommendations(using collaborative filtering)?

1  Asked on February 11, 2021 by user3676846

### subspace clustering with density threshold

0  Asked on February 11, 2021 by pianobegginer

### In real-life applications, which continuous distributions have NON-CONVERGENT expectations that require Lebesgue integration?

0  Asked on February 11, 2021 by iterator516

### How many combinations and how to find a (standard?) distribution that matches?

2  Asked on February 10, 2021 by d-b

### Does PyMC3’s Timeseries API allow for time varying parameters of any model that would require fixed values under a frequentist approach?

0  Asked on February 10, 2021 by lisa-ann

### How do I modify data to test retest measure as more normal?

1  Asked on February 10, 2021 by user136083

### Anomaly detection in Text Classification

2  Asked on February 9, 2021 by naveen-y

### Which output activation is recommended when predicting a variable with a lower but not an upper bound?

2  Asked on February 9, 2021 by user26067

### Extracting a parameter from a probability problem

1  Asked on February 8, 2021 by mishe-mitasek

### Missing target variable values for a labeled model with just a time feature (timeseries)?

0  Asked on February 8, 2021 by amit-s

### What happens to the Gaussian as $sigma to infty$ at $x to infty$?

1  Asked on February 6, 2021 by rkabra

### Deriving comparable probabilites from continuous and discrete data

0  Asked on February 6, 2021 by pst0102