# What is the Cost Function for Neural Network with Dropout Regularisation?

Cross Validated Asked on January 3, 2022

For some context, I shall outline my current understanding:

Considering a Neural Network, for a Binary Classification problem, the Cross-entropy cost function, J, is defined as:

$$J = frac{-1}{m} sum_{i=1}^m y^i*log(a^i) + (1-y^i)*log(1-a^i)$$

1. m = number of training examples
2. y = class label (0 or 1)
3. a = output prediction (value between 0 and 1)

Dropout regularisation works as follows: For a given training example, we randomly shut down some nodes in a layer according to some probability. This has the effect of keeping the weights low during training and hence regularises the network and prevents overfitting.

I have learnt that if we do apply dropout regularisation, the cross entropy cost function is no longer easy to define due to all the intermediate probabilities. Why is this the case? Why doesn’t the old definition still hold? As long as the network learns better parameters, won’t the cross entropy cost decrease on every iteration of Gradient Descent? Thanks in advance.

Dropout does not change the cost function, and you do not need to make changes to the cost function when using dropout.

The reasoning is that dropout is a way to average over an ensemble of each of the exponentially-many "thinned" networks resulting from dropping units randomly. In this light, each time you apply dropout and compute the loss, you're computing the loss that corresponds to a randomly-selected thinned network; collecting together many of these losses reflects a distribution of losses over these networks. Of course, the loss surface is noisier as a result, so model training takes longer. The goal of training the network in this way is to obtain a model that is averaged over all of these different "thinned" networks.

For more information, see How to explain dropout regularization in simple terms? or the original paper: Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov, "Dropout: A Simple Way to Prevent Neural Networks from Overfitting", Journal of Machine Learning Research, 2014.

Answered by Sycorax on January 3, 2022

## Related Questions

### Why does a class weight fraction improve precision compared to undersampling approach where precision drops?

1  Asked on November 12, 2021

### Is it possible to implement an activation function or layer in Keras that uses two distinct sets of weights?

1  Asked on November 12, 2021

### Expert Knowledge Acquisition and Machine learning

1  Asked on November 12, 2021

### Improving F1 scores using models with good precision and recall

1  Asked on November 12, 2021

### Is ROCR applied to training data or testing data?

1  Asked on November 12, 2021 by fcas80

### Is the plot “White noise”?

0  Asked on November 12, 2021

### Multilabel Tweet Classification

0  Asked on November 12, 2021 by vineet

### Propagating uncertainty through nested random forest models

0  Asked on November 12, 2021

### Identify a contaminated distribution

1  Asked on November 12, 2021 by stephen-clark

### How do we generate the samples of hidden root nodes in the Bayes network (Sigmoid Belief Networks) of a generative model

1  Asked on November 12, 2021 by user6703592

### Which monotone transformations give a very loose confidence interval in transformed space?

0  Asked on November 12, 2021

### Restricted standard deviation of survival time

1  Asked on November 12, 2021 by emma-jean

### What does the term episode mean in meta-learning?

2  Asked on November 12, 2021

### Classification technique to classify categories in two variables when dateset has larger number of numerical variables and few data points

0  Asked on November 12, 2021 by rbeginner

### difference partial dependence and feature weights

1  Asked on November 12, 2021

### Interpreting HRs from stratified cox survival analysis in R

1  Asked on November 12, 2021

### Compare two datasets and whether they agree

2  Asked on November 9, 2021 by jennifer-ruurs

### Handling daily time series data for better accuracy

1  Asked on November 9, 2021 by joy_1379

### Model tuning in the presence of incorrect training labels

1  Asked on November 9, 2021 by astel

### Term for the error in machine learning as a direct result of incorrectly labelled data?

0  Asked on November 9, 2021