Reason behind the sum of rate factors for calculating cost function derivative

Question

Suppose we have a network of neurons like below:

We make a little change in weight w[l][j][k] on our network, and it can make change on our cost function from many paths

now we just pick one of that paths from our first choosed weight to -> it's activation's neuron to -> one of it's weights on the next layer to -> it's activation and so on...

we can imagine an equation (using chain rule) like below for this only one path and we call it a rate factor:

but it is only for one path of changing the cost as we changing that weight, so we need calculate the rate of cost changing by all possible related paths (all the rate factors) in the famous book of michael nielsen wrote that we calculating the rate of cost changing by doing sum all over the rate factors like below:

but really i don't understand exactly how can we calculate that slope of the cost with respect to that weight by summing all that rate factors how can relate that sum to our equation

a little later on that book we had a proof for an equation for the error in layer 'l' in terms of the error in the next layer, for that we have:

and here we do a sum over all related neurons errors too.
can someone give me a proof or an explain for that sum
how that rate of cost changing can relate to sum over all that related rate factors, why we use sum?

of course the neurons's activations are related to each other by calculating sum of all neurons in last layer using weights and biases

but using this for calculating that derivative is not so clear and tangible for me but any explanation about that sum over the derivatives can be very helpful for me.

thank you.

Andreas Storvik Strauman · Answer

First, disclaimer: you're presenting fairly hairy computations and it's a little hard to read your question. I'll try answering nonetheless.

i don't understand exactly how can we calculate that slope of the cost with respect to that weight by summing [...]

If I interpreted the sums correctly then the sums in your question seem to deal with the Jacobian matrix that is used when applying the chain rule in multiple dimensions.
Intuitively, the weight $w_{j,k}^{l}$ will be referenced by all the paths originating from the neuron utilizing $w_{j,k}^l$, and will then be expressed by nodes in the "final" expression of $C$ via multiple linear combinations (sums). I'm fairly certain that if you actually followed the gradient $partial Coverpartial w_{j,k}^l$ starting at the sum $C=a^L_{1}w_{2,k}^L+a^L_{2}w_{2,k}^L$ it would work out to be the sum you show. This also, I believe, would apply to the second sum you present.

Reason behind the sum of rate factors for calculating cost function derivative

One Answer

Add your own answers!

Ask a Question