TransWikia.com

CNN backpropagation between layers

Data Science Asked by user63067 on August 13, 2021

I have this CNN architecture:
CNN architecture

I know how to calculate error for weights based on the output and update weights between output<–>hidden and hidden<–>input layers.

The problem is that I have no idea how to calculate delta for values in input layer based on the error and then use it in convolution backpropagation.

One Answer

Let's look at the layers before the reshaping stage since everything after that is simply a densely connected neural network.

Backpropagation in Max Pooling

Max pooling takes a window of values and only the maximum value passes through. This means that the error can only be contributed by the maximum values, thus only the weights for these values will be updated.

Backpropagation in the Convolutional Layers

This is the same as for the densely connected layer. You will take the derivative of the cross-correlation function (mathematically accurate name for convolution layer). Then use that layer in the backpropagation algorithm.


An Example

Let's look at this following example

enter image description here

Forward pass

The forward pass of the convolutional layer can be expressed as

$x_{i, j}^l = sum_m sum_n w_{m,n}^l o_{i+m, j+n}^{l-1} + b_{i, j}^l$

where in our case $k_1$ and $k_2$ is the size of the kernel, in our case $k_1=k_2=2$. So this says for an output $x_{0,0} = 0.25$ like you found. $m$ and $n$ iterate across the dimensions of the kernel.

Backpropagation

Assuming you are using the mean squared error (MSE) defined as

$E = frac{1}{2}sum_p (t_p - y_p)^2$,

we want to determine

$frac{partial E}{partial w^l_{m', n'}}$ in order to update the weights. $m'$ and $n'$ are the indices in the kernel matrix not be confused with its iterators. For example $w^1_{0,0} = -0.13$ in our example. We can also see that for an input image $H$x$K$ the output dimension after the convolutional layer will be

$(H-k_1+1)$x$(W-k_2+1)$.

In our case that would be $4$x$4$ as you showed. Let's calculate the error term. Each term found in the output space has been influenced by the kernel weights. The kernel weight $w^1_{0,0} = -0.13$ contributed to the output $x^1_{0,0} = 0.25$ and every single other output. Thus we express its contribution to the total error as

$frac{partial E}{partial w^l_{m', n'}} = sum_{i=0}^{H-k_1} sum_{j=0}^{W-k_2} frac{partial E}{partial x^l_{i, j}} frac{partial x^l_{i, j}}{partial w^l_{m', n'}}$.

This iterates across the entire output space, determines the error that output is contributing and then determines the contribution factor of the kernel weight with respect to that output.

Let us call the contribution to the error from the output space delta for simplicity and to keep track of the backpropagated error,

$frac{partial E}{partial x^l_{i, j}} = delta^l_{i,j}$.

The contribution from the weights

The convolution is defined as

$x_{i, j}^l = sum_m sum_n w_{m,n}^l o_{i+m, j+n}^{l-1} + b_{i, j}^l$,

thus,

$frac{partial x^l_{i, j}}{partial w^l_{m', n'}} = frac{partial}{partial w^l_{m', n'}} (sum_m sum_n w_{m,n}^l o_{i+m, j+n}^{l-1} + b_{i, j}^l)$.

By expanding the summation we end up observing that the derivative will only be non-zero when $m=m'$ and $n=n'$. We then get:

$frac{partial x^l_{i, j}}{partial w^l_{m', n'}} = o^{l-1}_{i+m', j+n'}$.

Then back in our error term

$frac{partial E}{partial w^l_{m', n'}} = sum_{i=0}^{H-k_1} sum_{j=0}^{W-k_2} delta_{i,j}^l o^{l-1}_{i+m', j+n'}$.

Stochastic gradient descent

$w^{(t+1)} = w^{(t)} - eta frac{partial E}{partial w^l_{m', n'}}$

Let's calculate some of them

import numpy as np
from scipy import signal
o = np.array([(0.51, 0.9, 0.88, 0.84, 0.05), 
              (0.4, 0.62, 0.22, 0.59, 0.1), 
              (0.11, 0.2, 0.74, 0.33, 0.14), 
              (0.47, 0.01, 0.85, 0.7, 0.09),
              (0.76, 0.19, 0.72, 0.17, 0.57)])

d = np.array([(0, 0, 0.0686, 0), 
              (0, 0.0364, 0, 0), 
              (0, 0.0467, 0, 0), 
              (0, 0, 0, -0.0681)])

gradient = signal.convolve2d(np.rot90(np.rot90(d)), o, 'valid')

array([[ 0.044606, 0.094061], [ 0.011262, 0.068288]])

Now you can put that into the SGD equation in place of $frac{partial E}{partial w}$.


Please let me know if there are errors in the derivation.

Answered by JahKnows on August 13, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP