TransWikia.com

Order of derivatives in automatic differentiation

Mathematics Asked by Niki on January 20, 2021

I did this derivation and want to know if it is correct. Basically, we want to know the gradient of $f$ at the point $g(M)$.

Let $Psi equiv g circ f$ be the function representing the forward pass and let $f: mathbb{R}^{m times n} to mathbb{R}^{m times n}$ represent the part of the forward pass we are interested in, where $m$ is the batch size and $n$ the array length. Assume we can decompose $f$ into $d in mathbb{N}^+$ functions $f_1, f_2, dots, f_d$:
begin{equation*}
f equiv f_d circ f_{d-1} circ dots circ f_1
end{equation*}

Now define $f^i$ for $i in {0, 1, dots, d}$, where $f_0(M) := M$:
begin{equation*}
f^i := f_i circ f_{i – 1} circ dots circ f_1 circ f_0
end{equation*}

Defininig $f_0$ as the identity function is just a trick to simplify the below expression. Please note that we already know $nabla g(f(M))$ as that is one of the inputs to our backward pass and since we did the forward pass, we also calculated all the $f^i$, $i in {0, 1, dots, d}$.

Now let $M in mathbb{R}^{m times n}$. It follows using the chain rule:
begin{equation*}
begin{aligned}
nabla (g circ f)(M)
& = nabla g(f(M)) cdot nabla f^d(M) \
& = nabla g(f(M)) cdot nabla f_d(f^{d-1}(M)) cdot nabla f^{d-1}(M) \
& = nabla g(f(M)) cdot nabla f_d(f^{d-1}(M)) cdot nabla f_{d-1}(f^{d-2}(M)) cdot nabla f^{d-2}(M) \
& ;;vdots \
& = nabla g(f(M)) cdot prod_{i=1}^{d-1} left(nabla f_{d-i+1}(f^{d-i}(M))right) cdot nabla f^1(M) \
&= underbrace{nabla g(f(M))}_{text{known}} cdot prod_{i=1}^{d} nabla f_{d-i+1}(underbrace{f^{d-i}(M)}_{text{known}})
end{aligned}
end{equation*}

My derivation implies that the order of calculating the $nabla f_i(f^{i-1}(M))$ is irrelevant. However, the technique to do automatic differentiation is called backward pass, and as I understand it, you always use the previous derivative as an input to the next one, taking derivatives of the composed parts of the desired function in reverse order. Did I make a mistake or does the order not matter? I’m not specifically looking to apply this to neural networks. Is the order maybe only relevant if we have a neural network, where we want to update every layer $k$ for which we need $prod_{i=1}^k nabla f_{d-i+1}(f^{d-i}(M))$?

More specifically, does this mean the backwards pass can be parallelized to some degree?

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP