TransWikia.com

Understanding the multidimensional chain rule

Mathematics Asked by ATW on November 26, 2021

I have some trouble in understanding the multidimensional chain rule. For differentiable functions $f,g$, defined by $f: U to V$, $g:Vtomathbb K^n$ where $U subseteq mathbb R^d$, $V subseteq mathbb R^nu$ are open,

$$Big(mathrm D(g circ f)Big)(x)=Big(mathrm D g(f(x))Big)cdotBig(mathrm Df(x)Big).$$

I have rather accepted this identity than understood it. I already saw some examples and understood the fact that I did not understand something completely. By differentiating a multidimensional function, one apparentely differentiates the components separately and somehow adds everything? Unfortunately I don’t know exactly how… any explanation would be highly appreciated.

One Answer

When talking about the differential of a multidimensional function in general, without a specific function and application in mind, it's better to forget about its components. It's tedious and obstructs the view on what the differential is: a linear map.

The definition of (total) differentiability of a function $f:Ulongrightarrowmathbb R^m$ where $Usubseteqmathbb R^n$ is open is that $f$ is differentiable in $x_0in U$ if there exists a linear map $L:mathbb R^nlongrightarrowmathbb R^m$ which approximates $f$ nicely near $x_0$. "Approximating $f$ nicely" is an informal way to say that the difference between the function $f(x)$ and its (affine) linear approximation $f(x_0)+L(x-x_0)$ becomes small as $xto x_0$, and quickly so. In particular, quicker than $x$ goes to $x_0$, or framed differently, quicker than $x-x_0$ goes to $0$. This can be phrased mathematically in two equivalent ways:

  1. $limlimits_{xto x_0}frac{f(x)-[f(x_0)+L(x-x_0)]}{Vert x-x_0Vert}=0.$
  2. There exists a remainder function $R_f:Ulongrightarrowmathbb R^m$ such that $f(x)=f(x_0)+L(x-x_0)+R(x)$ and $limlimits_{xto x_0}frac{R_f(x)}{Vert x-x_0Vert}$.

In (1), this limit just says that $f(x)$ minus its linear approximation $f(x_0)+L(x-x_0)$ goes to $0$, and the $Vert x-x_0Vert$ in the denominator guarantees that it does so quicker than $x-x_0$. And then (2) is just a rephrasing, where the remainder $R_f$ is just the numerator of the fraction in (1). Notice that nowhere did I reference components of a function. Sure, $f$ has components, and we could find a matrix representation of the linear map $L$ with respect to the standard bases of $mathbb R^n$ and $mathbb R^m$ whose entries would be the partial derivatives of the components of $f$, but that's just a representation, and we could represent it completely differently if we changed bases. The only important part is that it's linear and has the properties described above. Anyway, we call this linear map $L$ the (total) differential of $f$ at $x_0$, and we like to write it as $mathrm D f(x_0)$ to make sure that everyone knows which function it approximates and at what point. But it's still a linear map.

Now to the actual point of your question, we want to find the differential $mathrm D(gcirc f)(x_0)$ of a composition of two differentiable functions $f:Ulongrightarrowmathbb R^m$ and $g:Vlongrightarrowmathbb R^l$, where $Usubseteqmathbb R^n$ and $Vsubseteqmathbb R^m$ are open and $f(U)subseteq V$. Since $gcirc f$ is a function $Ulongrightarrowmathbb R^l$ where $Usubseteqmathbb R^m$, this differential is a linear map

$$mathrm D(gcirc f)(x_0):mathbb R^nlongrightarrowmathbb R^l.$$

Note that this kinda fits the chain rule already: $mathrm Df(x_0)$ is a map $mathbb R^nlongrightarrowmathbb R^m$, and $mathrm Dg(f(x_0))$ is a map $mathbb R^mlongrightarrowmathbb R^l$. Applying $mathrm Df(x_0)$ first and $mathrm Dg(f(x_0))$ second will map from $mathbb R^n$ to $mathbb R^m$ and from there to $mathbb R^l$, so in total, it maps from $mathbb R^n$ to $mathbb R^l$, just what we want. This is also the reason why the order matters for the multidimensional chain rule: The differential $mathrm Df$ needs to be applied first, which is why it's on the right.

And the actual proof is just calculation: $f$ is differentiable in $x_0$ with differential $mathrm Df(x_0)$ (I'll shorten this to just $L_f$ for the calculation), and $g$ is differentiable in $y_0:=f(x_0)$ with differential $mathrm Dg(f(x_0))$ (shortened to $L_g$). Now according to the second version of the definition above, there exist remainder functions

$$begin{align*}R_f:&Ulongrightarrowmathbb R^m,\ R_g:&Vlongrightarrowmathbb R^l end{align*}$$

such that

$$begin{align}f(x)&=f(x_0)+L_f(x-x_0)+R_f(x)&&(1)\ g(y)&=g(y_0)+L_g(y-y_0)+R_g(y)&&(2) end{align}$$

and

$$begin{align}lim_{xto x_0}frac{R_f(x)}{Vert x-x_0Vert}&=0,\ lim_{yto y_0}frac{R_g(y)}{Vert y-y_0Vert}&=0. end{align}$$

Now if we insert $y=f(x)$ in $(2)$ and remember $y_0=f(x_0)$, we get

$$begin{align}g(f(x))&=g(f(x_0))+L_g(L_f(x-x_0)+R_f(x))+R_g(f(x))\ &=g(f(x_0))+underbrace{L_g(L_f(x-x_0))}_{=mathrm Dg(f(x_0))cdotmathrm Df(x_0)(x-x_0)}+underbrace{L_g(R_f(x))+R_g(f(x))}_{=R_{gcirc f}(x)}. end{align}$$

You can show that the rightmost part goes to $0$ even when divided by $Vert x-x_0Vert$, and then this equation is exactly what defines the differential of $mathrm D(gcirc f)(x_0)$, and it is apparently $mathrm Dg(f(x_0))cdotmathrm Df(x_0)$. The $cdot$ is there to denote matrix multiplication because when actually calculating stuff we will use the matrix representation, but we could have written $mathrm Dg(f(x_0))circmathrm Df(x_0)$ just as well.

Answered by Vercassivelaunos on November 26, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP