Understanding the "unroling" step in the proof of the policy gradient theorem

Question

In the proof of the policy gradient theorem in the RL book of Sutton and Barto (that I shamelessly paste here):

there is the "unrolling" step that is supposed to be immediately clear

With just elementary calculus and re-arranging of terms

Well, it's not. :) Can someone explain this step in more detail?
How exactly is $Pr(s rightarrow x, k, pi)$ deduced by "unrolling"?

David Ireland · Answer

The unrolling step is due to the fact you end up with an equation that you can keep expanding indefinitely.
Note that we start with calculating $nabla v_pi(s)$ and arrive at
$$nabla v_pi(s) = sum_aleft[ nabla pi(a|s) q_pi(s,a) + pi(a|s) sum_{s'}p(s'|s,a) nabla v_pi (s') right];,$$
which contains a term for $nabla v_pi(s')$. This is a recursive relationship, similar to the bellman equation, so we can substitute in a term for $nabla v_pi(s')$ which will be a term similar just with $nabla v_pi(s'')$. As I mentioned, we can do this indefinitely which leads us to
$$nabla v_pi(s) = sum_{x in mathcal{S}} sum_{k=0}^infty mathbb{P}(srightarrow x, k, pi) sum_a nabla pi(a|x) q_pi(x,a);.$$
We need the term $sum_{x in mathcal{S}} sum_{k=0}^infty mathbb{P}(srightarrow x, k, pi)$ because we want to take an average over the state space, however due to unrolling there are many different $s_t$'s that we need to average over (this comes from the $s',s'',s''',...$ in the unrolling) so we also need to add the probability state of transitioning from state $s$ to $x$ in $k$ time steps, where we sum over an infinite horizon due to the repeated unrolling.
If you are wondering what happens to the terms $pi(a|s)$ and $p(s'|s,a)$ terms and why they are not explicitly shown in this final form, it is because this is exactly what the $mathbb{P}(srightarrow x, k, pi)$ represents. The average over all possible states accounts for the $p(s'|s,a)$ and the fact that we follow policy $pi$ in the probability statement accounts for the $pi(a|s)$.

oleg.mosalov · Answer

It looks like "v of s prime" is just substituted with the already derived value for "v of s". You can call it a recursion of a kind.
In other words, v(s) is dependent on v(s') and that implies that v(s') is dependent on v(s''). So we can combine that and get the dependency of v(s) of v(s'').

Understanding the "unroling" step in the proof of the policy gradient theorem

2 Answers

Add your own answers!

Ask a Question