Why is the policy loss the mean of $-Q(s, mu(s))$ in the DDPG algorithm?

Artificial Intelligence Asked by Dhanush Giriyan on November 17, 2021

I am trying to implement the DDPG algorithm based on this paper.

The part that confuses me is the actor network’s update.
I don’t understand why the policy loss is simply the mean of $-Q(s, mu(s))$, where $Q$ is the critic network and $mu$ is the policy network.
How does one arrive at this?

One Answer

This is not quite the loss that is stated in the paper.

For standard policy gradient methods the objective is to maximise $v_{pi_theta}(s_0)$ -- note that this is analogous to minimising $-v_{pi_theta}(s_0)$. This is for a stochastic policy. In DDPG the policy is now assumed to be deterministic.

In general, we can write $$v_pi(s) = mathbb{E}_{asimpi}[Q(s,a)];;$$ to see this note that $$Q(s,a) = mathbb{E}[G_t | S_t = s, A_t=a];;$$ so if we took expectation over this with respect to the distribution of $a$ we would get $$mathbb{E}_{asimpi}[mathbb{E}[G_t|S_t=s, A_t=a]] = mathbb{E}[G_t|S_t=s] = v_pi(s);.$$

However, if our policy is deterministic then $pi(cdot|s)$ is a point mass (a distribution which has probability 1 for a specific point and 0 everywhere else) for a certain action, so $mathbb{E}_{asimpi}[ Q(s,a)] = Q(s,a=pi(s)) = v_pi(s)$. Thus the objective is still to maximise $v_pi(s)$ it is just that now we know the policy is deterministic we say we want to maximise $Q(s,a=pi(s))$.

The policy gradient of this term was shown to be begin{align} nabla_theta Q(s,a=pi_theta(s)) & approx mathbb{E}_{s sim mu}[nabla_theta Q(s,a=pi_theta(s))];; \ & = mathbb{E}_{ssimmu}[nabla_aQ(s,a=pi(s)) nabla_theta pi_theta(s)];; end{align}

where if we put a minus at the front of this term then we would arrive at the loss from the paper. Intuitively this makes sense, you want to know how much the action-value function changes with respect to the parameter of the policy, but this would be difficult to directly calculate so you use the chain rule to see how much the action-value function changes with $a$ and in term how much $a$ (i.e. our policy) changes with the parameter of the policy.

I realise I have changed notation from the paper you are reading so here $pi$ is our policy as opposed to $mu$ and here where I have used $mu$ I take this to be the state distribution function.

Answered by David Ireland on November 17, 2021

Add your own answers!

Related Questions

Ask a Question

Get help from others!

© 2022 All rights reserved. Sites we Love: PCI Database, MenuIva, UKBizDB, Menu Kuliner, Sharing RPP