Artificial Intelligence Asked by Dhanush Giriyan on November 17, 2021
I am trying to implement the DDPG algorithm based on this paper.
The part that confuses me is the actor network’s update.
I don’t understand why the policy loss is simply the mean of $-Q(s, mu(s))$, where $Q$ is the critic network and $mu$ is the policy network.
How does one arrive at this?
This is not quite the loss that is stated in the paper.
For standard policy gradient methods the objective is to maximise $v_{pi_theta}(s_0)$ -- note that this is analogous to minimising $-v_{pi_theta}(s_0)$. This is for a stochastic policy. In DDPG the policy is now assumed to be deterministic.
In general, we can write $$v_pi(s) = mathbb{E}_{asimpi}[Q(s,a)];;$$ to see this note that $$Q(s,a) = mathbb{E}[G_t | S_t = s, A_t=a];;$$ so if we took expectation over this with respect to the distribution of $a$ we would get $$mathbb{E}_{asimpi}[mathbb{E}[G_t|S_t=s, A_t=a]] = mathbb{E}[G_t|S_t=s] = v_pi(s);.$$
However, if our policy is deterministic then $pi(cdot|s)$ is a point mass (a distribution which has probability 1 for a specific point and 0 everywhere else) for a certain action, so $mathbb{E}_{asimpi}[ Q(s,a)] = Q(s,a=pi(s)) = v_pi(s)$. Thus the objective is still to maximise $v_pi(s)$ it is just that now we know the policy is deterministic we say we want to maximise $Q(s,a=pi(s))$.
The policy gradient of this term was shown to be begin{align} nabla_theta Q(s,a=pi_theta(s)) & approx mathbb{E}_{s sim mu}[nabla_theta Q(s,a=pi_theta(s))];; \ & = mathbb{E}_{ssimmu}[nabla_aQ(s,a=pi(s)) nabla_theta pi_theta(s)];; end{align}
where if we put a minus at the front of this term then we would arrive at the loss from the paper. Intuitively this makes sense, you want to know how much the action-value function changes with respect to the parameter of the policy, but this would be difficult to directly calculate so you use the chain rule to see how much the action-value function changes with $a$ and in term how much $a$ (i.e. our policy) changes with the parameter of the policy.
I realise I have changed notation from the paper you are reading so here $pi$ is our policy as opposed to $mu$ and here where I have used $mu$ I take this to be the state distribution function.
Answered by David Ireland on November 17, 2021
0 Asked on December 16, 2021
0 Asked on December 16, 2021 by sirfroggy
bellman equations convergence proofs q learning reinforcement learning
1 Asked on December 13, 2021
convergence deep rl dqn function approximation reinforcement learning
1 Asked on December 13, 2021
3 Asked on December 11, 2021
categorical crossentropy comparison machine learning mean squared error objective functions
1 Asked on December 11, 2021
attention deep learning natural language processing transformer
1 Asked on December 11, 2021 by sports_stats
algorithm cross validation machine learning support vector machine
2 Asked on December 9, 2021
0 Asked on December 9, 2021 by blba
algorithmic bias neural networks reference request regularization training
0 Asked on December 9, 2021 by i_al-thamary
1 Asked on December 7, 2021
1 Asked on December 7, 2021 by jr123456jr987654321
activation function convolutional neural networks neural networks relu residual networks
1 Asked on December 7, 2021
data preprocessing machine learning neural networks normalisation standardisation
1 Asked on December 4, 2021 by nilsinelabore
convolutional neural networks keras machine learning neural networks
3 Asked on December 2, 2021 by pnar-demetci
deep learning deep neural networks feedforward neural network machine learning neural networks
2 Asked on December 2, 2021
geometric deep learning graph neural networks resource request
2 Asked on November 29, 2021 by ipsumpanest
6 Asked on November 29, 2021
bayesian deep learning convolutional neural networks keras tensorflow uncertainty quantification
1 Asked on November 29, 2021 by sanmu
architecture convolutional neural networks deep learning neural networks
1 Asked on November 29, 2021 by unter_983
actor critic methods ddpg policy gradients reinforcement learning
Get help from others!
Recent Answers
© 2022 AnswerBun.com. All rights reserved. Sites we Love: PCI Database, MenuIva, UKBizDB, Menu Kuliner, Sharing RPP