# Why is the policy loss the mean of $-Q(s, mu(s))$ in the DDPG algorithm?

Artificial Intelligence Asked by Dhanush Giriyan on November 17, 2021

I am trying to implement the DDPG algorithm based on this paper.

The part that confuses me is the actor network’s update.
I don’t understand why the policy loss is simply the mean of $$-Q(s, mu(s))$$, where $$Q$$ is the critic network and $$mu$$ is the policy network.
How does one arrive at this?

This is not quite the loss that is stated in the paper.

For standard policy gradient methods the objective is to maximise $$v_{pi_theta}(s_0)$$ -- note that this is analogous to minimising $$-v_{pi_theta}(s_0)$$. This is for a stochastic policy. In DDPG the policy is now assumed to be deterministic.

In general, we can write $$v_pi(s) = mathbb{E}_{asimpi}[Q(s,a)];;$$ to see this note that $$Q(s,a) = mathbb{E}[G_t | S_t = s, A_t=a];;$$ so if we took expectation over this with respect to the distribution of $$a$$ we would get $$mathbb{E}_{asimpi}[mathbb{E}[G_t|S_t=s, A_t=a]] = mathbb{E}[G_t|S_t=s] = v_pi(s);.$$

However, if our policy is deterministic then $$pi(cdot|s)$$ is a point mass (a distribution which has probability 1 for a specific point and 0 everywhere else) for a certain action, so $$mathbb{E}_{asimpi}[ Q(s,a)] = Q(s,a=pi(s)) = v_pi(s)$$. Thus the objective is still to maximise $$v_pi(s)$$ it is just that now we know the policy is deterministic we say we want to maximise $$Q(s,a=pi(s))$$.

The policy gradient of this term was shown to be begin{align} nabla_theta Q(s,a=pi_theta(s)) & approx mathbb{E}_{s sim mu}[nabla_theta Q(s,a=pi_theta(s))];; \ & = mathbb{E}_{ssimmu}[nabla_aQ(s,a=pi(s)) nabla_theta pi_theta(s)];; end{align}

where if we put a minus at the front of this term then we would arrive at the loss from the paper. Intuitively this makes sense, you want to know how much the action-value function changes with respect to the parameter of the policy, but this would be difficult to directly calculate so you use the chain rule to see how much the action-value function changes with $$a$$ and in term how much $$a$$ (i.e. our policy) changes with the parameter of the policy.

I realise I have changed notation from the paper you are reading so here $$pi$$ is our policy as opposed to $$mu$$ and here where I have used $$mu$$ I take this to be the state distribution function.

Answered by David Ireland on November 17, 2021

## Related Questions

### Classification or regression for deep Q learning

0  Asked on December 16, 2021

### Is the Bellman equation that uses sampling weighted by the Q values (instead of max) a contraction?

0  Asked on December 16, 2021 by sirfroggy

### Why does reinforcement learning using a non-linear function approximator diverge when using strongly correlated data as input?

1  Asked on December 13, 2021

### How Graph Convolutional Neural Networks forward propagate?

1  Asked on December 13, 2021

### In which cases is the categorical cross-entropy better than the mean squared error?

3  Asked on December 11, 2021

### What are the keys and values of the attention model for the encoder and decoder in the “Attention Is All You Need” paper?

1  Asked on December 11, 2021

### Is my 57% sports betting accuracy correct?

1  Asked on December 11, 2021 by sports_stats

### Understanding the “unroling” step in the proof of the policy gradient theorem

2  Asked on December 9, 2021

### Forcing a neural network to be close to a previous model – Regularization through given model

0  Asked on December 9, 2021 by blba

### Why is DDPG not learning and it does not converge?

0  Asked on December 9, 2021 by i_al-thamary

### How artificial intelligence will change the future?

1  Asked on December 7, 2021

### Can residual neural networks use other activation functions different from ReLU?

1  Asked on December 7, 2021 by jr123456jr987654321

### Is it necessary to standardise the expected output

1  Asked on December 7, 2021

### Is CNN capable of extracting the descriptive statistics features

1  Asked on December 4, 2021 by nilsinelabore

### How to create Partially Connected NNs with prespecified connections using Tensorflow?

3  Asked on December 2, 2021 by pnar-demetci

### What is the best resources to learn Graph Convolutional Neural Networks?

2  Asked on December 2, 2021

### Is it possible to use AI to reverse engineer software?

2  Asked on November 29, 2021 by ipsumpanest

### Why do CNN’s sometimes make highly confident mistakes, and how can one combat this problem?

6  Asked on November 29, 2021

### Can you explain me this CNN architecture?

1  Asked on November 29, 2021 by sanmu

### In Deep Deterministic Policy Gradient, are all weights of the policy network updated with the same or different value?

1  Asked on November 29, 2021 by unter_983