Generation of 'new log probabilities' in continuous action space PPO

Question

I have a conceptual question for you all that hopefully I can convey clearly. I am building an RL agent in Keras using continuous PPO to control a laser attached to a pan/tilt turret for target tracking. My question is how the new policy gets updated. My current implementation is as follows

Make observation (distance from laser to target in pan and tilt)
Pass observation to actor network which outputs a mean (std for now is fixed)
I sample from a gaussian with the mean output from step 2
Apply the command and observe the reward (1/L2 distance to target)
collect N steps of experience, compute advantage and old log probabilities,
train actor and critic

My question is this. I have my old log probabilities (probabilities of the actions taken given the means generated by the actor network), but I dont understand how the new probabilities are generated. At the onset of the very first minibatch my new policy is identical to my old policy as they are the same neural net. Given that in the model.fit function I am passing the same set of observations to generate 'y_pred' values, and I am passing in the actual actions taken as my 'y_true' values, the new policy should generate the exact same log probabilities as my old one. The only (slight) variation that makes the network update is from the entropy bonus, but my score
np.exp(new_log_probs-old.log_probs)  is nearly identically 1 because the policies are the same.
Should I be using a pair of networks similar to DDQN so there are some initial differences in the policies between the one used to generate the data and the one used for training?

Hai Nguyen · Answer

The idea in PPO is that you want to reuse the batch many times to update the current policy. However, you cannot update mindlessly in a regular actor-critic fashion, because your policy might stray too far away from the optimal point.
This means you repeat your step 6. epoch amount of times for the same batch of trajectories. Usually epoch is somewhere between 3 and 30 but it is a hyper-parameter you need to adjust. For the first repeat, the old and the new policy are the same, so their ratio should be 1. After the first update, the new probabilities will change due to the updated policy, whereas you will still need to use the old probabilities giving you a ratio different from 1. The old probabilities will stay the same during these epoch update steps, whereas your new probabilities will keep changing.

Generation of 'new log probabilities' in continuous action space PPO

One Answer

Add your own answers!

Ask a Question