Is it common to have extreme policy's probabilities?

Question

I have implemented several policy gradient algorithms (REINFORCE, A2C, and PPO) and am finding that the resultant policy's action probability distributions can be rather extreme. As a note, I have based my implementations on OpenAI's baselines. I've been using NNs as the function approximator followed by a Softmax layer. For example, with Cartpole I end up with action distributions like $[1.0,3e-17]$. I could understand this for a single action, potentially, but sequential trajectories end up having a probability of 1. I have been calculating the trajectory probability by $prod_i pi(a_i|s_i)$. Varying the learning rate changes how fast I arrive at this distribution, I have used learning rates of $[1e-6, 0.1]$. It seems to me that a trajectory's probability should never be 1.0 or 0.0 consistently, especially with a stochastic start. This also occurs for environments like LunarLander.
For the most part, the resulting policies are near-optimal solutions that pass the criteria for solving the environments set by OpenAI. Some random seeds are sub-optimal
I have been trying to identify a bug in my code, but I'm not sure what bug would be across all 3 algorithms and across environments.
Is it common to have such extreme policy's probabilities? Is there a common way to handle an update so the policy's probabilities do not end up so extreme? Any insight would be greatly appreciated!

Neil Slater · Accepted Answer

Your policy gradient algorithms appear to be working as intended. All standard MDPs have one or more deterministic optimal solutions, and those are the policies that solvers will converge to. Making any of these policies more random will often reduce their effectiveness, making them sub-optimal. So once consistently good actions are discovered, the learning process will reduce exploration naturally as a consequence of the gradients, much like a softmax classifier with a clean dataset.
There are some situations where a stochastic policy can be optimal, and you could check your implementations can find those:

A partially observable MDP (POMDP) where one or more key states requiring different optimal actions are indistinguishable to the agent. For example, the state could be available exits in a corridor trying to get the end in a small maze, where one location secretly (i.e. without the agent having any info in the state representation that the location is different) reverses all directions, so that progressing along it is not possible for a deterministic agent, but a random agent would eventually get through.

In opposing guessing games where a Nash equilibrium occurs for specific random policies. For example scissor, paper, stone game where the optimal policy in self-play should be to choose each option randomly with 1/3 chance.

The first example is probably easiest to set up a toy environment to show that your implementations can find stochastic solutions when needed. A concrete example of that kind of environment is in Sutton & Barto: Reinforcement Learning, An Introduction chapter 13, example 13.1 on page 323.
Setting up opposing agents in self-play is harder, but if you can get it to work and discover the Nash equilibrium point for the policies, it would be further proof that you have got something right.

Is it common to have extreme policy's probabilities?

One Answer

Add your own answers!

Ask a Question