How does Generalized Policy Iteration stabilize to the optimal policy and value function?

Question

I've seen this question answered here Why does the policy iteration algorithm converge to optimal policy and value function? and here The proof for policy iteration algorithm's optimality however I'm still unable to connect the dots and I'm not even sure if my question is related. I'm hoping a slightly less mathematical and more intuitive explanation would be possible

I feel I'm making a fundamental mistake in my embarkation into RL. Now, on to my question.

Following Sutton's RL Version 2, Chapter 4, Generalized Iteration Policy is explained:

The value function stabilizes only when it is consistent with the
  current policy, and the policy stabilizes only when it is greedy with
  respect to the current value function. Thus, both processes stabilize
  only when a policy has been found that is greedy with respect to its
  own evaluation function. This implies that the Bellman optimality
  equation (4.1) holds, and thus that the policy and the value function
  are optimal.

Furthermore, the Bellman optimality equation seems to make sense regarding this since the equation specifically lists all states. I can accept this in the event all states are visited after each iteration.

However, when applying GPI to Q-Learning to my own toy problems or even examples from the book, I cannot see how an optimal policy is reached even though both conditions of GPI are satisfied.

Take for example this toy problem:

[5 0 X 0 0 20]

whereby a reward of 5 exists to the extreme left, a reward of 20 to the extreme right, and all other locations a reward of 0. Termination occurs at either end. X indicates the starting location. The agent can move either left or right.

Let's say X explores to the left by a greedy-random policy with no exploration (since the EV is 0 for both left and right it will tie-break). After 3 iterations of moving to the left, it will update its Qtable with values that, under this greedy policy the agent will constantly move to the left. The following are eventually observed -

1) The value function stabilizes in accordance with this greedy policy (ie: values are no longer updated)

2) The policy stabilizes with respect to the value function (ie: the agent will always greedily move left, never anything else).

According to the GPI as far as I understand, this is the definition of the optimal policy. Yet as we can see, moving to the right would provide far better EV and converge on the optimal solution. In fact, if we sum over all states, the agent would solve to always move to the right.

So how come the 2 principles of GPI seem to apply even though this policy isn't optimal? Since I can't see how the agent is ever given a chance to sum over all the states, despite it seemingly matching the definition of the optimal policy?

If someone could explain it with as little math as necessary it would be greatly appreciated, thank you.

lzl · Answer

You probably have figured this out by now but I'll just add an answer here for completeness (and for anyone else that gets confused)

I haven't read the book but it seems to me that you're confusing a modeled environment (is that the official name?) with a model free environment.

In a modeled environment, you don't need to explore because you already have the dynamics of the system. In this case you simply apply the algorithm to EVERY state against EVERY action.

In your case let's just say you start with the random policy of going left everywhere.

it looks like this:

s1, s2, s3, s4, s5, s6

5, <-, <-, <-, <-, 20

first workout the value as per policy, let's make gamma 1 because it's way too early in the morning. You get:

s1, s2, s3, s4, s5, s6

5, 5, 5, 5, 5, 20

now apply policy improvement, in this iteration the only thing interesting that's gonna happen is at s5. The guy sitting at s5 looks to the left, looks to the right and decides right is better (20 over 5, or.... officially the "max of all actions"), and improves the policy. All other states stays the same for now (they all get 5 at either side). The new policy now becomes:

s1, s2, s3, s4, s5, s6

5, <-, <-, <-, ->, 20

Iteration 2: evaluate value as per policy

s1, s2, s3, s4, s5, s6

5, 5, 5, 5, 20, 20

Iteration 2: policy improvement (s4 changes his mind and goes right, everyone else stays the same). New policy now becomes:

s1, s2, s3, s4, s5, s6

5, <-, <-, ->, ->, 20

and so on, until everyone eventually goes to the right.... and policy stabalises. You now drop a guy at s3 and he will go right.

In a model freeeeed environment, and pretty much in all real world cases. You don't get the model and you have to just explore. In this case you're right, you need to introduce some kind of exploratiion term, so that the agent will try new options every now and then. In your case the agent only sees a black tunnel to his left, and a black tunnel to his right, he tries left and gets 5. He keeps on going left, and if he's just greedy he'll just go left forever, but if you introduce a chance for him to explore, he'll travel down the right tunnel a few times eventually to see what's going on, and he will figure out the right side is better.

How does Generalized Policy Iteration stabilize to the optimal policy and value function?

One Answer

Add your own answers!

Ask a Question