TransWikia.com

Is my pseudocode titled "Monte Carlo Exploring Starts (with model)" correct?

Artificial Intelligence Asked on February 8, 2021

Reinforcement Learning: An Introduction second edition, Richard S. Sutton and Andrew G. Barto:

  1. We made two unlikely assumptions above in order to easily obtain this guarantee of
    convergence for the Monte Carlo method. … For now we focus on the assumption that policy evaluation operates on an infinite
    number of episodes. This assumption is relatively easy to remove. In fact, the same issue
    arises even in classical DP methods such as iterative policy evaluation, which also converge
    only asymptotically to the true value function.
  1. There is a second approach to avoiding the infinite number of episodes nominally
    required for policy evaluation, in which we give up trying to complete policy evaluation
    before returning to policy improvement. On each evaluation step we move the value
    function toward q⇡k, but we do not expect to actually get close except over many steps.
    We used this idea when we first introduced the idea of GPI in Section 4.6. One extreme
    form of the idea is value iteration, in which only one iteration of iterative policy evaluation
    is performed between each step of policy improvement. The in-place version of value
    iteration is even more extreme; there we alternate between improvement and evaluation
    steps for single states.

The original pseudocode:

Monte Carlo ES (Exploring Starts), for estimating $pi approx pi_{*}$

Initialize:

$quad$ $pi(s) in mathcal{A}(s)$ (arbitrarily), for all $s in mathcal{S}$

$quad$ $Q(s, a) in mathbb{R}$ (arbitrarily), for all $s in mathcal{S}, a in mathcal{A}(s)$

$quad$ $Returns(s, a) leftarrow$ empty list, for all $s in mathcal{S}, a in mathcal{A}(s)$

Loop forever (for each episode):

$quad$ Choose $S_{0} in mathcal{S}, A_{0} in mathcal{A}left(S_{0}right)$ randomly such that all pairs have probability $geq 0$

$quad$ Generate an episode from $S_{0}, A_{0},$ following $pi: S_{0}, A_{0}, R_{1}, ldots, S_{T-1}, A_{T-1}, R_{T}$

$quad$ $G leftarrow 0$

$quad$ Loop for each step of episode, $t=T-1, T-2, ldots, 0$

$quadquad$ $G leftarrow gamma G+R_{t+1}$

$quadquad$ Unless the pair $S_{t}, A_{t}$ appears in $S_{0}, A_{0}, S_{1}, A_{1} ldots, S_{t-1}, A_{t-1}:$

$quadquadquad$ Append $G$ to $Returnsleft(S_{t}, A_{t}right)$

$quadquadquad$ $Qleft(S_{t}, A_{t}right) leftarrow text{average}left(Returnsleft(S_{t}, A_{t}right)right)$

$quadquadquad$ $pileft(S_{t}right) leftarrow arg max _{a} Qleft(S_{t}, aright)$

I want to make the same algorithm but with a model. The book states:

  1. With a model, state values alone are sufficient to determine a policy; one simply looks ahead one step and chooses whichever action leads to the best combination of reward and next state, as we did in the chapter on DP.

So based on the 1st quote I must use "stars exploration" and "one evaluation — one improvement" ideas (as well as in model-free version) to make the algorithm converge.

My version of the pseudocode:

Monte Carlo ES (Exploring Starts), for estimating $pi approx pi_{*}$ (with model)

Initialize:

$quad$ $pi(s) in mathcal{A}(s)$ (arbitrarily), for all $s in mathcal{S}$

$quad$ $V(s) in mathbb{R}$ (arbitrarily), for all $s in mathcal{S}$

$quad$ $Returns(s) leftarrow$ empty list, for all $s in mathcal{S}$

Loop forever (for each episode):

$quad$ Choose $S_{0} in mathcal{S}, A_{0} in mathcal{A}left(S_{0}right)$ randomly such that all pairs have probability $geq 0$

$quad$ Generate an episode from $S_{0}, A_{0},$ following $pi: S_{0}, A_{0}, R_{1}, ldots, S_{T-1}, A_{T-1}, R_{T}$

$quad$ $G leftarrow 0$

$quad$ Loop for each step of episode, $t=T-1, T-2, ldots, 1$:

$quadquad$ $G leftarrow gamma G+R_{t+1}$

$quadquad$ Unless $S_{t}$ appears in $S_{0}, S_{1}, ldots, S_{t-1}:$

$quadquadquad$ Append $G$ to $Returns left(S_{t}right)$

$quadquadquad$ $Vleft(S_{t}right)leftarrowtext{average}left(Returnsleft(S_{t}right)right)$

$quadquadquad$ $pileft(S_{t-1}right) leftarrow operatorname{argmax}_{a} sum_{s^{prime}, r} pleft(s^{prime}, r mid S_{t-1}, aright)left[gamma Vleft(s^{prime}right)+rright]$

— Here I update the policy in $S_{t-1}$ because the step before we update $V(S_{t})$ and changes to $V(S_{t})$ don’t affect $pi (S_{t})$, but affect $ pi (S_{t-1})$, as $S_{t}$ is in $S’$ for $S_{t-1}$.

Pseudocode as images:

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP