What is the expectation of an empirical model in model based RL?

Artificial Intelligence Asked by ijuneja on November 4, 2021

In the paper – "Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems", on page 1083, on the 6th line from the bottom, the authors define expectation of the empirical model as
$$hat{mathbb{E}}_{s,s’,a}[V(s’)] = sum_{s’ in S} hat{P}^{a}_{s, s’}V(s’).$$
I didn’t understand the significance of this quantity since it puts $V(s’)$ inside an expectation while assuming the knowledge of $V(s’)$ in the definition on the right.

A clarification in this regard would be appreciated.

The paper defines $hat{P}^{a}_{s, s’}$ as,
$$hat{P}^{a}_{s, s’} = frac{|(s, a, s’, t)|}{|(s, a, t)|}.$$
Where $|(s, a, t)|$ is the number of times state $s$ was visited and action $a$ was taken and $|(s, a, s’, t)|$ as the number of times among the $|(s, a, t)|$ times $(s, a)$ was visited when the next state landed in was $s’$ during model learning.

No explicit definition for $V$ is provided however, $V^{pi}$ is defined as the usual expected discounted return, using the same definition as Sutton and Barto or other sources.

One Answer

If I understand your question correctly, the significance of this is due to the fact that $s'$ is random. In the RHS of the equation it is assumed that $V(cdot)$ is known for each state, but the quantity is measuring the expected value of the next state given the current state and action.

Answered by harwiltz on November 4, 2021

Add your own answers!

Related Questions

Measuring novel configuration of points

1  Asked on February 7, 2021 by vaibhav-thakkar


Computation of initial adjoint for NODE

1  Asked on January 28, 2021 by seewoo-lee


Ask a Question

Get help from others!

© 2022 All rights reserved. Sites we Love: PCI Database, MenuIva, UKBizDB, Menu Kuliner, Sharing RPP