# What is the expectation of an empirical model in model based RL?

Artificial Intelligence Asked by ijuneja on November 4, 2021

In the paper – "Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems", on page 1083, on the 6th line from the bottom, the authors define expectation of the empirical model as
$$hat{mathbb{E}}_{s,s’,a}[V(s’)] = sum_{s’ in S} hat{P}^{a}_{s, s’}V(s’).$$
I didn’t understand the significance of this quantity since it puts $$V(s’)$$ inside an expectation while assuming the knowledge of $$V(s’)$$ in the definition on the right.

A clarification in this regard would be appreciated.

EDIT:
The paper defines $$hat{P}^{a}_{s, s’}$$ as,
$$hat{P}^{a}_{s, s’} = frac{|(s, a, s’, t)|}{|(s, a, t)|}.$$
Where $$|(s, a, t)|$$ is the number of times state $$s$$ was visited and action $$a$$ was taken and $$|(s, a, s’, t)|$$ as the number of times among the $$|(s, a, t)|$$ times $$(s, a)$$ was visited when the next state landed in was $$s’$$ during model learning.

No explicit definition for $$V$$ is provided however, $$V^{pi}$$ is defined as the usual expected discounted return, using the same definition as Sutton and Barto or other sources.

If I understand your question correctly, the significance of this is due to the fact that $$s'$$ is random. In the RHS of the equation it is assumed that $$V(cdot)$$ is known for each state, but the quantity is measuring the expected value of the next state given the current state and action.

Answered by harwiltz on November 4, 2021

