TransWikia.com

Idenitity between TD(0) algorithm and Policy Evaluation in Dynamic Programming when alpha is equal to 1

Data Science Asked on February 26, 2021

TD(0) algorithm is defined as the iterative update of the following:

$$ V(s) leftarrow V(s) + alpha({r + gamma V(s’)} – V(s) ) $$

Now, if we assume alpha to be equal to 1, we get the traditional Policy Evaluation formula in Dynamic programming. Is it correct?

2 Answers

$alpha$ is independent of the type of RL algorithm. It is the learning rate, i.e. the rate at which you will update a state value. You could set it to 1 or less.

Policy evaluation is a 'general principle'. Temporal difference is a way to make it work. More precisely, TD defines by how far in the future you take in account the consequences of an action. In your equation, gamma defines by how much you take that future into account.

Answered by Dany Yatim on February 26, 2021

No - Dynamic programming estimates the value of the next state by first looking at all possible next states. Temporal difference 0 estimates the value of the next state by only looking at a single next state.

Answered by Brian Spiering on February 26, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP