TransWikia.com

Does gradient descent work for tabular Q learning?

Cross Validated Asked on December 6, 2021

Suppose I have a tabular Q learning problem such as grid-world.

Let our loss be defined as,

$$hat{L}(Q)=0.5(Q(s,a)-(r+gammamax_{a’}{Q(s’,a’)}))^2$$

Then $Q_{k+1}(s,a) = Q_k(s,a) – eta nabla hat {L}(Q) = Q_k(s,a) – eta(Q_k(s,a) – r_k+gammamax_{a’}{Q_k(s’,a’)})$ which is just Q learning.

So, does a gradient descent approach make sense if we take our loss function to be the difference between the current Q value and the TD error?

One Answer

Yes, it is possible; you are close, but not quite there.

You lost a gradient in your equation; it should be: $$Q_{k+1}(s,a) = Q_k(s,a) - eta left(Q(s,a)-(r+gammamax_{a'}{Q(s',a')})right)left(left.frac{d~Q}{d~theta}right|_{(s,a)} - gamma left.frac{d~max_{a'}Q}{d~theta}right|_{(s')} right)$$

Which does simplify a bit in the case of a tabular representation:

$$Q_{k+1}(s,a) = Q_k(s,a) - eta left(Q(s,a)-(r+gammamax_{a'}{Q(s',a')})right)left(1 - gamma left.frac{d~max_{a'}Q}{d~theta}right|_{(s')} right)$$

Problems may arise if $s=s'$ and $a=a'$, because your update will be $0$ (which it shouldn't). It's also not a good idea to try and differentiate the $max$ function.

You can do the "double deep q-learning trick" and introduce $theta_textrm{old}$ to estimate $Q(s',a')$, i.e., use the q-table from the previous step. This will make the other gradient dissapear, and you are indeed left with q-learning:

$$Q_{k+1}(s,a) = Q_k(s,a) - eta left(Q(s,a, theta)-(r+gammamax_{a'}{Q(s',a', theta_textrm{old})})right)$$

In this case, the loss will be

$$hat L(theta)= frac12 left(Q(s,a,theta)−(r+gamma max_{a′}Q(s′,a′, theta_textrm{old}))right)^2$$

Answered by FirefoxMetzger on December 6, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP