TransWikia.com

Is the Bellman equation that uses sampling weighted by the Q values (instead of max) a contraction?

Artificial Intelligence Asked by sirfroggy on December 16, 2021

It is proved that the Bellman update is a contraction (1).

Here is the Bellman update that is used for Q-Learning:

$$Q_{t+1}(s, a) = Q_{t}(s, a) + alpha*(r(s, a, s’) + gamma max_{a^*} (Q_{t}(s’,
a^*)) – Q_t(s,a)) tag{1} label{1}$$

The proof of (ref{1}) being contraction comes from one of the facts (the relevant one for the question) that max operation is non expansive; that is:

$$lvert max_a f(a)- max_a g(a) rvert leq max_a lvert f(a) – g(a) rvert tag{2}label{2}$$

This is also proved in a lot of places and it is pretty intuitive.

Consider the following Bellman update:

$$ Q_{t+1}(s, a) = Q_{t}(s, a) + alpha*(r(s, a, s’) + gamma SAMPLE_{a^*} (Q_{t}(s’, a^*)) – Q_t(s,a)) tag{3}label{3}$$

where $SAMPLE_a(Q(s, a))$ samples an action with respect to the Q values (weighted by their Q values) of each action in that state.

Is this new Bellman operation still a contraction?

Is the SAMPLE operation non-expansive? It is, of course, possible to generate samples that will not satisfy equation (ref{2}). I ask is it non-expansive in expectation?

My approach is:

$$lvert,mathbb{E}_{a sim Q}[f(a)] – mathbb{E}_{a sim Q}[g(a)], rvert leq ,,mathbb{E}_{a sim Q}lvert,,[f(a) – g(a)],,rvert tag{4} label{4} $$

Equivalently:

$$lvert,mathbb{E}_{a sim Q}[f(a) – g(a)] , rvert leq ,,mathbb{E}_{a sim Q}lvert,,[f(a) – g(a)],,rvert$$

(ref{4}) is true since:

$$lvert,mathbb{E}[X] , rvert leq ,,mathbb{E} ,,lvert,,[X],,rvert $$

But, I am not sure if proving (ref{4}) proves the theorem. Do you think that this is a legit proof that (ref{3}) is a contraction.

(If so; this would mean that stochastic policy q learning theoretically converges and we can have stochastic policies with regular q learning; and this is why I am interested.)

Both intuitive answers and mathematical proofs are welcome.

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP