Understanding distributional Temporal Difference Learning

Question

I am trying to understand a recently published Deepmind Paper. In the supplementary information of the paper (accessible here), the authors explain distributional temporal difference learning.

In normal temporal difference learning, one tries to estimate the value function $v_{pi}(x)$ where $pi$ is a policy and $x$ an environmental state using an estimator $V(x)$ by the following update rule

$$V(x) leftarrow V(x) + alpha delta$$
where $alpha$ is a learning rate and $$delta = r + gamma V(x') - V(x)$$ is the reward prediction error going from state $x$ to $x'$ and receiving immediate reward $r$ ($gamma$ is just the discount factor).

However, in distributional temporal difference learning, things are a bit different. In the paper they say the following:

"In this method, instead of a single value function, a set of value functions is learned. For each value function $V_i$, a distinct reward prediction error $delta_i$ is computed: $$ delta_i = r + gamma V_j(x') - V_i(x)$$
where $mathbf{V_j(x')}$ is a sample from the distribution $mathbf{V(x')}$."

I have a hard time understanding what is meant by the last line, highlighted in bold. Where does the $V(x')$ now come from? How is it a distribution, how do we know it and how can we sample from it? What is meant by this?

I hope that I have been more or less clear, please, if you have any further questions let me know and I will edit this post.

Any help is welcome!
Thanks

statguy789 · Answer

This is a great question. The authors are opaque with respect to this point in most of their papers, but they fully address it in Rowland et al 2019 (Statistics and Samples in Distributional RL).
In order for the quantile code to converge, the agent needs to do the following steps every time the RPE $delta_i$ is computed:

Impute a distribution consistent with the current set of quantiles ${V_i}_{i=1,...,N}$ estimated for state $x'$
Sample a quantile $V_j(x')$ from that distribution

Step 1 basically means finding a distribution that is consistent with the current set of estimates (quantiles, expectiles, etc), and step 2 is sampling an estimate (quantile, expectile) from that distribution.
In the case in which $V_i$ are quantiles, a sample from the imputed distribution can be approximated by simply sampling a quantile $V_j(x')$. In the case that $V_i$ are expectiles, however, the imputation step cannot be sidestepped in the same way (see Rowland et al 2019 for more details).

Understanding distributional Temporal Difference Learning

One Answer

Add your own answers!

Ask a Question