Deep Q-learning, how to set q-value of non-selected actions?

Question

I am learning Deep Q-learning by applying it to a real world problem. I have been through some tutorials and papers available online but I counldn't figure out the solution for the following problem statement.
Let's say we have $N$ possible actions in each state to select from. When in state $s$ we make a move by selecting an action $a_i, i=1dots N$, as the result we get a reward $r$ and end up in a new state $s^prime$. In order to update the Neural Network with this experience $(s, a, r, s^prime)$, the only ground truth q-value that we have is for action $a_i$. In other words we do not have any ground truth q-values for all other possible actions ($a_j, j=1dots N, jneq i$). Then how should we feed this training data sample to the Neural Network?
Following are the possibilities I was thinking about,

Set other q-values to don't cares. In this scenario we do not update the weights of the last hidden layer that connects to the output values of all $a_j$'s. However, due to interconnections between the earlier layers, any update to the weights will affect the q-values of $a_j$'s.

Set other q-values to the current predicted values by the Neural Network. This makes the error for those q-values zero, but like above solution the change in the weights will affect the $a_j$'s q-values eventually.

Use one Neural Network for each possible actions. This perfectly makes sense as the solution to me, but Deep Q-learning uses one network to predict the q-values of all possible actions (regardless of having two networks, one for policy and the second as target).

If someone who has experience and knowledge to help me with this to understand, I would highly appreciate it.

R12568asdb · Answer

Consider these two depictions of a simple Deep Q Learning Algorithm:

Look at the gradient update steps in both images: we are only taking the gradient with respect to the Q value of the one action, $a_i$. The network outputs $N$ nodes, but since we only care about one of them, you can probably see that all of the computations still work out, for the ground truths of the other output nodes are never used in this gradient update step.
I hope this helps.
Edit: looking at you first solution you listed, this is correct, but i feel you aren’t understanding the actual computation of the gradient. If you work it out, the ground truth values of the other nodes aren’t actually used because the loss function is only written in terms of the $i$th node.

Deep Q-learning, how to set q-value of non-selected actions?

One Answer

Add your own answers!

Ask a Question