How does the target network in double DQNs find the maximum Q value for each action?

Artificial Intelligence Asked on November 7, 2021

I understand the fact that the neural network is used to take the states as inputs and it outputs the Q-value for state-action pairs. However, in order to compute this and update its weights, we need to calculate the maximum Q-value for the next state $$s’$$. In order to get that, in the DDQN case, we input that next state $$s’$$ in the target network.

What I’m not clear on is: how do we train this target network itself that will help us train the other NN? What is its cost function?

Both in DQN and in DDQN, the target network starts as an exact copy of the Q-network, that has the same weights, layers, input and output dimensions, etc., as the Q-network.

The main idea of the DQN agent is that the Q-network predicts the Q-values of actions from a given state and selects the maximum of them and uses the mean squared error (MSE) as its cost/loss function. That is, it performs gradient descent steps on

$$left(Y_{t}^{mathrm{DQN}} -Qleft(s_t, a_t;boldsymbol{theta}right)right)^2,$$

where the target $$Y_{t}^{mathrm{DQN}}$$ is defined (in the case of DQN) as

$$Y_{t}^{mathrm{DQN}} equiv R_{t+1}+gamma max _{a} Qleft(S_{t+1}, a ; boldsymbol{theta}_{t}^{-}right)$$

$$boldsymbol{theta}$$ are the Q-network weights and $$boldsymbol{theta^-}$$ are the target network weights.

After a usually fixed number of timesteps, the target network updates its weights by copying the weights of the Q-network. So, basically, the target network never performs a feed-forward training phase and, thus, ignores a cost function.

In the case of DDQN, the target is defined as

$$Y_{t}^{text {DDQN}} equiv R_{t+1}+gamma Qleft(S_{t+1}, underset{a}{operatorname{argmax}} Qleft(S_{t+1}, a ; boldsymbol{theta}_{t}right) ; boldsymbol{theta}_{t}^{-}right)$$

This target is used to decouple the selection of the action (i.e. the argmax part) from its evaluation (i.e. the computation of the Q value at the next state with this selected action), as stated the paper that introduced the DDQN)

The max operator in standard Q-learning and DQN, in (2) and (3), uses the same values both to select and to evaluate an action. This makes it more likely to select overestimated values, resulting in overoptimistic value estimates. To prevent this, we can decouple the selection from the evaluation

Answered by ddaedalus on November 7, 2021

Related Questions

Classification or regression for deep Q learning

0  Asked on December 16, 2021

Is the Bellman equation that uses sampling weighted by the Q values (instead of max) a contraction?

0  Asked on December 16, 2021 by sirfroggy

Why does reinforcement learning using a non-linear function approximator diverge when using strongly correlated data as input?

1  Asked on December 13, 2021

How Graph Convolutional Neural Networks forward propagate?

1  Asked on December 13, 2021

In which cases is the categorical cross-entropy better than the mean squared error?

3  Asked on December 11, 2021

What are the keys and values of the attention model for the encoder and decoder in the “Attention Is All You Need” paper?

1  Asked on December 11, 2021

Is my 57% sports betting accuracy correct?

1  Asked on December 11, 2021 by sports_stats

Understanding the “unroling” step in the proof of the policy gradient theorem

2  Asked on December 9, 2021

Forcing a neural network to be close to a previous model – Regularization through given model

0  Asked on December 9, 2021 by blba

Why is DDPG not learning and it does not converge?

0  Asked on December 9, 2021 by i_al-thamary

How artificial intelligence will change the future?

1  Asked on December 7, 2021

Can residual neural networks use other activation functions different from ReLU?

1  Asked on December 7, 2021 by jr123456jr987654321

Is it necessary to standardise the expected output

1  Asked on December 7, 2021

Is CNN capable of extracting the descriptive statistics features

1  Asked on December 4, 2021 by nilsinelabore

How to create Partially Connected NNs with prespecified connections using Tensorflow?

3  Asked on December 2, 2021 by pnar-demetci

What is the best resources to learn Graph Convolutional Neural Networks?

2  Asked on December 2, 2021

Is it possible to use AI to reverse engineer software?

2  Asked on November 29, 2021 by ipsumpanest

Why do CNN’s sometimes make highly confident mistakes, and how can one combat this problem?

6  Asked on November 29, 2021

Can you explain me this CNN architecture?

1  Asked on November 29, 2021 by sanmu

In Deep Deterministic Policy Gradient, are all weights of the policy network updated with the same or different value?

1  Asked on November 29, 2021 by unter_983