# Why are Target Networks used in Deep Q-Learning as opposed to the Expected Value equation?

Artificial Intelligence Asked by TMT on September 10, 2020

I understand we use a target network because it helps resolve issues regarding stability, however, that’s not what I’m here to ask.

What I would like to understand is why a target network is used as a measure of ground truth as opposed to the expectation equation.

To clarify, here is what I mean. This is the process used for DQN:

1. In DQN, we begin with a state $$S$$
2. We then pass this state through a neural network which outputs Q values for each action in the action space
3. A policy e.g. epsilon-greedy is used to take an action
4. This subsequently produces the next state $$S_{t+1}$$
5. $$S_{t+1}$$ is then passed through a target neural network to produce target Q values
6. These target Q values are then injected into the Bellman equation which ultimately produces a target Q value via the Q-learning update rule equation
7. MSE is used on 6 and 2 to compute the loss
8. This is then back-propagated to update the parameters for the neural network in 2
9. The target neural network has its parameters updated every X epochs to match the parameters in 2

Why do we use a target neural network to output Q values instead of using statistics. Statistics seems like a more accurate way to represent this. By statistics, I mean this:

Q values are the expected return, given the state and action under policy π.

$$Q(S_{t+1},a) = V^π(S_{t+1})$$ = $$mathbb{E}(r_{t+1}+ γr_{t+2}+ (γ^2)_{t+3} + … mid S_{t+1}) = {E}(∑γ^kr_{t+k+1}mid S_{t+1})$$

We can then take the above and inject it into the Bellman equation to update our target Q value:

$$Q(S_{t},a_t) + α*(r_t+γ*max(Q(S_{t+1},a))-Q(S_{t},a))$$

So, why don’t we set the target to the sum of diminishing returns? Surely a target network is very inaccurate, especially since the parameters in the first few epochs for the target network are completely random.

Why are Target Networks used in Deep Q-Learning as opposed to the Expected Value equation?

In short, because for many problems, this learns more efficiently.

It is the difference between Monte Carlo (MC) methods and Temporal Difference (TD) learning.

You can use MC estimates for expected return in deep RL. They are slower for two reasons:

• It takes far more experience to collect enough data to train a neural network, because to fully sample a return you need a whole episode. You cannot just use one episode at a time because that presents the neural network with correlated data. You would need to collect multiple episodes and fill a large experience table.

• As an aside, you would also need to discard all the experience after each update, because sampled full returns are on-policy data. Or you could implement importance sampling for off-policy Monte Carlo control and re-calculate the correct updates when the policy starts to improve, which is added complexity.
• Samples of full returns have a higher variance, so the sampled data is noisier.

In comparison, TD learning starts with biased samples. This bias reduces over time as estimates become better, but it is the reason why a target network is used (otherwise the bias would cause runaway feed back).

So you have a bias/variance trade off with TD representing high bias and MC representing high variance.

It is not clear theoretically which is better in general, because it depends on the nature of MDPs that you are solving with each method. In practice, on the types of problems Deep RL has been tried on, single-step TD learning appears to do better than MC sampling of returns, in terms of goals such as sample efficiency and learning time.

You can compromise between TD and MC using eligibility traces, resulting in TD($$lambda$$). However, this is awkward to implement in Deep RL due to the experience replay table. A simpler compromise is to use $$n$$-step returns e.g. $$r_{t+1} + gamma r_{t+2} + gamma^2 r_{t+3} + gamma^3 text{max}_a(Q(s_{t+4},a))$$, which was one of the refinements used in the "Rainbow" DQN paper - note that even though strictly in their case, this handled off-policy incorrectly (it should use importance sampling, but they didn't bother), it still worked well enough for low $$n$$ on the Atari problems.

Answered by Neil Slater on September 10, 2020

## Related Questions

### Why do we use $X_{I_t,t}$ and $v_{I_t}$ to denote the reward received and the at time step $t$ and the distribution of the chosen arm $I_t$?

2  Asked on August 24, 2021 by mab_n00b

### Can I apply AdaBoost on a random forest?

0  Asked on August 24, 2021 by swakshar-deb

### Why is the expected return in Reinforcement Learning (RL) computed as a sum of cumulative rewards?

1  Asked on August 24, 2021 by that_ai_guy

### How can a learning rate that is too large cause the output of the network (and the error) to go to infinity?

0  Asked on August 24, 2021 by user1477107

### Why does the number of channels in the PointNet increase as we go deeper?

0  Asked on August 24, 2021 by user3180

### How is AI helping humanity?

2  Asked on August 24, 2021

### Prioritised Remembering in Experience Replay (Q-Learning)

0  Asked on August 24, 2021 by conscious_process

### Advantages of training Neural Networks based on analytic success criteria

0  Asked on August 24, 2021 by emvee

### What kind of policy evaluation and policy improvement AlphaGo, AlphaGo Zero and AlphaZero are using

0  Asked on August 24, 2021 by daniel-wiczew

### Why would the learning rate curve go backwards?

1  Asked on August 24, 2021 by monkeydluffy

### When would bias regularisation and activation regularisation be necessary?

1  Asked on August 24, 2021

### Upper limit to the maximum cumulative reward in a deep reinforcement learning problem

3  Asked on August 24, 2021 by kamran-thomas-alimagham

### In continuous action spaces, how is the standard deviation, associated with Gaussian distribution from which actions are sampled, represented?

0  Asked on August 24, 2021 by m-s

### Extending FaceNet’s triplet loss to object recognition

1  Asked on August 24, 2021 by benedict-aaron-tjandra

### When is it time to switch to deep neural networks from simple networks in text classification problems?

1  Asked on August 24, 2021

### time-series prediction : loss going down, then stagnates with very high variance

1  Asked on August 24, 2021 by johncowk

### Why isn’t my implementation of DQN using TensorFlow on the FrozenWorld environment working?

1  Asked on August 24, 2021 by kosa

### What’s the intuition behind contrastive learning?

1  Asked on August 24, 2021 by cataluna84

### What are the most common feedforward neural networks?

1  Asked on August 24, 2021 by caesar-cruez