Artificial Intelligence Asked by TMT on September 10, 2020
I understand we use a target network because it helps resolve issues regarding stability, however, that’s not what I’m here to ask.
What I would like to understand is why a target network is used as a measure of ground truth as opposed to the expectation equation.
To clarify, here is what I mean. This is the process used for DQN:
Why do we use a target neural network to output Q values instead of using statistics. Statistics seems like a more accurate way to represent this. By statistics, I mean this:
Q values are the expected return, given the state and action under policy π.
$Q(S_{t+1},a) = V^π(S_{t+1})$ = $mathbb{E}(r_{t+1}+ γr_{t+2}+ (γ^2)_{t+3} + … mid S_{t+1}) = {E}(∑γ^kr_{t+k+1}mid S_{t+1})$
We can then take the above and inject it into the Bellman equation to update our target Q value:
$Q(S_{t},a_t) + α*(r_t+γ*max(Q(S_{t+1},a))-Q(S_{t},a))$
So, why don’t we set the target to the sum of diminishing returns? Surely a target network is very inaccurate, especially since the parameters in the first few epochs for the target network are completely random.
Why are Target Networks used in Deep Q-Learning as opposed to the Expected Value equation?
In short, because for many problems, this learns more efficiently.
It is the difference between Monte Carlo (MC) methods and Temporal Difference (TD) learning.
You can use MC estimates for expected return in deep RL. They are slower for two reasons:
It takes far more experience to collect enough data to train a neural network, because to fully sample a return you need a whole episode. You cannot just use one episode at a time because that presents the neural network with correlated data. You would need to collect multiple episodes and fill a large experience table.
Samples of full returns have a higher variance, so the sampled data is noisier.
In comparison, TD learning starts with biased samples. This bias reduces over time as estimates become better, but it is the reason why a target network is used (otherwise the bias would cause runaway feed back).
So you have a bias/variance trade off with TD representing high bias and MC representing high variance.
It is not clear theoretically which is better in general, because it depends on the nature of MDPs that you are solving with each method. In practice, on the types of problems Deep RL has been tried on, single-step TD learning appears to do better than MC sampling of returns, in terms of goals such as sample efficiency and learning time.
You can compromise between TD and MC using eligibility traces, resulting in TD($lambda$). However, this is awkward to implement in Deep RL due to the experience replay table. A simpler compromise is to use $n$-step returns e.g. $r_{t+1} + gamma r_{t+2} + gamma^2 r_{t+3} + gamma^3 text{max}_a(Q(s_{t+4},a))$, which was one of the refinements used in the "Rainbow" DQN paper - note that even though strictly in their case, this handled off-policy incorrectly (it should use importance sampling, but they didn't bother), it still worked well enough for low $n$ on the Atari problems.
Answered by Neil Slater on September 10, 2020
1 Asked on August 24, 2021 by harry-stuart
a star admissible heuristic consistent heuristic heuristics search
1 Asked on August 24, 2021 by user784446
automated machine learning convolutional neural networks deep learning neural networks self supervised learning
1 Asked on August 24, 2021 by nathan-b
0 Asked on August 24, 2021 by creepsy
evolutionary algorithms machine learning neat unsupervised learning
1 Asked on August 24, 2021 by parzival
0 Asked on August 24, 2021
convolutional layers convolutional neural networks image segmentation neural networks pooling
1 Asked on August 24, 2021
architecture ddpg environment reinforcement learning weights
0 Asked on August 24, 2021 by naveen-reddy-marthala
ai design architecture convolutional neural networks deep neural networks recurrent neural networks
1 Asked on August 24, 2021 by ravi-teja
1 Asked on August 24, 2021 by curiouscat22
1 Asked on August 24, 2021
alphago alphago zero monte carlo tree search reinforcement learning
0 Asked on February 27, 2021 by khush-agrawal
imitation learning importance sampling on policy methods proximal policy optimization reinforcement learning
1 Asked on February 23, 2021 by saha
accuracy alexnet binary classification neural networks training
1 Asked on February 23, 2021
1 Asked on February 20, 2021 by guineu
1 Asked on February 19, 2021 by fara
imbalanced datasets multiclass classification prediction text classification
0 Asked on February 14, 2021 by tyler-h
3 Asked on February 13, 2021 by neomerarcana
1 Asked on February 12, 2021 by nivter
1 Asked on February 11, 2021 by nae
board games combinatorial games evaluation functions game ai go
Get help from others!
Recent Answers
Recent Questions
© 2023 AnswerBun.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP