Understanding the loss function in deep Q-learning

Question

I am trying to understand how deep Q learning (DQN) works. To my current understanding, each $Q(s, a)$ functions is estimated to be a function of a feature vector of its state $phi$(s) and the weight of the network $theta$.
The loss function to minimise is $||delta_{t+1}||^2$ where $delta_{t+1}$ is shown below. The loss function is from the website talking about function approximation. Even though it is not explicitly deep Q learning, the loss function to minimise is similar.
$$delta_{mathrm{t}+1}=mathrm{R}_{mathrm{t}+1}+max _{mathrm{a}inmathrm{A}} boldsymbol{theta}^{top} Phileft(mathrm{s}_{t+1}, mathrm{a}right)-boldsymbol{theta}^{top} Phileft(mathrm{s}_{mathrm{t}}, mathrm{a}right)$$
Source: https://towardsdatascience.com/function-approximation-in-reinforcement-learning-85a4864d566.
Intuitively, I am not able to understand why the loss function is defined as such. Once the network converges to a $theta$ using gradient descent, does that mean that the $Q_{max}(s,a)$ is found?
In essence, I am not able to grasp intuitively how the neural network is able to generalise the learning to unseen states.
The algorithm I am looking at to help me understand the deep Q networks is below.

Source: https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf

harwiltz · Answer

Especially in continuous space, convergence of the value function is mainly a theoretical property. Without seeing enough of the state space, as you suggest, there's no way to ensure that your Q function will generalize to the whole state space. Convergence results for Q learning with function approximation generally show that in the limit of infinite data, your value function will converge to the desired fixed point -- note that this is only true when your agent explores occasionally, for an infinite amount of time.
When your parameters have converged, this simply means that your Q function has fit the data you've collected. As you explore more, your agent may get "surprised" and your parameters may start to change again.
Also, convergence of the parameters in function approximation can never guarantee that an optimal value function was found in practice -- the only guarantee you can wish for is that the optimal value function that can be produced with your model has been found. For instance, the parameters of the linear Q function you posted can converge, even if the optimal Q function is not linear.

d56 · Answer

Well, you want your network to have a good prediction powers for the Q-values. So you compare Q-value at time t with the reward that you've got at time t after having executed action a + the prediction of the best Q-value of your neural network at time t+1. Note, that you are optimizing using a prediction and not a true value. That is called bootstrapping, look up TD-learning to have a better grasp of the concept.

Answered by d56 on November 4, 2021

Understanding the loss function in deep Q-learning

2 Answers

Add your own answers!

Ask a Question