# Understanding the loss function in deep Q-learning

Artificial Intelligence Asked on November 4, 2021

I am trying to understand how deep Q learning (DQN) works. To my current understanding, each $$Q(s, a)$$ functions is estimated to be a function of a feature vector of its state $$phi$$(s) and the weight of the network $$theta$$.

The loss function to minimise is $$||delta_{t+1}||^2$$ where $$delta_{t+1}$$ is shown below. The loss function is from the website talking about function approximation. Even though it is not explicitly deep Q learning, the loss function to minimise is similar.

$$delta_{mathrm{t}+1}=mathrm{R}_{mathrm{t}+1}+max _{mathrm{a}inmathrm{A}} boldsymbol{theta}^{top} Phileft(mathrm{s}_{t+1}, mathrm{a}right)-boldsymbol{theta}^{top} Phileft(mathrm{s}_{mathrm{t}}, mathrm{a}right)$$

Intuitively, I am not able to understand why the loss function is defined as such. Once the network converges to a $$theta$$ using gradient descent, does that mean that the $$Q_{max}(s,a)$$ is found?

In essence, I am not able to grasp intuitively how the neural network is able to generalise the learning to unseen states.

The algorithm I am looking at to help me understand the deep Q networks is below. Especially in continuous space, convergence of the value function is mainly a theoretical property. Without seeing enough of the state space, as you suggest, there's no way to ensure that your Q function will generalize to the whole state space. Convergence results for Q learning with function approximation generally show that in the limit of infinite data, your value function will converge to the desired fixed point -- note that this is only true when your agent explores occasionally, for an infinite amount of time.

When your parameters have converged, this simply means that your Q function has fit the data you've collected. As you explore more, your agent may get "surprised" and your parameters may start to change again.

Also, convergence of the parameters in function approximation can never guarantee that an optimal value function was found in practice -- the only guarantee you can wish for is that the optimal value function that can be produced with your model has been found. For instance, the parameters of the linear Q function you posted can converge, even if the optimal Q function is not linear.

Answered by harwiltz on November 4, 2021

Well, you want your network to have a good prediction powers for the Q-values. So you compare Q-value at time t with the reward that you've got at time t after having executed action a + the prediction of the best Q-value of your neural network at time t+1. Note, that you are optimizing using a prediction and not a true value. That is called bootstrapping, look up TD-learning to have a better grasp of the concept.

Answered by d56 on November 4, 2021

## Related Questions

### Why are Target Networks used in Deep Q-Learning as opposed to the Expected Value equation?

1  Asked on September 10, 2020 by tmt

### Handling a Large Discrete Action Space in Deep Q Learning

0  Asked on September 9, 2020 by foxcharles

### What are the advantages and disadvantages of using LISP for constraint satisfaction in 3D space

1  Asked on September 7, 2020 by shashank-gargeshwari

### Local Search vs K-means Clustering

0  Asked on September 5, 2020 by kghatak

### What are the right algorithms for this open loop control problem

1  Asked on August 30, 2020 by toben-aus

### What is the state of the art solution for text classification for large corpora

0  Asked on August 22, 2020 by nick

### Non Max Suppression and Object Detection

1  Asked on August 16, 2020 by moe-kaung-kin

### Isn’t a simulation a great model for model-based reinforcement learning?

1  Asked on August 9, 2020 by ray-walker

### Correct problem statement for CNN. Stitching parts of the map

0  Asked on August 1, 2020 by green_wizard

### How can I predict the true label for data with incomplete features based on the trained model with data with more features?

0  Asked on July 26, 2020 by dae-young-park

### Why is Symbolic AI not so popular as ANN but used by IBM Deep Blue?

1  Asked on July 21, 2020 by datdinhquoc