How does the Dyna Q algorithm works?

Question

I'm having a hard time trying to understand how the dyna Q algorithm works. I put the picture which helps me to understand. My questions are:

What planning really means? (it's the (f) in this picture)
What the n represents? 
Why a term Model(S,A) is used?

Thanks a lot for your help

Brale · Answer

$Model(S, A)$ is basically a table that represents all state and action pairs in your environment. In step e) of the algorithm we are improving the model of the environment by saving the reward $R$ and state $S'$ that we got by executing action $A$ from state $S$. This approach will only work for smaller environments that don't have large number of states and actions. I believe in the pseucode of the algorithm, it is also assumed that environment is deterministic so you will have perfect knowledge of the environment dynamics for specific state action pair after you sampled next state and reward.

In f) part of the algorithm we are doing $n$ steps of Q-learning update by using our model. $n$ is simply a number which represents how many updates we will do. In the body of the loop, like I said, we are doing Q-learning updates, but this time we are using the rewards and states that were saved in our model, instead of taking the samples from the actual environment. We can do that because Q-learning is an off-policy algorithm so we don't need to always sample from the environment to learn. The benefit of this is that once you have a model of the environment, you don't need to obtain so many samples from the environment, which can be potentially expensive, to learn about it.

How does the Dyna Q algorithm works?

One Answer

Add your own answers!

Ask a Question