Why isn't my implementation of DQN using TensorFlow on the FrozenWorld environment working?

Question

I am trying to test DQN on FrozenWorld environment in gym using TensorFlow 2.x. The update rule is (off policy)
$$Q(s,a) leftarrow Q(s,a)+alpha (r+gamma~ max_{a'}Q(s',a')-Q(s,a))$$

I am using an epsilon greedy policy. 
In this environment, we get a reward only if we succeed. So I explored with 100% until I have 50 successes. Then I saved the data of failures and success in different bins. Then I sampled (with replacement) from these bins and used them to train the Q network. However, no matter how long I train the agent doesn't seem to learn.

The code is available in Colab. I am doing this for a couple of days.

PS: I modified the code for SARSA and Expected SARSA; nothing works.

Brett Daley · Answer

I see at least 3 issues with your DQN code that need to be fixed:

You should not have separate replay memories for successes/failures. Put all of your experiences in one replay memory and sample from it uniformly.
Your replay memory is extremely small with only 2,000 samples. You need to make it significantly larger; try at least 100,000 up to 1,000,000 samples.
Your batch_target is incorrect. You need to train on returns and not just rewards. In your train function, compute the 1-step return $r + gamma cdot max_{a'} Q(s',a')$, remembering to set $max_{a'} Q(s',a') = 0$ if $s'$ is terminal, and then pass it to model.fit() as your prediction target.

Why isn't my implementation of DQN using TensorFlow on the FrozenWorld environment working?

One Answer

Add your own answers!

Ask a Question