TransWikia.com

Upper limit to the maximum cumulative reward in a deep reinforcement learning problem

Artificial Intelligence Asked by Kamran Thomas Alimagham on August 24, 2021

Is there an upper limit to the maximum cumulative reward in a deep reinforcement learning problem? For example you want to train an DQN agent in an environment and you want to know what is the highest possible value you can get from the cumulative reward, so you can compare this with your agents performance.

3 Answers

Lets assume $sup_{s,a} r(s,a)<b$. Then for continuing problems the upper bound can be obtained by begin{align} sum_{t=0}^{infty} gamma^{t}r(s_t,a_t) &le sum_{t=0}^{infty} gamma^{t} sup_{s,a}r(s,a) nonumber \ &=sum_{t=0}^{infty} gamma^{t} b = frac{b}{1-gamma}. end{align}

We can use the same bound for episodic tasks with discounted return. For episodic tasks without discounting ($gamma=1$) the above sum goes to infinity. However, if we know the episode length $T$, we can use $Tb$ as an upper bound.

Answered by M.S. on August 24, 2021

My answer to:Is there an upper limit to the maximum cumulative reward in a deep reinforcement learning problem?

Yes but depending on the environment, if dealing with the theoretical environment, where there are infinite number of time steps.

Calculating the upper bound

In reinforcement learning (deep RL inclusive), we want to maximize the discounted cumulative reward i.e. Find the upper bound of: $sum_{k=0}^infty gamma^kR_{t+k+1}, where$ $gamma$ $epsilon$ $[0, 1)$

Before we find the upper bound of the series above, we need to find out if the upper bound exists i.e. whether it converges according to the environment specifications such as the reward function.

I will provide one example environment where the series converges. It is an environment that has simple rules and goes on for infinite time steps. It's reward function definition is as follows:

-> A reward of +2 for every favorable action.

-> A reward of 0 for every unfavorable action.

So, our path through the MDP that gives us the upper bound is where we only get 2's.

Let's say $gamma$ is a constant, example $gamma = 0.5$, note that $gamma$ $epsilon$ $[0, 1)$

Now, we have a geometric series which converges:

$sum_{k=0}^infty gamma^kR_{t+k+1}$ = $sum_{k=1}^infty (1)(2gamma^{k-1})$ = $sum_{k=1}^infty 2gamma^{k-1}$ = $frac{2}{1 - 0.5}$ = $4$

Thus the upper bound is 4.

For environments that go on for a finite number of time steps the the upper bound does exist but for certain environments, likewise for the infinite time step environments, it may be a bit difficult to calculate but not necessarily impossible, the environments I speak of are ones with complicated reward functions and environments i.e. the environments are stochastic or the reward function's possible values are dependent on the state, they always are but we can loosely say that a reward function is independent of state when all possible reward values for an environment can be given in any state, obviously with regards to the actions taken though.

Answered by rert588 on August 24, 2021

In any reinforcement learning problem, not just Deep RL, then there is an upper bound for the cumulative reward, provided that the problem is episodic and not continuing.

If the problem is episodic and the rewards are designed such that the problem has a natural ending, i.e. the episode will end regardless of how well the agent does in the environment, then then you could work it out by calculating the max possible reward in each step of the episode; however this is potentially non-trivial depending on your environment.

For an example in a trivial setting, however, imagine the problem of cartpole -- I could define the MDP to have a reward of +1 for every time step that the agent is able to balance the pole upright, and 0 when the pole falls. If I also defined that the problem terminates after 200 time steps then the upper bound on cumulative rewards for this problem would be 200.

In general, if the problem is continuing then in theory the problem goes on infinitely and so there is no upper bound, as the episode never ends -- this is partly why we use the discount factor, to ensure that $sum_{k=0} gamma^k R_{t+k}$ converges.

Answered by David Ireland on August 24, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP