TransWikia.com

Continuous state and continuous action Markov decision process time complexity estimate: backward induction VS policy gradient method (RL)

Artificial Intelligence Asked by leodongxu on December 16, 2021

Model Description: Model based(assume known of the entire model) Markov decision process.

Time($t$): Finite horizon discrete time with discounting factor

State($x_t$): Continuous multi-dimensional state

Action($a_t$): Continuous multi-dimensional action (deterministic action)

Feasible(possible action) ($A_t$): possible action is a continuous multi-dimensional space, no discretization for the action space.

Transition kernel($P$): known and have some randomness associated with the current state and next state

Reward function: known in explicit form and terminal reward is know.

The method I tried to solve the model:

  1. Discretize the state space and construct multi-dimensional grid for state space, starting from the terminal state, I used backward induction to reconstruct value function from the previous period. By using the Bellman equation, I need to solve an optimization problem, selecting the best action that gives me the best objective function.

$$V_t(x_t) = max_{a_t in A_t}[R_t(x_t,a_t) + E[tilde{V}_{t+1}(x_{t+1})|x_t, a_t]]$$

where $tilde{V}_{t+1}$ here is an approximation using interpolation method, since the discrete values are calculated from the previous time episode. In other words: $tilde{V}_{t+1}$ is approximated by some discrete value: $$V_{t+1}(x_0),V_{t+1}(x_1),V_{t+1}(x_2)cdots V_{t+1}(x_N)$$ where $x_0,x_1,x_2cdots x_N$ are grid points from discretizing the state space.

In this way, for every time steps t, I could have a value function for every grid point and my value function could be approximated by using some interpolation method(probably cubic spline). But here are some of the problems: 1. what kind of interpolation is suitable for high dimensional data. 2. Say we have five dimension for the state, then I discretize the state by giving 5 grid points to every dimension, then there are 5^5 = 3125 discrete state values I need to calculate through the optimization.(Curse of the dimensionality). 3. What kind of optimizer should I use? Since I do not know the shape of the objective function, I do not know if it is a smooth function and I do not know if the function is concave. So I may have to use a robust optimizer, probably some evolutionary optimizer. So eventually I end up with this computational complexity and it takes too long for the computation.

And recently I learned the techniques of policy gradient from OpenAI: https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html
This method avoid using this backward induction and approximating the value function by using interpolation. And obtained the approximated policy by firstly guessing a functional form of the policy and take approximated gradient of the policy function by using sampling(simulation) method. Since the model is known, every time it could sample new trajectories and use that to update the policy by using the stochastic gradient assent method. In this way updating the policy until it gets some sort of convergency.

I am wondering if this type of technique could potentially reduce the computational complexity significantly. Any advice helps, thanks a lot.

One Answer

Discretizing state/action space will always be a very expensive strategy, in fact I don't think you can do better than exponential time in state/action dimension that way.

Now, you still haven't really explained your algorithm with the discretized states and actions. Given your known models, it sounds like you're planning on doing dynamic programming. This would definitely be an exponential time algorithm. You also haven't explained how you'd use your knowledge of the model in your policy gradient implementation. If you're just using it to sample some trajectories to have more data for updates, policy gradient methods will be much more efficient (poly-time). Otherwise, if you're planning on using your model for some sort of MCTS planning, depending on how you implement that, it could be very inefficient as well.

Answered by harwiltz on December 16, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP