Continuous state and continuous action Markov decision process time complexity estimate: backward induction VS policy gradient method (RL)

Question

Model Description: Model based(assume known of the entire model) Markov decision process.
Time($t$): Finite horizon discrete time with discounting factor
State($x_t$): Continuous multi-dimensional state
Action($a_t$): Continuous multi-dimensional action (deterministic action)
Feasible(possible action) ($A_t$): possible action is a continuous multi-dimensional space, no discretization for the action space.
Transition kernel($P$): known and have some randomness associated with the current state and next state
Reward function: known in explicit form and terminal reward is know.
The method I tried to solve the model:

Discretize the state space and construct multi-dimensional grid for state space, starting from the terminal state, I used backward induction to reconstruct value function from the previous period. By using the Bellman equation, I need to solve an optimization problem, selecting the best action that gives me the best objective function.

$$V_t(x_t) = max_{a_t in A_t}[R_t(x_t,a_t) + E[tilde{V}_{t+1}(x_{t+1})|x_t, a_t]]$$
where $tilde{V}_{t+1}$ here is an approximation using interpolation method, since the discrete values are calculated from the previous time episode. In other words: $tilde{V}_{t+1}$ is approximated by some discrete value: $$V_{t+1}(x_0),V_{t+1}(x_1),V_{t+1}(x_2)cdots V_{t+1}(x_N)$$ where $x_0,x_1,x_2cdots x_N$ are grid points from discretizing the state space.
In this way, for every time steps t, I could have a value function for every grid point and my value function could be approximated by using some interpolation method(probably cubic spline). But here are some of the problems: 1. what kind of interpolation is suitable for high dimensional data. 2. Say we have five dimension for the state, then I discretize the state by giving 5 grid points to every dimension, then there are 5^5 = 3125 discrete state values I need to calculate through the optimization.(Curse of the dimensionality). 3. What kind of optimizer should I use? Since I do not know the shape of the objective function, I do not know if it is a smooth function and I do not know if the function is concave. So I may have to use a robust optimizer, probably some evolutionary optimizer. So eventually I end up with this computational complexity and it takes too long for the computation.
And recently I learned the techniques of policy gradient from OpenAI: https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html
This method avoid using this backward induction and approximating the value function by using interpolation. And obtained the approximated policy by firstly guessing a functional form of the policy and take approximated gradient of the policy function by using sampling(simulation) method. Since the model is known, every time it could sample new trajectories and use that to update the policy by using the stochastic gradient assent method. In this way updating the policy until it gets some sort of convergency.
I am wondering if this type of technique could potentially reduce the computational complexity significantly. Any advice helps, thanks a lot.

harwiltz · Answer

Discretizing state/action space will always be a very expensive strategy, in fact I don't think you can do better than exponential time in state/action dimension that way.
Now, you still haven't really explained your algorithm with the discretized states and actions. Given your known models, it sounds like you're planning on doing dynamic programming. This would definitely be an exponential time algorithm. You also haven't explained how you'd use your knowledge of the model in your policy gradient implementation. If you're just using it to sample some trajectories to have more data for updates, policy gradient methods will be much more efficient (poly-time). Otherwise, if you're planning on using your model for some sort of MCTS planning, depending on how you implement that, it could be very inefficient as well.

Continuous state and continuous action Markov decision process time complexity estimate: backward induction VS policy gradient method (RL)

One Answer

Add your own answers!

Ask a Question