Handling a Large Discrete Action Space in Deep Q Learning

Question

I am attempting to solve a timetabling problem using deep Q learning. It could be thought of as a resource allocation problem to obtain some certificate of 'optimality'. However, how to define and access the action space is alluding me. Any help, thoughts, or direction towards the literature would be appreciated. Thanks!
The problem is entirely deterministic, the pair of the current state and action is isomorphic to the resulting state. The Q network is therefore being set up to approximate a Q value (a scalar) for the resulting state, i.e. for the current state and proposed action.
I have so far assumed that the action space should be randomly sampled during training to generate some approximation of the Q table. This seems highly inefficient.
I am open to reinterpretations of the action space. The problem involves a set of n individuals and at any given state a maximum of b can be 'active' and, of the remaining 'inactive' individuals, f can be made 'active' by an action. An action will need to involve making some reallocation to active individuals made up of those who are already active and the other f available people.
To give you a sense over the numbers that I will ultimately use, $n=17, b=7$, and $f$ will hover somewhere around 7-10 (but depends on the allocations). At first this sounds tractable, but a (very) rough approximation of the cardinality of the set of actions is 17 choose 7 = 19448.
Does anyone know a more efficient way to encode this action space? If not, is there a more sensible way to sample it (as is my current plan) than uniformly extracting actions from the space? Also when sampling the space is it valid to enforce some cap on the number of samples drawn (say 500). Please feel free to ask for further clarification.

Handling a Large Discrete Action Space in Deep Q Learning

Add your own answers!

Ask a Question