TransWikia.com

How to frame this problem using RL?

Artificial Intelligence Asked by blue-sky on January 27, 2021

How should this problem be framed in the domain of RL for preventing users from exceeding their bank account balance and being overdrawn?

For example, a user has 1000 in an account, and proceeds to withdraw 300, 400, 500, making the user overdrawn by 200 :((300+400+500) – 1000).

Treating this as a supervised learning problem, I could use logistic regression. The input feature is the transaction amounts. The input features for a training instance, 300,400,500 and the output feature occurs if the account is overdrawn or not overdrawn with corresponding values of 1 and 0 respectively. For simplicity, we will assume the number of transactions is consistent and is always 3.

For RL, a state could be represented as a series of transactions, but how should the reward be assigned?

Update:

Here my RL implementation of the problem:

import torch
from collections import defaultdict
gamma = .1
alpha = 0.1
epsilon = 0.1
n_episode = 2000
overdraft_limit = 1000

length_episode = [0] * n_episode
total_reward_episode = [0] * n_episode

episode_states = [[700,100,200,290,500] , [400,100,200,300,500] , [212, 500,100,100,200,500]]

def gen_epsilon_greedy_policy(n_action, epsilon):
    def policy_function(state, Q):
        probs = torch.ones(n_action) * epsilon / n_action
        best_action = torch.argmax(Q[state]).item()
        probs[best_action] += 1.0 - epsilon
        action = torch.multinomial(probs, 1).item()
        return action
    return policy_function

def is_overdrawn(currentTotal):
    return currentTotal >= overdraft_limit

# Actions are overdrawn or not, 0 - means it is not overdrawn, 1 - means that it will be overdrawn
def get_reward(action, currentTotal):
    if action == 0 and is_overdrawn(currentTotal):
        return -1
    elif action == 0 and not is_overdrawn(currentTotal):
        return 1
    if action == 1 and is_overdrawn(currentTotal):
        return 1
    elif action == 1 and not is_overdrawn(currentTotal):
        return -1
    else :
        raise Exception("Action not found") 

def q_learning(gamma, n_episode, alpha,n_action):
    """
    Obtain the optimal policy with off-policy Q-learning method
    @param gamma: discount factor
    @param n_episode: number of episodes
    @return: the optimal Q-function, and the optimal policy
    """
    Q = defaultdict(lambda: torch.zeros(n_action))
    for ee in episode_states : 
        for episode in range(n_episode):
            state = ee[0]
            index = 0
            currentTotal = 0
            while index < len(ee)-1 :
                currentTotal = currentTotal + state
                next_state = ee[index+1] 
                action = epsilon_greedy_policy(state, Q)
#                 print(action)
                reward = get_reward(action, currentTotal)
                td_delta = reward + gamma * torch.max(Q[next_state]) - Q[state][action]
                Q[state][action] += alpha * td_delta

                state = next_state
                index = index + 1

                length_episode[episode] += 1
                total_reward_episode[episode] += reward
                
    policy = {}
    for state, actions in Q.items():
        policy[state] = torch.argmax(actions).item()
    return Q, policy

epsilon_greedy_policy = gen_epsilon_greedy_policy(2, epsilon)

optimal_Q, optimal_policy = q_learning(gamma, n_episode, alpha, 2)

print('The optimal policy:n', optimal_policy)
print('The optimal Q:n', optimal_Q)

This code prints:

The optimal policy:
 {700: 0, 100: 0, 200: 1, 290: 1, 500: 0, 400: 0, 300: 1, 212: 0}
The optimal Q:
 defaultdict(<function q_learning.<locals>.<lambda> at 0x7f9371b0a3b0>, {700: tensor([ 1.1110, -0.8890]), 100: tensor([ 1.1111, -0.8889]), 200: tensor([-0.8889,  1.1111]), 290: tensor([-0.9998,  1.0000]), 500: tensor([ 1.1111, -0.8889]), 400: tensor([ 1.1110, -0.8890]), 300: tensor([-1.0000,  1.0000]), 212: tensor([ 1.1111, -0.8888])})

The optimal policy is to inform us if 700 is added to the balance, then the customer will not overdraw (0). If 200 is added to the balance, then the customer will overdraw(1). What avenues can I explore to improve upon this method as this is quite basic, but I’m unsure as to what approach I should take in order to improve the solution.

For example, this solution just looks at the most recent additions to the balance to determine if the customer is overdrawn. Is this a case of adding new features to the training data?

I’m just requesting a critique on this solution so I can improve it. How can I improve the representation of the state?

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP