Artificial Intelligence Asked by lll on December 18, 2020
I am new to reinforcement learning, but, for a finite horizon application problem, I am considering using the average reward instead of the sum of rewards as the objective. Specifically, there are a total of $T$ maximally possible time steps (e.g., the usage rate of an app in each time-step), in each time-step, the reward may be 0 or 1. The goal is to maximize the daily average usage rate.
Episode length ($T$) is maximally 10. $T$ is the maximum time window the product can observe about a user’s behavior of the chosen data. There is an indicator value in the data indicating whether an episode terminates. From the data, it is offline learning, so in each episode, $T$ is given in the data. As long as an episode doesn’t terminate, there is a reward of ${0, 1}$ in each time-step.
I heard if I use an average reward for the finite horizon, the optimal policy is no longer a stationary policy, and optimal $Q$ function depends on time. I am wondering why this is the case.
I see normally, the objective is defined maximizing
$$sum_t^T gamma^t r_t$$
And I am considering two types of average reward definition.
$1/T(sum^?_{t=0}gamma^t r_t)$, $T$ varies is in each episode.
$1/(T-t)sum^T_{i=t-1}gamma^i r_i$
1 Asked on November 24, 2021
applications deep learning deepfakes generative adversarial networks
1 Asked on November 20, 2021
autoencoders deep learning machine learning neural networks unsupervised learning
1 Asked on November 20, 2021
1 Asked on November 17, 2021 by dhanush-giriyan
1 Asked on November 12, 2021
1 Asked on November 10, 2021
long short term memory machine learning open ai reinforcement learning time series
1 Asked on November 7, 2021
2 Asked on November 4, 2021
deep rl dqn neural networks reinforcement learning temporal difference methods
1 Asked on November 4, 2021
dense rewards reinforcement learning reward design reward functions reward shaping
0 Asked on November 4, 2021 by tinu
ai development machine learning papers research state of the art
1 Asked on November 4, 2021 by ijuneja
1 Asked on August 24, 2021 by kashan
1 Asked on August 24, 2021 by ram-bharadwaj
1 Asked on August 24, 2021 by metrician
epsilon greedy policy monte carlo methods notation on policy methods reinforcement learning
1 Asked on August 24, 2021 by user289602
1 Asked on August 24, 2021 by daniel-koh
0 Asked on August 24, 2021 by soitgoes
function approximation markov decision process reinforcement learning
0 Asked on August 24, 2021 by seunosiko
0 Asked on August 24, 2021 by user38639
convolutional neural networks data augmentation neural networks testing training
Get help from others!
Recent Questions
Recent Answers
© 2023 AnswerBun.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP