Why do we use $X_{I_t,t}$ and $v_{I_t}$ to denote the reward received and the at time step $t$ and the distribution of the chosen arm $I_t$?

Question

I'm doing some introductory research on classical (stochastic) MABs. However, I'm a little confused about the common notation (e.g. in the popular paper of Auer (2002) or Bubeck and Cesa-Bianchi (2012)).
As in the latter study, let us consider an MAB with a finite number of arms $iin{1,...,K}$, where an agent choses at every timestep $t=1,...,n$ an arm $I_t$ which generates a reward $X_{I_t,t}$ according to a distribition $v_{I_t}$.
In my understanding, each arm has an inherent distribution, which is unknown to the agent. Therefore, I'm wondering why the notation $v_{I_t}$ is used instead of simply using $v_{i}$? Isn't the distribution independent of the time the arm $i$ was chosen?
Furthermore, I ask myself: Why not simply use $X_i$ instead of $X_{I_t,t}$ (in terms of rewards). Is it because the chosen arm at step $t$ (namely $I_t$) is a random variable and $X$ depends on it? If I am right, why is $t$ used twice in the index (namely $I_t,t$)?
Shouldn't $X_{I_t}$ be sufficient, since $X_{I_t,m}$ and $X_{I_t,n}$ are drawn from the same distribution?

DeepQZero · Answer

Isn't the distribution independent of the time the arm $i$ was chosen?

Each one of the two references you describe assumes the context of the random bandit problem proposed by Robbins (1952) where the underlying reward distributions of each bandit are fixed. Therefore, yes, the underlying distributions are independent of the current time.

Is it because the chosen arm at step $t$ (namely $I_t$) is a random variable and $X$ depends on it?

The reward is a random variable which is dependent on the chosen arm at time $t$. Since each arm has an underlying reward distribution, the index $I_t$ is a random variable that designates the specific arm we are pulling, and the index $t$ denotes the time step when we pull the arm.

Why is $t$ used twice in the index (namely $I_t,t$)?

Note that $t$ is used twice, but the observed value of $I_t$ does not encode any information about the time it was chosen. For example, if $I_m = 5$, then $X_{I_m, m} = X_{5, m}$. If we drop the second subscript, then we have no way to distinguish $X_{5, m}$ notationally from $X_{5, n}$ (where $I_{n neq m} = 5$). Two distinct rewards $X_{5, m}$ and $X_{5, n}$ would map to the same reward $X_5$ notationally. At first glance, this introduces numerous potential notational problems, such as losing the count of the number of times each arm was pulled.

Why not simply use $X_i$ instead of $X_{I_t, t}$ (in terms of rewards)? Shouldn't $X_{I_t}$ be sufficient, since $X_{I_t, m}$ and $X_{I_t, n}$ are drawn from the same distribution?

Admittedly, there are probably ways to get around the extra subscript for certain algorithms. For example, maybe you are using an algorithm where past rewards from each arm are averaged to yield an estimate of each arm's expected reward (see Section 2.2 of Sutton and Barto). This may require a collection of lists that store the past rewards for each arm, or it may require the count of each arm being pulled and an associated current estimate of the expected reward (see Section 2.4 of Sutton and Barto). However, these methods introduce more parameters that would be unnecessary had we initially included a second subscript for time in our notation (e.g. the counts of each arm pulled, the current estimate of expected reward for each arm, the labels of each reward list corresponding to an arm, etc.). Most of the fundamental equations regarding multi-armed bandits that I have seen are either heavily or solely dependent on the reward random variable (e.g. the definition of regret). Keeping the time index in a single random variable promotes concision and consistency among various sources by preventing the need for delegating the time index to another random variable, data structure, etc., even though specific implementations or contexts may profit from other notations.
The dual-subscript notation also has the benefit of generalizing to other contexts aside from the one posed by Robbins (1952). These include nonstationary reward distributions (see Section 2.5 of Sutton and Barto), time discounting, and families of alternative bandit processes, among others (see Sections 2.2-2.4 of this book for info on the last two extensions).

David Ireland · Answer

Isn't the distribution independent of the time the arm $i$ was chosen?

Yes, but you don't know which arm was chosen at time $t$, that is what $I_t$ represents. $v_i$ would represent the $i$th arms distribution, whereas you want the distribution of the arm that was chosen at time $t$, which is $v_{I_t}$.
$X_{I_t,t}$ is used to represent the arm you chose at time $t$ and the time you chose it. Imagine if we chose the following arms at respective time steps {1,2,5,2,4}, then the reward you get at time 3 (assuming in my example time started at 1) would be $X_{5,3}$. You need this notation because every time you pull arm $i$ you get a different reward, because the reward is a random variable (unless the assumption was a deterministic reward but that would not be interesting).
It has been a while since I read the papers but I assume that the distributions of the arms are stationary, however this notation is more general and would allow for non-stationary distributions.

Why do we use $X_{I_t,t}$ and $v_{I_t}$ to denote the reward received and the at time step $t$ and the distribution of the chosen arm $I_t$?

2 Answers

Add your own answers!

Ask a Question