Suppose we have offline statistic regarding wins and losses of each slot machine as our state:
\begin{equation} w_1, l_{1}, \dots, w_{n}, l_{n} \end{equation}
What if we want to create a policy that maximises exploration?
We construct a value function:
\begin{equation} U^{*}([w_1, l_{1}, \dots, w_{n}, l_{n}]) = \max_{a} Q^{*}([w_1, l_{1}, \dots, w_{n}, l_{n}], a) \end{equation}
our policy is the greedy policy:
\begin{equation} U^{*}([w_1, l_{1}, \dots, w_{n}, l_{n}]) = \arg\max_{a} Q^{*}([w_1, l_{1}, \dots, w_{n}, l_{n}], a) \end{equation}
Now, how do we go about calculating the action-value:
\begin{align} Q ([w_1, l_{1}, \dots, w_{n}, l_{n}], a) =\ & \frac{w_{a}+1}{w_{a}+l_{a}+2} (R(w) + U^{*}(\dots, w_{a}+1, l_{a}, \dots)) \&+ \qty(1-\frac{w_{a}+1}{w_{a}+l_{a}+2})(R(l) + U^{*}(\dots, w_{a}, l_{a}+1, \dots)) \end{align}
“the probability of you win” (expectation of Beta Distribution), times the instantaneous reward you win + the utility you gain in terms of information of you doing that.
To solve this in a finite horizon, note that at time \(t=k\), your \(U^{*}\) is \(0\) because you have nothing to do anymore.
Then, you can back up slowly to get each previous state:
- calculate \(Q[w_1-1, l_1, …, 1]\)
- calculate \(Q[w_1, l_1-1, …,1]\)
and so on, and choosing the utility of each state from there.