Houjun Liu

Directed Exploration

Softmax Method

Pull arm \(a\) with probability \(\propto \exp (\lambda \rho_{a})\), where \(\lambda \geq 0\) is the “precision parameter”.

When \(\lambda \to 0\), this system uses the same rate for each of the actions, so you are essentially randomly sampling; when \(\lambda \to \infty\), the system will use only the greedy action because only the element with the biggest \(\rho_{a}\) gets selected.

For a multi-state case:

\begin{equation} \propto \exp (\lambda Q(s,a)) \end{equation}

Quantile Exploration

Choose the arm with the largest \(\theta\) at the highest \(\alpha\) percentile of its beta distribution, pull that arm, update priors

“choose the arm with the highest \(\theta\) for the \(90\%\) percentile, then update the distribution”


Inspired by monte-carlo exploration

take action \(a\) such that

\begin{equation} \max_{a} \rho_{a} + c \sqrt{ \frac{\log N}{N(a)}} \end{equation}

where, \(c\) is the exploration factor, \(N\) is the total number of trials, \(N(a)\) is the number of trials for \(a\) we have done.

This value is considered the “upper confidence bound”; hence “UCB”

Posterior Sampling

Same one point from each Beta Distribution for each of your slot machines; then you pick the result that is the highest.

Does not require any parameter.

This is proven to do some over-exploration. But that’s (mostly) just fine.


See R-Max