Houjun Liu

Undirected Exploration

base epsilon-greedy:

choose a random action with probability \(\epsilon\)
otherwise, we choose the action with the best expectation \(\arg\max_{a} Q(s,a)\)

epsilon-greedy exploration with decay

Sometimes, approaches are suggested to decay \(\epsilon\) whereby, at each timestamp:

\begin{equation} \epsilon \leftarrow \alpha \epsilon \end{equation}

whereby \(\alpha \in (0,1)\) is called the “decay factor.”

Explore-then-commit

Select actions uniformly at random for \(k\) steps; then, go to greedy and stay there