Houjun Liu

Option (MDP)

an Option (MDP) represents a high level collection of actions. Big Picture: abstract away your big policy into \(n\) small policies, and value-iterate over expected values of the big policies.

Markov Option

A Markov Option is given by a triple \((I, \pi, \beta)\)

  • \(I \subset S\), the states from which the option maybe started
  • \(S \times A\), the MDP during that option
  • \(\beta(s)\), the probability of the option terminating at state \(s\)

one-step options

You can develop one-shot options, which terminates immediate after one action with underlying probability

  • \(I = \{s:a \in A_{s}\}\)
  • \(\pi(s,a) = 1\)
  • \(\beta(s) = 1\)

option value fuction

\begin{equation} Q^{\mu}(s,o) = \mathbb{E}\qty[r_{t} + \gamma r_{t+1} + \dots] \end{equation}

where \(\mu\) is some option selection process

semi-markov decision process

a semi-markov decision process is a system over a bunch of options, with time being a factor in option transitions, but the underlying policies still being MDPs.

\begin{equation} T(s’, \tau | s,o) \end{equation}

where \(\tau\) is time elapsed.

because option-level termination induces jumps between large scale states, one backup can propagate to a lot of states.

intra option q-learning

\begin{equation} Q_{k+1} (s_{i},o) = (1-\alpha_{k})Q_{k}(S_{t}, o) + \alpha_{k} \qty(r_{t+1} + \gamma U_{k}(s_{t+1}, o)) \end{equation}

where:

\begin{equation} U_{k}(s,o) = (1-\beta(s))Q_{k}(s,o) + \beta(s) \max_{o \in O} Q_{k}(s,o’) \end{equation}