Big picture: **combining off-line and on-line approaches** maybe the best way to tackle large POMDPs.

Try planning:

- only where we are
- only where we can reach

Take into account three factors:

- uncertainty in the value function
- reachability from the current belief
- actions that are likely optimal

It allows policy improvement on any base policy.

## Setup

Discrete POMDPs:

- \(L\), lower bound
- \(U\), upper-bound
- \(b_0\): current belief

Two main phases: the algorithm

### Planning Phase

- at each belief point, choose a particular next node to expand (using the scheme below to score the nodes)
- expand that next node that are chosen
- propagate the value of the belief upwards through POMDP Bellman Backup up through the tree

#### Best Node Selection

We select the best node by three metrics:

- Uncertainty: \(\epsilon(b) = U(b)-L(b)\) we want small gap between upper and lower bound
- Optimality in actions:
- AEMS1: \(p(a|b) = \frac{U(a,b)-L(b)}{U(a^{*}, b)-L(b)}\) (“what’s the relative optimality of our action, compared to best action”)
- AEMS2: \(p(a|b)\) = \(1\) if \(a=A^{*}\), \(0\) otherwise. (“just take best action”)

- Reachability: \(p(b) = \prod_{i=0}^{d} P(o^{(i)} | b^{(i)}, a^{(i)}) p(a^{(i)}|b^{(i)}})\), where small \(p\) is either AIMS 1 or 2 above, where \(a\) comes from the best action conditional plan that came so far

Combining the metrics gives:

\begin{equation} E(b) = \gamma P(b) \epsilon(b) \end{equation}

### Execution

- execute the best action at \(b_0\)
- Perceive a new observation
- \(b_0 \leftarrow update(b_0,a,o)\)