Online POMDP Methods

These are basically MDP methods but tweaked. We make some changes:

for everywhere that we need a state, we use a belief
to sample the next state given an action (random next step), we call our generative model to get a new observation, and call update(b,a,o) with our filter to propegate our belief forward
if we need an action-value, we use the one-step lookahead in POMDP:

\begin{equation} Q(b,a) = R(b,a)+\gamma \qty(\sum_{o}^{}P(o|b,a) U^{\Gamma}(update(b,a,o))) \end{equation}

where,

\begin{equation} R(b,a) = \sum_{s}^{} R(s,a)b(s) \end{equation}

\begin{align} P(o|b,a) &= \sum_{s}^{} p(o|s,a) b(s) \\ &= \sum_{s}^{} \sum_{s’}^{} T(s’|s,a) O(o|s’,a) b(s) \end{align}

and where, if needed (i.e. most algorithms estimate this):

\begin{equation} U^{\Gamma}(b) = \max_{\alpha \in \Gamma} \alpha^{\top} b \end{equation}

we also revise our generative model:

each step requires belief and action, and we sample from our belief a next state, propegate belief forward, and use a traditional generative model to get the rewards and next states (which we don’t use).

Receeding Horizon: plan to a depth \(d\), select action, replan
Rollout with Lookahead: simple to implement, no grantees of optimality or even boundedness
Forward Search: quite expensive—exponential given the size of horizon
monte-carlo tree search, but instead our counts are stored not in terms of states (which we don’t know), but sequences of action observations: \(h = a_1o_2a_2o_1a_2o_1\) etc. Then, the counter takes \(N(h,a)\) as input: will head towards optimality and it requires a generative model to sample tracks
Branch and Bound, but you use the POMDP Approximation methods to estimate the upper and lower bounds of your utility: its Forward Search with pruning
Sparse Sampling: its Forward Search, but the next action-value is determined by a finite sampling of next observations and rewards and you average their future utility. that is, the action-value before depth \(d\) is obtained by: \(Q(b,a) = \frac{1}{m} \sum_{i=1}^{m} \qty(r_{a}^{(i)}+\gammaU_{d-1}(Update(b,a,o_{a}^{(i)})))\)