HSVI

Improving PBVI without sacrificing quality.

Initialization

We first initialize HSVI with a set of alpha vectors $\Gamma$, representing the lower-bound, and a list of tuples of $(b, U(b))$ named $\Upsilon$, representing the upper-bound. We call the value functions they generate as $\bar{V}$ and $\underline V$.

Lower Bound

Set of alpha vectors: best-action worst-state (HSVI1), blind lower bound (HSVI2)

Calculating $\underline{V}(b)$

\begin{equation} \underline{V}_{\Gamma} = \max_{\alpha} \alpha^{\top}b \end{equation}

Upper Bound

Fast Informed Bound

solving fully-observable MDP
Project $b$ into the point-set
Projected the upper bound onto a convex hull (HSVI2: via approximate convex hull projection)

Calculating $\bar{V}(b)$

Recall that though the lower-bound is given by alpha vectors, the upper bound is given in terms of a series of tuples $(b, U(b)) \in \Upsilon$.

HSVI1: we figure the upper bound for any given $b$ by projecting onto the convex hull formed by points on $\Upsilon$
HSVI2: approximate linear projection

Update

Begin with state $b = b_0$.

Repeat:

at every step, we perform a local update for upper and lower bound using the current $b$

the lower bound is updated using PBVI Backup on $b, \Gamma$
the upper bound is updated using POMDP Bellman Update on $b, \Upsilon$, putting the new $(b, u(b))$ in the set $\Upsilon$.

Then, we update our belief via the usual:

\begin{equation} b \leftarrow update(b, a^{*}, o^{*}) \end{equation}

where $a^{*}$ and $o^{*}$ are determined by…

IE-MAX Heuristic

IE-MAX Heuristic is used to determine $a^{*}$, whereby we choose the action such that:

\begin{equation} a^{*} = \arg\max_{a}Q^{(\bar{V})}(b) \end{equation}

yes, we choose the next action which maximizes the upper bound of the utility we can get.

weighted excess uncertainty

weighted excess uncertainty is used to determine $o^{*}$. Suppose we are depth $d$ loops in the search tree (i.e. this is our $d$th chain), we define:

\begin{equation} \text{excess}(b,t) = (\bar{V}(b)-\underline{V}(b)) - \epsilon \gamma^{-t} \end{equation}

“how far away are we from converging to a value uncertainty of no more than $\epsilon$, given we are depth $t$ in?

and, we choose the observation $o^{*}$ such that:

\begin{equation} o^{*} = \arg\max_{o} \qty[p(o|b,a^{*}) \text{excess}(update(b,a,o), t+1)] \end{equation}

where,

\begin{align} P(o|b,a) &= \sum_{s}^{} p(o|s,a) b(s) \\ &= \sum_{s}^{} \sum_{s’}^{} T(s’|s,a) O(o|s’,a) b(s) \end{align}