Houjun Liu

HSVI

Improving PBVI without sacrificing quality.

Initialization

We first initialize HSVI with a set of alpha vectors \(\Gamma\), representing the lower-bound, and a list of tuples of \((b, U(b))\) named \(\Upsilon\), representing the upper-bound. We call the value functions they generate as \(\bar{V}\) and \(\underline V\).

Lower Bound

Set of alpha vectors: best-action worst-state (HSVI1), blind lower bound (HSVI2)

Calculating \(\underline{V}(b)\)

\begin{equation} \underline{V}_{\Gamma} = \max_{\alpha} \alpha^{\top}b \end{equation}

Upper Bound

Fast Informed Bound

  • solving fully-observable MDP
  • Project \(b\) into the point-set
  • Projected the upper bound onto a convex hull (HSVI2: via approximate convex hull projection)

Calculating \(\bar{V}(b)\)

Recall that though the lower-bound is given by alpha vectors, the upper bound is given in terms of a series of tuples \((b, U(b)) \in \Upsilon\).

  • HSVI1: we figure the upper bound for any given \(b\) by projecting onto the convex hull formed by points on \(\Upsilon\)
  • HSVI2: approximate linear projection

Update

Begin with state \(b = b_0\).

Repeat:

at every step, we perform a local update for upper and lower bound using the current \(b\)

  • the lower bound is updated using PBVI Backup on \(b, \Gamma\)
  • the upper bound is updated using POMDP Bellman Update on \(b, \Upsilon\), putting the new \((b, u(b))\) in the set \(\Upsilon\).

Then, we update our belief via the usual:

\begin{equation} b \leftarrow update(b, a^{*}, o^{*}) \end{equation}

where \(a^{*}\) and \(o^{*}\) are determined by…

IE-MAX Heuristic

IE-MAX Heuristic is used to determine \(a^{*}\), whereby we choose the action such that:

\begin{equation} a^{*} = \arg\max_{a}Q^{(\bar{V})}(b) \end{equation}

yes, we choose the next action which maximizes the upper bound of the utility we can get.

weighted excess uncertainty

weighted excess uncertainty is used to determine \(o^{*}\). Suppose we are depth \(d\) loops in the search tree (i.e. this is our $d$th chain), we define:

\begin{equation} \text{excess}(b,t) = (\bar{V}(b)-\underline{V}(b)) - \epsilon \gamma^{-t} \end{equation}

“how far away are we from converging to a value uncertainty of no more than \(\epsilon\), given we are depth \(t\) in?

and, we choose the observation \(o^{*}\) such that:

\begin{equation} o^{*} = \arg\max_{o} \qty[p(o|b,a^{*}) \text{excess}(update(b,a,o), t+1)] \end{equation}

where,

\begin{align} P(o|b,a) &= \sum_{s}^{} p(o|s,a) b(s) \\ &= \sum_{s}^{} \sum_{s’}^{} T(s’|s,a) O(o|s’,a) b(s) \end{align}