Point-Based Value Iteration

we keep track of a bunch of alpha vectors and belief samples (which we get from point selection):

\begin{equation} \Gamma = \{\alpha_{1}, \dots, \alpha_{m}\} \end{equation}

and

\begin{equation} B = \{b_1, \dots, b_{m}\} \end{equation}

To preserve the lower-boundedness of these alpha vectors, one should seed the alpha vectors via something like blind lower bound

We can estimate our utility function at any belief by looking in the set for the most optimal:

\begin{equation} U^{\Gamma}(b) = \max_{\alpha \in \Gamma} \alpha^{\top}b \end{equation}

We now define a function named backup (see PBVI Backup), and call it on all of our beliefs to generate a new set of alpha vectors:

\begin{equation} \Gamma^{t+1} = \{backup(\Gamma, b) | b \in B\} \end{equation}

where:

\begin{equation} \alpha \leftarrow backup(\Gamma, b) \end{equation}

therefore we call backup on each $b$.

PBVI Backup

backup procedure given $\Gamma$ and $b$—

we want to mint a single new alpha vector by selecting the highest-valued one from the set of good alpha-vectors, one for each action:

\begin{equation} \alpha = \arg\max_{\alpha_{a}} \alpha_{a}^{\top} b \end{equation}

now, we define each $\alpha_{a}$ as:

\begin{equation} \alpha_{a}(s) = R(s,a) + \gamma \sum_{s’,o}^{} O(o|a,s’)T(s’|s,a)\alpha_{a,o} (s’) \end{equation}

where we obtain the old $\alpha_{a,o}$ by computing vector which currently provides the highest value estimate, which we compute over all actions and observations $a,o$ given our $\Gamma$:

\begin{equation} \alpha_{a,o} = \arg\max_{\alpha \in \Gamma} \alpha^{\top} update(b,a,o) \end{equation}

Randomized PBVI

see Perseus