value iteration

We apply the Bellman Expectation Equation and selecting the utility that is calculated by taking the most optimal action given the current utility:

\begin{equation} U_{k+1}(s) = \max_{a} \qty(R(s,a) + \gamma \sum_{s’} T(s’ | s,a) U_{k}(s’)) \end{equation}

This iterative process is called the Bellman backup, or Bellman update.

\begin{equation} U_1 \dots U_{k} \dots U^{*} \end{equation}

eventually will converge into the optimal value function. After which, we just extract the greedy policy from the utility to get a policy to use.

We stop when the Bellman Residual hits a the desired error threshold:

Bellman Residual

Take the L-\(\infty\) norm of \(U^{k+1}-U^{k}\) (that is, take \(||U_{k+1} - U_{k}||_{\infty}\). We call that the Bellman Residual. If this Bellman Residual drops below \(\delta\), it is shown that the error between \(U^{*}\) (convergence) and \(U_{k}\) will only be:

\begin{equation} \epsilon = \frac{\delta \gamma}{(1-\gamma)} \end{equation}

So as long as the Bellman Residual between your two updates \(\leq \delta\), you know that you are at most \(\epsilon\) away from the optimal utility.

You will note that as future discount \(\gamma \to 1\), this error bound becomes much larger. Therefore, you have to iterate more to get to the same \(\epsilon\). You need more iterations when \(\gamma \to 1\).

Notably, the loss of some arbitrary utility derived from policy evaluation is:

\begin{equation} || U^{\pi} - U^{*} || < \frac{2\epsilon \gamma}{1-\gamma} \end{equation}

asynchronous value iteration

We choose an ordering of states. We then loop through the entire list, updating the value function. Then, we loop through this system multiple times until the system converged.

That is, instead of creating a list of things \(U_{k}\), keeping only the current current one in memory, we come up with some:

\begin{equation} U(s) \leftarrow \max_{a} \qty(R(s,a) + \gamma \sum_{s’} T(s’ | s,a) U(s’)) \end{equation}

The idea is, instead of keeping all of the \(U_{k-1}\) until you have calculated all of \(U_{k}\) for each state, we just use an ordering of the states to just use whatever value we calculated last.

time complexity

\begin{equation} O(S^{2}A) \end{equation}

where \(S\) is the number of states and \(A\) the number of actions.

loop over all states in each update
loop over all actions to figure out the max
loop over all next states and calculate their utility

POMDP value-iteration

compute alpha vectors for all one-step plans (i.e. conditional plans that does just one action and gives up)
alpha vector pruning on any plans that are dominated
generate all possible two-step conditional plans over all actions using combinations of non-pruned one-step plans above as SUBPLANS (yes, you can use a one-step plan twice)
repeat steps 2-3

see also performing value-iteration naively with one-step lookahead in POMDP.

POMDP Bellman Update

Say you want to extract a policy out of a bunch of alpha vectors.

Let \(\alpha \in \Gamma\), a set of alpha vectors; we obtain a new alpha vector \(U’(b) = [U(s_0) \dots U(s_{n})]\) by:

\begin{equation} U’(b) = \max_{a}\qty[R(b,a)+\gamma \qty(\sum_{o}^{}P(o|b,a) U(b))] \end{equation}

where:

\begin{equation} R(b,a) = \sum_{s}^{} R(s,a)b(s) \end{equation}

\begin{align} P(o|b,a) &= \sum_{s}^{} p(o|s,a) b(s) \\ &= \sum_{s}^{} \sum_{s’}^{} T(s’|s,a) O(o|s’,a) b(s) \end{align}

and

\begin{equation} U^{\Gamma}(b) = \max_{\alpha \in \Gamma} \alpha^{\top} b \end{equation}