Our belief can be represented as vectors as the probability of us being in each state. If we have that, we can just use our belief vector as our state vector. Now use MDP any solving you’d like, keeping in mind that the reward is just the expected reward:

\begin{equation} \mathbb{E}[R(b,a)] = \sum_{s} R(s,a) b(s) \end{equation}

we can estimate our transition between belief-states like so:

\begin{align} T(b’|b,a) &= P(b’|b,a) \\ &= \sum_{o}^{} P(b’|b,a,o) P(o|b,a) \\ &= \sum_{o}^{} P(b’ = Update(b,a,o)) \sum_{s’}^{}O(o|a,s’) \sum_{s}^{}T(s’|s,a)b(s) \end{align}

“the probability of the next belief being \(b’\) is equal to how probable it is to get state b’ from conditions b,a,o, times the probability of getting that particular observation.”.

However, this expression is quite unwheldy if your state-space is large. Hence, we turn to a technique like conditional plans which foregos considering individual states altogether.