One alpha vector per action:

\begin{equation} \alpha^{(k+1)}_{a}(s) = R(s,a) + \gamma \sum_{s’}^{}T(s’|s,a) \max_{a’} \alpha^{(k)}_{a’} (s’) \end{equation}

This is going to give you a set of alpha vectors, one corresponding to each action.

time complexity: \(O(|S|^{2}|A|^{2})\)

you will note we don’t ever actually use anything partially-observable in this. Once we get the alpha vector, we need to use one-step lookahead in POMDP (which does use transitions) to actually turn this alpha vector into a policy, which then does create you

We can deal with continuous state space by using some estimation of the value function (instead of alpha-vectors, we will just use a value-function estimate like q learning)