Houjun Liu


A CPOMDP, or Constrained Partially Observable Markov Decision Process, gives two objectives for the system to optimize upon:

an reward function \(r(s,a)\) and a set of constraints \(c(s,a) \geq 0\). Specifically, we formulate it as a POMDP: \((S,A,\Omega), T, O ,R\), with an additional set of constraints \(\bold{C}\) and budgets \(\beta\).

Whereby, we seek to maximize the infinite-horizon reward \(\mathbb{E}_{t} \qty[R(a_{t}, s_{t})]\) subject to discounting, subject to:

\begin{equation} C_{i}(s,a) \leq \beta_{i}, \forall C_{i},\beta_{i} \in \bold{C}, \beta \end{equation}