_index.org

POMCP

Last edited: August 8, 2025

Previous monte-carlo tree search methods which are not competitive to PBVI, SARSOP, etc., but those are affected by close-up history.

key point: monte-cargo roll outs best-first tree search + unweighted particle filter (instead of categorical beliefs)

Background

  • History: a trajectory of some \(h = \{a_1, o_1, …\}\)
  • generative model: we perform a random sample of possible next state (weighted by the action you took, meaning an instantiation of \(s’ \sim T(\cdot | s,a)\)) and reward \(R(s,a)\) from current state
  • Rollout: keep sampling at each point, rolling out and calculating future reward

monte-carlo tree search

  1. loop:
    1. sample \(s\) from the belief distribution \(B(h)\) for each node and call that the node state
    2. loop until we reach a leaf:
      1. sample exploratino using UCB 1 via the belief
      2. get observation, reward, next state
    3. add leaf node, add node for each available action
    4. Rollout
    5. backpropegate the obtained value with discounts backwards via POMDP Bellman Backup

During runtime, we choose the action with the best action, prune the tree given what you observed, and do this again in a different.

POMCPOW

Last edited: August 8, 2025

POMDPs with continuous actions are hard. So POMCP or (belief update + MCTS).

So instead, let’s try improving that. Unlike just POMCP, not only do we have \(B(h)\), we also have \(W(h)\), which is the weight of a specific state sampled. Naively applying POMCP on continuous states will give a wide-ass tree because each sampled state will not be the same as before.

double progressive widening

We want to use sampling to sample from observation. This will eventually lead to a suboptimal QMDP policy—this is because there are no state uncertainty?

POMDP Approximation

Last edited: August 8, 2025

Upper bounds of alpha vectors

QMDP and FIB represents an upper bound of the true optimal alpha vector values.

FIB is a generally lower bound than QMDP.

Lower bounds of alpha vectors

BAWS and blind lower bound represents

Faster:

Slower:

point selection

see point selection

POMDP-lite

Last edited: August 8, 2025

What if our initial state never change or is deterministically changing? For instance, say, for localization. This should make solving a POMDP easier.

POMDP-lite

  • \(X\) fully observable states
  • \(\theta\) hidden parameter: finite amount of values \(\theta_{1 \dots N}\)
  • where \(S = X \times \theta\)

we then assume conditional independence between \(x\) and \(\theta\). So: \(T = P(x’|\theta, x, a)\), where \(P(\theta’|\theta,x,a) = 1\) (“our hidden parameter is known or deterministically changing”)

POMDPs Index

Last edited: August 8, 2025

a class about POMDPs

ThemeTopics
Robot dogsNeBula, AISR NeBula
ApplicationsPOMDSoar,
Offline SolversPBVI, HSVI, Perseus
Offline SolversSARSOP, E-PCA, CALP
Policy GraphsHansen, MCVI, PGA
Online SolversAEMS, POMCP, DESPOT
Moar Online MethodsIS-DESPOT, POMCPOW, AdaOPS
POMDPishMOMDP, POMDP-lite, rho-POMDPs
Memoryless + Policy SearchSarsa (Lambda), JSJ, Pegasus
Hierarchical DecompositionOption, MaxQ, LTRDP
Hybrid PlanningHybPlan, LetsDrive, BetaZero
LQR + Shared AutonomyiLQR, Hindsight, TrustPOMDP
Multi-AgentFactored MDPs, FV-POMCPs, G-DICE

Other Content