Houjun Liu

Sarsa (Lambda)

Previous approaches to deal with Partially Observable Markov Decision Process:

  • memory-based state estimation (beliefs)
  • special planning methods

Can we use MDP reinforcement learning to deal with POMDPs?

Background

Recall MDP SARSA:

\begin{equation} Q(s,a) \leftarrow Q(s,a) + \alpha \qty [(r + \gamma Q(s’, a’)) - Q(s,a)] \end{equation}

Recall that, sparse rewards with SARSA can take a long time to learn because it takes time to backpropgate.

Hence, we use Eligibility Traces, which keeps track of what’s “eligible” for updates:

let \(\lambda\) be some decay parameter, we have:

\begin{equation} \delta = r + \gamma Q(s’,a’) - Q(s,a) \end{equation}

and, we can write:

\begin{equation} Q(s,a) \leftarrow Q(s,a) + \lambda \delta N(s,a) \end{equation}

where by the visit counts are discounted such that:

\begin{equation} N(s,a) \leftarrow \gamma \lambda N(s,a) \end{equation}

Worry

Inability of fully observing the state seems to invalidate the point of \(Q\) learning, SARSA, etc.

Applying Eligibility Traces to POMDPs

Instead of \(N(s,a)\), we initialize some \(\eta(x,a)\) for observation + action and work on the rest of it in the same way:

\begin{equation} \begin{cases} \eta(x_{t},a_{t}) = 1 \\ \eta(x_{t},a_{t}) = \gamma \lambda \eta_{t-1}(x,a) \\ Q(x,a) \leftarrow Q(x,a) + \lambda \eta(x_{t},a_{t}) \end{cases} \end{equation}

Notably, we formulate \(x\) as a tuple of TWO observations in the past—meaning we have a single step of memory in the past and optimise over those.

This requires no belief propagation.

aliasing

an important failure of this is aliasing–where you maybe in one of two different places that has similar properties observationally, but taking actions at those states results in very different places