How do we deal with Markov Decision Process solution with continuous state space?

Let there be a value function parameterized on \(\theta\):

\begin{equation} U_{\theta}(s) \end{equation}

Let us find the value-function policy of this utility:

\begin{equation} \pi(s) = \arg\max_{a} \qty(R(s,a) + \gamma \sum_{s’}^{} T(s’|s,a) U_{\theta}(s’)) \end{equation}

We now create a finite sampling of our state space, which maybe infinitely large (for instance, continuous):

\begin{equation} S \in \mathcal{S} \end{equation}

where, \(S\) is a set of discrete states \(\{s_1, \dots, s_{m}\}\).

Now, what next?

## generally:

Loop until convergence:

- Initialize \(u_{\theta}\)
- For all \(s_{i} \in S\), let \(u_{i} = \max_{a} R(s,a) + \gamma \sum_{s’}^{}T(s’|s,a) u_{\theta}(s’)\), the utility at those discrete state samples \(s_{i}\)
- Then, fit a \(\theta\) so that \(U_{\theta}(s_{i})\) is close to \(u_{i}\)

**to get \(T\)**: get a finite sampling of next states, or fit a function to it.

BUT: **Convergence is not guaranteed.**

There are two main specific approaches to achieve this:

## global approximation

- linreg a best-fit line of state value vs. utility value
- polynomial fit a best-fit line, whereby \(U_{\theta}(s) = \theta^{T}\beta(s)\), where each \(\beta_{j}(s)=s^{j-1}\).

- a frigin neural network (train a model with parameters \(\theta\) which produces the utility calculations for you \(M_{\theta}(s) = U_{\theta}(s)\))

## local approximation

- make a sampling in your continuous state space to discretized it
- do any utility function thing you’d like (policy evaluation or value iteration) to get some set of \(\theta_{i}\), which is the utility for being in each sampled discrete state \(s_{i}\)
- whenever you need to calculate \(U(s)\) of a particular state…
- linearly interpolate
- k nearest neighbor
- kernel smoothing