Memoryless policy search through fake determinism.

- uses a deterministic simulative function to calculate the value
- performs policy search by using normal standard optimizations

Primary contribution: transforming **stochastic** POMDP to a deterministic simulative function; foregos alpha vectors.

Suppose you have \(m\) initial states that you sampled, you can then just try to get the set of acions that maximize:

\begin{equation} \arg\max_{\theta} \tilde{V} = \frac{1}{m} \sum_{n}^{m} V_{\theta}(s_{m}) \end{equation}

To actually ensure that \(V\) has deterministic transitions…

## deterministic simulative function

Typically, a generative model takes random actions from the action distribution. However, what we do is have a simulator which takes a **RANDOM NUMBER** as **INPUT**, and also the action distribution, and **DETERMINISTICALLY** give an action.

## Pegasus procedure

We augment the state:

\begin{equation} s \in (S, \mathbb{R}^{[0,1]}, \mathbb{R}^{[0,1]}, \dots) \end{equation}

meaning every state is a state against a series of random numbers between \(0\) and \(1\):

\begin{equation} (s, 0.91, 0.22, \dots) \end{equation}

at every transition, we eat up one of the random numbers to use, and take an action, and use those in our deterministic simulative function to obtain our next state.

## determinism

The idea is that if we have sampled enough initial states, the correct action trajectory which maximizes the deterministic \(\tilde{V}\) will also maximize that for the real \(V\).