## Step 1: Getting Model

We want a model

- \(T\): transition probability
- \(R\): rewards

### Maximum Likelihood Parameter Learning Method

\begin{equation} N(s,a,s’) \end{equation}

which is the count of transitions from \(s,a\) to \(s’\) and increment it as \(s, a, s’\) gets observed. This makes, with Maximum Likelihood Parameter Learning:

\begin{equation} T(s’ | s,a) = \frac{N(s,a,s’)}{\sum_{s’’}^{} N(s,a,s’’)} \end{equation}

We also keep a table:

\begin{equation} p(s,a) \end{equation}

the sum of rewards when taking \(s,a\). To calculate a reward, we take the average:

\begin{equation} R(s,a) \approx \frac{p(s,a)}{\sum_{s’’}^{}N(s,a,s’’)} \end{equation}

### Baysian Parameter Learning Method

We build a Dirichlet Distribution; let:

\begin{equation} \vec{\theta}_{(s,a)} = \mqty[T(s_1 | s,a) \\ \dots\\ T(s_{n} | s,a)] \end{equation}

We then calculate a distribution:

\begin{equation} Dir(\vec{\theta}_{s,a} | \vec{N}_{s,a}) \end{equation}

which will give you a probability over a set of transitions.

Then, when we need a transition \(T\), we perform Posterior Sampling on this Dirichlet Distribution at every episode (or so, otherwise the model shifts a lot) and then optimize on that.

Getting rewards is an advanced topic. So let’s just use Maximum Likelihood Parameter Learning or assume its given

## Step 2: Getting Value Function.

### full update

One direct strategy to work on this, then, is to use whatever transition and rewards you observed to perform value iteration or policy iteration. First go through one or a bunch of observations, then take a full value iteration or policy iteration sweep, and then go back and take more measurements.

### randomized update

We randomly update a single state:

\begin{equation} U(s) \leftarrow \max_{a} \qty[R(s,a) + \gamma \sum_{s’}^{} T(s’ | s,a) U(s’)] \end{equation}

and take another observation, update our model estimate, and move on.

### prioritized updates

Say we are current updating a state \(s\), and there are two previous states that could transition into \(s\). First we create an estimate like before:

\begin{equation} U(s) \leftarrow \max_{a} \qty[R(s,a) + \gamma \sum_{s’}^{} T(s’ | s,a) U(s’)] \end{equation}

We create a queue whose contents are ranked by:

\begin{equation} T(s|s^{-}, a^{-}) \times |U(s)-u(s)| \end{equation}

where \(u(s)\) is \(U(s)\) prior to the update.

We move on to the next state to update by popping off the queue.

## Step 3: Explore a Little

### R-Max

Most strategies above focuses on choosing a random action. This exploration focuses on adapting reward/transitions to explicitly explore new-state.

\begin{equation} R(s,a) = \begin{cases} r_{\max}, if N(s,a) < m,\\ \rho\frac{s,a}{N(s,a)}, otherwise \end{cases} \end{equation}

you get a large reward \(r_{\max }\) if you haven’t been to \((s,a)\), otherwise the reward you get gets discounted by the number of times you visited.

\begin{equation} T(s’|s,a) = \begin{cases} 1, if N(s,a) < m\ and\ s’ = s \\ 0, if N(s,a) < m\ and\ s’ \neq s \\ \frac{N(s,a,s’)}{N(s,a)}, otherwise \end{cases} \end{equation}