policy

Last edited: August 8, 2025

constituents

the history: last states and actions \(h_{t} = (s_{1:t}, a_{1:t-1})\)

requirements

typically:

\begin{equation} a_{t} = \pi_{t}(h_{t}) \end{equation}

for a Markov Decision Process, our past states are d-seperated from our current action given knowing the state, so really we have \(\pi_{t}(s_{t})\)

Some policies can be stochastic:

\begin{equation} P(a_{t}) = \pi_{t}(a_{t} | h_{t}) \end{equation}

instead of telling you something to do at a specific point, it tells you what the probability it chooses of doing \(a_{t}\) is given the history.

policy evaluation

Last edited: August 8, 2025

See also Roll-out utility if you don’t want to get a vector utility over all states.

solving for the utility of a policy

We can solve for the utility of the policy given the transitions \(T\) and reward \(R\) by solving the following equation

\begin{equation} \bold{U}^{\pi} = (I - \gamma T^{\pi})^{-1} \bold{R}^{\pi} \end{equation}

where \(T\) is an \(|S| \times |S|\) square matrix where each horizontal row is supposed to add up to \(1\) which encodes the probability of transitioning from each horizontal row to the column next rows.

Policy Gradient

Last edited: August 8, 2025

Two steps:

obtaining a function for the gradient of policy against some parameters \(\theta\)
making them more based than they are right now by optimization

Thoughout all of this, \(U(\theta)\) is \(U(\pi_{\theta})\).

Obtaining a policy gradient

Finite-Difference Gradient Estimation

We want some expression for:

\begin{equation} \nabla U(\theta) = \qty[\pdv{U}{\theta_{1}} (\theta), \dots, \pdv{U}{\theta_{n}}] \end{equation}

we can estimate that with the finite-difference “epsilon trick”:

\begin{equation} \nabla U(\theta) = \qty[ \frac{U(\theta + \delta e^{1}) - U(\theta)}{\delta} , \dots, \frac{U(\theta + \delta e^{n}) - U(\theta)}{\delta} ] \end{equation}

policy iteration

Last edited: August 8, 2025

policy iteration will allow us to get an optimal policy.

start with some initial policy \(\pi\) (this scheme converges to an optimal policy regardless of where you start)
solve for \(U^{\pi}\)
create a new policy \(\pi’\) by creating a value-function policy on \(U^{\pi}\)
repeat 2-3

Since there are a finite policies, this will eventually converge.

At each point, the utility of the policy increases.

At each step, the utility of the resulting policy will necessarily be larger or equal to than the previous one as we are greedily choosing “better” (or equivalent) actions as measured by the utility of the previous policy.

Policy Optimization

Last edited: August 8, 2025

Policy Optimization deals with algorithms that, unlike value iteration/policy iteration/online planning which uses a surrogate (like value function or some future discounted reward) to calculate a policy, directly optimizes against policy parameters \(\theta\) for a policy \(\pi_{\theta}\).