policy
Last edited: August 8, 2025constituents
the history: last states and actions \(h_{t} = (s_{1:t}, a_{1:t-1})\)
requirements
typically:
\begin{equation} a_{t} = \pi_{t}(h_{t}) \end{equation}
for a Markov Decision Process, our past states are d-seperated from our current action given knowing the state, so really we have \(\pi_{t}(s_{t})\)
Some policies can be stochastic:
\begin{equation} P(a_{t}) = \pi_{t}(a_{t} | h_{t}) \end{equation}
instead of telling you something to do at a specific point, it tells you what the probability it chooses of doing \(a_{t}\) is given the history.
policy evaluation
Last edited: August 8, 2025See also Roll-out utility if you don’t want to get a vector utility over all states.
solving for the utility of a policy
We can solve for the utility of the policy given the transitions \(T\) and reward \(R\) by solving the following equation
\begin{equation} \bold{U}^{\pi} = (I - \gamma T^{\pi})^{-1} \bold{R}^{\pi} \end{equation}
where \(T\) is an \(|S| \times |S|\) square matrix where each horizontal row is supposed to add up to \(1\) which encodes the probability of transitioning from each horizontal row to the column next rows.
Policy Gradient
Last edited: August 8, 2025Two steps:
- obtaining a function for the gradient of policy against some parameters \(\theta\)
- making them more based than they are right now by optimization
Thoughout all of this, \(U(\theta)\) is \(U(\pi_{\theta})\).
Obtaining a policy gradient
Finite-Difference Gradient Estimation
We want some expression for:
\begin{equation} \nabla U(\theta) = \qty[\pdv{U}{\theta_{1}} (\theta), \dots, \pdv{U}{\theta_{n}}] \end{equation}
we can estimate that with the finite-difference “epsilon trick”:
\begin{equation} \nabla U(\theta) = \qty[ \frac{U(\theta + \delta e^{1}) - U(\theta)}{\delta} , \dots, \frac{U(\theta + \delta e^{n}) - U(\theta)}{\delta} ] \end{equation}
policy iteration
Last edited: August 8, 2025policy iteration will allow us to get an optimal policy.
- start with some initial policy \(\pi\) (this scheme converges to an optimal policy regardless of where you start)
- solve for \(U^{\pi}\)
- create a new policy \(\pi’\) by creating a value-function policy on \(U^{\pi}\)
- repeat 2-3
Since there are a finite policies, this will eventually converge.
At each point, the utility of the policy increases.
At each step, the utility of the resulting policy will necessarily be larger or equal to than the previous one as we are greedily choosing “better” (or equivalent) actions as measured by the utility of the previous policy.
Policy Optimization
Last edited: August 8, 2025Policy Optimization deals with algorithms that, unlike value iteration/policy iteration/online planning which uses a surrogate (like value function or some future discounted reward) to calculate a policy, directly optimizes against policy parameters \(\theta\) for a policy \(\pi_{\theta}\).