policy

constituents

the history: last states and actions \(h_{t} = (s_{1:t}, a_{1:t-1})\)

requirements

typically:

\begin{equation} a_{t} = \pi_{t}(h_{t}) \end{equation}

for a Markov Decision Process, our past states are d-seperated from our current action given knowing the state, so really we have \(\pi_{t}(s_{t})\)

Some policies can be stochastic:

\begin{equation} P(a_{t}) = \pi_{t}(a_{t} | h_{t}) \end{equation}

instead of telling you something to do at a specific point, it tells you what the probability it chooses of doing \(a_{t}\) is given the history.

additional information

stationary policy

For infinite-horizon models, our policy can not care about how many time stamps are left (i.e. we are not optimizing within some box with constrained time) and therefore we don’t really care about historical actions. So we have:

\begin{equation} \pi(s) \end{equation}

this can be used in infinite-horizon models against stationary Markov Decision Process.

optimal policy

\begin{equation} \pi^{*}(s) = \arg\max_{\pi} U^{\pi}(s) \end{equation}

“the most optimal policy is the policy that maximizes the expected utility of following \(\pi\) when starting from \(s\)”

We call the utility from the best policy the “optimal value function”

\begin{equation} U^{*} = U^{\pi^{*}} \end{equation}

policy utility, and value

creating a good utility function: either policy evaluation or value iteration
creating a policy from a utility function: value-function policy (“choose the policy that takes the best valued action”)
calculating the utility function a policy currently uses: use policy evaluation

See policy evaluation