poisson distribution

Last edited: August 8, 2025

Let’s say we want to know what is the chance of having an event occurring $k$ times in a unit time, on average, this event happens at a rate of $\lambda$ per unit time.

“What’s the probability that there are $k$ earthquakes in the 1 year if there’s on average $2$ earthquakes in 1 year?”

where:

events have to be independent
probability of sucess in each trial doesn’t vary

constituents

$λ$—count of events per time
$X \sim Poi(\lambda)$

requirements

the probability mass function:

policy

Last edited: August 8, 2025

constituents

the history: last states and actions $h_{t} = (s_{1:t}, a_{1:t-1})$

requirements

typically:

\begin{equation} a_{t} = \pi_{t}(h_{t}) \end{equation}

for a Markov Decision Process, our past states are d-seperated from our current action given knowing the state, so really we have $\pi_{t}(s_{t})$

Some policies can be stochastic:

\begin{equation} P(a_{t}) = \pi_{t}(a_{t} | h_{t}) \end{equation}

instead of telling you something to do at a specific point, it tells you what the probability it chooses of doing $a_{t}$ is given the history.

Policy Gradient

Last edited: August 8, 2025

Two steps:

obtaining a function for the gradient of policy against some parameters $\theta$
making them more based than they are right now by optimization

Thoughout all of this, $U(\theta)$ is $U(\pi_{\theta})$.

Obtaining a policy gradient

Finite-Difference Gradient Estimation

We want some expression for:

\begin{equation} \nabla U(\theta) = \qty[\pdv{U}{\theta_{1}} (\theta), \dots, \pdv{U}{\theta_{n}}] \end{equation}

we can estimate that with the finite-difference “epsilon trick”:

\begin{equation} \nabla U(\theta) = \qty[ \frac{U(\theta + \delta e^{1}) - U(\theta)}{\delta} , \dots, \frac{U(\theta + \delta e^{n}) - U(\theta)}{\delta} ] \end{equation}

policy iteration

Last edited: August 8, 2025

policy iteration will allow us to get an optimal policy.

start with some initial policy $\pi$ (this scheme converges to an optimal policy regardless of where you start)
solve for $U^{\pi}$
create a new policy $\pi’$ by creating a value-function policy on $U^{\pi}$
repeat 2-3

Since there are a finite policies, this will eventually converge.

At each point, the utility of the policy increases.

At each step, the utility of the resulting policy will necessarily be larger or equal to than the previous one as we are greedily choosing “better” (or equivalent) actions as measured by the utility of the previous policy.

Policy Optimization

Last edited: August 8, 2025

Policy Optimization deals with algorithms that, unlike value iteration/policy iteration/online planning which uses a surrogate (like value function or some future discounted reward) to calculate a policy, directly optimizes against policy parameters $\theta$ for a policy $\pi_{\theta}$.