pointer
Last edited: August 8, 2025A pointer is a variable which stores memory addresses. Because there are no pass-by reference, we use pointers to emulate pass by reference: by sharing addresses with other functions.
A pointer can identify a single byte OR some large data structures. We can dynamically allocate pointers, and also identify memory generically without types.
C is always pass-by-copy. Therefore, to pass-by-reference, you basically have to
int x = 2; // declare object
int *xptr = &x; // get location of object (&: address of)
printf("%d\n", *xptr); // dereference the pointer
address operator
You will note, in the line above:
poisson distribution
Last edited: August 8, 2025Let’s say we want to know what is the chance of having an event occurring \(k\) times in a unit time, on average, this event happens at a rate of \(\lambda\) per unit time.
“What’s the probability that there are \(k\) earthquakes in the 1 year if there’s on average \(2\) earthquakes in 1 year?”
where:
- events have to be independent
- probability of sucess in each trial doesn’t vary
constituents
- $λ$—count of events per time
- \(X \sim Poi(\lambda)\)
requirements
policy
Last edited: August 8, 2025constituents
the history: last states and actions \(h_{t} = (s_{1:t}, a_{1:t-1})\)
requirements
typically:
\begin{equation} a_{t} = \pi_{t}(h_{t}) \end{equation}
for a Markov Decision Process, our past states are d-seperated from our current action given knowing the state, so really we have \(\pi_{t}(s_{t})\)
Some policies can be stochastic:
\begin{equation} P(a_{t}) = \pi_{t}(a_{t} | h_{t}) \end{equation}
instead of telling you something to do at a specific point, it tells you what the probability it chooses of doing \(a_{t}\) is given the history.
Policy Gradient
Last edited: August 8, 2025Two steps:
- obtaining a function for the gradient of policy against some parameters \(\theta\)
- making them more based than they are right now by optimization
Thoughout all of this, \(U(\theta)\) is \(U(\pi_{\theta})\).
Obtaining a policy gradient
Finite-Difference Gradient Estimation
We want some expression for:
\begin{equation} \nabla U(\theta) = \qty[\pdv{U}{\theta_{1}} (\theta), \dots, \pdv{U}{\theta_{n}}] \end{equation}
we can estimate that with the finite-difference “epsilon trick”:
\begin{equation} \nabla U(\theta) = \qty[ \frac{U(\theta + \delta e^{1}) - U(\theta)}{\delta} , \dots, \frac{U(\theta + \delta e^{n}) - U(\theta)}{\delta} ] \end{equation}
policy iteration
Last edited: August 8, 2025policy iteration will allow us to get an optimal policy.
- start with some initial policy \(\pi\) (this scheme converges to an optimal policy regardless of where you start)
- solve for \(U^{\pi}\)
- create a new policy \(\pi’\) by creating a value-function policy on \(U^{\pi}\)
- repeat 2-3
Since there are a finite policies, this will eventually converge.
At each point, the utility of the policy increases.
At each step, the utility of the resulting policy will necessarily be larger or equal to than the previous one as we are greedily choosing “better” (or equivalent) actions as measured by the utility of the previous policy.
