reinforcement learning

Last edited: August 8, 2025

reinforcement learning is a decision making method with no known model of the environment at all.

agent interacts with environment directly
designer provide a performance measure of the agent in the environment
agent tries to optimize the decision making algorithm to maximise the performance measure

Note: agent’s own choice of action, in this case, actually influences how the environment works (and what futures the agent sees). So the agent’s actions will influence the environment outcomes

Rejection Sampling

Last edited: August 8, 2025

steps (coda)

for some unnormalized target failure density, which is our target (and nominal trajectory \(p\qty(\tau)\)):

\begin{equation} \bar{p} \qty(\tau \mid \tau \not \in \psi) = \mathbb{1}\qty {\tau \not \in \psi} p\qty(\tau) \end{equation}

sample \(\tau \sim q\qty(\cdot)\)

where \(q\) is the proposal distribution where you start generating your samples; you want this to be as close as you can to the target failure distribution.

reject if \(cq\qty(\tau) r > \bar{p}\qty(\tau)\)

first, choose a normalizing constant \(c\) which makes

relative probability

Last edited: August 8, 2025

Let \(X \sim \mathcal{N}\).

“How much more likely is \(x=10\) than \(x=5\)?”

We note that \(P(x=value) = 0\) for any value if \(X\) is continuous. However, we can still get an answer:

\begin{equation} \frac{\dd{X} P(x=10)}{\dd{X} P(x=5)} \end{equation}

these two things cancel out. Therefore, you can just divide the PDF:

\begin{equation} \frac{f(x=10)}{f(x=5)} \end{equation}

Relavitivism

Last edited: August 8, 2025

Problems with Relavitivism:

missing disagreement problem
arbitrariness: anything could be right and wrong
social infallibility; if people all agree on something, it becomes right

Mind-to-World Desire of Fit

When you have a desire, you change the world so that you fulfill it. “truth-apt”

“when you are hungry, you want food”

Mind-to-World Desire of Fit

Your belief changes as a function of the world. “truth-apt”

“reasoning/cognition and truth”

Simple Subjectivism

“murder is wrong” <=> “I disapprove of mrder.”

relaxation (algorithms)

Last edited: August 8, 2025

background info

Recall asymtotic analysis. We remember that:

constant time < logarithmic time < linear time < polynomial time < exponential time

The question? What happens if dynamic programming is too slow/not good enough for the problem? What if dynamic programming is not needed; instead, why don’t we just settle for a pretty good solution?

Take, for instance, Nueva Courses. The optimal solution is “most students get their highest possible preferences.” However, this is impractical and pretty much impossible. Instead, what if we endeavor to figure a schedule that generally maximize happiness?