reinforcement learning
Last edited: August 8, 2025reinforcement learning is a decision making method with no known model of the environment at all.
- agent interacts with environment directly
- designer provide a performance measure of the agent in the environment
- agent tries to optimize the decision making algorithm to maximise the performance measure
Note: agent’s own choice of action, in this case, actually influences how the environment works (and what futures the agent sees). So the agent’s actions will influence the environment outcomes
Rejection Sampling
Last edited: August 8, 2025steps (coda)
for some unnormalized target failure density, which is our target (and nominal trajectory \(p\qty(\tau)\)):
\begin{equation} \bar{p} \qty(\tau \mid \tau \not \in \psi) = \mathbb{1}\qty {\tau \not \in \psi} p\qty(\tau) \end{equation}
sample \(\tau \sim q\qty(\cdot)\)
where \(q\) is the proposal distribution where you start generating your samples; you want this to be as close as you can to the target failure distribution.
reject if \(cq\qty(\tau) r > \bar{p}\qty(\tau)\)
first, choose a normalizing constant \(c\) which makes
relative probability
Last edited: August 8, 2025Let \(X \sim \mathcal{N}\).
“How much more likely is \(x=10\) than \(x=5\)?”
We note that \(P(x=value) = 0\) for any value if \(X\) is continuous. However, we can still get an answer:
\begin{equation} \frac{\dd{X} P(x=10)}{\dd{X} P(x=5)} \end{equation}
these two things cancel out. Therefore, you can just divide the PDF:
\begin{equation} \frac{f(x=10)}{f(x=5)} \end{equation}
Relavitivism
Last edited: August 8, 2025Problems with Relavitivism:
- missing disagreement problem
- arbitrariness: anything could be right and wrong
- social infallibility; if people all agree on something, it becomes right
Mind-to-World Desire of Fit
When you have a desire, you change the world so that you fulfill it. “truth-apt”
“when you are hungry, you want food”
Mind-to-World Desire of Fit
Your belief changes as a function of the world. “truth-apt”
“reasoning/cognition and truth”
Simple Subjectivism
“murder is wrong” <=> “I disapprove of mrder.”
relaxation (algorithms)
Last edited: August 8, 2025background info
Recall asymtotic analysis. We remember that:
constant time < logarithmic time < linear time < polynomial time < exponential time
The question? What happens if dynamic programming is too slow/not good enough for the problem? What if dynamic programming is not needed; instead, why don’t we just settle for a pretty good solution?
Take, for instance, Nueva Courses. The optimal solution is “most students get their highest possible preferences.” However, this is impractical and pretty much impossible. Instead, what if we endeavor to figure a schedule that generally maximize happiness?
