SU-CS229 NOV102025

SU-CS229 NOV102025

Key Sequence

Notation

continuous state MDP

New Concepts

Important Results / Claims

Questions

Interesting Factoids

“sometimes we may want to model slower than the data to be collected; for instance, your helicopter really doesn’t move anywhere every 100ths of a second to be learned, but you can collect data that fast”

Debugging RL

RL should work when

The simulator is good
The RL algorithm correctly maximize \(V^{\pi}\)
Reward such that maximum expected payoff corresponds to your goal

Diagnostics

check your simulator: if your policy works in sim but not IRL, your sim is bad
if \(V^{\text{RL}} < V^{\text{human}}\), then your RL algorithm is just bad
if \(V^{\text{RL}} \geq V^{\text{human}}\), then your objective function is bad