SU-CS229 NOV102025

Key Sequence

Notation

New Concepts

Important Results / Claims

Questions

Interesting Factoids

  • “sometimes we may want to model slower than the data to be collected; for instance, your helicopter really doesn’t move anywhere every 100ths of a second to be learned, but you can collect data that fast”

Debugging RL

RL should work when

  1. The simulator is good
  2. The RL algorithm correctly maximize \(V^{\pi}\)
  3. Reward such that maximum expected payoff corresponds to your goal

Diagnostics

  • check your simulator: if your policy works in sim but not IRL, your sim is bad
  • if \(V^{\text{RL}} < V^{\text{human}}\), then your RL algorithm is just bad
  • if \(V^{\text{RL}} \geq V^{\text{human}}\), then your objective function is bad