Key Sequence
Notation
New Concepts
Important Results / Claims
Questions
Interesting Factoids
- “sometimes we may want to model slower than the data to be collected; for instance, your helicopter really doesn’t move anywhere every 100ths of a second to be learned, but you can collect data that fast”
Debugging RL
RL should work when
- The simulator is good
- The RL algorithm correctly maximize \(V^{\pi}\)
- Reward such that maximum expected payoff corresponds to your goal
Diagnostics
- check your simulator: if your policy works in sim but not IRL, your sim is bad
- if \(V^{\text{RL}} < V^{\text{human}}\), then your RL algorithm is just bad
- if \(V^{\text{RL}} \geq V^{\text{human}}\), then your objective function is bad
