specification gaming
specification gaming, or reward hacking, is the phenomina where a system runs suboptimally because it exploited an underspecified part of the reward.
challenges
- sparse rewards
- partial obervability
- dynamic rewards (and reward shifting)
- sim-to-real transfer is hard
- computational costs
- specification gaming
AI alignment
AI alignment ensures that AI systems are aligned with human values and interests.
there is a spectrum of unexpected solutions: undesirable novel solutions an desirable novel solutions
Problems with RLHF
- RLHF degrates model quality
Goodharting
Overfitting!! is an example of goodharting.