Houjun Liu

SU-CS120 OCT012024

specification gaming

specification gaming, or reward hacking, is the phenomina where a system runs suboptimally because it exploited an underspecified part of the reward.

challenges

  • sparse rewards
  • partial obervability
  • dynamic rewards (and reward shifting)
  • sim-to-real transfer is hard
  • computational costs
  • specification gaming

AI alignment

AI alignment ensures that AI systems are aligned with human values and interests.

there is a spectrum of unexpected solutions: undesirable novel solutions an desirable novel solutions

Problems with RLHF

  • RLHF degrates model quality

Goodharting

Overfitting!! is an example of goodharting.