SU-CS120 OCT012024

SU-CS120 OCT012024

specification gaming

specification gaming, or reward hacking, is the phenomina where a system runs suboptimally because it exploited an underspecified part of the reward.

challenges

sparse rewards
partial obervability
dynamic rewards (and reward shifting)
sim-to-real transfer is hard
computational costs
specification gaming

AI alignment

AI alignment ensures that AI systems are aligned with human values and interests.

there is a spectrum of unexpected solutions: undesirable novel solutions an desirable novel solutions

Problems with RLHF

RLHF degrates model quality

Goodharting

Overfitting!! is an example of goodharting.