Houjun Liu

reward model

feed both accepted and rejected into your model, and get two scalars out \(r_{\text{rejected}}\), and \(r_{\text{chosen}}\):

\begin{equation} \mathcal{L}_{RM} = \log \qty(1 + e^{r_{\text{rejected}}-r_{\text{chosen}}}) \end{equation}

  1. train only for one epoch
  2. you should be getting low accuracy scores
  3. you may need to ensemble, margin loss

  • ppo gets the best model