Houjun Liu

reward model

feed both accepted and rejected into your model, and get two scalars out \(r_{\text{rejected}}\), and \(r_{\text{chosen}}\):

\begin{equation} \mathcal{L}_{RM} = \log \qty(1 + e^{r_{\text{rejected}}-r_{\text{chosen}}}) \end{equation}

train only for one epoch
you should be getting low accuracy scores
you may need to ensemble, margin loss

ppo gets the best model