feed both accepted and rejected into your model, and get two scalars out \(r_{\text{rejected}}\), and \(r_{\text{chosen}}\):
\begin{equation} \mathcal{L}_{RM} = \log \qty(1 + e^{r_{\text{rejected}}-r_{\text{chosen}}}) \end{equation}
- train only for one epoch
- you should be getting low accuracy scores
- you may need to ensemble, margin loss
- ppo gets the best model