SU-CS224N Paper Review
Last edited: August 8, 2025Key Information
- Title: Fine-Grained Language Model Detoxification with Dense, Token-Level Rewards
- Team Member (in 224n): Houjun Liu <[email protected]>
- External Collaborators: Amelia Hardy <[email protected]>, Bernard Lange <[email protected]>
- Custom Project
- Mentor: we have no particular mentor within 224n
- Sharing Project: this project is shared with AA222, and is a part of a research project PI’d by Mykel Kochenderfer <[email protected]>, of which Houjun is taking a leading role
Research Paper Summary
| Title | Fine-Grained Human Feedback Gives Better Rewards for Language Model Training |
|---|---|
| Venue | NeurIPS (Spotlight) |
| Year | 2023 |
| URL | https://arxiv.org/pdf/2306.01693 |
Background
Reinforcement Learning with Human Feedback (RLHF) has demonstrated superb effect for improving performance of a language model (LM) via human preference judgments of LM output desirability–reducing incidences of toxic or false generation trajectories ((Ziegler et al. 2020)). Naive application of RLHF directly has shown success in reducing the toxicity in language model outputs, yet its effects could sometimes be inconsistent without further in-context guidance of the resulting model ((Ouyang et al. 2022)).
SU-CS229 JAN062025
Last edited: August 8, 2025SU-CS229 JAN082025
Last edited: August 8, 2025Notation
New Concepts
- supervised learning
- machine learning evaluation
- root-mean-square error (this is least-squares error with normalization, so “units work well” for test benchmarks)
- linear regression
- gradient descent
- normal equation
Important Results / Claims
Questions
- Why can’t we use root-mean-square error for the training objective? It seems like its just more normalization…
