SU-CS224N Paper Review

Last edited: August 8, 2025

Key Information

Title: Fine-Grained Language Model Detoxification with Dense, Token-Level Rewards
Team Member (in 224n): Houjun Liu <[email protected]>
External Collaborators: Amelia Hardy <[email protected]>, Bernard Lange <[email protected]>
Custom Project
Mentor: we have no particular mentor within 224n
Sharing Project: this project is shared with AA222, and is a part of a research project PI’d by Mykel Kochenderfer <[email protected]>, of which Houjun is taking a leading role

Research Paper Summary

Title	Fine-Grained Human Feedback Gives Better Rewards for Language Model Training
Venue	NeurIPS (Spotlight)
Year	2023
URL	https://arxiv.org/pdf/2306.01693

Background

Reinforcement Learning with Human Feedback (RLHF) has demonstrated superb effect for improving performance of a language model (LM) via human preference judgments of LM output desirability–reducing incidences of toxic or false generation trajectories ((Ziegler et al. 2020)). Naive application of RLHF directly has shown success in reducing the toxicity in language model outputs, yet its effects could sometimes be inconsistent without further in-context guidance of the resulting model ((Ouyang et al. 2022)).