## Key Information

**Title**: Fine-Grained Language Model Detoxification with Dense, Token-Level Rewards**Team Member**(in 224n): Houjun Liu <[email protected]>**External Collaborators**: Amelia Hardy <[email protected]>, Bernard Lange <[email protected]>**Custom Project****Mentor**: we have no particular mentor within 224n**Sharing Project**: this project is shared with AA222, and is a part of a research project PI’d by Mykel Kochenderfer <[email protected]>, of which Houjun is taking a leading role

## Research Paper Summary

Title | Fine-Grained Human Feedback Gives Better Rewards for Language Model Training |
---|---|

Venue | NeurIPS (Spotlight) |

Year | 2023 |

URL | https://arxiv.org/pdf/2306.01693 |

### Background

Reinforcement Learning with Human Feedback (RLHF) has demonstrated superb effect for improving performance of a language model (LM) via human preference judgments of LM output desirability–reducing incidences of toxic or false generation trajectories ((Ziegler et al. 2020)). Naive application of RLHF directly has shown success in reducing the toxicity in language model outputs, yet its effects could sometimes be inconsistent without further in-context guidance of the resulting model ((Ouyang et al. 2022)).

Even without the specific technique of RLHF, other approaches have been formulated to specifically target the problem of the reduction of harmfulness in LM toxicity via human feedback data given by LM output trajectories, from contrastive learning ((Lu et al. 2022)), or—though a combination of instruction-following RLHF and in-context learning (i.e. prompting)—eliciting LM self-correction in output trajectories ((Ganguli et al. 2023)).

Though the works above highlight that human-reviewed preference data are largely helpful for the task of LM detoxification, these approaches suffers from a major overriding issue. As they are mostly preference or ranking based schemes over the entire output trajectories, the naive RLHF reward signal is fairly sparse, especially given long contexts. (Ramamurthy et al. 2023) discusses empirically that such reward signals from direct application RLHF may be unreliable for long-form output trajectories.

### Contributions

Fine-Grained RLHF ((Wu et al. 2023)) (FG-RLHF) has demonstrated success in limiting the toxicity of LMs a *dense*, token-level formulation of the “human-preference” reward usually constant throughout an entire output trajectory within naive RLHF.

The authors first collected “fine-grained preference data” in the form of a sequences of manually-annotated (or, in the case of toxicity, automatically annotated via the Google Jigsaw perspective API ((Lees et al. 2022))) spans within sampled output trajectories of the language model. Each span contains two fields: a `class`

indicating what type of comment is being made, and a `score`

\(r \in [-1,1]\) which indicates whether the behavior is good or bad. For instance, an annotator may provide a span annotating that a section of the output is (`Relavent +0.3`

, or `Toxic -0.5`

). For this specific work, each spans is a minimum of one sentence long, but can also encompass the entire document (the author discusses “information completeness” as such as annotation which can only be applied document-wide.)

Then, to tune the target language model to follow these human annotations, the author provides a slight reformulation of RLHF to densely provide reward to the model.

In particular, the authors formulate the task of language modeling as a token-level Markov Decision Process (MDP), where, at each timestamp, the language model takes an “action” within the MDP framework by selecting a word \(w_{n} \sim P_{\theta}(\cdot | w_{1}, …, w_{n-1})\). Each \(w_{n}\), results in the “state” of the MDP being extended by that word, into \(\{w_1, …, w_{n-1}\} \cup \{w_{n}\}\).

This specific (and perhaps unsurprising) formulating leads to a reward assignment consistent with that of an MDP, which is given at *each timestamp*: meaning, instead of given reward as preference scores per generating trajectory, we assign rewards with each output token. This gives rise to \(r_{1}, …, r_{n}\) corresponding to each token in the output \(w_1, …, w_{n}\).

To actually calculate \(r_{t}\), the authors simply sum the rewards given by each span. For \(K\) `class`

of outputs (e.g. “Relevant”), and \(L_{k}\) spans for each class \(k\), and for a set \(T_{L_{k}}^{(j)}\) indicating that a particular token is within the $j$-th annotated span, document-wide weight to the `class`

\(w_{k}\), and score \(R_{k}(j)\) given by the annotator at span \(j\), we have:

\begin{equation} r_{t} = \sum_{k=1}^{K} \sum_{j=1}^{L_{k}} \qty(\bold{1}\qty(t \in T_{L_{k}}^{(j)}) w_{k} R_{k}(j)) - \beta \log \frac{P_{\theta}(w_{t}|w_{1} \dots w_{t-1})}{P_{\theta_{orig}}(w_{t}|w_{1} \dots w_{t-1})} \end{equation}

whereby we, similar to the original formulation of RLHF ((Ziegler et al. 2020)), provide a penalty for overly large weight changes from the un-tuned model, parameterized by \(\beta\).

By applying this reward signal and standard Proximal Policy Optimization ((Schulman et al. 2017)) on the GPT-2 model, using Jigsaw Perspective API’s “toxicity score” ((Lees et al. 2022)) as a heuristic for \(R_{k}\), the authors saw a toxicity elicitation success rate reduction from \(13.0\%\) to \(8.1\%\) via the RealToxicityPrompts ((Gehman et al. 2020)) set—a set of harmless prefixes but which commonly result in toxic outputs (e.g. “they looked at her up and down, and…”). More excitingly, the model was able to preserve its perplexity on the dataset (i.e. the prompts, which are not toxic alone), obtaining a mean perplexity value of \(9.77\) compared to non-fine-tuned perplexity of \(9.58\) and that of full RLHF at \(11.75\). Non-RLHF based techniques ((Lu et al. 2022; Ganguli et al. 2023)) benchmarked on the same task were worse or equal on these metrics to the naive RLHF implementation.

These results indicate that the densely formulated rewards helped both obtain better results on the actual task of avoiding toxic output trajectories, but also retained more of the model’s modeling—with the authors arguing that the latter metric indicates higher sample efficiency in the training procedure.

### Limitation

First, given the work’s reliance on the notion of “fine-grained” upon its reward model, it is interesting to see that the actual reward formulation given still had an indicator function over what is essentially *sentence level* spans. The choice of sentence-level spans were never really justified, and, in particular, not benchmarked against simply applying RLHF on shorter generation trajectories instead of formulating a token-level reward. The authors correctly recognized that trajectory length is a big factor of the suitableness of reward signal, but simply reducing the length of the trajectory before RLHF ((Singhal et al. 2023)) may have been a helpful baseline for that claim.

Second, given authors’ cited previous discussion of in-context LM detoxicity elicitation—with which this work was benchmarked as an alternative to RLHF—were done with dramatically larger models than GPT-2 ((Ganguli et al. 2023), where the authors even commented the necessity for larger scale models), it is difficult to gauge whether or not the baseline, non-RLHF results reported in the paper was an adequate benchmark against the model chosen here.

### Why this Paper

The paper provides a helpful reformulation of language modeling as an MDP, and actually takes advantage of this reformulation by proposing a way by which step-wise action-reward can be *non-sparsely* assigned. One long-standing challenge in LM alignment, especially in detoxicity, involves the shifting ((Welbl et al. 2021)) of a LM’s distribution away from any coverage of texts about sensitive topics which may frequency co-occur with toxic texts—liming the LM’s capabilities and creating unintentional *representational harm* ((Dixon et al. 2018)). By formulating a more fine-grained metric like so, it is hopeful that this behavior could be lessened. In general, taking advantage of this token-level reformulation will hopefully allow more fine-grained control over model generation.

### Wider research context

This paper is a part of a body of work which focuses on the post-training techniques in language modeling. Furthermore, it extends a widely used post training technique, RLHF ((Ziegler et al. 2020)), which leverages human preferences as a part of a final step in language modeling. Furthermore, the specific technique this paper investigates involves the research in language model toxicity elicitation and prevention ((Gehman et al. 2020; Ganguli et al. 2023; Lu et al. 2022; Welbl et al. 2021))—promoting safer language models used throughout deployment.

Furthermore, as mentioned above, the paper is selected both for its specific application for methods in reducing LM output detoxicity, but also for its general formulation of toke-level rewards which enables further work to specific alignment of LM behavior to human or heuristic based metrics that are not necessarily focused on detoxicity. This property is particularly advantageous given the development of more specific, span-level metrics for language model evaluation metrics ((Min et al. 2023)), which would complement the work here directly.

## Project Proposal

### Goal

Classic formulations of RLHF has demonstrated positive capability to align language model outputs to specific preferences/ranking data in an unsupervised manner. However, as RLHF formulates its rewards over the entire generation trajectory, leading to fairly sparse assignment of rewards.

Turn-based conversational dialogue is one such long-context language modeling task which is in particular susceptible to the risks of sparse, document-wide rewards ((Mehrabi et al. 2022; Wallace et al. 2019; Ramamurthy et al. 2023)). Furthermore, many domains of application of conversational agents where toxicity maybe elicited involves conversations about marginalized groups or sensitive topics, which themselves are not toxic; yet, typical mitigation strategies may ((Welbl et al. 2021)) also shift the LM’s distribution away from any coverage of texts about these sensitive topics—liming the LM’s capabilities and creating unintentional *representational harm* ((Dixon et al. 2018)).

Recent work Fine-Grained RLHF ((Wu et al. 2023)) (FG-RLHF) has demonstrated success in limiting the toxicity of LMs through a novel formulation of language modeling as a step-wise Markov Decision Process (MDP)—treating each token as a timestamp—whereby rewards are *densely* assigned at each token based on *span* level annotations of the target objective. This formulation avoids some of the problems of reward sparsity previously found in naive applications of RLHF in long-form contexts.

In this project, we propose an extension of the work of FG-RLHF to the dialogue domain, and in particular as a means to lower the susceptibility of LMs for long-form dialogue toxicity elicitation attacks while retaining their representational capability for sensitive topics.

We hypothesize that 1) the application of an even more densely specified (word, turn, multi-turn) level RLHF scheme (using the same technique as proposed by (Ziegler et al. 2020), but importantly **not** keep reward constant at the sentence level as did (Wu et al. 2023)) can reduce the susceptibility of a language model to multi-turn adversarial toxicity attacks, while 2) due to our proposed methods’ localized application of reward, the resulting policy will better retain its modeling performance in general (non-toxic) discussion of topics co-occurring with toxic content, thereby limiting the model’s *representational harm*.

### Task

The baseline task mirrors exactly as that in the FG-RLHF work: to use a heuristic based toxicity evaluation ((Lees et al. 2022)) as a densely formulated reward signal to perform *fine-grained* reinforcement learning on possibly-toxic LM output trajectories. We add an additional evaluation beyond those presented in the FG-RLHF work, described in greater detail below, which evaluates the resulting model for its ability to model non-toxic yet sensitive topics after both naive RLHF and fine-grained reward schemes.

### Methods

The proposed work involves four key steps: first, we aim to leverage a LM which has not been tuned with RLHF before to elicit toxic turn-based dialogue; second, we aim to use automated metrics to create localized utterance and turn-level tags of toxicity within the elicited conversations; third, we aim to apply the Fine-Grained RLHF scheme ((Wu et al. 2023)) to those conversations, using the localized toxicity scores as a negative reward signal; lastly, we aim to evaluate the resulting policy again for toxic behavior again via toxicity elicitation as well as its modeling capability of non-toxic sensitive topics by scoring its perplexity over a linguistic bias dataset.

#### Data Gathering

Language Modeling and Toxicity Elicitation

We aim to leverage a large language model (LLM)—Mistral 7B ((Jiang et al. 2023)), whose base LM variant was not conversation fine-tuned and therefore has not been supervised by existing variants of RLHF—and the

*RealToxicityPrompts*((Gehman et al. 2020)) dataset to elicit toxic responses.Consistent with previous work ((Wu et al. 2023)), we will use nucleus sampling with \(p=0.9\) and \(t=1.0\) to elicit a series of decoding sequences following an open-domain dialogue prompt ((Bae et al. 2022)). Within the last user turns, we will insert adversarial toxicity elicitation trajectories given by

*RealToxicityPrompts*, and sample model decoding sequences.

Toxicity Scoring and Reward Signal

The resulting conversation will be scored turn-wise via the Perspective API from Google Jigsaw ((Lees et al. 2022)), which has been used ((Ziegler et al. 2020; Wu et al. 2023; Mehrabi et al. 2022; Gehman et al. 2020)) as a standard heuristic for LM output toxicity. In particular, the API model confidence (“Toxicity Score”) has been shown to be a workable negative reward signal ((Wu et al. 2023)) for toxicity filtering training.

Consistent with previous work, we treat an open-domain conversation as a finite-horizon MDP, whereby rewards are densely assigned at each turn inversely proportional to its toxicity rating. We will vary the span and discount rate to which each “turn” is defined, and evaluate model performance.

#### NLP/Neural Method

We will train our model following the procedure and objective outlined by ((Schulman et al. 2017; Wu et al. 2023)), consistent with previous literature. In particular, we will apply a span-level RLHF metric and optimize it using Proximal Policy Optimization (PPO).

Recall we consider the task of language modeling as a fully-observable MDP, whereby each new token generated is given by \(P_{\theta}(a_{t} | S_{t})\), whereby the language model \(P_{\theta}\) gives a choice of \(a_{t} \in W\) over the distribution of words given a prompt \(S_{t}\).

We formalize the Jigsaw Perspective API as a model which elicits a score for a sequence of words \(w_1, … w_{N}\) which

\begin{equation} J(w_1, \dots, w_{N}) \in [0, 1] \end{equation}

where if a highly toxic statement exists among \(w_1, …, w_{N}\), \(J \to 1\), and otherwise \(J \to 0\). Investigations into the behavior of \(J\) indicate that, for a truly toxic subsequence of length \(k\) embedded within a larger sequence of length \(N\), \(J \to 1\) smoothly as \(k \to N\), and the inverse holds as well (by formulating the “non-toxic overall sequence” as a form of obfuscation, this has been discussed by (Lees et al. 2022)).

To address this property of length-based decay, we define our optimization objective as a expectation of the toxicity score over multiple candidate spans given a center word. As a slight modification to the FG-RLHF framework ((Wu et al. 2023)), then, we first sample an output formulate a token-level reward as:

\begin{equation} r_{t} = \sum_{n=1}^{N} \sum_{j=\min\qty(0, (t-n+1))}^{t+n} -J\qty(w_{j}, \dots, w_{j+N}) \frac{\alpha}{(\mid t-j\mid)} - \beta \log \frac{P_{\theta}(w_{t} | w_{t-1} \dots w_{0})}{P_{\theta_{init}}\qty(w_{t} | w_{t-1} \dots w_{0})} \end{equation}

where, under the framework of FG-RLHF, we essentially consider all size-\(N\) and below windows in the text “spans”, score each span using the Perspective API, and define the span weight as the distance from the “center word” of the window (\(\frac{\alpha}{| t - j|}\)).

Given \(L\) trajectory samples \(Y_1, …, Y_{L}\) from a single toxic prompt, then, we desire to:

\begin{equation} \max_{t} \mathbb{E}_{Y \sim Y_{j}} \mathbb{E}_{t} r_{t} \end{equation}

The remainder of the procedure follows exactly to Proximal Policy Optimization ((Schulman et al. 2017)); however, we consider a single timestamp a *token*, rather than an entire generation sequence. This will increase the number of reward model and evaluation calls dramatically, but does not increase the amount of memory usage because the evaluations of each token can be computed separately (conditioned upon the actual output trajectory sampled from teh LM, which can be offloaded from active memory) and does not need to be within one batch.

In particular, let us define a symbol \(s_{t}\) as a partial output trajectory \(w_1, …, w_{t}\); we further define a surrogate reward model \(V_{\phi}: S \to \mathbb{R}\) to estimate the quality of a particular partial-output. We will use a smaller model (such as the T5 encoder ((Raffel et al. 2023))) and learn parameters \(\phi\).

For an output of length \(T\), We formulate our **advantage** at a timestamp as:

\begin{equation} A_{t} = \sum_{t’=t}^{T} (\gamma\lambda)^{t’-t} \qty(r_{t’} + \gamma V_{\phi} \qty(s_{t’ + 1}) - V_{\phi} \qty(s_{t’})) \end{equation}

we further define a “target value” as:

\begin{equation} V^{(t)}\qty(s_{t}) = \sum_{t’ = t}^{T-1} \gamma^{t’ - t} r_{t’} + \gamma^{T-t} V_{\phi} \qty(s_{T}) \end{equation}

where in both \(\lambda, \gamma\) are both hypeparemeters, with \(\lambda\) being the discount factor. Finally, we update our model parameters via classic PPO:

\begin{equation} \begin{cases} \theta \leftarrow \arg \max_{\theta} \mathbb{E}_{Y \sim Y_{j}} \mathbb{E}_{t} \min \qty( \frac{P_{\theta}(a_{t}|s_{t})}{P_{\theta_{old}}(a_{t}|s_{t})} A_{t}, \text{clip} \qty( \frac{P_{\theta}(a_{t}|s_{t})}{P_{\theta_{old}}(a_{t}|s_{t})}, 1-\epsilon, 1+\epsilon))\\ \phi \leftarrow \arg \min_{\phi} \mathbb{E}_{Y \sim Y_{j}} \mathbb{E}_{t} \min\qty(V_{\phi}(s_{t}) - V^{(t)}(s_{t}))^{2} \end{cases} \end{equation}

### Baselines and Evaluation

After obtaining the improved policy, we aim to evaluate our resulting scheme against naive applications of RLHF (applying the same exact elicitation and tuning procedure, with the Perspective API toxicity score uniformly calculated and applied over the entire conversation).

We aim to perform our evaluation following two key metrics.

#### Toxicity Elicitation

We again follow the procedure outlined above, and measure change in average discounted toxicity score given by the Perspectives model for model output trajectories on the test-partition of *RealToxicityPrompts*. We expect that the toxicity incident and average over turns will decrease after this procedure; previous work ((Wu et al. 2023)) has shown that it may provide lower incidences of toxicity as compared to a naive RLHF application.

#### Language Modeling

To benchmark our model against *representational harm* in shifting its modeling distribution away from sensitive non-toxic discussion, we will further measure the resulting policy’s fluency in sensitive topics.

We aim to use the LM perplexity on target sequences as a proxy for measuring model fluency, and in particular aim to measure the change in parity and calibration of the policy before and after tuning upon modeling different partitions of the BOLD dataset ((Dhamala et al. 2021)). We expect that, due to the localized nature of our reward and tuning procedure, our policy will have lower perplexities as compared to naive RLHF application due to samples of such sensitive conversations not being in span of the actual localized toxicity.

### Ethics

First, this is a project that involves eliciting and responding to the ability for toxicity to arise in language models. Naturally, it will involve first a demonstration and evaluation of an LM’s ability to perform toxic generation trajectories. Beyond examples given in our article of correction of behavior, we do not intend to release any completions performed during the experiment (except to the Perspective API, for evaluation), and will clearly demarcate toxic content in our article as is standard. Second, though we can evaluate the safety profile of our resulting model on a fixed dataset, we have no holistic counterfacutal on the objective safety of our resulting model. As such, when we release the artifacts (code, weights) to our experiments, we expect to note clearly that it is a research work and not intended for broader deployment without further evaluation. Ideally, this will serve to limit the capacity to which our work is misused as a general technique for detoxification without further evaluation. If the weights has shown no improvements to detoxicity over the base model, we will not release it to mitigate any harms it may cause.

### Note on Hardware

This project involves the use of a 7-billion parameter language model, as well as a roughly 770-million parameter reward model. We are in progress to making separate arrangements to obtain compute necessary for this project, and are confident in our ability to obtain sufficient allocations to tune these models.

*Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency*, 862–72. Virtual Event Canada: ACM. doi:10.1145/3442188.3445924.

*Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society*, 67–73. New Orleans LA USA: ACM. doi:10.1145/3278721.3278729.

*Advances in Neural Information Processing Systems*35: 27591–609.

*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, edited by Houda Bouamor, Juan Pino, and Kalika Bali, 12076–100. Singapore: Association for Computational Linguistics. doi:10.18653/v1/2023.emnlp-main.741.

*Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)*, 2153–62. Hong Kong, China: Association for Computational Linguistics. doi:10.18653/v1/D19-1221.

*Findings of the Association for Computational Linguistics: EMNLP 2021*, 2447–69. Punta Cana, Dominican Republic: Association for Computational Linguistics. doi:10.18653/v1/2021.findings-emnlp.210.