## Introduction

Reinforcement Learning with Human Feedback (RLHF) has demonstrated superb effect for aligning the performance of a language model (LM) against unsupervised human preference judgements of LM output trajectories ((Ziegler et al. 2020)). However, the original RLHF formulation has shown little direct improvements to the model’s toxicity without further prompting, yet conferred some advantage when prompted to specifically be respectful ((Ouyang et al. 2022)).

To specifically target the problem of the reduction of harmfulness in LM toxicity, varying approaches have been explored via contrastive learning ((Lu et al., n.d.)), or—though a combination of instruction-following RLHF and in-context learning (i.e. prompting)—sampling and LM self-correcting output trajectories ((Ganguli et al. 2023)).

These approaches, however, suffers from a major overriding issue. As they are mostly preference or ranking based schemes over the entire distribution of sampled output trajectories, the naive RLHF reward signal is fairly sparse over the output trajectory. (Ramamurthy et al. 2023) discusses empirically that such reward signals from direct application RLHF maybe unreliable suitable for long-form output trajectories.

Turn-based conversational dialogue is one such long-context language modeling task which is in particular susceptible to the risks of sparse, document-wide rewards. Two key issues make this domain particularly challenging. First, previous work ((Mehrabi et al. 2022; Wallace et al. 2019)) has shown the possibility of the insertion of model-specific adversarial toxicity elicitation contexts optimized specifically for each model which themselves appear harmless and can be placed in conversation turns before the toxicity trigger—making it an overall reward scheme too sparse if applied over the entire conversation domain, yet not generalized enough to prevent the original elicitation if applied only over the triggering utterance. Second, many domains of application of conversational agents where toxicity maybe elicited involves conversations about marginalized groups or sensitive topics, which themselves are not toxic; yet, typical mitigation strategies may ((Welbl et al. 2021)) also shift the LM’s distribution away from any coverage of texts about these sensitive topics—liming the LM’s capabilities and creating unintentional *representational harm* ((Dixon et al. 2018)).

Recent work Fine-Grained RLHF ((Wu et al. 2023)) (FG-RLHF) has demonstrated success in limiting the toxicity of LMs through a novel formulation of language modeling as a step-wise Markov Decision Process (MDP)—treating each token as a timestamp—whereby rewards are *densely* assigned at each token based on *span* level annotations of the target objective. Then, the decision model is then improved via Proximal Policy Optimization (PPO) ((Schulman et al. 2017)) similar to the scheme given in RLHF.

In our work, we propose an extension of the work of FG-RLHF to the dialogue domain, and in particular as a means to lower the susceptibility of LMs for multi-turn dialogue attacks while retaining their representational capability. We hypothesize that 1) the application of a more densely specified (word, turn, multi-turn) level RLHF scheme (using the same technique as proposed by (Ziegler et al. 2020)) can reduce the susceptibility of a language model to multi-turn adversarial toxicity attacks, while 2) due to RLHF’s localized application of reward, the resulting policy will better retain its modeling performance in general (non-toxic) discussion of topics co-occurring with toxic content, thereby limiting the model’s *representational harm*.

## Methods

The proposed work involves four key steps: first, we aim to leverage a LM which has not been tuned with RLHF before to elicit toxic turn-based dialogue; second, we aim to use automated metrics to create localized utterance and turn-level tags of toxicity within the elicited conversations; third, we aim to apply the Fine-Grained RLHF scheme ((Wu et al. 2023)) to those conversations, using the localized toxicity scores as a negative reward signal; lastly, we aim to evaluate the resulting policy again for toxic behavior again via toxicity elicitation as well as its modeling capability of non-toxic sensitive topics by scoring its perplexity over a linguistic bias dataset.

### Language Modeling and Toxicity Elicitation

We aim to leverage a large language model (LLM)—Mistral 7B ((Jiang et al. 2023)), whose base LM variant was not conversation fine-tuned and therefore has not been supervised by existing variants of RLHF—and the *RealToxicityPrompts* ((Gehman et al. 2020)) dataset to elicit toxic responses.

Consistent with previous work ((Wu et al. 2023)), we will use nucleus sampling with \(p=0.9\) and \(t=1.0\) to elicit a series of decoding sequences following an open-domain dialogue prompt ((Bae et al. 2022)). Within the last user turns, we will insert adversarial toxicity elicitation trajectories given by *RealToxicityPrompts*, and sample model decoding sequences.

### Toxicity Scoring and Reward Signal

The resulting conversation will be scored turn-wise via the Perspective API from Google Jigsaw ((Lees et al. 2022)), which has been used ((Ziegler et al. 2020; Wu et al. 2023; Mehrabi et al. 2022; Gehman et al. 2020)) as a standard heuristic for LM output toxicity. In particular, the API model confidence (“Toxicity Score”) has been shown to be a workable negative reward signal ((Wu et al. 2023)) for toxicity filtering training.

Consistent with previous work, we treat an open-domain conversation as a finite-horizon MDP, whereby rewards are densely assigned at each turn inversely proportional to its toxicity rating. We will vary the span and discount rate to which each “turn” is defined, and evaluate model performance.

### Model Improvement

We will train our model following the procedure and objective outlined by ((Schulman et al. 2017; Wu et al. 2023)), consistent with previous literature. In particular, we will apply a span-level RLHF metric and optimize it using Proximal Policy Optimization (PPO).

Recall we consider the task of language modeling as a fully-observable MDP, whereby each new token generated is given by \(P_{\theta}(a_{t} | S_{t})\), whereby the language model \(P_{\theta}\) gives a choice of \(a_{t} \in W\) over the distribution of words given a prompt \(S_{t}\).

We formalize the Jigsaw Perspective API as a model which elicits a score for a sequence of words \(w_1, … w_{N}\) which

\begin{equation} J(w_1, \dots, w_{N}) \in [0, 1] \end{equation}

where if a highly toxic statement exists among \(w_1, …, w_{N}\), \(J \to 1\), and otherwise \(J \to 0\). Investigations into the behavior of \(J\) indicate that, for a truly toxic subsequence of length \(k\) embedded within a larger sequence of length \(N\), \(J \to 1\) smoothly as \(k \to N\), and the inverse holds as well (by formulating the “non-toxic overall sequence” as a form of obfuscation, this has been discussed by (Lees et al. 2022)).

To address this property of length-based decay, we define our optimization objective as a expectation of the toxicity score over multiple candidate spans given a center word. As a slight modification to the FG-RLHF framework ((Wu et al. 2023)), then, we first sample an output formulate a token-level reward as:

\begin{equation} r_{t} = \sum_{n=1}^{N} \sum_{j=\min\qty(0, (t-n+1))}^{t+n} -J\qty(w_{j}, \dots, w_{j+N}) \frac{\alpha}{(\mid t-j\mid)} - \beta \log \frac{P_{\theta}(w_{t} | w_{t-1} \dots w_{0})}{P_{\theta_{init}}\qty(w_{t} | w_{t-1} \dots w_{0})} \end{equation}

where, under the framework of FG-RLHF, we essentially consider all size-\(N\) and below windows in the text “spans”, score each span using the Perspective API, and define the span weight as the distance from the “center word” of the window (\(\frac{\alpha}{| t - j|}\)).

Given \(L\) trajectory samples \(Y_1, …, Y_{L}\) from a single toxic prompt, then, we desire to:

\begin{equation} \max_{t} \mathbb{E}_{Y \sim Y_{j}} \mathbb{E}_{t} r_{t} \end{equation}

To do this, we will optimize this objective via Proximal Policy Optimization ((Schulman et al. 2017)).

In particular, let us define a symbol \(s_{t}\) as a partial output trajectory \(w_1, …, w_{t}\); we further define a surrogate reward model \(V_{\phi}: S \to \mathbb{R}\) to estimate the quality of a particular partial-output. We will use a smaller model (such as the T5 encoder ((Raffel et al. 2023))) and learn parameters \(\phi\).

For an output of length \(T\), We formulate our **advantage** at a timestamp as:

\begin{equation} A_{t} = \sum_{t’=t}^{T} (\gamma\lambda)^{t’-t} \qty(r_{t’} + \gamma V_{\phi} \qty(s_{t’ + 1}) - V_{\phi} \qty(s_{t’})) \end{equation}

we further define a “target value” as:

\begin{equation} V^{(t)}\qty(s_{t}) = \sum_{t’ = t}^{T-1} \gamma^{t’ - t} r_{t’} + \gamma^{T-t} V_{\phi} \qty(s_{T}) \end{equation}

where in both \(\lambda, \gamma\) are both hypeparemeters, with \(\lambda\) being the discount factor. Finally, we update our model parameters via classic PPO:

\begin{equation} \begin{cases} \theta \leftarrow \arg \max_{\theta} \mathbb{E}_{Y \sim Y_{j}} \mathbb{E}_{t} \min \qty( \frac{P_{\theta}(a_{t}|s_{t})}{P_{\theta_{old}}(a_{t}|s_{t})} A_{t}, \text{clip} \qty( \frac{P_{\theta}(a_{t}|s_{t})}{P_{\theta_{old}}(a_{t}|s_{t})}, 1-\epsilon, 1+\epsilon))\\ \phi \leftarrow \arg \min_{\phi} \mathbb{E}_{Y \sim Y_{j}} \mathbb{E}_{t} \min\qty(V_{\phi}(s_{t}) - V^{(t)}(s_{t}))^{2} \end{cases} \end{equation}

### Evaluation

After obtaining the improved policy, we aim to evaluate our resulting scheme against naive applications of RLHF (applying the same exact elicitation and tuning procedure, with the expectation of the turn-based reward applied uniformly over the entire conversation).

We aim to perform our evaluation following two key metrics.

#### Toxicity Elicitation

We again follow the procedure outlined above, and measure change in average discounted reward given by the toxicity model for the test-partition of *RealToxicityPrompts*. We expect that the toxicity incident and average over turns will decrease after this procedure; previous work ((Wu et al. 2023)) has shown that it may provide lower incidences of toxicity as compared to a naive RLHF application.

#### Language Modeling

To benchmark our model against *representational harm* in shifting its modeling distribution away from sensitive non-toxic discussion, we will further measure the resulting policy’s fluency in sensitive topics.

We aim to use the LM perplexity on target sequences as a proxy for measuring model fluency, and in particular aim to measure the change in parity and calibration of the policy before and after tuning upon modeling different partitions of the BOLD dataset ((Dhamala et al. 2021)). We expect that, due to the localized nature of our reward and tuning procedure, our policy will have lower perplexities as compared to naive RLHF application due to samples of such sensitive conversations not being in span of the actual localized toxicity.

*Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency*, 862–72. Virtual Event Canada: ACM. doi:10.1145/3442188.3445924.

*Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society*, 67–73. New Orleans LA USA: ACM. doi:10.1145/3278721.3278729.

*Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)*, 2153–62. Hong Kong, China: Association for Computational Linguistics. doi:10.18653/v1/D19-1221.

*Findings of the Association for Computational Linguistics: EMNLP 2021*, 2447–69. Punta Cana, Dominican Republic: Association for Computational Linguistics. doi:10.18653/v1/2021.findings-emnlp.210.