Zen's Defense

levels of interp

  • probes: no causality
  • attribution (i.e. integrated gradient): no interpretation

methods of causal interventions

activation patching / interchange interventions

Record the activation, and swap the activations (can thus find the output)

Features are not axis aligned. Find equality task efficiently after (a rotation?)

three worlds of casual interventions

…as interp

“can we find interpretable causal mechanisms?” That is, “searching for a rotation” and then run interchange interventions.

  1. proposa a model to align
  2. figure out if counter factual matches
  3. solve for alignment

…as control

ReFT: “can we optimize our intervention for any task?” That is, can intervention be a good way to derive control.

ReFT applies only limited interventions to prompt tokens using the same notion of minor control

…as steering

AxBench tells us that most current steering objective untenable.

However, we can steer better if by simply contrastive learning of both positive and negative cases. Via the notino of “negative steering”, we find that negative steering