levels of interp
- probes: no causality
- attribution (i.e. integrated gradient): no interpretation
methods of causal interventions
activation patching / interchange interventions
Record the activation, and swap the activations (can thus find the output)
distributed alignment search
Features are not axis aligned. Find equality task efficiently after (a rotation?)
three worlds of casual interventions
…as interp
“can we find interpretable causal mechanisms?” That is, “searching for a rotation” and then run interchange interventions.
- proposa a model to align
- figure out if counter factual matches
- solve for alignment
…as control
ReFT: “can we optimize our intervention for any task?” That is, can intervention be a good way to derive control.
ReFT applies only limited interventions to prompt tokens using the same notion of minor control
…as steering
AxBench tells us that most current steering objective untenable.
However, we can steer better if by simply contrastive learning of both positive and negative cases. Via the notino of “negative steering”, we find that negative steering
