Zen's Defense

levels of interp

probes: no causality
attribution (i.e. integrated gradient): no interpretation

methods of causal interventions

activation patching / interchange interventions

Record the activation, and swap the activations (can thus find the output)

distributed alignment search

Features are not axis aligned. Find equality task efficiently after (a rotation?)

three worlds of casual interventions

…as interp

“can we find interpretable causal mechanisms?” That is, “searching for a rotation” and then run interchange interventions.

proposa a model to align
figure out if counter factual matches
solve for alignment

…as control

ReFT: “can we optimize our intervention for any task?” That is, can intervention be a good way to derive control.

ReFT applies only limited interventions to prompt tokens using the same notion of minor control

…as steering

AxBench tells us that most current steering objective untenable.

However, we can steer better if by simply contrastive learning of both positive and negative cases. Via the notino of “negative steering”, we find that negative steering