meeting 8/1

Updates

dropping dropout: https://proceedings.mlr.press/v202/geiping23a/geiping23a.pdf
CAW-coref: revised! do we need more space for things such as a figure?
- stanza 1.9.0 staged! https://huggingface.co/stanfordnlp/stanza-en

Yay mend works!

.mean() vs. .sum() for the dW maps?

PPL Isn’t the Only Possible Metric

even if our model is better ppl, its worse at squad than Facebook (granted its been trained a lot less); will run with new pretraining model (expect that no dropout will be better (see paper above)).

Pretraining Updates (smaller Bert, which dataset?)

https://wandb.ai/jemoka/dropfree?nw=nwuserjemoka

Binary Masking with the Pretraining Above

Edit success

	Our Bert (No Dropout)	Our Bert (Dropout)
edit success	0.9709	0.9723
edit localization	0.8375	0.8452
mean activations	3853	22511

Yay! (more seriously)

Question

Paper Plan

Part 1: Skipping Dropout isn’t bad, and may even be good

pretraining
squad

Part 3: Emperics: Dropout Has Knowledge Storage Consiquences

knowledge neurons
integrated gradients
binary masking

Part 4: Impact: look, editing is easier without dropout (no data yet)

consistency (this is weak, hence theory maybe helpful, or we can skip)
MEND (just worked yay!)
Finetune (echo to squad)
LoRA
slowly reduce ReFT update rank, see how edit success drops

(x verbs y)

for each, train/(MEND infer) on 90%, test on the other 10%, see if it works
- “correct” to the {IID} example

Casual Interventions

shall we? simple exchange interaction? how does it work for a bert? does it fit into this story?

(ottowa captial <mask>) => (DC capital <mask>)