One-Liner
Getting rid of low singular value components in weights actually improves model performance.
Motivation
Previous work has shown that pruning SVD components works without significant performance degradation. But this work shows that with knowing where to prune more carefully, we can obtain better-than-baseline performance.
Notable Methods
We do this by trying all reductions based on \(\qty(\tau, \ell, \rho)\) tuples where we have \(\tau\) being the parameter type (projs q, k, v, attn out, mlp in and out), \(\ell\) being the layer number, and \(\rho\) being the rate of reduction.
We keep the ones that work based on a factuality dataset.
Notes
Unclear if this generalizes OOD, i.e. is table 1 OOD by search on one train and many test or many train and many test?
