MOEReview Sharma: LAZER

One-Liner

Getting rid of low singular value components in weights actually improves model performance.

Motivation

Previous work has shown that pruning SVD components works without significant performance degradation. But this work shows that with knowing where to prune more carefully, we can obtain better-than-baseline performance.

Notable Methods

We do this by trying all reductions based on \(\qty(\tau, \ell, \rho)\) tuples where we have \(\tau\) being the parameter type (projs q, k, v, attn out, mlp in and out), \(\ell\) being the layer number, and \(\rho\) being the rate of reduction.

We keep the ones that work based on a factuality dataset.

Notes

Unclear if this generalizes OOD, i.e. is table 1 OOD by search on one train and many test or many train and many test?