One-Liner
“If you have infinite compute but limited samples, how do you pretrain?”
Novelty
- closed-form best approach which reduces loss given the current data budget
Notable Methods
Outline
- Takes 200 million tokens, 300 million parameters, from a corpus
- Measure validation loss
- Vary training recipes
regularized parameter scaling
Stick in some more weight decay: more parameters, more weight decay
Model | WDK |
---|---|
150M | 0.8 |
300M | 1.6 |
600M | 3.2 |
1.4B | 3.2 |
ensembling
Train a bunch of seperate models (i.e. with random shuffles?) and then parameter merge; different initialization, etc.
Key Results
Various training recipes gets various results:
- epoching: overfitting eventually, and larger models does so quickly
- regularized parameter scaling: faster improvements in loss at increased parameter scales
- ensemble: lower asymptote
Takeaways
training recipes
- current aprpocah overfit
- regularization can scaling law
- esembling decreases loss at asymptote
inference time efficiency
This is an ad for ensembling, and that’s expensive! You can distill the model down to a single dense size and a 4-ensemble distilled down densely can even outperform an optimal 4 ensemble.
Self distillation can be good as well!
continual pre-training
Training with ensembling gives efficiency gains even at large data scale regimes (4B tokens, etc.)