Limited Samples and Infinite Compute

One-Liner

“If you have infinite compute but limited samples, how do you pretrain?”

Novelty

closed-form best approach which reduces loss given the current data budget

Notable Methods

Outline

Takes 200 million tokens, 300 million parameters, from a corpus
Measure validation loss
Vary training recipes

regularized parameter scaling

Stick in some more weight decay: more parameters, more weight decay

Model	WDK
150M	0.8
300M	1.6
600M	3.2
1.4B	3.2

ensembling

Train a bunch of seperate models (i.e. with random shuffles?) and then parameter merge; different initialization, etc.

Key Results

Various training recipes gets various results:

epoching: overfitting eventually, and larger models does so quickly
regularized parameter scaling: faster improvements in loss at increased parameter scales
ensemble: lower asymptote

Takeaways

training recipes

current aprpocah overfit
regularization can scaling law
esembling decreases loss at asymptote

inference time efficiency

This is an ad for ensembling, and that’s expensive! You can distill the model down to a single dense size and a 4-ensemble distilled down densely can even outperform an optimal 4 ensemble.

Self distillation can be good as well!

continual pre-training

Training with ensembling gives efficiency gains even at large data scale regimes (4B tokens, etc.)