Limited Samples and Infinite Compute

One-Liner

“If you have infinite compute but limited samples, how do you pretrain?”

Novelty

  • closed-form best approach which reduces loss given the current data budget

Notable Methods

Outline

  • Takes 200 million tokens, 300 million parameters, from a corpus
  • Measure validation loss
  • Vary training recipes

regularized parameter scaling

Stick in some more weight decay: more parameters, more weight decay

ModelWDK
150M0.8
300M1.6
600M3.2
1.4B3.2

ensembling

Train a bunch of seperate models (i.e. with random shuffles?) and then parameter merge; different initialization, etc.

Key Results

Various training recipes gets various results:

  • epoching: overfitting eventually, and larger models does so quickly
  • regularized parameter scaling: faster improvements in loss at increased parameter scales
  • ensemble: lower asymptote

Takeaways

training recipes

  1. current aprpocah overfit
  2. regularization can scaling law
  3. esembling decreases loss at asymptote

inference time efficiency

This is an ad for ensembling, and that’s expensive! You can distill the model down to a single dense size and a 4-ensemble distilled down densely can even outperform an optimal 4 ensemble.

Self distillation can be good as well!

continual pre-training

Training with ensembling gives efficiency gains even at large data scale regimes (4B tokens, etc.)

New Concepts

Notes