Limited Samples and Infinite Compute

Last edited: October 10, 2025

One-Liner

“If you have infinite compute but limited samples, how do you pretrain?”

Novelty

closed-form best approach which reduces loss given the current data budget

Notable Methods

Outline

Takes 200 million tokens, 300 million parameters, from a corpus
Measure validation loss
Vary training recipes

regularized parameter scaling

Stick in some more weight decay: more parameters, more weight decay

Model	WDK
150M	0.8
300M	1.6
600M	3.2
1.4B	3.2

ensembling

Train a bunch of seperate models (i.e. with random shuffles?) and then parameter merge; different initialization, etc.

op

Last edited: October 10, 2025

power series

Last edited: October 10, 2025

a power series centered at \(a\) is defined with \(c_{n} \in \mathbb{R}\), whereby:

\begin{equation} f(x) = \sum_{n=0}^{\infty} c_{n}(x-a)^{n} \end{equation}

meaning it is written as \(c_0 + c_1(x-a) + c_2(x-a)^{2} + c_3 (x-a)^{3} + \cdots\)

radius of convergence

there is a radius of convergence \(R \geq 0\) for any power series, possibly infinite, by which the series is absolutely convergent where \(|x-a| < R\), and it does not converge when \(|x-a| > R\) , the case where \(|x-a| = R\) is uncertain
ratio test: if all coefficients \(c_{n}\) are nonzero, and some \(\lim_{n \to \infty} \left| \frac{c_{n}}{c_{n+1}} \right|\) evaluates to some \(c\) — if \(c\) is positive or \(+\infty\), then that limit is equivalent to the radius of convergence
Taylor’s Formula: a power series \(f(x)\) can be differentiated, integrated on the bounds of \((a-R, a+R)\), the derivatives and integrals will have radius of convergence \(R\) and \(c_{n} = \frac{f^{(n)}(a)}{n!}\) to construct the series

linear combinations of power series

When \(\sum_{n=0}^{\infty} a_{n}\) and \(\sum_{n=0}^{\infty} b_{n}\) are both convergent, linear combinations of them can be described in the usual fashion:

Limited Samples and Infinite Compute

One-Liner

Novelty

Notable Methods

Outline

regularized parameter scaling

ensembling

op

power series

radius of convergence

linear combinations of power series

SU-CS229 OCT012025

Key Sequence

Notation

New Concepts

Important Results / Claims

Questions

Interesting Factoids

Scratch

SU-CS161 OCT022025

Key Sequence

New Concepts

Important Results / Claims

Questions

Interesting Factoids

scratchpad