Limited Samples and Infinite Compute
Last edited: October 10, 2025One-Liner
“If you have infinite compute but limited samples, how do you pretrain?”
Novelty
- closed-form best approach which reduces loss given the current data budget
Notable Methods
Outline
- Takes 200 million tokens, 300 million parameters, from a corpus
- Measure validation loss
- Vary training recipes
regularized parameter scaling
Stick in some more weight decay: more parameters, more weight decay
| Model | WDK |
|---|---|
| 150M | 0.8 |
| 300M | 1.6 |
| 600M | 3.2 |
| 1.4B | 3.2 |
ensembling
Train a bunch of seperate models (i.e. with random shuffles?) and then parameter merge; different initialization, etc.
op
Last edited: October 10, 2025power series
Last edited: October 10, 2025a power series centered at \(a\) is defined with \(c_{n} \in \mathbb{R}\), whereby:
\begin{equation} f(x) = \sum_{n=0}^{\infty} c_{n}(x-a)^{n} \end{equation}
meaning it is written as \(c_0 + c_1(x-a) + c_2(x-a)^{2} + c_3 (x-a)^{3} + \cdots\)
radius of convergence
- there is a radius of convergence \(R \geq 0\) for any power series, possibly infinite, by which the series is absolutely convergent where \(|x-a| < R\), and it does not converge when \(|x-a| > R\) , the case where \(|x-a| = R\) is uncertain
- ratio test: if all coefficients \(c_{n}\) are nonzero, and some \(\lim_{n \to \infty} \left| \frac{c_{n}}{c_{n+1}} \right|\) evaluates to some \(c\) — if \(c\) is positive or \(+\infty\), then that limit is equivalent to the radius of convergence
- Taylor’s Formula: a power series \(f(x)\) can be differentiated, integrated on the bounds of \((a-R, a+R)\), the derivatives and integrals will have radius of convergence \(R\) and \(c_{n} = \frac{f^{(n)}(a)}{n!}\) to construct the series
linear combinations of power series
When \(\sum_{n=0}^{\infty} a_{n}\) and \(\sum_{n=0}^{\infty} b_{n}\) are both convergent, linear combinations of them can be described in the usual fashion:
