Pretraining Data

Last edited: August 8, 2025

Problems of pre-training data

pre-training influence downstream capabilities
…and therefore can escape into model generation
real world users expect novelty

Changes in Distribution

Big Pretraining Data

GPT2

deduplicated data
Removed Wikipedia (to prevent data leak)
Heuristic based cleaning

GPT3

Deduplicated
based on leaked data

Llama

the usual spheal

removed high perplexity data using wiki n-gram model
removed non-English
deduplicated

Llama 2

removed high volue of PII
Removed non-english

Pretraining Curation Decisions

what to include
what is the timestamp being scraped
heuristic based cleaning? data cleaning? etc.
language filtering (only take English?)
PII removal
dedup
Toxicity + SafeURL filtering
“quality filtering”
sampling distributions

Change in Model Age

Good alignment shown between validation year and pre-training year, even mixing in older data.

Pretraining Long Transformers

Last edited: August 8, 2025

Unfortunately, this note is not published online.

price

Last edited: August 8, 2025

The price

prime

Last edited: August 8, 2025

An integer \(p > 1\) is prime if it has no positive divisors other than \(1\) and itself.

No even number, except \(2\), is prime. Because 2

additional information

There are infinitely many primes

Credit: Euler.

Proof:

Assume to the contrary that there are finitely many primes. \(p_1, …, p_{n}\). We desire to make a new prime to reach contradiction.

Consider:

\begin{equation} N = p_1 \times \dots \times p_{n} + 1 \end{equation}

Note that \(p_1 \times … \times p_{n}\) is divisible by each of the \(p_{j}\). If some \(p_i |N\), \(p_{i}|1\), which is impossible as \(1\) is not divisible by anything. So, no \(p_{i}\) divides \(N\).

prime factorization

Last edited: August 8, 2025