Pretraining Data
Last edited: August 8, 2025Problems of pre-training data
- pre-training influence downstream capabilities
- …and therefore can escape into model generation
- real world users expect novelty
Changes in Distribution
Big Pretraining Data
GPT2
- deduplicated data
- Removed Wikipedia (to prevent data leak)
- Heuristic based cleaning
GPT3
- Deduplicated
- based on leaked data
Llama
the usual spheal
- removed high perplexity data using wiki n-gram model
- removed non-English
- deduplicated
Llama 2
- removed high volue of PII
- Removed non-english
Pretraining Curation Decisions
- what to include
- what is the timestamp being scraped
- heuristic based cleaning? data cleaning? etc.
- language filtering (only take English?)
- PII removal
- dedup
- Toxicity + SafeURL filtering
- “quality filtering”
- sampling distributions
Change in Model Age
Good alignment shown between validation year and pre-training year, even mixing in older data.
Pretraining Long Transformers
Last edited: August 8, 2025Unfortunately, this note is not published online.
price
Last edited: August 8, 2025The price
prime
Last edited: August 8, 2025An integer \(p > 1\) is prime if it has no positive divisors other than \(1\) and itself.
No even number, except \(2\), is prime. Because 2
additional information
There are infinitely many primes
Credit: Euler.
Proof:
Assume to the contrary that there are finitely many primes. \(p_1, …, p_{n}\). We desire to make a new prime to reach contradiction.
Consider:
\begin{equation} N = p_1 \times \dots \times p_{n} + 1 \end{equation}
Note that \(p_1 \times … \times p_{n}\) is divisible by each of the \(p_{j}\). If some \(p_i |N\), \(p_{i}|1\), which is impossible as \(1\) is not divisible by anything. So, no \(p_{i}\) divides \(N\).