Posts

Pretraining Data

Last edited: August 8, 2025

Problems of pre-training data

  1. pre-training influence downstream capabilities
  2. …and therefore can escape into model generation
  3. real world users expect novelty

Changes in Distribution

Big Pretraining Data

GPT2

  • deduplicated data
  • Removed Wikipedia (to prevent data leak)
  • Heuristic based cleaning

GPT3

  • Deduplicated
  • based on leaked data

Llama

the usual spheal

  • removed high perplexity data using wiki n-gram model
  • removed non-English
  • deduplicated

Llama 2

  • removed high volue of PII
  • Removed non-english

Pretraining Curation Decisions

  • what to include
  • what is the timestamp being scraped
  • heuristic based cleaning? data cleaning? etc.
  • language filtering (only take English?)
  • PII removal
  • dedup
  • Toxicity + SafeURL filtering
  • “quality filtering”
  • sampling distributions

Change in Model Age

Good alignment shown between validation year and pre-training year, even mixing in older data.

Pretraining Long Transformers

Last edited: August 8, 2025

Unfortunately, this note is not published online.

price

Last edited: August 8, 2025

The price

prime

Last edited: August 8, 2025

An integer \(p > 1\) is prime if it has no positive divisors other than \(1\) and itself.

No even number, except \(2\), is prime. Because 2

additional information

There are infinitely many primes

Credit: Euler.

Proof:

Assume to the contrary that there are finitely many primes. \(p_1, …, p_{n}\). We desire to make a new prime to reach contradiction.

Consider:

\begin{equation} N = p_1 \times \dots \times p_{n} + 1 \end{equation}

Note that \(p_1 \times … \times p_{n}\) is divisible by each of the \(p_{j}\). If some \(p_i |N\), \(p_{i}|1\), which is impossible as \(1\) is not divisible by anything. So, no \(p_{i}\) divides \(N\).

prime factorization

Last edited: August 8, 2025