k-means clustering

Last edited: October 10, 2025

constituents

dataset \(\qty {x^{(1)}, \dots, x^{(n)}}\)
number of clusters \(k\)

requirements

Initialize cluster centroids \(\mu_{1}, …, \mu_{k}\) randomly, and repeat:

assigning points \(x^{(i)}\) to cluster centers \(c^{(i)}\): for each \(i \in [1, \dots N]\), write \(c^{(i)} = \arg\min_{j} \norm{x^{(i)}- \mu_{j}}^{2}_{2}\)
update centroids: for each \(j \in [1 \dots k]\), write \(\mu_{j} = \frac{\sum_{i=1}^{n} \mathbbm{1}\qty {c^{(i)}=j} x^{(i)}}{\sum_{i=1}^{n} \mathbbm{1}\qty {c^{(i)}=j}}\)

additional information

some ways to pick better centroid

sample the data points as the initial centroids
k-means++

k-means++

pick uniform data point to be first centroid
pick next centroid w.r.t. probability proportional to distance to the previous centroid squared

distortion function

Consider the following function:

Kaplan et al., 2020

Last edited: October 10, 2025

OG scaling laws paper

Predicting Scaling Laws

Last edited: October 10, 2025

Consider in Kaplan et al., 2020 scaling laws, at scale, we can get reasonably smooth trends for MMLU / ARC / etc.

We would like to predict these IN ADVANCE at smaller scale.

Pretraining Data

Last edited: October 10, 2025

Small scale: DCLM Baseline Data
Legally friendly data: CommonPile
Web scraped data with quality group: NemoTron

people measure isoFLOPS

Problems of pre-training data

pre-training influence downstream capabilities
…and therefore can escape into model generation
real world users expect novelty

Changes in Distribution

Big Pretraining Data

GPT2

deduplicated data
Removed Wikipedia (to prevent data leak)
Heuristic based cleaning

GPT3

Deduplicated
based on leaked data

Llama

the usual spheal

removed high perplexity data using wiki n-gram model
removed non-English
deduplicated

Llama 2

removed high volue of PII
Removed non-english

Pretraining Curation Decisions

what to include
what is the timestamp being scraped
heuristic based cleaning? data cleaning? etc.
language filtering (only take English?)
PII removal
dedup
Toxicity + SafeURL filtering
“quality filtering”
sampling distributions

Change in Model Age

Good alignment shown between validation year and pre-training year, even mixing in older data.

k-means clustering

constituents

requirements

additional information

some ways to pick better centroid

k-means++

distortion function

Kaplan et al., 2020

Predicting Scaling Laws

Pretraining Data

Problems of pre-training data

Changes in Distribution

Big Pretraining Data

GPT2

GPT3

Llama

Llama 2

Pretraining Curation Decisions

Change in Model Age

SU-CS229 OCT152025

Key Sequence

Notation

New Concepts

Important Results / Claims

Questions

Interesting Factoids