k-means clustering
Last edited: October 10, 2025constituents
- dataset \(\qty {x^{(1)}, \dots, x^{(n)}}\)
- number of clusters \(k\)
requirements
Initialize cluster centroids \(\mu_{1}, …, \mu_{k}\) randomly, and repeat:
- assigning points \(x^{(i)}\) to cluster centers \(c^{(i)}\): for each \(i \in [1, \dots N]\), write \(c^{(i)} = \arg\min_{j} \norm{x^{(i)}- \mu_{j}}^{2}_{2}\)
- update centroids: for each \(j \in [1 \dots k]\), write \(\mu_{j} = \frac{\sum_{i=1}^{n} \mathbbm{1}\qty {c^{(i)}=j} x^{(i)}}{\sum_{i=1}^{n} \mathbbm{1}\qty {c^{(i)}=j}}\)
additional information
some ways to pick better centroid
- sample the data points as the initial centroids
- k-means++
k-means++
- pick uniform data point to be first centroid
- pick next centroid w.r.t. probability proportional to distance to the previous centroid squared
distortion function
Consider the following function:
Kaplan et al., 2020
Last edited: October 10, 2025OG scaling laws paper
Predicting Scaling Laws
Last edited: October 10, 2025Consider in Kaplan et al., 2020 scaling laws, at scale, we can get reasonably smooth trends for MMLU / ARC / etc.
We would like to predict these IN ADVANCE at smaller scale.
Pretraining Data
Last edited: October 10, 2025- Small scale: DCLM Baseline Data
- Legally friendly data: CommonPile
- Web scraped data with quality group: NemoTron
people measure isoFLOPS
Problems of pre-training data
- pre-training influence downstream capabilities
- …and therefore can escape into model generation
- real world users expect novelty
Changes in Distribution
Big Pretraining Data
GPT2
- deduplicated data
- Removed Wikipedia (to prevent data leak)
- Heuristic based cleaning
GPT3
- Deduplicated
- based on leaked data
Llama
the usual spheal
- removed high perplexity data using wiki n-gram model
- removed non-English
- deduplicated
Llama 2
- removed high volue of PII
- Removed non-english
Pretraining Curation Decisions
- what to include
- what is the timestamp being scraped
- heuristic based cleaning? data cleaning? etc.
- language filtering (only take English?)
- PII removal
- dedup
- Toxicity + SafeURL filtering
- “quality filtering”
- sampling distributions
Change in Model Age
Good alignment shown between validation year and pre-training year, even mixing in older data.
