Posts

k-means clustering

Last edited: October 10, 2025

constituents

  • dataset \(\qty {x^{(1)}, \dots, x^{(n)}}\)
  • number of clusters \(k\)

requirements

Initialize cluster centroids \(\mu_{1}, …, \mu_{k}\) randomly, and repeat:

  • assigning points \(x^{(i)}\) to cluster centers \(c^{(i)}\): for each \(i \in [1, \dots N]\), write \(c^{(i)} = \arg\min_{j} \norm{x^{(i)}- \mu_{j}}^{2}_{2}\)
  • update centroids: for each \(j \in [1 \dots k]\), write \(\mu_{j} = \frac{\sum_{i=1}^{n} \mathbbm{1}\qty {c^{(i)}=j} x^{(i)}}{\sum_{i=1}^{n} \mathbbm{1}\qty {c^{(i)}=j}}\)

additional information

some ways to pick better centroid

  • sample the data points as the initial centroids
  • k-means++

k-means++

  • pick uniform data point to be first centroid
  • pick next centroid w.r.t. probability proportional to distance to the previous centroid squared

distortion function

Consider the following function:

Kaplan et al., 2020

Last edited: October 10, 2025

OG scaling laws paper

Predicting Scaling Laws

Last edited: October 10, 2025

Consider in Kaplan et al., 2020 scaling laws, at scale, we can get reasonably smooth trends for MMLU / ARC / etc.

We would like to predict these IN ADVANCE at smaller scale.

Pretraining Data

Last edited: October 10, 2025
  • Small scale: DCLM Baseline Data
  • Legally friendly data: CommonPile
  • Web scraped data with quality group: NemoTron

people measure isoFLOPS

Problems of pre-training data

  1. pre-training influence downstream capabilities
  2. …and therefore can escape into model generation
  3. real world users expect novelty

Changes in Distribution

Big Pretraining Data

GPT2

  • deduplicated data
  • Removed Wikipedia (to prevent data leak)
  • Heuristic based cleaning

GPT3

  • Deduplicated
  • based on leaked data

Llama

the usual spheal

  • removed high perplexity data using wiki n-gram model
  • removed non-English
  • deduplicated

Llama 2

  • removed high volue of PII
  • Removed non-english

Pretraining Curation Decisions

  • what to include
  • what is the timestamp being scraped
  • heuristic based cleaning? data cleaning? etc.
  • language filtering (only take English?)
  • PII removal
  • dedup
  • Toxicity + SafeURL filtering
  • “quality filtering”
  • sampling distributions

Change in Model Age

Good alignment shown between validation year and pre-training year, even mixing in older data.

SU-CS229 OCT152025

Last edited: October 10, 2025

Key Sequence

Notation

New Concepts

Important Results / Claims

Questions

Interesting Factoids