Posts

topological sort

Last edited: August 8, 2025

For directed acyclic graphs, a topological sort of a directed graph is such that if there’s an edge \(A \to B\), then \(A\) comes before \(B\) in the sort.

Under direct acyclic graphs, a topological sort always exist.

Train a Bert

Last edited: August 8, 2025

So you want to train a Bert? Either your name is david or you stumbled upon this from the internet. well hello. I’m tired and its 1AM so like IDFK if this will be any accurate at all. Oh and if you are reading this because you are an NLPer, I apologize for the notation its 1am.

CLS Tokens

A Bert is a bi-directional transformer encoder model. A Transformer encoder takes a sequence of tokenized input text (each being an embedding), and produces a dense embedding for each token. Hence, for each word vector \(w \in W \subset \mathbb{R}^{n}\), a Bert \(B\) performs a mapping \(\mathcal{L}\qty(W, \mathbb{R}^{m})\) applied onto each input token.

Train-Test Split

Last edited: August 8, 2025
  • training data: to train the network
  • validation data: to select the best model/fitting strategy
  • test data: show how much error you have on it to not bias the model any choices (to give information on how the model actually performs)

Training Data Sourcing

Last edited: August 8, 2025

Finding training data for AI is hard. So instead:

Intentional training data

  • curated for training data
  • Spent time thinking about bias, control, etc.

Training set of convenience

  • Dataset that just comes about
  • Problematic:

Accidentally introduce bias into the data: Googling images of CEOs, which is convenient, results in all white males for a bit.

Training Helpful Chatbots

Last edited: August 8, 2025

“What we have been building since ChatGPT at H4.

  • No pretraining in any way

Basic Three Steps

Goal: “helpful, harmless, honest, and huggy” bots.

  1. Retraining step: large-scale next token prediction
  2. Incontext learning: few shot learning without updating parameters
  3. “Helpful” steps
    1. Taking supervised data to perform supervised fine tuning
  4. “Harmless” steps
    1. Training a classifier for result ranking
    2. RLHF

Benchmarking

Before we started to train, we have a problem. Most benchmarks are on generic reasoning, which evaluates 1), 2). Therefore, we need new metrics for steps 4) and 5).