Posts

window-based co-occurance

Last edited: August 8, 2025

window-based co-occurance is a matrix whereby we increment the value where each row is the center word, and each column is the number of occurrences of that other word next to a window of that word.

This approach is fine (not great), but if your vocabulary is HUGE, your word vectors will be exactly that length—bad. Therefore, we take this matrix and we SVD it; then, we chop off the smaller singular values to create a low dimensional approximation of our matrix.

Windows FAT

Last edited: August 8, 2025

linked files architecture for filesystem, but it caches the file links in memory when the OS is running.

problems

  • data is still scattered across the disk
  • we had to construct the file allocation table
  • though its must faster because jumping to the middle of the file is now in memory, we are still doing O(n) search for a specific sub part

Word Normalization

Last edited: August 8, 2025

Pay attention to:

  1. cases (all letters to lower case?)
  2. lemmatization

This is often done with morphological parsing, for instance, you can try stemming.

word2vec

Last edited: August 8, 2025

we will train a classifier on a binary prediction task: “is context words \(c_{1:L}\) likely to show up near some target word \(W_0\)?”

We estimate the probability that \(w_{0}\) occurs within this window based on the product of the probabilities of the similarity of the embeddings between each context word and the target word.

  • we have a corpus of text
  • each word is represented by a vector
  • go through each position \(t\) in the text, which has a center word \(c\) and set of context words \(o \in O\)
  • use similarity of word vectors \(c\) and \(o\) to calculate \(P(o|c)\)

Meaning, we want to devise a model which can predict high probabilities \(P(w_{t-n}|w_{t})\) for small \(n\) and low probabilities for large \(n\)

Works Progress Administration

Last edited: August 8, 2025

WPA is the largest relief program ever in the Great Depression New Deal, to promote public infrastructure and create artistic murals. It helped unskilled men to carry out public works infrastructure.

The project started 5/1935 and dissolved 6/1943.