window-based co-occurance
Last edited: August 8, 2025window-based co-occurance is a matrix whereby we increment the value where each row is the center word, and each column is the number of occurrences of that other word next to a window of that word.
This approach is fine (not great), but if your vocabulary is HUGE, your word vectors will be exactly that length—bad. Therefore, we take this matrix and we SVD it; then, we chop off the smaller singular values to create a low dimensional approximation of our matrix.
Windows FAT
Last edited: August 8, 2025linked files architecture for filesystem, but it caches the file links in memory when the OS is running.
problems
- data is still scattered across the disk
- we had to construct the file allocation table
- though its must faster because jumping to the middle of the file is now in memory, we are still doing O(n) search for a specific sub part
Word Normalization
Last edited: August 8, 2025Pay attention to:
- cases (all letters to lower case?)
- lemmatization
This is often done with morphological parsing, for instance, you can try stemming.
word2vec
Last edited: August 8, 2025we will train a classifier on a binary prediction task: “is context words \(c_{1:L}\) likely to show up near some target word \(W_0\)?”
We estimate the probability that \(w_{0}\) occurs within this window based on the product of the probabilities of the similarity of the embeddings between each context word and the target word.
- we have a corpus of text
- each word is represented by a vector
- go through each position \(t\) in the text, which has a center word \(c\) and set of context words \(o \in O\)
- use similarity of word vectors \(c\) and \(o\) to calculate \(P(o|c)\)
Meaning, we want to devise a model which can predict high probabilities \(P(w_{t-n}|w_{t})\) for small \(n\) and low probabilities for large \(n\)
Works Progress Administration
Last edited: August 8, 2025WPA is the largest relief program ever in the Great Depression New Deal, to promote public infrastructure and create artistic murals. It helped unskilled men to carry out public works infrastructure.
The project started 5/1935 and dissolved 6/1943.
