## stochastic gradient descent

See stochastic gradient descent

## Word2Vec

see word2vec

Or, we can even use a simpler approach, window-based co-occurance

## GloVe

**goal**: we want to capture linear meaning components in a word vector space correct**insight**: the*ratio*of co-occurrence probabilities are linear meaning components

Therefore, GloVe vectors comes from a log-bilinear:

\begin{equation} w_{i} \cdot w_{j} = \log P(i|j) \end{equation}

such that:

\begin{equation} w_{x} \cdot (w_{a} - w_{b}) = \log \frac{P(x|a)}{P(x|b)} \end{equation}

## Evaluating a NLP System

### Intrinsic

- evaluate on the specific target task the system is trained on
- evaluate speed
- evaluate understandability

### Extrinsic

- real task + attempt to replace older system with new system
- maybe expensive to compute

## Word Sense Ambiguity

Each word may have multiple different meanings; each of those separate word sense should live in a different place. However, words with polysemy have related senses, so we usually average:

\begin{equation} v = a_1 v_1 + a_2v_2 + a_3v_3 \end{equation}

where \(a_{j}\) is the frequency for the \(j\) th sense of the word, and \(v_1 … v_{3}\) are separate word senses.

### sparse coding

if each sense is relatively common, at **high enough dimensions**, sparse coding allows you to recover the component sense vectors from the average word vectors because of the general sparsity of the vector space.

## Word-Vector NER

Create a window +- n words next to each word to classify; feed the entire sequence’s embeddings concatenated into a neural classifier, and use a target to say whether the center word is a entity/person/no label, etc.

These simplest classifiers usually use softmax, which, without other activations, gives a linear decision boundary.

With a neural classifier, we can add enough nonlinearity in the middle to make the entire representation, but the final output layer will still contain a linear classifier.