Generalized Linear Model

Last edited: October 10, 2025

A Generalized Linear Model is a model of data with the following properties:

The model for \(P(y \mid x; \theta)\) should come from a exponential family (depending on what your distribution of \(y\) is—for real data, we pick Gaussian distribution, for binary data, we pick Bernoulli distribution, for counts, we use poisson distribution, \(\mathbb{R}^{+}\) we use gamma distribution or exponential distribution, and for distributions of distributions we use Beta Distribution or Dirichlet Distribution).
\(\eta = \theta^{T}x\), where \(\theta,x \in \mathbb{R}^{d}\)
at test time…
- we want to output \(\mathbb{E}\qty [y|x; \theta]\)
- so our predictor is written as \(h_{\theta}\qty(x) = \mathbb{E}\qty [y|x; \theta]\)
at train time, we maximize log likelihood \(\max_{\theta} \sum_{i=1}^{n} \log P\qty(y^{(i)} | \theta^{T}x^{(i)})\)
to update using gradient ascend, \(\theta_{j} = \theta_{j} + \alpha \sum_{i=1}^{n} \qty(y^{(i)} - h_{\theta}\qty(x^{(i)}))x_{j}^{(i)}\)

We also have two fancy names for things

k-select

Last edited: October 10, 2025

\begin{equation} \text{SELECT}\qty(A,k) \end{equation}

returns the kth smallest element of \(A\).

additional information

properties of k-select

\(\text{SELECT}\qty(A,1)\) is the smallest
\(\text{SELECT}\qty(A, \frac{n}{2})\) is the median
\(\text{SELECT}\qty(A,n)\) is the maximum

a naive solution

Naively, \(\text{SELECT}\qty(A,C)\) for any constant \(C\) is in \(O\qty(n)\) because you can just keep track of the \(C\) smallest array. But turns out, something like the median \(\text{SELECT}\qty(A, \frac{n}{2})\) is basically insertion sort, so \(O\qty(n^{2})\).

an actual linear-time solution

Proof idea: we want to pick a “pivot” value which is in the list; then partition the array into things that are less than that value, or more than that value.

Softmax Regression

Last edited: October 10, 2025

Suppose you have dataset \(\qty(x^{1}, y^{1}) …, \qty(x^{n}, y^{n})\), where \(y^{(j)} \in \qty {1,2,3,4}\).

We can learn a model of this

\begin{align} \max_{\theta} L\qty(\theta) &= \prod_{i=1}^{n} p\qty(y^{(i)} \mid x^{(i)}; \theta) \\ &= \prod_{i=1}^{n}\theta_{1}^{1\qty {y_{i} = 1}} \dots \theta_{4}^{1\qty {y_{i} = 4}} \end{align}

the derivative ends up being nice.

Derivation

Consider a multinomial distribution in 4 elements. Let’s write this in terms of a n exponential family. Consider:

\begin{equation} \begin{cases} T\qty(1) = \mqty(1 & 0 & 0) \\ T\qty(2) = \mqty(0&1&0) \\ T\qty(3) = \mqty(0&0&1) \end{cases} \end{equation}

substitution method

Last edited: October 10, 2025

Consider:

\begin{equation} T\qty(n) = 2 T\qty(\frac{n}{2}) + n \end{equation}

We can get the recurrence relation.

guess the answer

We can try to expand the middle by plugging stuff in:

\(T\qty(n) = 2T\qty(\frac{n}{2}) +n\)
\(T\qty(n) = 2\qty(2 T\qty(\frac{n}{4})+ \frac{n}{2}) + n\)
\(T\qty(n) = 4T\qty(\frac{n}{4}) + 2n\)
…

We can guess the pattern by staring at it: \(T\qty(n) = 2^{j} T\qty(\frac{n}{2^{j}}) + j\cdot n\) is true for all \(j\).

Suppose \(j= \log \qty(n)\) (that’s where \(T\qty(n)\) disappears), \(T\qty(n) = n\qty(\log \qty(n) + 1)\).

least-squares error

Last edited: October 10, 2025

requirements

\(h\qty(x)\) the predictor function
\(x,y\), the samples of data

definition

\begin{equation} J\qty(\theta) = \frac{1}{2} \sum_{i=1}^{n}\qty(h_{\theta }\qty(x^{(i)}) - y^{(i)})^{2} \end{equation}

additional information

“why the 1/2”?

Because when you take \(\nabla J\qty(\theta)\) you end up with the \(\frac{1}{2}\) and the \(2\) canceling out.

probabilistic intuition for least-squares error in linear regression

Assume that our dataset \(\qty(x^{(i)}, y^{(i)}) \sim D\) has the following property: “the true \(y\) value is just our model’s output, plus some error.” Meaning: