Generalized Linear Model
Last edited: October 10, 2025A Generalized Linear Model is a model of data with the following properties:
- The model for \(P(y \mid x; \theta)\) should come from a exponential family (depending on what your distribution of \(y\) is—for real data, we pick Gaussian distribution, for binary data, we pick Bernoulli distribution, for counts, we use poisson distribution, \(\mathbb{R}^{+}\) we use gamma distribution or exponential distribution, and for distributions of distributions we use Beta Distribution or Dirichlet Distribution).
- \(\eta = \theta^{T}x\), where \(\theta,x \in \mathbb{R}^{d}\)
- at test time…
- we want to output \(\mathbb{E}\qty [y|x; \theta]\)
- so our predictor is written as \(h_{\theta}\qty(x) = \mathbb{E}\qty [y|x; \theta]\)
- at train time, we maximize log likelihood \(\max_{\theta} \sum_{i=1}^{n} \log P\qty(y^{(i)} | \theta^{T}x^{(i)})\)
- to update using gradient ascend, \(\theta_{j} = \theta_{j} + \alpha \sum_{i=1}^{n} \qty(y^{(i)} - h_{\theta}\qty(x^{(i)}))x_{j}^{(i)}\)
We also have two fancy names for things
k-select
Last edited: October 10, 2025\begin{equation} \text{SELECT}\qty(A,k) \end{equation}
returns the kth smallest element of \(A\).
additional information
properties of k-select
- \(\text{SELECT}\qty(A,1)\) is the smallest
- \(\text{SELECT}\qty(A, \frac{n}{2})\) is the median
- \(\text{SELECT}\qty(A,n)\) is the maximum
a naive solution
Naively, \(\text{SELECT}\qty(A,C)\) for any constant \(C\) is in \(O\qty(n)\) because you can just keep track of the \(C\) smallest array. But turns out, something like the median \(\text{SELECT}\qty(A, \frac{n}{2})\) is basically insertion sort, so \(O\qty(n^{2})\).
an actual linear-time solution
Proof idea: we want to pick a “pivot” value which is in the list; then partition the array into things that are less than that value, or more than that value.
Softmax Regression
Last edited: October 10, 2025Suppose you have dataset \(\qty(x^{1}, y^{1}) …, \qty(x^{n}, y^{n})\), where \(y^{(j)} \in \qty {1,2,3,4}\).
We can learn a model of this
\begin{align} \max_{\theta} L\qty(\theta) &= \prod_{i=1}^{n} p\qty(y^{(i)} \mid x^{(i)}; \theta) \\ &= \prod_{i=1}^{n}\theta_{1}^{1\qty {y_{i} = 1}} \dots \theta_{4}^{1\qty {y_{i} = 4}} \end{align}
the derivative ends up being nice.
Derivation
Consider a multinomial distribution in 4 elements. Let’s write this in terms of a n exponential family. Consider:
\begin{equation} \begin{cases} T\qty(1) = \mqty(1 & 0 & 0) \\ T\qty(2) = \mqty(0&1&0) \\ T\qty(3) = \mqty(0&0&1) \end{cases} \end{equation}
substitution method
Last edited: October 10, 2025Consider:
\begin{equation} T\qty(n) = 2 T\qty(\frac{n}{2}) + n \end{equation}
We can get the recurrence relation.
guess the answer
We can try to expand the middle by plugging stuff in:
- \(T\qty(n) = 2T\qty(\frac{n}{2}) +n\)
- \(T\qty(n) = 2\qty(2 T\qty(\frac{n}{4})+ \frac{n}{2}) + n\)
- \(T\qty(n) = 4T\qty(\frac{n}{4}) + 2n\)
- …
We can guess the pattern by staring at it: \(T\qty(n) = 2^{j} T\qty(\frac{n}{2^{j}}) + j\cdot n\) is true for all \(j\).
Suppose \(j= \log \qty(n)\) (that’s where \(T\qty(n)\) disappears), \(T\qty(n) = n\qty(\log \qty(n) + 1)\).
least-squares error
Last edited: October 10, 2025requirements
- \(h\qty(x)\) the predictor function
- \(x,y\), the samples of data
definition
\begin{equation} J\qty(\theta) = \frac{1}{2} \sum_{i=1}^{n}\qty(h_{\theta }\qty(x^{(i)}) - y^{(i)})^{2} \end{equation}
see also example: gradient descent for least-squares error.
additional information
“why the 1/2”?
Because when you take \(\nabla J\qty(\theta)\) you end up with the \(\frac{1}{2}\) and the \(2\) canceling out.
probabilistic intuition for least-squares error in linear regression
Assume that our dataset \(\qty(x^{(i)}, y^{(i)}) \sim D\) has the following property: “the true \(y\) value is just our model’s output, plus some error.” Meaning:
