matrix exponentiation

Last edited: August 8, 2025

If we have some system:

\begin{equation} x’ = Ax \end{equation}

the solution for this system should be \(e^{At}\). This gives rise to, given the power series:

\begin{equation} e^{At} = 1 + At + \frac{1}{2} \qty(At)^{2} + \frac{1}{3!} (At)^{3}+ \dots \end{equation}

the derivative of which:

\begin{align} \dv t e^{At} &= A + A^{2}t + \frac{A^{3}t^{2}}{2} + \dots \\ &= A\qty(1 + At + \frac{A^{2}t^{2}}{2}) \end{align}

This intuition makes sense for all matrices \(A\). Meaning the general solution gives:

matrix multiplication

Last edited: August 8, 2025

matrix multiplication is defined such that the expression \(\mathcal{M}(ST) = \mathcal{M}(S)\mathcal{M}(T)\) holds:

\begin{equation} (AC)_{j,k} = \sum_{r=1}^{n}A_{j,r}C_{r,k} \end{equation}

While matrix multiplication is distributive and associative, it is NOT!!!!!!!!!!! commutative. I hope you can see that \(ST\neq TS\).

memorization

its always row-by-column, move down rows first then columns
multiply element-wise and add (row times column and add)

other ways of thinking about matrix multiplication

it is “row times column”: \((AC)_{j,k} = A_{j, .} \cdot C_{., k}\)
it is “matrix times columns”: \((AC)_{. , k} = A C_{., k}\)

matrix as a linear combinator

Suppose \(A\) is an \(m\) by \(n\) matrix; and \(c = \mqty(c_1\\ \vdots\\ c_{0})\) is an \(n\) by \(1\) matrix; then:

maximal interval

Last edited: August 8, 2025

a maximal interval is the largest interval you can fit while the function is finite while the function is finite.

maximum a posteriori estimate

Last edited: August 8, 2025

maximum a posteriori estimate is a parameter learning scheme that uses Beta Distribution and Baysian inference to get a distribution of the posterior of the parameter, and return the argmax (i.e. the mode) of the MAP.

This differs from MLE because we are considering a distribution of possible parameters:

\begin{equation} p\qty (\theta \mid x_1, \dots, x_{n}) \end{equation}

Calculating a MAP posterior, in general:

\begin{equation} \theta_{MAP} = \arg\max_{\theta} P(\theta|x_1, \dots, x_{n}) = \arg\max_{\theta} \frac{f(x_1, \dots, x_{n} | \theta) g(\theta)}{h(x_1, \dots, x_{n})} \end{equation}

Maximum Likelihood Parameter Learning

Last edited: August 8, 2025

“We find the parameter that maximizes the likelihood.”

for each \(X_{j}\), sum
1. what’s the log-likelihood of one \(X_{i}\)
take derivative w.r.t. \(\theta\) and set to \(0\)
solve for \(\theta\)

(this maximizes the log-likelihood of the data!)

that is:

\begin{equation} \theta_{MLE} = \arg\max_{\theta} P(x_1, \dots, x_{n}|\theta) = \arg\max_{\theta} \qty(\sum_{i=1}^{n} \log(f(x_{i}|\theta)) ) \end{equation}

If your \(\theta\) is a vector of more than \(1\) thing, take the gradient (i.e. partial derivative against each of your variables) of the thing and solve the place where the gradient is identically \(0\) (each slot is \(0\)). That is, we want: