Numerical Linear Algebra

FLOP

A FLOP is a “floating-point operations”. One addition, subtraction, multiplication, or division of two floating-point numbers.

Basic Linear Algebra Subroutines

BLAS Level 1

Vector-Vector operations

inner product \(x^{T}y\) which is \(2n-1\) flops
sum \(x+y\) and scale \(\alpha x\) which is \(n\) flops

BLAS Level 2

Matrix-vector product

\(y = Ax\) with \(A \in \mathbb{R}^{m \times n}\)

standard multiply \(m\qty(2n - 1)\)
\(2N\) if \(A\) is sparse with \(N\) non-zero elements
\(2p\qty(n+m)\) if \(A\) is given as \(A = UV^{T}\), where \(U \in \mathbb{R}^{m \times p}\) and \(V \in \mathbb{R}^{n \times p}\), which is called a “banded” matrix (sparse everywhere except for a strip nexht to diagonal)

BLAS Level 3

Matrix-matrix operations

for \(A \in \mathbb{R}^{m\times n}\) and \(B \in \mathbb{R}^{n \times p}\)

\(mp\qty(2n-1)\) flops (on the order of \(2 mnp\))
less if \(A\) and or \(B\) is false
\(\frac{1}{2}m\qty(m+1)\qty(2n-1) \approx m^{2}n\) if \(m = p\) and \(AB\) is symmetric

BLAS Speeds

CPU single thread does at around 1-10 GFlops/s
CPU multi-tread 10-100 GFlops/s
GPU typically 100 GFlops/s - 1 TFloaps/s

GPUs uses “blocking” (divide and concur)

Solving Linear Equations

solution of \(Ax = b\) is \(x = A^{-1}b\)

no one actually materializes \(A^{-1}\)
for general methods, \(O\qty(n^{3})\)
for \(A\) with half-bandwidth \(k\), we have \(O\qty(k^{2}n)\)

For special structures, we can have:

special matricies

lower-triangular — forward substitution
upper-triangular — do it backwards
orthogonal matrices — \(2n^{2}\) flops to compute with householder transformation

factorization

Factor \(A\) as a product of simple matricies \(A = A_1 … A_{k}\). And then we can compute:

\begin{equation} x = A^{-1}b = A_{k}^{-1} \dots A_{1}^{-1} b \end{equation}

Cost of solver is dominated by this factoring instead of actually the solving. We can solve a modest number of equations with the same coefficinets at the same cost as one (because you are just factoring most of the time, you can just cache the factored results).

LU Factorization

Every nonsingular matrix \(A\) can be factored as:

\begin{equation} A = PLU \end{equation}

where \(P\) is a permutation, \(L\) is lower triangular, \(U\) is upper triangular. Solving procedure:

LU factorization
permutation (shuffle \(X\) around backwards)
forward substitution
backward substitution

Sparse LU Factorization

For \(A\) sparse and invertible, we can factor it into four matricies \(A = P_1 L U P_2\).

Doing these row and column permutations allows \(L\) and \(U\) to be sparse. They are then chosen heuristically to yield sparse \(L\), \(U\).

Cholesky Factorization

Suppose \(A\) is positive definite.

Factor \(A\) into \(A = LL^{T}\). \(L\) is lower-triangular with positive diagonal entries.

Sparse Cholesky Factorization

for sparse positive definite \(A\), factor as \(A = PLL^{T}P^{T}\)
adding permutation matrix \(P\) makes sparser \(L\)

This is same as:

permuting rows and columns of \(A\) to get \(\tilde{A} = P^{T}AP\)
and then Cholesky

Interestingly we can figure out \(P\) just with the sparsity pattern of \(A\), with no value.

LDL-transpose Factorization

Every nonsingular symmetric matrix \(A\) can be factored as:

\begin{equation} A = PLDL^{T} P^{T} \end{equation}

with \(P\) permutaiton matrix, \(L\) lower triangular, \(D\) block diagonal with \(1 \times 1\) or \(2 \times 2\) diagonal blocks.

Structured Sub-Blocks

For:

\begin{equation} \mqty(A_{11} & A_{12} \\ A_{21} &A_{22}) \mqty(x_1 \\ x_2) = \mqty(b_1 \\ b_2) \end{equation}

We can first solve:

Form \(A_{11}^{-1}A_{12}\) and \(A_{11}^{-1}b_1\) (factor \(A_{11}\), solve with each column of \(A_{12}\) and \(b\))
Form Shure’s complement: \(S = A_{22} - A_{21}A^{-1}_{11} A_{12}\) and \(\tilde{b} = b_2 - A_{21} A_{11}^{-1}b_1\)
Determine \(x_2\) by solving \(S x_2 = \tilde{b}\)
Determine \(x_1\) by solving \(A_{11}x_1 = b_1 - A_{12} x_2\)

Matrix Inversion Lemma

\begin{equation} \qty(A+BC)^{-1} = A^{-1} - A^{-1}B\qty(I + CA^{-1}B)^{-1}CA^{-1} \end{equation}