Houjun Liu

random variable

A random variable is a quantity that can take on different values, whereby there is a separate probability associated with each value:

  • discrete: finite number of values
  • continuous: infinitely many possible values

probability mass function

A discrete random variable is encoded as a probability mass function

probability density function

A continuous random variable is represented as a probability density function.

summary statistics

adding random variables

“what’s the probability of \(X + Y = n\) with IID \(X\) and \(Y\)?” “what’s the probability of two independent samples from the same exact distribution adding up to \(n\)?”

\begin{equation} \sum_{i=-\infty}^{\infty} P(X=i, Y=n-i) \end{equation}

or integrals and PDFs, as appropriate for continuous cases

for every single outcome, we want to create every possible operation which causes the two variables to sum to \(n\).

We can use convolution to figure out every combination of assignments to random variables which add to a value, and sum their probabilities together.

If you add a bunch of IID things together…. central limit theorem

averaging random variables

adding random variables + linear transformers on Gaussian

You end up with:

\begin{equation} \mathcal{N}\qty(\mu, \frac{1}{n} \sigma^{2}) \end{equation}

you note: as you sum together many things that is IID, the average is pretty the same; but the variance gets smaller as you add more.

maxing random variables

Gumbel distribution: fisher tripplett gedembo theorem???

sampling statistics

We assume that there’s some underlying distribution with some true mean \(\mu\) and true variance \(\sigma^{2}\). We would like to model it with some confidence.

Consider a series of measured samples \(x_1, …, x_{n}\), each being an instantiation of a IID random variable drawn from the underlying distribution each being \(X_1, …, X_{n}\).

sample mean

Let us estimate the true population mean… by creating a random variable representing the the averaging \(n\) measured random variables representing the observations:

\begin{equation} \bar{X} = \frac{1}{N} \sum_{i=1}^{n} X_{i} \end{equation}

we can do this because we really would like to know \(\mathbb{E}[\bar{X}] = \mathbb{E}[\frac{1}{N} \sum_{i=1}^{n} X_i] = \frac{1}{N}\sum_{i=1}^{n} \mathbb{E}[X_{i}] = \frac{1}{N} N \mu = \mu\) and so as long as each of the underlying variables have the same expected mean (they do because IID) drawn, we can use the sample mean to estimate the population mean.

sample variance

We can’t just calculate the sample variance with the variance of the sample. This is because the sample mean will be by definition by closer to each of the sampled points than the actual value. So we correct for it. This is a random variable too:

\begin{equation} S^{2} = \frac{1}{n-1} \sum_{i=1}^{N} (X_{i} - \bar{X})^{2} \end{equation}

standard error of the mean

\begin{equation} Var(\bar{X}) = \frac{S^{2}}{n} \end{equation}

this is the ERROR OF the mean given what you measured because of the central limit theorem