Houjun Liu

bootstrap

bootstrap allows you to know distribution statistics, calculate p-value, etc, with NO statistical testing like t test, etc.

Big idea: treat your sample space as your population, and sample from it to obtain an estimate of the properties of the sample distribution.

\begin{equation} D \approx \hat{D} \end{equation}

so, to calculate the distribution of any given statistic via a sample:

  1. estimate the PMF using sample
  2. my_statistic_dist = [] (like sample mean, sample variance, etc.)
  3. for i in (N >> 10000)
    1. take a subsample of len(sample) samples from PMFu
    2. my_statistic_dist.append(my_statistic=(=subsample)) (recall it has to be a sampling statistic (like N-1 for sample variance)
  4. how you have a distribution of my_statistic

We know that taking mean and var re drawn as a statistic of the same random variable, \(N\) times. So, central limit theorem holds. Therefore, these are normal and you can deal with them.

In terms of step 3.1, the subsample of len sample can be given by:

np.random.choice(sample_pop, len(sample_pop), replace=True)

because we essentilaly want to draw from a weighted distribution of your input sample, WITH REPLACEMENT (otherwise it’d be the same exact set of data instead of a sample from it).

p-value from bootstrap

p-value is defined as “probability of having an difference in sample means (called Effecient Frontier) greater than that observed in samples of the null hypothesis, that the two sames came from the same distribution”.

so:

\begin{equation} P(|\mu_{1} - \mu_{2}|>x | \text{null}\)) \end{equation}

We can simply calculate an effect size distribution via the bootstrapping on the combined population of both distributions, to see what the probability above is where \(x\) is the actual effect size we got.