Houjun Liu

stochastic gradient descent

\begin{equation} \theta^{t+1} = \theta^{t} - \eta \nabla_{\theta} L(f_{\theta}(x), y) \end{equation}

this terminates when theta differences becomes small, or when progress halts: like when \(\theta\) begins going up instead.

batch gradient descent

stochastic gradient descent gives choppy movements because it does one sample at once. We could also do gradient over entire dataset, but that’s slow.

mini-batches helps take advantage of both.