\begin{equation} \theta^{t+1} = \theta^{t} - \eta \nabla_{\theta} L(f_{\theta}(x), y) \end{equation}

this terminates when theta differences becomes small, or when progress halts: like when \(\theta\) begins going up instead.

we update the weights in SGD by taking a **single random sample** and moving weights to that direction.

```
while True:
subset = sample_window(corpus)
theta = theta - lr*subset.grad()
```

In theory this is an *approximation* of gradient descent; however, Neural Networks works actually BETTER when you jiggle a bit.

## batch gradient descent

stochastic gradient descent gives choppy movements because it does one sample at once.

batch gradient descent does it over the entire dataset, which is fine but its slow.

## mini-batch gradient

mini-batches helps take advantage of both by training over groups of \(m\) samples

## Regularization

See Regularization