(Py)Torch is a great C++/Python library to construct and train complex neural networks. It has taken over academia over the last few years and is slowly taking over industry. Let’s learn about how it works!

**This document is meant to be read cover-to-cover. It makes NO SENSE unless read like that. I focus on building intuition about why PyTorch works, so we will be writing unorthodox code until the very end where we put all ideas together.**

The chapters below take you through large chapters in a machine-learning journey. But, to do anything, we need to import some stuff which we will need:

```
import numpy as np
import torch
```

## Autograd

I believe that anybody learning a new ML framework should learn how its differentiation tools work. Yes, this means that we should first understand how it works with not a giant matrix, but with just two simple variables.

At the heart of PyTorch is the built-in gradient backpropagation facilities. To demonstrate this, let us create two such variables.

```
var_1 = torch.tensor(3.0, requires_grad=True)
var_2 = torch.tensor(4.0, requires_grad=True)
(var_1, var_2)
```

```
(tensor(3., requires_grad=True), tensor(4., requires_grad=True))
```

There is secretly a lot going on here, so let’s dive in. First, just to get the stickler out of the way, `torch.tensor`

(used here) is the generic variable creator, `torch.Tensor`

(capital!) initializes a proper tensor—which you will **never** need.

What is a `tensor`

? A `tensor`

is simply a very efficient matrix that can updates its own values dynamically but keep the same variable name. The above commands creates two such `tensor`

, both being `1x1`

matrices.

Note that, for the initial values, I used *floats!* instead of *ints*. The above code will crash if you use ints: this is because we want the surface on which the matrix changes value to be smooth to make things like gradient descent to work.

Lastly, we have an argument `requires_grad=True`

. This argument tells PyTorch to keep track of the gradient of the `tensor`

. For now, understand this as “permit PyTorch to change this variable if needed.” More on that in a sec.

Naturally, if we have two tensors, we would love to multiply them!

```
var_mult = var_1*var_2
var_mult
```

```
tensor(12., grad_fn=<MulBackward0>)
```

Wouldyalookatthat! Another tensor, with the value \(12\).

Now. Onto the main event. Back-Propagation! The core idea of a neural network is actually quite simple: figure out how much each input parameter (for us `var_1`

, `var_2`

) influence the output, then adjust the inputs accordingly to get the output to be \(0\).

To see what I mean, recall our output `tensor`

named:

```
var_mult
```

```
tensor(12., grad_fn=<MulBackward0>)
```

How much does changing `var_1`

and `var_2`

, its inputs, influence this output `tensor`

? This is not immediately obvious, so let’s write what we are doing out:

\begin{equation} v_1 \cdot v_2 = v_{m} \implies 3 \cdot 4 = 12 \end{equation}

with \(v_1\) being `var_1`

, \(v_2\) being `var_2`

, and \(v_{m}\) being `var_mult`

.

As you vary `var_1`

, by **what factor** does the output change? For instance, if `var_1`

(the \(3\)) suddenly became a \(2\), how much *less* will `var_mult`

be? Well, \(2\cdot 4=8\), the output is exactly \(4\) less than before less than before. Hence, `var_1`

influences the value of `var_mult`

by a factor of \(4\); meaning every time you add/subtract \(1\) to the value of `var_1`

, `var_mult`

gets added/subtracted by a value of \(4\).

Similarly, as you vary `var_2`

, by what factor does the output change? For instance, if `var_2`

(the \(4\)) suddenly became a \(5\), how much *less* will `var_mult`

be? Well, \(3\cdot 3=5\), the output is exactly \(3\) more than before less than before. Hence, `var_2`

influences the value of `var_mult`

by a factor of \(3\); meaning every time you add/subtract \(1\) to the value of `var_3`

, `var_mult`

gets added/subtracted by a value of \(3\).

Those of you who have exposure to Multi-Variable Calculus—this is indeed the same concept as a partial derivative of `var_mult`

w.r.t. `var_1`

and `var_2`

for the previous two paragraphs respectively.

These relative-change-units (\(4\) and \(3\)) are called **gradients**: the factor by which changing any given variable change the output.

Now, gradient calculation is awfully manual! Surely we don’t want to keep track of these tiny rates-of-change ourselves! This is where PyTorch autograd comes in. Autograd is the automated tool that helps you figure out these relative changes! It is built in to all PyTorch tensors.

In the previous paragraphs, we figured out the relative influences `var_1`

and `var_2`

on `var_multi`

. Now let’s ask a computer to give us the same result, in much less time.

First, we will ask PyTorch to calculate gradients for all variables that contributed to `var_mult`

.

```
var_mult.backward()
```

The `backward`

function is a magical function that finds and calculates these relative-change-values of `var_multi`

with respect to every variable that contributed to its values. To view the actual relative values, we will use `.grad`

now on the actual variables:

```
var_1.grad
```

```
tensor(4.)
```

Recall! We used our big brains to deduce above that changing `var_1`

by \(1\) unit will change `var_mult`

by \(4\) units. So this works!

The other variables works as expected:

```
var_2.grad
```

```
tensor(3.)
```

Yayyy! Still what we expected.

## Gradient Descent

Relative changes are cool, but it isn’t all that useful unless we are actually doing some changing. We want to use our epic knowledge about the relative influences of `var_1`

and `var_2`

, to manipulate those variables such that `var_mult`

is the value we want.

**THE REST OF THIS DOCUMENT IS IN CONSTRUCTION**

```
import torch.optim as optim
```

To start an optimizer, you give it all the variables for which it should keep track of updating.

```
optim = torch.optim.SGD([var_1, var_2], lr=1e-2, momentum=0.9)
```

And then, to update gradients, you just have to:

```
optim.step()
# IMPORTANT
optim.zero_grad()
```

What’s that `zero_grad`

? That clears the gradients from the variables (after applying them with `.step()`

) so that the next update doesn’t influence the current one.

## Your First Neural Network

```
import torch.nn as nn
```

### Layers

```
m = nn.Linear(20, 30)
input = torch.randn(128, 20)
output = m(input)
output, output.size()
```

Explain what the \(20, 30\) means.

Ok one layer is just lame. What if you want a bunch of layers?

```
m1 = nn.Linear(20, 30)
m2 = nn.Linear(30, 30)
m3 = nn.Linear(30, 40)
input = torch.randn(128, 20)
# function call syntax! Functions call from rigth to left!
output = m3(m2(m1(input)))
output, output.size()
```

And guess what? If you want to adjust the values here, you would just do:

```
m1 = nn.Linear(20, 30)
m2 = nn.Linear(30, 30)
m3 = nn.Linear(30, 40)
input = torch.randn(128, 20)
# function call syntax! Functions call from rigth to left!
output = m3(m2(m1(input)))
(output.sum() - 12).backward()
```

```
None
```

But wait! What are the options you give to your optimizer?

```
optim = torch.optim.SGD([m1.weight, m1.bias ... ... ], lr=1e-2, momentum=0.9)
```

That’s a *lot of variables!!* Each linear layer has a \(m\) and a \(b\) (from \(y=mx+b\) fame), and you will end up with a bajillon one of those! Also, that function call syntax, chaining one layer after another, is so knarly! Can we do better? Yes.

### An Honest-to-Goodness Neural Network

PyTorch makes the `module`

framework to make model creator’s lives easier. This is the best practice for creating a neural network.

Let’s replicate the example above with the new `module`

framework:

```
class MyNetwork(nn.Module):
def __init__(self):
# important: runs early calls to make sure that
# the module is correct
super().__init__()
# we declare our layers. We don't use them yet.
self.m1 = nn.Linear(20,30)
self.m2 = nn.Linear(30,30)
self.m3 = nn.Linear(30,40)
# this is a special function that is called when
# the module is called
def forward(self, x):
# we want to pass our input through to every layer
# like we did before, but now more declaritively
x = self.m1(x)
x = self.m2(x)
x = self.m3(x)
return x
```

Explain all of this.

But now, we essentially built our entire network in own “layer” (actually we literally did, all =Layer=s are just =torch.Module=s) that does the job of all other layers acting together. To use it, we just:

```
my_network = MyNetwork()
input = torch.randn(128, 20)
# function call syntax! Functions call from rigth to left!
output = my_network(input)
output
```

```
tensor([[-0.1694, 0.0095, 0.4306, ..., 0.1580, 0.2644, 0.1509],
[-0.2346, -0.0269, -0.1191, ..., 0.0229, -0.0819, -0.1452],
[-0.4871, -0.2868, -0.2488, ..., 0.0637, 0.1832, 0.0619],
...,
[-0.1323, 0.2531, -0.1086, ..., 0.0975, 0.0426, -0.2092],
[-0.4765, 0.1441, -0.0520, ..., 0.2364, 0.0253, -0.1914],
[-0.5044, -0.3263, 0.3102, ..., 0.1938, 0.1427, -0.0587]],
grad_fn=<AddmmBackward0>)
```

But wait! What are the options you give to your optimizer? Surely you don’t have to pass `my_network.m1.weight`

, `my_network.m1.bias`

, etc. etc. to the optimizer, right?

You don’t. One of the things that the `super().__init__()`

did was to register a special function to your network class that keeps track of everything to optimize for. So now, to ask the optimizer to update the entire network, you just have to write:

```
optim = torch.optim.SGD(my_network.parameters(), lr=1e-2, momentum=0.9)
optim
```

```
SGD (
Parameter Group 0
dampening: 0
differentiable: False
foreach: None
lr: 0.01
maximize: False
momentum: 0.9
nesterov: False
weight_decay: 0
)
```

TODO make students recall original backprop example, backprope and step and zero_grad with this new optim.

Look! Optimizing an entire network works in the *exact same way* as optimizing two lone variables.

## Putting it together

TODO

- training loop (zero first, call model, get diff/loss, .backward(), .step())
- best practices
- saving and restoring models
- GPU