(Py)Torch is a great C++/Python library to construct and train complex neural networks. It has taken over academia over the last few years and is slowly taking over industry. Let’s learn about how it works!
This document is meant to be read cover-to-cover. It makes NO SENSE unless read like that. I focus on building intuition about why PyTorch works, so we will be writing unorthodox code until the very end where we put all ideas together.
The chapters below take you through large chapters in a machine-learning journey. But, to do anything, we need to import some stuff which we will need:
import numpy as np
import torch
Autograd
I believe that anybody learning a new ML framework should learn how its differentiation tools work. Yes, this means that we should first understand how it works with not a giant matrix, but with just two simple variables.
At the heart of PyTorch is the built-in gradient backpropagation facilities. To demonstrate this, let us create two such variables.
var_1 = torch.tensor(3.0, requires_grad=True)
var_2 = torch.tensor(4.0, requires_grad=True)
(var_1, var_2)
(tensor(3., requires_grad=True), tensor(4., requires_grad=True))
There is secretly a lot going on here, so let’s dive in. First, just to get the stickler out of the way, torch.tensor
(used here) is the generic variable creator, torch.Tensor
(capital!) initializes a proper tensor—which you will never need.
What is a tensor
? A tensor
is simply a very efficient matrix that can updates its own values dynamically but keep the same variable name. The above commands creates two such tensor
, both being 1x1
matrices.
Note that, for the initial values, I used floats! instead of ints. The above code will crash if you use ints: this is because we want the surface on which the matrix changes value to be smooth to make things like gradient descent to work.
Lastly, we have an argument requires_grad=True
. This argument tells PyTorch to keep track of the gradient of the tensor
. For now, understand this as “permit PyTorch to change this variable if needed.” More on that in a sec.
Naturally, if we have two tensors, we would love to multiply them!
var_mult = var_1*var_2
var_mult
tensor(12., grad_fn=<MulBackward0>)
Wouldyalookatthat! Another tensor, with the value \(12\).
Now. Onto the main event. Back-Propagation! The core idea of a neural network is actually quite simple: figure out how much each input parameter (for us var_1
, var_2
) influence the output, then adjust the inputs accordingly to get the output to be \(0\).
To see what I mean, recall our output tensor
named:
var_mult
tensor(12., grad_fn=<MulBackward0>)
How much does changing var_1
and var_2
, its inputs, influence this output tensor
? This is not immediately obvious, so let’s write what we are doing out:
\begin{equation} v_1 \cdot v_2 = v_{m} \implies 3 \cdot 4 = 12 \end{equation}
with \(v_1\) being var_1
, \(v_2\) being var_2
, and \(v_{m}\) being var_mult
.
As you vary var_1
, by what factor does the output change? For instance, if var_1
(the \(3\)) suddenly became a \(2\), how much less will var_mult
be? Well, \(2\cdot 4=8\), the output is exactly \(4\) less than before less than before. Hence, var_1
influences the value of var_mult
by a factor of \(4\); meaning every time you add/subtract \(1\) to the value of var_1
, var_mult
gets added/subtracted by a value of \(4\).
Similarly, as you vary var_2
, by what factor does the output change? For instance, if var_2
(the \(4\)) suddenly became a \(5\), how much less will var_mult
be? Well, \(3\cdot 3=5\), the output is exactly \(3\) more than before less than before. Hence, var_2
influences the value of var_mult
by a factor of \(3\); meaning every time you add/subtract \(1\) to the value of var_3
, var_mult
gets added/subtracted by a value of \(3\).
Those of you who have exposure to Multi-Variable Calculus—this is indeed the same concept as a partial derivative of var_mult
w.r.t. var_1
and var_2
for the previous two paragraphs respectively.
These relative-change-units (\(4\) and \(3\)) are called gradients: the factor by which changing any given variable change the output.
Now, gradient calculation is awfully manual! Surely we don’t want to keep track of these tiny rates-of-change ourselves! This is where PyTorch autograd comes in. Autograd is the automated tool that helps you figure out these relative changes! It is built in to all PyTorch tensors.
In the previous paragraphs, we figured out the relative influences var_1
and var_2
on var_multi
. Now let’s ask a computer to give us the same result, in much less time.
First, we will ask PyTorch to calculate gradients for all variables that contributed to var_mult
.
var_mult.backward()
The backward
function is a magical function that finds and calculates these relative-change-values of var_multi
with respect to every variable that contributed to its values. To view the actual relative values, we will use .grad
now on the actual variables:
var_1.grad
tensor(4.)
Recall! We used our big brains to deduce above that changing var_1
by \(1\) unit will change var_mult
by \(4\) units. So this works!
The other variables works as expected:
var_2.grad
tensor(3.)
Yayyy! Still what we expected.
Gradient Descent
Relative changes are cool, but it isn’t all that useful unless we are actually doing some changing. We want to use our epic knowledge about the relative influences of var_1
and var_2
, to manipulate those variables such that var_mult
is the value we want.
THE REST OF THIS DOCUMENT IS IN CONSTRUCTION
import torch.optim as optim
To start an optimizer, you give it all the variables for which it should keep track of updating.
optim = torch.optim.SGD([var_1, var_2], lr=1e-2, momentum=0.9)
And then, to update gradients, you just have to:
optim.step()
# IMPORTANT
optim.zero_grad()
What’s that zero_grad
? That clears the gradients from the variables (after applying them with .step()
) so that the next update doesn’t influence the current one.
Your First Neural Network
import torch.nn as nn
Layers
m = nn.Linear(20, 30)
input = torch.randn(128, 20)
output = m(input)
output, output.size()
Explain what the \(20, 30\) means.
Ok one layer is just lame. What if you want a bunch of layers?
m1 = nn.Linear(20, 30)
m2 = nn.Linear(30, 30)
m3 = nn.Linear(30, 40)
input = torch.randn(128, 20)
# function call syntax! Functions call from rigth to left!
output = m3(m2(m1(input)))
output, output.size()
And guess what? If you want to adjust the values here, you would just do:
m1 = nn.Linear(20, 30)
m2 = nn.Linear(30, 30)
m3 = nn.Linear(30, 40)
input = torch.randn(128, 20)
# function call syntax! Functions call from rigth to left!
output = m3(m2(m1(input)))
(output.sum() - 12).backward()
None
But wait! What are the options you give to your optimizer?
optim = torch.optim.SGD([m1.weight, m1.bias ... ... ], lr=1e-2, momentum=0.9)
That’s a lot of variables!! Each linear layer has a \(m\) and a \(b\) (from \(y=mx+b\) fame), and you will end up with a bajillon one of those! Also, that function call syntax, chaining one layer after another, is so knarly! Can we do better? Yes.
An Honest-to-Goodness Neural Network
PyTorch makes the module
framework to make model creator’s lives easier. This is the best practice for creating a neural network.
Let’s replicate the example above with the new module
framework:
class MyNetwork(nn.Module):
def __init__(self):
# important: runs early calls to make sure that
# the module is correct
super().__init__()
# we declare our layers. We don't use them yet.
self.m1 = nn.Linear(20,30)
self.m2 = nn.Linear(30,30)
self.m3 = nn.Linear(30,40)
# this is a special function that is called when
# the module is called
def forward(self, x):
# we want to pass our input through to every layer
# like we did before, but now more declaritively
x = self.m1(x)
x = self.m2(x)
x = self.m3(x)
return x
Explain all of this.
But now, we essentially built our entire network in own “layer” (actually we literally did, all =Layer=s are just =torch.Module=s) that does the job of all other layers acting together. To use it, we just:
my_network = MyNetwork()
input = torch.randn(128, 20)
# function call syntax! Functions call from rigth to left!
output = my_network(input)
output
tensor([[-0.1694, 0.0095, 0.4306, ..., 0.1580, 0.2644, 0.1509],
[-0.2346, -0.0269, -0.1191, ..., 0.0229, -0.0819, -0.1452],
[-0.4871, -0.2868, -0.2488, ..., 0.0637, 0.1832, 0.0619],
...,
[-0.1323, 0.2531, -0.1086, ..., 0.0975, 0.0426, -0.2092],
[-0.4765, 0.1441, -0.0520, ..., 0.2364, 0.0253, -0.1914],
[-0.5044, -0.3263, 0.3102, ..., 0.1938, 0.1427, -0.0587]],
grad_fn=<AddmmBackward0>)
But wait! What are the options you give to your optimizer? Surely you don’t have to pass my_network.m1.weight
, my_network.m1.bias
, etc. etc. to the optimizer, right?
You don’t. One of the things that the super().__init__()
did was to register a special function to your network class that keeps track of everything to optimize for. So now, to ask the optimizer to update the entire network, you just have to write:
optim = torch.optim.SGD(my_network.parameters(), lr=1e-2, momentum=0.9)
optim
SGD (
Parameter Group 0
dampening: 0
differentiable: False
foreach: None
lr: 0.01
maximize: False
momentum: 0.9
nesterov: False
weight_decay: 0
)
TODO make students recall original backprop example, backprope and step and zero_grad with this new optim.
Look! Optimizing an entire network works in the exact same way as optimizing two lone variables.
Putting it together
TODO
- training loop (zero first, call model, get diff/loss, .backward(), .step())
- best practices
- saving and restoring models
- GPU