ML / PyTorch Crash Course - Stanford University
Transcript of ML / PyTorch Crash Course - Stanford University
ML / PyTorch Crash CourseAlex Tamkin
| Alex Tamkin | @alextamkin
Prelude
○ Anything you especially want to focus on?
○ Don't expect to understand all of this perfectly from today!
○ Drinking from a firehose○ Slides will be uploaded
| Alex Tamkin | @alextamkin
ML Crash Course
| Alex Tamkin | @alextamkin
Neural Network Classifiers
Neural Network
Defective0.1%
OK99.9%
| Alex Tamkin | @alextamkin
Inside a neural net
[.2, .4, .5 ...] Lots of matrices
...
Defective0.1%
OK99.9%
Neural nets learn transformations from inputs to outputs
| Alex Tamkin | @alextamkin
Training a neural net
○ You don't program the neural net, the data programs the neural net
○ It learns through examples
OK
Defective
OK
| Alex Tamkin | @alextamkin
Inside a neural net○ Simplest neural net: y = softmax(Ax)○ x = input image
○ (size 16x16?)○ y = label
○ (size 2)○ [0.0, 1.0] or [1.0, 0.0]
○ A○ matrix that goes from 16x16 to 2
○ Softmax: makes sure numbers add up to 1 ○ So it's a probability distribution!
| Alex Tamkin | @alextamkin
Optimization (example: faucet)
○ Loss / objective○ Facuet: how happy you are
with the temp/pressure○ NN: how much your
network's predictions line up with the labels○ Rewarded based on the
probability you assigned to correct answer
| Alex Tamkin | @alextamkin
Optimization (example: faucet)
○ Parameters: determine behavior of your model
○ Start close to 0○ Faucet: positions of the
handles (sad, no pressure)○ NN: entries of the
matrices
| Alex Tamkin | @alextamkin
Optimization (example: faucet)
○ Optimization algorithm○ SGD: Stochastic gradient
descent○ Compute loss on a "batch"
of data (several examples)○ Gradient: what direction
to push each parameter to decrease loss?
| Alex Tamkin | @alextamkin
Optimization (example: faucet)
○ Learning rate: how much to push parameters based on direction
○ Too big: can overshoot!○ Think of when the handles
are sticky and the water gets too hot suddenly
○ Too slow: takes forever
| Alex Tamkin | @alextamkin
Optimization
○ Activation functions○ needed for nonlinear relationships (can't capture input-output
relationship with just matrices)○ "ReLU" is just max(0, x)
[.2, .4, .5 ...]Matrix
+ ReLU
Matrix +
ReLU
Matrix +
ReLULoss
| Alex Tamkin | @alextamkin
Optimization
○ Computational graph○ Series of steps traversed from input to output
[.2, .4, .5 ...]Matrix
+ ReLU
Matrix +
ReLU
Matrix +
ReLULoss
| Alex Tamkin | @alextamkin
Optimization
○ Computational graph○ Series of steps traversed from input to output
[.2, .4, .5 ...]Matrix
+ ReLU
Matrix +
ReLU
Matrix +
ReLULoss
| Alex Tamkin | @alextamkin
Can be pretty wacky (InceptionNet)
| Alex Tamkin | @alextamkin
Loss curve
| Alex Tamkin | @alextamkin
Loss curve
A lot better!
| Alex Tamkin | @alextamkin
Distributed representations○ Vectors between the layers
○ Especially towards the end of the NN○ Group similar things near each other○ Some insight into what models are doing!
[.2, .4, .5 ...]Matrix
+ ReLU
Matrix +
ReLU
Matrix +
ReLULoss
| Alex Tamkin | @alextamkin
PyTorch Crash Course
| Alex Tamkin | @alextamkin
The magic of PyTorch
○ Would be a huge pain to write all the matrices ourselves○ and a huger pain to compute the gradients○ PyTorch lets us
○ Describe the steps from input to output○ Define the loss, optimizer, learning rate○ Input the data○ Then it updates the parameters accordingly! :)
Defining the model
○ nn.Module○ lets pytorch keep track of
params ○ __init__
○ Define the parameters in initialization
○ forward○ "Forward pass"○ How the net goes from input to
output
○ Linear○ A linear layer○ Fc: "fully connected"○ Matrix A and vector b○ Input: x, output Ax + b
Defining the model
Defining the model
○ Conv2d○ 2D Convolutional layers ○ Special layers for images○ Lets us tile a tiny matrix across
the image, instead of one big matrix
○ Works better and takes less memory
Defining the model
○ maxpool_2d○ Makes the output of a layer
smaller by averaging adjacent entries
○ Helps get from a large image to a binary decision
○ Dropout○ Helps prevent memorization○ Randomly "zeros out" some
entries in the matrix each forward pass
○ Slightly magic
Defining the model
○ log_softmax○ Softmax takes exp() of every
number in the vector○ Then normalizes them to sum
to one○ This gets us a probability
distribution○ We return the log probabilities
for numerical stability
Defining the model
○ datasets.MNIST○ MNIST is a handwriting
classification dataset○ Helpful for post offices!○ train=True/False defines the
train/test split○ we want to test our model
on things it hasn't been trained on
Defining the data
○ Transforms○ In this case, just tensorizes +
normalizes○ Can apply data augmentations
applied to images○ Makes the dataset "bigger"○ Harder to just memorize○ E.g. random flipping, cropping○ Don't want to do this for
numbers!
Defining the data
○ DataLoader○ Data processing can take a
while - don't want your GPU to be waiting
○ Applies transformations in parallel
○ Returns batches
Defining the data
○ .to(device)○ Sends it to the GPU, if you have
one○ Optimizer
○ Smarter version of SGD○ Tunes learning rate for each
parameter○ Training loop
○ Updates parameters on full dataset, then evaluates it
Training + testing
○ For batch_idx, (data, target) in enumerate(train_loader)○ Fetches a batch○ Data is a tensor of size
[batch_size, num_channels, height, width]
○ Target is the label (which number)
Training loop
○ .train()○ Enables layers only used during
training (e.g. dropout)○ optimizer.zero_grad()
○ Discards gradients computed last batch, for old parameters
○ Output = model(data)○ Runs model on a batch of data!
Training loop
○ F.nll_loss(output, target)○ Negative log likelihood○ -log(p_correct_answer)○ This is lower the higher the
probability you assigned to the correct answer!
○ loss.backwards()○ Compute gradients
○ optimizer.step()○ Update params with gradients!
Training loop
○ Model.eval○ Disable stuff like dropout
○ torch.no_grad()○ Don't keep track of
computational graph (we're not computing gradients)
○ Computes accuracy based on class with highest predicted probability
Evaluation loop
| Alex Tamkin | @alextamkin
PyTorch Lightning
○ Organization○ PyTorch is super useful, but can be kinda messy / disorganized○ PL provides a nice way to structure your code
○ Functionality○ In PyTorch, you have to do both research code (modeling) and
engineering code (loading model onto GPU, remembering best practices about data loading)
○ PL automates a lot of this (can just set gpus=8 and it will do it for you)
PyTorch Lightning
| Alex Tamkin | @alextamkin
Weights and Biases
Keep track of experiments easily
Weights and Biases
| Alex Tamkin | @alextamkin
Hydra
○ You'll run a lot of experiments with different configurations○ Hydra is a tool to help you manage these
○ Without changing them manually in your code each time!
| Alex Tamkin | @alextamkin
Hydra
| Alex Tamkin | @alextamkin
Google Cloud
○ Colab is nice, easy and free○ But can be a pain to use (keep getting disconnected)○ If you want to train longer, can use Google Cloud
○ You start with $300 free, then we can supplement with extra $50○ More of a pain to set up, but get dedicated GPU○ Really helpful guide: https://github.com/cs231n/gcloud○ Crucial: stop your instances when you're not using them
○ Credits will keep rolling and you'll be sad