Machine learning pt.1: Artificial Neural Networks ® All Rights Reserved
-
Upload
jonathan-mitchell -
Category
Engineering
-
view
1.171 -
download
1
Transcript of Machine learning pt.1: Artificial Neural Networks ® All Rights Reserved
Machine Learning pt1: Classification, Regression, and Artificial Neural Networks
Self-Driving-Cars Los Angeles
By Jonathan Mitchellgithub.com/jonathancmitchell
linkedin.com/in/[email protected]
Self Driving Cars Los Angeleshttps://www.meetup.com/Los-Angeles-Self-Driving-Car-Meetup/
Welcome to Machine Learning
aka computational statistics
How did I learn this?
Sources: ● Udacity’s Self-Driving Car Nanodegree problem (udacity.com/drive)● MIT Self-Driving Car program (selfdrivingcars.mit.edu)● Stanford’s cs-231n (cs231n.github.io)
TopicsA) Probability basics - Basics to LogitsB) Linear Classification/ Logistic regression overviewC) Perceptron D) Perceptron (biological inspiration)E) Neuron F) Forward PassG) Computing a loss functionH) Visualizing Hidden LayersI) Setting up training dataJ) Preprocessing / NormalizationK) Overfitting / Hyperparameter introL) EpochsM) MinibatchN) Gradient Descent / Stochastic Gradient DescentO) BackpropagationP) Cross Entropy Loss
Probability basics
Probability p = all outcomes of interest / all possible outcomes
P can range from (0, 1) inclusive. P = 1 = 100% likelihood P = 0 = 0% likelihood
Odds: The likelihood of an event P happening:
Coin Toss: Toss a coin in the air, it can either be heads or tails.
P_heads = 0.5 => P_tails = 0.5 = 1 - P_heads
P1 = Heads Probability (0, 1), P0 = Tails Probability (0,1)
Odds ratio can be written as Odds1 : Odds2. In this case 1:1. Equal chance of getting Heads or tails
P(Not Tails)
P(Tails)
P(Heads)
P(Not Heads)Odds_heads
Odds_tails
Bernoulli Probability (A specific case of binomial distribution)
Bernoulli Probability: A yes or no question. Two possible outcomes: Success and Fail
p = probability of success (of one trial)
q = probability of fail (of one trial) = 1 - p
unknown probability p.
N = number of trials = 1
Probability of K successes.
K = 1 for Bernoulli
A Bernoulli distribution is a special case of a Binomial Distribution with N = 1 trial.
Binomial Probability
Probability Basics -> Logistic Regression
Goal: Estimate an unknown probability p for any given linear combination of the independent variables.
Link independent variables to the Ber(p) distribution.
Logistic regression: estimate an unknown probability p for any given linear combination of the independent variables.
Estimate p = p^
Need function that maps linear combination of variables that can result in any value onto the bernoulli probability distribution with a domain from 0 to 1.
Use Logit: Natural log of the odds ratio
Logistic Regression
Logit: natural log of odds ratio
Undefined at P = 0, P = 1
Good P domain (0, 1)Linear Combination
graph from http://www.graphpad.com/support/faqid/1465/
Logistic Regression
Logit:
α will be the linear combination of independent variables and their coefficient
Recall:
Ber(p) = logit(p) - logit(1-p)Inverse logit gives us the probability that dependent var (p) is a “1”
Linear Combination
Probability of x with linear combination mapping (B and B0)
Binary Output variable Y. We want to model the conditional probability Pr(Y = 1 | X = x) as a function of x; any unknown parameters are to be estimated by max likelihood
graph from http://www.graphpad.com/support/faqid/1465/
Logistic Regression -> Linear Classification
To classify we seek a binary output variable Y = 1 or 0.
Recall Pr(Y = 1 | X = x). We modeled this as p(x;b,w)
Predict Y = 1 when p >= 0.5. Y = 1 = Class A
Predict Y = 0 when P < 0.5. Y = 0 = Class B
Guess 1 when B + B0 is non-negative
Guess 0 when B + B0 is negative
This is a linear classifier.
We can also infer that the probabilities depend on the distance from the boundary.
This is known as a Binary Logistic Classifier (Binary = 2 options, Class A or Class B)
The decision boundary separates the two predicted classes and is the solution to this equation
Graph from http://pubs.rsc.org/services/images/RSCpubs.ePlatform.Service.FreeContent.ImageService.svc/ImageService/Articleimage/2010/AN/b918972f/b918972f-f7.gif
Logistic Regression -> Linear Classification
To classify we seek a binary output variable Y = 1 or 0.
Recall Pr(Y = 1 | X = x). We modeled this as p(x;b,w)
Predict Y = 1 when p >= 0.5. Y = 1 = Class A
Predict Y = 0 when P < 0.5. Y = 0 = Class B
Guess 1 when B + B0 is non-negative
Guess 0 when B + B0 is negative
This is a linear classifier.
We can also infer that the probabilities depend on the distance from the boundary.
This is known as a Binary Logistic Classifier (Binary = 2 options, Class A or Class B)
The decision boundary separates the two predicted classes and is the solution to this equation
Graph from http://pubs.rsc.org/services/images/RSCpubs.ePlatform.Service.FreeContent.ImageService.svc/ImageService/Articleimage/2010/AN/b918972f/b918972f-f7.gif
Neuron: Building block of a neural network
src: MIT-Self-Driving-Cars, Fridman, Lex
A Neuron is a computational building block of the brain.Human brain: 1000T synapses10x that of an Artificial Neuron
Artificial Neuron is a computational building block of an artificial neural network.~1-10B synapses
*Takes a set of inputs*Places a weight of each input*sums them together*applies a bias value on each neuron*Uses an activation function that takes in the sum plus bias and squeeze values together into a probability distribution (range 0, 1)
Takes a few inputs and places an output
Classification: output: 1 or 0This can serve as a linear classifier
src: MIT-Self-Driving-Cars, Fridman, Lex
Perceptron Algorithm X1
X3
X2Output
1. Initialize perceptron with random weights.2. Compute perceptrons output3. If output does not match known output
a. if output should have been 0 but was 1, decrease the weights that had an input of 1b. if output should have been 1 but was 0, increase the weights that had an input of 1
4. Move on to next example in the training set until perceptron makes no more mistakes
src: MIT-Self-Driving-Cars, Fridman, Lex
If output does not match expected output = Punish!
Your output neurons didn’t match the expected output.
X1
X3
X2Output
Expected Output: Cat but we got Burrito
Training Images
Perceptron
Why Neural Networks are great.
X1
X3
X2Output
Perceptron
We can use the Hidden Layer to approximate any functionUniversality: We can closely approximate any function f(x) with a single hidden layer.
Driving: Input (sensor data from the world)Output: Drive (use steering data etc)
src: MIT-Self-Driving-Cars, Fridman, Lex
Dual class Linear Classification with Binary Logistic Regression
Input Data
Goal: To predict class A or class B from input data.
Two possible outputs!
x
Linear Combination Logistic Regression
Predictor
Class A is Y >= 0.5
Class B is Y < 0.5
P = 1
P = 0
Squeezes Values between 0 and 1
Scores (0,1) range
Notation changeup:logit-1 -> sigmoid
Input Data
Two possible outputs!
x
Linear Combination Logistic RegressionP = 1
P = 0
Squeezes Values between 0 and 1puts into probability distribution
Predictor
Class A is Y >= 0.5
Class B is Y < 0.5
Scores (0,1) range
Unnormalized log probabilities
Generalizing Logistic Regression to multiple classes
If we have two classes we can have two possible outputs: 1 or 0
What if we have 10 classes?
Binary - Two OutputsY either 1 or 0
Supposed we have k classes.
Let’s switch up some notation:
Now set each score s to the result of that function.
Probability that output Y = class K.We have J possible classes.
Perform softmax on scores
Softmax Classifieris Binary Logistic regression applied to multiple classes
Output = scores b/w 0 and 1
Scores
Notation changeup:logit-1 -> sigmoid
Input Data
Two possible outputs!
x
Linear Combination Logistic RegressionP = 1
P = 0
Predictor
Class A is Y >= 0.5
Class B is Y < 0.5
Scores (0,1) range
Unnormalized log probabilities
Notation changeup:logit-1 -> sigmoid
Input Data
Two possible outputs!
x
Linear Combination Logistic Regression
Predictor
Class A is Y >= 0.5
Class B is Y < 0.5
Scores (0,1) range
Unnormalized log probabilities
Output of Linear function. AKA Linear Scores
Linear(x) = xW+b or Wx + bTextbooks: Wx+b
Tensorflow: xW+b
Computing derivatives is easier for xW + b.
A few notesf(xi, W, b) = xW + b
Assume image x has all of its pixels flattened out into a single row vector. x =
X’s size is [n x m]. n: # examples/images m: # features (pixels in this case per image)
Matrix W of size [m x k]. m = # features, k = # classes
Bias b of size [k x 1]
Consider our input data (xi, yi) as being fixed. We can set W and b to approximate any function (remember universality principle).
We use the training data to learn W and b. Once our model has been trained. We can discard the training data and test our model on test data. Or anything for that matter.
W and b will be tensors if you are using TensorFlow. They can be arrays if you are using Numpy.
x[0] x[1] x[2] x[3] x[4]
Pixel(0, 255)
ExampleThe biases allow us to have these lines NOT all cross through the origin
W causes the lines to rotate about our pixels space
B pushes the lines away from the origin
src: Andrej Karpathy
Bias Trick (in practice)It would be annoying to worry about the Bias term separately during classification. Therefore we simply append the bias row vector to the end of our Weights matrix.
0.1 0.25 0.3
0.63 0.12 -0.64
0.26 0.62 0.58
0.99 -0.14 0.333 0.12 3.1 -0.5
Weights
Bias
0.1 0.25 0.3
0.63 0.12 -0.64
0.26 0.62 0.58
0.99 -0.14 0.333
0.12 3.1 -0.5
Weights
You may see this in the code as: logits = tf.add(tf.matmul(x, weights), bias) ORlogits = tf.matmul(x, weights)logits = tf.nn.bias_add(bias)
Input image
Xn x m1 x 4
20 254 40 1img1
1 image, 4 pixelsEach pixel is a feature.1 image, 4 featuresPixels range (0, 255)
0.1 0.25 0.3
0.63 0.12 -0.64
0.26 0.62 0.58
0.99 -0.14 0.333
Weightsm x k4 x 3
m: # features (pixels per img) = 4n: # images = 1k: # classes = 3 (Cat, Car, Dog)
pretend this image only has 4 pixels
Bias1 x k1 x 3
Linear Scoresstretch pixels into single row
Output1 x 3
Linear Scores = xW + b
Cat Car Dog
0.12 3.1 -0.5 3.2 5.1 -1.7
values from Andrej Karpathy
Initialize weights with values b/w 0 and 1. You can initialize biases to start at 0 or very small values if you like
Linear Scores, f(x; w, b)
Applying softmax
Apply exponential
Unnormalized log probabilities
Unnormalized probabilities
probabilities
Normalize so sum = 1
k = # specific class, different from k on the last slide.J = # classes
Cat Car Dog
3.2 5.1 -1.7 24.5 164 0.18 0.13 0.87 0.00
Cat Car Dog
values from Andrej Karpathy
Input image
Normalized Probabilities
3 x 1
stretch pixels to single row
Xn x m1 x 4
20 254 40 1img1 0.1 0.25 0.3
0.63 0.12 -0.64
0.26 0.62 0.58
0.99 -0.14 0.333
Weightsm x k4 x 3
Bias1 x k1 x 3
Linear Scores
Linear Scores1 x 3
Cat Car Dog
0.12 3.1 -0.5 3.2 5.1 -1.7
0.13 0.87 0.00
Cat Car Dog
Process so far:
Each pixel can be considered a neuron
values from Andrej Karpathy
Input image
Normalized Probabilities
3 x 1
stretch pixels to single row
Xn x m1 x 4
20 254 40 1img1 0.1 0.25 0.3
0.63 0.12 -0.64
0.26 0.62 0.58
0.99 -0.14 0.333
Weightsm x k4 x 3
Bias1 x k1 x 3
Linear Scores
Linear Scores1 x 3
Cat Car Dog
0.12 3.1 -0.5 3.2 5.1 -1.7
0.13 0.87 0.00
Cat Car Dog
Process so far: Forward Pass
Loss Function: How we learn
Recall:
Your output neurons didn’t match the expected output.
Input image
Normalized Probabilities
3 x 1
stretch pixels to single row
Xn x m1 x 4
20 254 40 1img1 0.1 0.25 0.3
0.63 0.12 -0.64
0.26 0.62 0.58
0.99 -0.14 0.333
Weightsm x k4 x 3
Bias1 x k1 x 3
Linear Scores
Linear Scores1 x 3
Cat Car Dog
0.12 3.1 -0.5 3.2 5.1 -1.7
0.13 0.87 0.00
Cat Car Dog
Process so far: Forward Pass
Loss Function: How we learn
Normalized Probabilities
3 x 1
0.13 0.87 0.00
Cat Car Dog Maximize the log likelihood of true classORMinimize the negative log likelihood of true class. (easier to do negative feedback loop than positive feedback loop)
values from Andrej Karpathy
Use the loss to manipulate the weights of the incorrect classifying inputs.
There are many different types of loss functions. More of this later
Visualizing a hidden layer
XLinear: L1
W1 b1
Linear: L2
W2 b2
Softmax
X4 x 10
W110 x 100
b1100 x 1
xW14 x 100
b1100 x 1
Since b1 has a dimension with value 1, its values can be broadcasted among the xW1 product automatically
X: n x m (examples x features)W: m x k (features x classes)b: k x 1 (classes row vector)
L14 x 100
L1
W1100 x 10
b110 x 1
L24 x 10
L2
We can add a wide layer by adding columns to W1 and then add a skinny layer by giving k columns in W2 so that our output still has the desired shape of examples x classes
These layers are hidden because we cannot see their output as we run the graph
Desired output size: 4 x 10examples x classes
Neurons are not classes, or objects, they are values.They are the values that are moving through the pipeline. Follow a pixel of an example image through a network and consider it to be a neuron.
Neuron
When we implement a neural network we use a graph.
XLinear: L1
W1 b1
Softmax S1 Probabilities
L1 S1
0.13 0.87 0.00
Training Data Images
Labels
Labels tell you the true class of each image.
Softmax S1
Sigmoid S1
Note: These are the same thing
XLinear: L1
W1 b1
Sigmoid S1
Probabilities(Logits)
L1 S1
0.13 0.87 0.00
Training Data Images
Labels
Labels tell you the true class of each image.
0 1 0
One-Hot-Training Labels
Cat 0 0 1
Car 0 1 0
Dog 1 0 0
‘Cat’
‘Car’
‘Dog’
Training Labels
Error! should be 0 1 0
If we run our network on just one image in the training set and take its corresponding label
XLinear: L1
W1 b1
Sigmoid S1 logits
L1 S1
Training Data Images
Labels
Run network on all training data and training labels
One-Hot-Training Labels
0 0 1
0 1 0
1 0 0
‘Cat’
‘Car’
‘Dog’
Training Labels
Cat
Car
Dog0.13 0.87 0.12
0.55 0.91 0.2
0.88 0.66 0.11
Cat Car Dog
Run network on 3 images
Correct_prediction = tf.equal(tf.argmax(logits, 1), tf.argmax(labels, 1))Accuracy = tf.reduce_mean(Correct_prediction, axis = 1) Find accuracy of our training network. More of this later.
Setting up training data
Training Data Test Data
Training Data Validation Data Test Data
Split up your training data into validation data and training data. Use validation data as test data as you train and tune your network.Train Data: 80% of original training dataValidation Data: 20% of original training dataThen shuffle!
from sklearn.utils import shufflefrom sklearn.model_selection import train_test_splitX_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, train_size = 0.80, test_size = 0.20)# this splits train validation to 80 20X_train, y_train = shuffle(X_train, y_train)# This shuffles and keeps label indices intact
Preprocessing: NormalizationIn our examples we used raw pixel values (0, 255) as our inputs to train our network.
In practice, we preprocess this data before running it through our network.
mean centered normalization: we subtract the mean pixel value from each pixel and divide by the standard deviation. This allows us to have a relatively gaussian distribution of values from [-1, 1].
(0, 255) => (-1, 1)
Maxmin normalization - where we set subtract from max and divide by the different of (max - min).
This gives us a domain of (0, 1).
(0, 255) => (0, 1)
-1 1
img from Wikipedia
Overfitting - Introduction to Hyperparameters
The goal of building an artificial neural network is to generalize.
● We want to apply new data to our network and classify inputs● If we overtrain / overfit our network to our training data then our accuracy will be
deceiving. It might work very well for training data, but will not work on test data.
● In order to prevent overfitting we implement preprocessing techniques and tune our hyper parameters.
● Tuning hyperparameters is basically all that we can do after we set up our network architecture
● It should be the last step in setting up your network
● Test on validation data while you tune (don’t touch test data)
http://docs.aws.amazon.com/machine-learning/latest/dg/images/mlconcepts_image5.png
Epochs
An Epoch is a single forward pass and backward pass of the entire network
It is a hyperparameter and we must tune the number of epochs to fit our data / increase our accuracy
The larger the number of epochs the longer it takes to train
We increase epochs to increase the number of training intervals. If they are increased too large we may overfit
XLinear: L1
W1 b1
Softmax S1 logits
L1 S1
backward
forward
More on the backward pass to come
Stop early!
https://qph.ec.quoracdn.net/main-qimg-d23fbbc85b7d18b4e07b7942ecdfd856?convert_to_webp=true
MinibatchWe don’t feed in all the neurons into our network at once. Instead, we choose a batch of neurons and feed them in. Perform forward and backwards propagation on them, and then feed in the next batch of neurons.
We do this so we can perform Stochastic Gradient Descent, and prevent our network from overfitting.
So in Mini batch gradient descent, you process a small subset of training data forwards and backwards and update the weights/ biases with the gradient update formula (shown on next page)
Mini batchWe feed only segments into our neural network at a time.
Training Data
Batch
Batch
Network
● The amount of neurons in each batch is a hyper parameter.
● This also depends on GPU size
● Typically use 128 or 256
Gradient DescentThe “Learning” in Machine Learning.
Update the values of X (punish) it when it is wrong.
X: weights or biases
η: Learning Rate (typically 0.01 to 0.001)
η :The rate at which our network learns. This can change over time with methods such as Adam, Adagrad etc. (hyperparameter)
∇(x): Gradient of X
We seek to update the weights and biases by a value indicating how “off”
they were from their target.
Gradients naturally have increasing slope, so we put a negative in front of it to go downwards
Stochastic Gradient DescentRecall Gradient Descent: X -= η∇(x) (eq 1)
Stochastic Gradient Descent (SGD) is a version of Gradient Descent where on each forward pass, a batch of data is randomly sampled from the total dataset and gradient descent is performed on that batch.
The more batches processed by the network = the better the approximation
1. Randomly sample a batch of data (1) from the total dataset2. Run the network forward and backward to calculate the gradient from data (1)3. Apply the gradient descent update (eq 1)4. Repeat 1-3 until convergence or epoch limit
Visualizing Batch and SGD
256
256
256
232
If we start out with 1000 images and use batch size of 256 we will have a batch that has 232 images in it.
Training Images batch size
Stochastic Gradient Descent sample size’s.
5 images
Maybe take ~5 images from the 256 batch size at a time and run SGD on them. Then go back and select 5 more.
X -= η∇(x)
Each X is an image in the SGD batch
BackpropagationWe need to figure out how to alter a parameter to minimize the cost (loss). First we must find out what effect that parameter has on the cost.
(we can’t just blindly change parameter values and hope that our network converges)
The gradient notes the effect each parameter has on the cost.
How to determine the effect of a parameter on the cost?
We use Backpropagation - which is an application of the chain rule from calculus
Did somebody say Chain Rule?
BackpropagationDerivative Review:
In order to know the effect f has on x, we must first find the effect f has on g, then the effect g has on x
BackpropagationYou want to stage backpropagation at each gate level locally. This is much easier to implement than by storing each weight value and trying to compute it at the end. Simply add up the gradients along an individual neurons path.
Andrej Karpathy
f
X
Y
ZChange in Loss w respect to Z
change in Z with respect to Xchange in L with
respect to Z
X
b1W1
Linear L1
S1
S1 b/c it goes to sigmoidS = WX + b
(Loss w respect to X)
More Backpropagation
f
X
Y
ZChange in Loss w respect to Z
change in Z with respect to Xchange in L with
respect to Z
X
b1W1
Linear L1
S1
S1 b/c it goes to sigmoidS = WX + b
(Loss w respect to X)
More Backpropagation
This comes together on the next slide!
X
b1W1
Linear L1
S1
S1 b/c it goes to sigmoidS = WX + b
(Loss w respect to X)
Sigmoid S1 Any Gate Output
X has a relationship to L1, S1 has a relationship to L1. We can use that relationship in an application of the chain rule to compute the change in L1 with respect to X. Then we perform a gradient descent update on X.
Accumulator of all the gradients up to the L1 gate(sum of all gradients in red box). aka Accumulated Loss
(Gradient Desc Eqn)
(Update X)
Backpropagation cont
Andrej Karpathy
XLinear: L1
W1 b1
Sigmoid S1 logits
L1 S1
Training Data Images
Labels
Run network on all training data and training labels
One-Hot-Training Labels
0 0 1
0 1 0
1 0 0
‘Cat’
‘Car’
‘Dog’
Training Labels
Cat
Car
Dog
0.13 0.87 0.12
0.55 0.91 0.2
0.88 0.66 0.11
Cat Car Dog
Run network on 3 images Cross Entropy
Cross Entropy(distance)
XInput
2.0
1.0
0.1
Wx+b
yLogit
Linear
0.7
0.2
0.1
S(Y)Softmax
S(Y)
1.0
0.0
0.0
LLabels
D(S,L)
Cross Entropy
Tells us how accurate we are
Minimize cross entropy● Want a high distance for
incorrect class● Want a low distance for correct
class● Training loss = average cross
entropy over the entire training set.
● Want all the distances to be small
● want loss to be small● So we attempt to minimize this
function.
Training Loss
weight 1
weight 2src: Udacity
Cross Entropy Loss (continued)
weight 1
weight 2src: Udacity
Want to find the weights to cause this loss to be the smallest. Turns M.L problem into numerical optimization
weight 1
weight 2
Training Loss
Average cross entropy over entire training setMinimize this function
Training Loss
● Take the derivative of Loss with respect to parameters and follow the derivative by taking a step backwards.
● Repeat until you get to the bottom.● In this case we have 2 parameters (w1, w2)● Typically we have millions of parameters
cross_entropy = -tf.reduce_sum(tf.mul(one_hot, tf.log(softmax)))
Cross Entropy Loss
Installing Dependencies
You can use pip3 or pip. I recommend using an anaconda environment with python3:
https://www.continuum.io/downloads to Download Anaconda, (get Python 3.4+ version)
conda create --name=IntroToTensorFlow python=3 anaconda
source activate IntroToTensorFlow (Your conda environment is named “IntroToTensorFlow”)
conda install -c anaconda numpy=1.11.3
conda install -c conda-forge matplotlib=2.0.0
conda install -c anaconda scipy=0.18.1
conda install scikit-learn
or pip install -u scikit-learnconda install -c conda-forge tensorflow
conda install -c menpo opencv3=3.2.0
jupyter notebook (to run in browser)
git clone https://github.com/JonathanCMitchell/TensorFlowLab.git
Installing TensorFlow
Recommended: Python 3.4 or higher and Anaconda
Install TensorFlow
conda create --name=IntroToTensorFlow python=3 anaconda
source activate IntroToTensorFlow
conda install -c conda-forge tensorflow
docker run -it -p 8888:8888 gcr.io/tensorflow/tensorflow (Docker if you need it)
# Hello World!import tensorflow as tf
# create tensorflow object called tensorhello_constant = tf.constant(‘Hello World!’)with tf.Session() as sess:
# Run the tf.constant operation in the sessionoutput = sess.run(hello_constant)print(output)
git clone https://github.com/JonathanCMitchell/TensorFlowLab.git
If you have questions here is my info:
Jonathan Mitchellgithub.com/jonathancmitchell
linkedin.com/in/[email protected]
Self Driving Cars Los Angeleshttps://www.meetup.com/Los-Angeles-Self-Driving-Car-Meetup/
Thank you!