Machine learning pt.1: Artificial Neural Networks ® All Rights Reserved

Machine Learning pt1: Classification, Regression, and Artificial Neural Networks

Self-Driving-Cars Los Angeles

By Jonathan Mitchellgithub.com/jonathancmitchell

linkedin.com/in/[email protected]

Self Driving Cars Los Angeleshttps://www.meetup.com/Los-Angeles-Self-Driving-Car-Meetup/

mailto:[email protected]


Welcome to Machine Learning

aka computational statistics

How did I learn this?

Sources: ● Udacity’s Self-Driving Car Nanodegree problem (udacity.com/drive)● MIT Self-Driving Car program (selfdrivingcars.mit.edu)● Stanford’s cs-231n (cs231n.github.io)

TopicsA) Probability basics - Basics to LogitsB) Linear Classification/ Logistic regression overviewC) Perceptron D) Perceptron (biological inspiration)E) Neuron F) Forward PassG) Computing a loss functionH) Visualizing Hidden LayersI) Setting up training dataJ) Preprocessing / NormalizationK) Overfitting / Hyperparameter introL) EpochsM) MinibatchN) Gradient Descent / Stochastic Gradient DescentO) BackpropagationP) Cross Entropy Loss

Probability basics

Probability p = all outcomes of interest / all possible outcomes

P can range from (0, 1) inclusive. P = 1 = 100% likelihood P = 0 = 0% likelihood

Odds: The likelihood of an event P happening:

Coin Toss: Toss a coin in the air, it can either be heads or tails.

P_heads = 0.5 => P_tails = 0.5 = 1 - P_heads

P1 = Heads Probability (0, 1), P0 = Tails Probability (0,1)

Odds ratio can be written as Odds1 : Odds2. In this case 1:1. Equal chance of getting Heads or tails

P(Not Tails)

P(Tails)

P(Heads)

P(Not Heads)Odds_heads

Odds_tails

Bernoulli Probability (A specific case of binomial distribution)

Bernoulli Probability: A yes or no question. Two possible outcomes: Success and Fail

p = probability of success (of one trial)

q = probability of fail (of one trial) = 1 - p

unknown probability p.

N = number of trials = 1

Probability of K successes.

K = 1 for Bernoulli

A Bernoulli distribution is a special case of a Binomial Distribution with N = 1 trial.

Binomial Probability

Probability Basics -> Logistic Regression

Goal: Estimate an unknown probability p for any given linear combination of the independent variables.

Link independent variables to the Ber(p) distribution.

Logistic regression: estimate an unknown probability p for any given linear combination of the independent variables.

Estimate p = p^

Need function that maps linear combination of variables that can result in any value onto the bernoulli probability distribution with a domain from 0 to 1.

Use Logit: Natural log of the odds ratio

Logistic Regression

Logit: natural log of odds ratio

Undefined at P = 0, P = 1

Good P domain (0, 1)Linear Combination

graph from http://www.graphpad.com/support/faqid/1465/

Logistic Regression

Logit:

α will be the linear combination of independent variables and their coefficient

Recall:

Ber(p) = logit(p) - logit(1-p)Inverse logit gives us the probability that dependent var (p) is a “1”

Linear Combination

Probability of x with linear combination mapping (B and B0)

Binary Output variable Y. We want to model the conditional probability Pr(Y = 1 | X = x) as a function of x; any unknown parameters are to be estimated by max likelihood

graph from http://www.graphpad.com/support/faqid/1465/

Logistic Regression -> Linear Classification

To classify we seek a binary output variable Y = 1 or 0.

Recall Pr(Y = 1 | X = x). We modeled this as p(x;b,w)

Predict Y = 1 when p >= 0.5. Y = 1 = Class A

Predict Y = 0 when P < 0.5. Y = 0 = Class B

Guess 1 when B + B0 is non-negative

Guess 0 when B + B0 is negative

This is a linear classifier.

We can also infer that the probabilities depend on the distance from the boundary.

This is known as a Binary Logistic Classifier (Binary = 2 options, Class A or Class B)

The decision boundary separates the two predicted classes and is the solution to this equation

Graph from http://pubs.rsc.org/services/images/RSCpubs.ePlatform.Service.FreeContent.ImageService.svc/ImageService/Articleimage/2010/AN/b918972f/b918972f-f7.gif

Neuron: Building block of a neural network

src: MIT-Self-Driving-Cars, Fridman, Lex

A Neuron is a computational building block of the brain.Human brain: 1000T synapses10x that of an Artificial Neuron

Artificial Neuron is a computational building block of an artificial neural network.~1-10B synapses

*Takes a set of inputs*Places a weight of each input*sums them together*applies a bias value on each neuron*Uses an activation function that takes in the sum plus bias and squeeze values together into a probability distribution (range 0, 1)

Takes a few inputs and places an output

Classification: output: 1 or 0This can serve as a linear classifier


Perceptron Algorithm X1

X3

X2Output

1. Initialize perceptron with random weights.2. Compute perceptrons output3. If output does not match known output

a. if output should have been 0 but was 1, decrease the weights that had an input of 1b. if output should have been 1 but was 0, increase the weights that had an input of 1

4. Move on to next example in the training set until perceptron makes no more mistakes


If output does not match expected output = Punish!

Your output neurons didn’t match the expected output.

X1

X3

X2Output

Expected Output: Cat but we got Burrito

Training Images

Perceptron

Why Neural Networks are great.

X1

X3

X2Output

Perceptron

We can use the Hidden Layer to approximate any functionUniversality: We can closely approximate any function f(x) with a single hidden layer.

Driving: Input (sensor data from the world)Output: Drive (use steering data etc)


Dual class Linear Classification with Binary Logistic Regression

Input Data

Goal: To predict class A or class B from input data.

Two possible outputs!

x

Linear Combination Logistic Regression

Predictor

Class A is Y >= 0.5

Class B is Y < 0.5

P = 1

P = 0

Squeezes Values between 0 and 1

Scores (0,1) range

Notation changeup:logit-1 -> sigmoid

Input Data


x

Linear Combination Logistic RegressionP = 1

P = 0

Squeezes Values between 0 and 1puts into probability distribution

Predictor

Class A is Y >= 0.5

Class B is Y < 0.5

Scores (0,1) range

Unnormalized log probabilities

Generalizing Logistic Regression to multiple classes

If we have two classes we can have two possible outputs: 1 or 0

What if we have 10 classes?

Binary - Two OutputsY either 1 or 0

Supposed we have k classes.

Let’s switch up some notation:

Now set each score s to the result of that function.

Probability that output Y = class K.We have J possible classes.

Perform softmax on scores

Softmax Classifieris Binary Logistic regression applied to multiple classes

Output = scores b/w 0 and 1

Scores


Input Data


x

Linear Combination Logistic RegressionP = 1

P = 0

Predictor

Class A is Y >= 0.5

Class B is Y < 0.5

Scores (0,1) range



Input Data


x

Linear Combination Logistic Regression

Predictor

Class A is Y >= 0.5

Class B is Y < 0.5

Scores (0,1) range


Output of Linear function. AKA Linear Scores

Linear(x) = xW+b or Wx + bTextbooks: Wx+b

Tensorflow: xW+b

Computing derivatives is easier for xW + b.

A few notesf(xi, W, b) = xW + b

Assume image x has all of its pixels flattened out into a single row vector. x =

X’s size is [n x m]. n: # examples/images m: # features (pixels in this case per image)

Matrix W of size [m x k]. m = # features, k = # classes

Bias b of size [k x 1]

Consider our input data (xi, yi) as being fixed. We can set W and b to approximate any function (remember universality principle).

We use the training data to learn W and b. Once our model has been trained. We can discard the training data and test our model on test data. Or anything for that matter.

W and b will be tensors if you are using TensorFlow. They can be arrays if you are using Numpy.

x[0] x[1] x[2] x[3] x[4]

Pixel(0, 255)

ExampleThe biases allow us to have these lines NOT all cross through the origin

W causes the lines to rotate about our pixels space

B pushes the lines away from the origin

src: Andrej Karpathy

Bias Trick (in practice)It would be annoying to worry about the Bias term separately during classification. Therefore we simply append the bias row vector to the end of our Weights matrix.

0.1 0.25 0.3

0.63 0.12 -0.64

0.26 0.62 0.58

0.99 -0.14 0.333 0.12 3.1 -0.5

Weights

Bias

0.1 0.25 0.3

0.63 0.12 -0.64

0.26 0.62 0.58

0.99 -0.14 0.333

0.12 3.1 -0.5

Weights

You may see this in the code as: logits = tf.add(tf.matmul(x, weights), bias) ORlogits = tf.matmul(x, weights)logits = tf.nn.bias_add(bias)

Input image

Xn x m1 x 4

20 254 40 1img1

1 image, 4 pixelsEach pixel is a feature.1 image, 4 featuresPixels range (0, 255)

0.1 0.25 0.3

0.63 0.12 -0.64

0.26 0.62 0.58

0.99 -0.14 0.333

Weightsm x k4 x 3

m: # features (pixels per img) = 4n: # images = 1k: # classes = 3 (Cat, Car, Dog)

pretend this image only has 4 pixels

Bias1 x k1 x 3

Linear Scoresstretch pixels into single row

Output1 x 3

Linear Scores = xW + b

Cat Car Dog

0.12 3.1 -0.5 3.2 5.1 -1.7

values from Andrej Karpathy

Initialize weights with values b/w 0 and 1. You can initialize biases to start at 0 or very small values if you like

Linear Scores, f(x; w, b)

Applying softmax

Apply exponential


Unnormalized probabilities

probabilities

Normalize so sum = 1

k = # specific class, different from k on the last slide.J = # classes

Cat Car Dog

3.2 5.1 -1.7 24.5 164 0.18 0.13 0.87 0.00

Cat Car Dog


Input image

Normalized Probabilities

3 x 1

stretch pixels to single row

Xn x m1 x 4

20 254 40 1img1 0.1 0.25 0.3

0.63 0.12 -0.64

0.26 0.62 0.58

0.99 -0.14 0.333

Weightsm x k4 x 3

Bias1 x k1 x 3

Linear Scores

Linear Scores1 x 3

Cat Car Dog

0.12 3.1 -0.5 3.2 5.1 -1.7

0.13 0.87 0.00

Cat Car Dog

Process so far:

Each pixel can be considered a neuron


Input image


3 x 1


Xn x m1 x 4

20 254 40 1img1 0.1 0.25 0.3

0.63 0.12 -0.64

0.26 0.62 0.58

0.99 -0.14 0.333

Weightsm x k4 x 3

Bias1 x k1 x 3

Linear Scores

Linear Scores1 x 3

Cat Car Dog

0.12 3.1 -0.5 3.2 5.1 -1.7

0.13 0.87 0.00

Cat Car Dog

Process so far: Forward Pass

Loss Function: How we learn

Recall:

Your output neurons didn’t match the expected output.

Input image


3 x 1


Xn x m1 x 4

20 254 40 1img1 0.1 0.25 0.3

0.63 0.12 -0.64

0.26 0.62 0.58

0.99 -0.14 0.333

Weightsm x k4 x 3

Bias1 x k1 x 3

Linear Scores

Linear Scores1 x 3

Cat Car Dog

0.12 3.1 -0.5 3.2 5.1 -1.7

0.13 0.87 0.00

Cat Car Dog

Process so far: Forward Pass

Loss Function: How we learn


3 x 1

0.13 0.87 0.00

Cat Car Dog Maximize the log likelihood of true classORMinimize the negative log likelihood of true class. (easier to do negative feedback loop than positive feedback loop)


Use the loss to manipulate the weights of the incorrect classifying inputs.

There are many different types of loss functions. More of this later

Visualizing a hidden layer

XLinear: L1

W1 b1

Linear: L2

W2 b2

Softmax

X4 x 10

W110 x 100

b1100 x 1

xW14 x 100

b1100 x 1

Since b1 has a dimension with value 1, its values can be broadcasted among the xW1 product automatically

X: n x m (examples x features)W: m x k (features x classes)b: k x 1 (classes row vector)

L14 x 100

L1

W1100 x 10

b110 x 1

L24 x 10

L2

We can add a wide layer by adding columns to W1 and then add a skinny layer by giving k columns in W2 so that our output still has the desired shape of examples x classes

These layers are hidden because we cannot see their output as we run the graph

Desired output size: 4 x 10examples x classes

Neurons are not classes, or objects, they are values.They are the values that are moving through the pipeline. Follow a pixel of an example image through a network and consider it to be a neuron.

Neuron

When we implement a neural network we use a graph.

XLinear: L1

W1 b1

Softmax S1 Probabilities

L1 S1

0.13 0.87 0.00

Training Data Images

Labels

Labels tell you the true class of each image.

Softmax S1

Sigmoid S1

Note: These are the same thing

XLinear: L1

W1 b1

Sigmoid S1

Probabilities(Logits)

L1 S1

0.13 0.87 0.00


Labels

Labels tell you the true class of each image.

0 1 0

One-Hot-Training Labels

Cat 0 0 1

Car 0 1 0

Dog 1 0 0

‘Cat’

‘Car’

‘Dog’

Training Labels

Error! should be 0 1 0

If we run our network on just one image in the training set and take its corresponding label

XLinear: L1

W1 b1

Sigmoid S1 logits

L1 S1


Labels

Run network on all training data and training labels


0 0 1

0 1 0

1 0 0

‘Cat’

‘Car’

‘Dog’

Training Labels

Cat

Car

Dog0.13 0.87 0.12

0.55 0.91 0.2

0.88 0.66 0.11

Cat Car Dog

Run network on 3 images

Correct_prediction = tf.equal(tf.argmax(logits, 1), tf.argmax(labels, 1))Accuracy = tf.reduce_mean(Correct_prediction, axis = 1) Find accuracy of our training network. More of this later.

Setting up training data

Training Data Test Data

Training Data Validation Data Test Data

Split up your training data into validation data and training data. Use validation data as test data as you train and tune your network.Train Data: 80% of original training dataValidation Data: 20% of original training dataThen shuffle!

from sklearn.utils import shufflefrom sklearn.model_selection import train_test_splitX_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, train_size = 0.80, test_size = 0.20)# this splits train validation to 80 20X_train, y_train = shuffle(X_train, y_train)# This shuffles and keeps label indices intact

Preprocessing: NormalizationIn our examples we used raw pixel values (0, 255) as our inputs to train our network.

In practice, we preprocess this data before running it through our network.

mean centered normalization: we subtract the mean pixel value from each pixel and divide by the standard deviation. This allows us to have a relatively gaussian distribution of values from [-1, 1].

(0, 255) => (-1, 1)

Maxmin normalization - where we set subtract from max and divide by the different of (max - min).

This gives us a domain of (0, 1).

(0, 255) => (0, 1)

-1 1

img from Wikipedia

Overfitting - Introduction to Hyperparameters

The goal of building an artificial neural network is to generalize.

● We want to apply new data to our network and classify inputs● If we overtrain / overfit our network to our training data then our accuracy will be

deceiving. It might work very well for training data, but will not work on test data.

● In order to prevent overfitting we implement preprocessing techniques and tune our hyper parameters.

● Tuning hyperparameters is basically all that we can do after we set up our network architecture

● It should be the last step in setting up your network

● Test on validation data while you tune (don’t touch test data)

http://docs.aws.amazon.com/machine-learning/latest/dg/images/mlconcepts_image5.png

Epochs

An Epoch is a single forward pass and backward pass of the entire network

It is a hyperparameter and we must tune the number of epochs to fit our data / increase our accuracy

The larger the number of epochs the longer it takes to train

We increase epochs to increase the number of training intervals. If they are increased too large we may overfit

XLinear: L1

W1 b1

Softmax S1 logits

L1 S1

backward

forward

More on the backward pass to come

Stop early!

https://qph.ec.quoracdn.net/main-qimg-d23fbbc85b7d18b4e07b7942ecdfd856?convert_to_webp=true

MinibatchWe don’t feed in all the neurons into our network at once. Instead, we choose a batch of neurons and feed them in. Perform forward and backwards propagation on them, and then feed in the next batch of neurons.

We do this so we can perform Stochastic Gradient Descent, and prevent our network from overfitting.

So in Mini batch gradient descent, you process a small subset of training data forwards and backwards and update the weights/ biases with the gradient update formula (shown on next page)

Mini batchWe feed only segments into our neural network at a time.

Training Data

Batch

Batch

Network

● The amount of neurons in each batch is a hyper parameter.

● This also depends on GPU size

● Typically use 128 or 256

Gradient DescentThe “Learning” in Machine Learning.

Update the values of X (punish) it when it is wrong.

X: weights or biases

η: Learning Rate (typically 0.01 to 0.001)

η :The rate at which our network learns. This can change over time with methods such as Adam, Adagrad etc. (hyperparameter)

∇(x): Gradient of X

We seek to update the weights and biases by a value indicating how “off”

they were from their target.

Gradients naturally have increasing slope, so we put a negative in front of it to go downwards

Stochastic Gradient DescentRecall Gradient Descent: X -= η∇(x) (eq 1)

Stochastic Gradient Descent (SGD) is a version of Gradient Descent where on each forward pass, a batch of data is randomly sampled from the total dataset and gradient descent is performed on that batch.

The more batches processed by the network = the better the approximation

1. Randomly sample a batch of data (1) from the total dataset2. Run the network forward and backward to calculate the gradient from data (1)3. Apply the gradient descent update (eq 1)4. Repeat 1-3 until convergence or epoch limit

Visualizing Batch and SGD

256

256

256

232

If we start out with 1000 images and use batch size of 256 we will have a batch that has 232 images in it.

Training Images batch size

Stochastic Gradient Descent sample size’s.

5 images

Maybe take ~5 images from the 256 batch size at a time and run SGD on them. Then go back and select 5 more.

X -= η∇(x)

Each X is an image in the SGD batch

BackpropagationWe need to figure out how to alter a parameter to minimize the cost (loss). First we must find out what effect that parameter has on the cost.

(we can’t just blindly change parameter values and hope that our network converges)

The gradient notes the effect each parameter has on the cost.

How to determine the effect of a parameter on the cost?

We use Backpropagation - which is an application of the chain rule from calculus

Did somebody say Chain Rule?

BackpropagationDerivative Review:

In order to know the effect f has on x, we must first find the effect f has on g, then the effect g has on x

BackpropagationYou want to stage backpropagation at each gate level locally. This is much easier to implement than by storing each weight value and trying to compute it at the end. Simply add up the gradients along an individual neurons path.

Andrej Karpathy

f

X

Y

ZChange in Loss w respect to Z

change in Z with respect to Xchange in L with

respect to Z

X

b1W1

Linear L1

S1

S1 b/c it goes to sigmoidS = WX + b

(Loss w respect to X)

More Backpropagation

f

X

Y

ZChange in Loss w respect to Z

change in Z with respect to Xchange in L with

respect to Z

X

b1W1

Linear L1

S1



More Backpropagation

This comes together on the next slide!

X

b1W1

Linear L1

S1



Sigmoid S1 Any Gate Output

X has a relationship to L1, S1 has a relationship to L1. We can use that relationship in an application of the chain rule to compute the change in L1 with respect to X. Then we perform a gradient descent update on X.

Accumulator of all the gradients up to the L1 gate(sum of all gradients in red box). aka Accumulated Loss

(Gradient Desc Eqn)

(Update X)

Backpropagation cont

Andrej Karpathy

XLinear: L1

W1 b1

Sigmoid S1 logits

L1 S1


Labels

Run network on all training data and training labels


0 0 1

0 1 0

1 0 0

‘Cat’

‘Car’

‘Dog’

Training Labels

Cat

Car

Dog

0.13 0.87 0.12

0.55 0.91 0.2

0.88 0.66 0.11

Cat Car Dog

Run network on 3 images Cross Entropy

Cross Entropy(distance)

XInput

2.0

1.0

0.1

Wx+b

yLogit

Linear

0.7

0.2

0.1

S(Y)Softmax

S(Y)

1.0

0.0

0.0

LLabels

D(S,L)

Cross Entropy

Tells us how accurate we are

Minimize cross entropy● Want a high distance for

incorrect class● Want a low distance for correct

class● Training loss = average cross

entropy over the entire training set.

● Want all the distances to be small

● want loss to be small● So we attempt to minimize this

function.

Training Loss

weight 1

weight 2src: Udacity

Cross Entropy Loss (continued)

weight 1

weight 2src: Udacity

Want to find the weights to cause this loss to be the smallest. Turns M.L problem into numerical optimization

weight 1

weight 2

Training Loss

Average cross entropy over entire training setMinimize this function

Training Loss

● Take the derivative of Loss with respect to parameters and follow the derivative by taking a step backwards.

● Repeat until you get to the bottom.● In this case we have 2 parameters (w1, w2)● Typically we have millions of parameters

cross_entropy = -tf.reduce_sum(tf.mul(one_hot, tf.log(softmax)))

Cross Entropy Loss

Installing Dependencies

You can use pip3 or pip. I recommend using an anaconda environment with python3:

https://www.continuum.io/downloads to Download Anaconda, (get Python 3.4+ version)

conda create --name=IntroToTensorFlow python=3 anaconda

source activate IntroToTensorFlow (Your conda environment is named “IntroToTensorFlow”)

conda install -c anaconda numpy=1.11.3

conda install -c conda-forge matplotlib=2.0.0

conda install -c anaconda scipy=0.18.1

conda install scikit-learn

or pip install -u scikit-learnconda install -c conda-forge tensorflow

conda install -c menpo opencv3=3.2.0

jupyter notebook (to run in browser)

git clone https://github.com/JonathanCMitchell/TensorFlowLab.git

https://www.continuum.io/downloads

https://www.continuum.io/downloads

https://github.com/JonathanCMitchell/TensorFlowLab.git

Installing TensorFlow

Recommended: Python 3.4 or higher and Anaconda

Install TensorFlow

conda create --name=IntroToTensorFlow python=3 anaconda

source activate IntroToTensorFlow

conda install -c conda-forge tensorflow

docker run -it -p 8888:8888 gcr.io/tensorflow/tensorflow (Docker if you need it)

# Hello World!import tensorflow as tf

# create tensorflow object called tensorhello_constant = tf.constant(‘Hello World!’)with tf.Session() as sess:

# Run the tf.constant operation in the sessionoutput = sess.run(hello_constant)print(output)

git clone https://github.com/JonathanCMitchell/TensorFlowLab.git

If you have questions here is my info:

Jonathan Mitchellgithub.com/jonathancmitchell

linkedin.com/in/[email protected]

Self Driving Cars Los Angeleshttps://www.meetup.com/Los-Angeles-Self-Driving-Car-Meetup/

Thank you!



Machine learning pt.1: Artificial Neural Networks ® All Rights Reserved

Engineering

Transcript of Machine learning pt.1: Artificial Neural Networks ® All Rights Reserved