ECE421 TUT0102 Tutorial 1: Introduction to Tensorflowexams.skule.ca/exams/ECE521H1_20199... ·...

ECE421 – TUT0102Tutorial 1: Introduction to Tensorflow

Bolin Gao

[email protected]

9/12/2019

Today’s Agenda 4:10 – 6:00 pm

First part: downloading and installing Anaconda, Tensorflow 1.x and Jupyter Notebook

Second part: basic concepts, Tensorflow 1.x models, Tensor graph, and eager execution mode

Third part: tf.placeholder and additional topics such as Tensorflow optimizers, Numpy and Tensorflow 2.x

Last part: code demo

1

Fun Fact About Tensorflow

President Obama catches a pregnant woman who

almost fainted during his speech on Healthcare.gov

in the Rose Garden (2013)

Meet Karmel Allision, manager for the TensorFlow team at

Google Brain (2019)

This happens when

I talk too long

2

What is Tensorflow?

Open-source library for numerical computation in large-scale machine learning.

Uses Python for building models, executes models in C++ for performance. In other words, Python API with C++ runtime.

Operations are implemented by both CPU and GPU.

CPU: few cores/processors, cores are fast, good at sequential tasksGPU: thousands of cores, cores are slow, good at parallel tasks (e.g., matrix multiplication!)

3

What is Jupyter Notebook?Web-browser based interactive coding environment

Consists of three parts:

1. Web-browser for users to write code snippets

2. Notebook server for saving file with .ipynb extension on local disk

3. Kernel responsible for executing code snippets and returning results

General purpose, not just for writing Tensorflow code

Can be run in the cloud by using Google Colab, i.e., no local resources necessary.

4

Setting up Tensorflow and Jupyter (Windows)

(Follow the guide on Course Website for Linux or MacOS)

Steps:

1. Install Anaconda via Course Link

2. Open Anaconda Prompt and install Tensorflow

3. Open Anaconda Navigator and install Jupyter notebook

Note: Anaconda helps to install, modify and manage Python libraries

5



Steps:

1. Install Anaconda via Course Linkhttps://repo.continuum.io/archive/index.html

Search for the file ‘Anaconda3-4.2.0-Windows-x86_64.exe’

Link in handout: “Tensorflow Installation Guide Using Anaconda”

Note: we are installing the CPU version only

GPU version requires carefully choosing CUDA Toolkit and cuDNN

(recommended to do it on your own time)

6

https://repo.continuum.io/archive/index.html



Steps:


2. Open Anaconda Prompt and Install Tensorflow (Version 1.2.1)

This will also download Numpy, an appropriate version of Python, BLAS (Basic Linear Algebra Subprogram), etc.

Note: Tensorflow does not work with Python 3.7! Use Python 3.6 and lower

8

conda create –n tensorflowactivate tensorflowconda install tensorflow (or pip install tensorflow)

10



Steps:


2. Open Anaconda Prompt and Install Tensorflow

3. Open Anaconda Navigator and install Jupyter notebook

11

Make sure to set to tensorflow instead of roots

13

Alternatively, launch Jupyter notebook by opening

any web browser and type in ‘localhost:8888’

15

Shift + Enter to run the code

Next up:

Tensorflow basics, creating and manipulating tensors, and building a

model in Tensorflow and using Tensorboard 16

Tensorflow Basics Array: a container of items of the same type and size

Tensor: a multidimensional numerical array

Rank (number of axes): number of dimensions of the tensor

Shape: the number of elements along each dimension of a tensor

Type: the data type (dtype) of a tensor

17

18

Array: a container of items of the same type and size





Rank = 2 Shape = (28, 28)Type = int32

A single 28 by 28 image 18

Tensorflow Basics

Tensorflow Basics Array: a container of items of the same type and size






Rank = 3 Shape = (28, 28, N)Type = int32

A single 28 by 28 image A stack of N images (perhaps used as “training set”) 19

20

Array: a container of items of the same type and size






Rank = 3 Shape = (28, 28, N)Type = int32

A stack of N images (perhaps used as “training set”) 20

Rank = 4 Shape = (28, 28, 3, N)Type = float32

Tensorflow Basics

A single 28 by 28 imageA stack of N RGB images

tf.constant(): instantiates a constant tensor

x = tf.constant(1)

>>Tensor(“Constant:0”,shape=(),dtype = int32)

tf.Variable(): instantiates a variable tensor whose value can be changed. This is usually used for trainable variables such as weight and biases.

x = tf.Variable([[1],[2]])

>><tf.Variable “Variable:0” shape=(1,1) dtype = int32_ref>

tf.placeholder(): instantiates a container whose value we will provide when running the program. This is usually used for our input data and training labels.

x = tf.placeholder(tf.int32)

>>Tensor(“Placeholder:0”, dtype = int32)

Tensorflow Basics – Creating Tensors

21

tf.constant(): instantiates a constant tensor

x = tf.constant(1)

>>Tensor(“Constant:0”,shape=(),dtype = int32)

tf.Variable(): instantiates a variable tensor whose value can be changed. This is usually used for trainable variables such as weight and biases.

x = tf.Variable([[1],[2]])

>><tf.Variable “Variable:0” shape=(1,1) dtype = int32_ref>

tf.placeholder(): instantiates a container whose value we will provide when running the program. This is usually used for our input data and training labels.

x = tf.placeholder(tf.int32)

>>Tensor(“Placeholder:0”, dtype = int32)


Simply means, the tensor can be

assigned a different value

22


tf.fill(): similar to tf.constant, but only for scalar valued tensors

x = tf.fill([2,2], 1) #creates a 2x2 tensor of ones

tf.random.normal(): tensor with random shape and value as specified with mean and standard deviation.

x = tf.random.normal([2,2], 5.0, 10.0)

#creates a Gaussian distributed 2x2 tensor with mean 5 and variance 10

This can be used to initialize weights associated with a neural network.

x = tf.Variable(tf.random.normal([2,2], 5.0, 10.0))

23

Tensorflow Basics – Datatypes (dtypes)

Get the datatype using tf.tensor.dtype

tf.Variable(3.14159265359).dtype

>> tf.float32_ref

'Incompatible type conversion error’ very common!

Typecasting using tf.cast(Tensor, dtype): convert from float32 to float64

y = tf.cast(tf.Variable(3.14159265359),tf.float64)

>> Tensor(“Cast:0”, shape = (), dtype = float64)

24

Tensorflow Basics – Shape, Size and Axis

tensor.get_shape(): operation that returns the shape of tensor

x = tf.constant([[1,2,3], [4,5,6]])

print(x.get_shape())

>> (2,3)

Size: the total number of elements in a tensor.

Axes: indices corresponding to the shape of a tensor.

For the previous example: the shape is (2,3),

axis 0 corresponds to the first entry, which are the rows

axis 1 corresponds to the second entry, which are the columns.

1 2 34 5 6

25

Tensorflow Basics – Manipulating Tensors

tf.matmul: multiple two tensors of appropriate dimensions

a = tf.constant([[1, 2, 3], [4, 5, 6]])

b = tf.constant([[7, 8], [9, 10], [11, 12]])

c = tf.matmul(a, b)

tf.square: element-wise square of all entries of a tensor

x = tf.constant([1, 2, 3, 4, 5])

y = tf.square(x)

Sum up entries and take square root, we obtain Euclidean norm of x

1 2 34 5 6

7 89 1011 12

12 22 32 42 52

26

Tensorflow Basics – Manipulating Tensors tf.reduce_sum: sum across a given axis (if no axis provided, sum entire tensor)

x = tf.constant([[1., 1., 1.], [2., 2., 2.]])

tf.reduce_sum(x) 9.0tf.reduce_sum(x, 0) [3. , 3. , 3. ] #sum vertically

tf.reduce_sum(x, 1) 3. , 6. #sum horizontally

tf.reduce_sum(x, 1, keepdims = True) [ 3. , 6. ]

tf.reduce_mean: sum across a given axis, then divide by the number of elements

x = tf.constant([[1., 1., 1.], [2., 2., 2.]])

tf.reduce_mean(x) 1.5tf.reduce_mean(x, 0) [1.5, 1.5, 1.5]tf.reduce_mean(x, 1) [1. , 2. ]tf.reduce_mean(x, 1, keepdims = True) [ 1. , 2. ]

27

Tensorflow Basics – Broadcasting

Broadcasting: set of rules to add/sub/mult … tensors of different shapes

x = tf.constant([1, 2, 3])

y = tf.constant(4)

z = x + y

Not very different from MATLAB…

x = tf.constant([1, 2, 3])

y = tf.constant([4,5])

z = x + y

You can look up all the broadcasting rules on Tensorflow website

[5, 6, 7]

𝐸𝑟𝑟𝑜𝑟!

28

Building a Tensorflow Model – Part 1Tensorflow (version 1.x) allows for two modes of computation

1. Computation (or dataflow) graph: builds a graph consisting of tensors (edges) and operations (nodes), then “run the graph” to get the value. This is the default setting and the most common way to build a model (in version 1.x).

Why a computation graph?

29

Building a Tensorflow Model – Part 1Tensorflow (version 1.x) allows for two modes of computation

1. Computation (or dataflow) graph: builds a graph consisting of tensors (edges) and operations (nodes), then “run the graph” to get the value. This is the default setting and the most common way to build a model (in version 1.x).

Why a computation graph?

GPU or multi-core CPU could compute values of different branches or subgraphs. This parallelizes training and scales up models.

Computation graph resides in CPU or GPU memory.

Computation graph are visualized using Tensorboard (more to come).

30

Tensorflow allows for two modes of computation

1. Computation graph

Aside, deep neural networks are usually

visualized as computation graphs as well in

many publications

This diagram is from a recent paper titled “A

Style-Based Generator Architecture for

Generative Adversarial Networks” by Karras,

Laine, Aila (Dec 12, 2018)

Building a Tensorflow Model – Part 1

31


1. Computation graph

Computation graph = Block diagrams

Tensors = Signals

Tensor Operations = Systems

Optimizers = Feedback control laws

Weights = States variablesConclusion: most ML models are discrete-time nonlinear

feedback control systems! Certainly all the ones in this course.

Next tutorial: perceptron algorithm


32

Example 1: Given 𝑥 = 3, 𝑦 = 4, find 𝑓 = 𝑥2𝑦 + 𝑦 + 2Solve using computation graph

import tensorflow as tf

x = tf.Variable(3, name="x")

y = tf.Variable(4, name="y")

f = x*x*y + y + 2Build computation graph (we will show later)

Import tensorflow module (must have!)

33





f = x*x*y + y + 2Build computation graph (we will show later)


We do not have the value of f at this stage!!!

All tensorflow does at this point is construct the computation graph

It is not possible to get the value of any tensor without running the “graph”.

How to run the graph?

f.eval()

Error: cannot evaluate tensor.

34





f = x*x*y + y + 2

init_op = tf.global_variables_initializer()

with tf.Session() as sess:

sess.run(init_op)

print(sess.run(f))

Build computation graph

Create a node to initialize all variables (x,y) of the graph

Run graph within a “session”, which is the execution environment. sess.run(init_op) must be run before any other node, as it actually sets x = 3, y = 4.

>> 42


If do not include: “Error: Attempting to uninitialized value Variable”

35

Example 1: Given 𝑥 = 3, 𝑦 = 4, find 𝑓 = 𝑥2𝑦 + 𝑦 + 2Solve using computation graph – alternative version




f = x*x*y + y + 2



sess.run(init_op)

print(sess.run(f))


sess = tf.InteractiveSession()



f = x*x*y + y + 2


sess.run(init_op)

print(sess.run(f))

sess.close()

This method expands the scope of the session

Just remember to close it

36

Example 1: Given 𝑥 = 3, 𝑦 = 4, find 𝑓 = 𝑥2𝑦 + 𝑦 + 2Solve using computation graph – alternative version




f = x*x*y + y + 2



sess.run(init_op)

print(sess.run(f))

Let’s visualize the computation graph using Tensorboard!


sess = tf.InteractiveSession()



f = x*x*y + y + 2


sess.run(init_op)

print(sess.run(f))

sess.close()

37





f = x*x*y + y + 2



writer = tf.summary.FileWriter(“path to directory”, sess.graph)

sess.run(init_op)

print(sess.run(f))

writer.close()

Procedure to open Tensorboard:

1. Add a two lines of code in the ‘with’ block

2. Open Anaconda Prompt, and type:

tensorboard --logdir=path to directory --host=127.0.0.1

3. Type in localhost:6006 in a browser

e.g., “C:/Users/bolin/Desktop/Tutorial_1/Example”absolutely no spaces in directory names

38





f = x*x*y + y + 2




sess.run(init_op)

print(sess.run(f))

writer.close()39

A lot more going on under the hood…

Notice difference

between tf.constant

(2), and tf.variable (y)

A constant is just an

“injection” into a node

A variable has more

going on because it

can be reassigned

during execution

40



Example 2: Given 𝑥 = 3, 𝑦 = 4, find 𝑓 = 𝑥2𝑦 + 𝑦 + 2Solve using computation graph, but using tf.constant


x = tf.constant(3, name="x")

y = tf.constant(4, name="y")

f = x*x*y + y + 2

41







f = x*x*y + y + 2




sess.run(init_op)

print(sess.run(f))

writer.close()>> 42

If only using tf.constants,

do not need to initialize

42





f = x*x*y + y + 2



print(sess.run(f))

writer.close()43



2. Eager execution: unlike computation graph model, eager execution mode executes operations immediately.

This mode is not the default setting, must be declared ahead of time.

Eager execution works just like MATLAB, add two numbers, immediately get result.

44

Example 3: Given 𝑥 = 3, 𝑦 = 4, find 𝑓 = 𝑥2𝑦 + 𝑦 + 2Same operation, but using Eager Executionimport tensorflow as tf

tf.enable_eager_execution()



f = x*x*y + y + 2

print(f)

>>tf.Tensor(42, shape=(), dtype=int32)

print(f.numpy())

>> 42

Activate eager execution mode

Evaluate the tensor f immediately(Just like MATLAB)


Print f shows that f is a tensor

To access actual value, convert f into an numpy array, then print its value

45


from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing()

m, n = housing.data.shape

housing_data_plus_bias = np.c_[np.ones((m, 1)), housing.data]

X = tf.constant(scaled_housing_data_plus_bias, dtype=tf.float32, name="X")

y = tf.constant(housing.target.reshape(-1, 1), dtype=tf.float32, name="y")

theta = tf.Variable(tf.random_uniform([n + 1, 1], -1.0, 1.0, seed=42), name="theta")

y_pred = tf.matmul(X, theta, name="predictions")

error = y_pred - y

mse = tf.reduce_mean(tf.square(error), name="mse")

optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)

training_op = optimizer.minimize(mse)

init = tf.global_variables_initializer()


sess.run(init)

for epoch in range(1000):

if epoch % 100 == 0:

print("Epoch", epoch, "MSE =", mse.eval())

sess.run(training_op)

best_theta = theta.eval() #theta.eval() is the same as sess.run(theta)

5 minute break

Meanwhile, think about what this code is doing (where are we defining the graph?

Where are we running the graph?)

Next up: Optimizer, tf.variable versus tf.placeholder, quick words about Numpy,

Tensorflow 2.0 and Demo46

Detour: optimization using Tensorflow Optimizers

Suppose that we want to solve the following unconstrained optimization problem

min𝑥

𝑓 𝑥

We can find 𝑥𝑜𝑝𝑡 using gradient descent, 𝑥𝑘+1 = 𝑥𝑘 − 𝜂∇𝑓 𝑥𝑘 , 𝜂 our learning rate

To create an optimization node in our graph, we write:

optimizer = tf.train.GradientDescentOptimizer(learning_rate = 0.01)

training_op = optimizer.minimize(f)

Optimization node automatically finds all the variable tensors that the loss function depends on, and updates them one step at a time during execution.

47

Example: minimize 𝑓(𝑥) = (log(𝑥))^2import tensorflow as tf

x = tf.Variable(5, name='x', dtype=tf.float32)

f = tf.square(tf.log(x))

optimizer = tf.train.GradientDescentOptimizer(0.1)

train = optimizer.minimize(f)



sess.run(init_op)

print("starting at", "x:", sess.run(x), "log(x)^2:", sess.run(f))

for step in range(100):

sess.run(train)

print("step", step+1, "x:", sess.run(x), "log(x)^2:", sess.run(f))

>> … step 100: x = 1.000216 log(x)^2: 4.6555176e-10 48

Set initial value to be 5

Define function

Define optimizer (with learning rate 0.1)

Define optimizer node

Run gradient descent for 100 stepsNote: range(100) = 0, 1, 2, … , 99

import matplotlib.pyplot as plt

import numpy as np

x = np.linspace(0.1,10,100)

f = np.square(np.log(x))

plt.plot(x, f, 'b')

plt.show()

Difference between tf.variable and tf.placeholder

tf.variable has a value upon construction, can be “mutated” (assigned different values). For us, these are the trainable weights.

tf.placeholder do not possess a value at the construction phase of the graph, cannot be mutated. For us, these are the training data and labels.

Only needs to know the type of placeholder (although can also specify shape)

We “feed” data as a dictionary to the tf.placeholder at the same time when we evaluate the graph using sess.run (syntax: placeholder : value)

49

Example 4: Given 𝑥 = 3, 𝑦 = 4, find 𝑓 = 𝑥2𝑦 + 𝑦 + 𝑐Solve using computation graph, where 𝑐 is a placeholder




c = tf.placeholder(tf.int32)

f = x*x*y + y + c




sess.run(init_op)

print(sess.run(f, c : 2))

writer.close()

>> 42

Create a placeholder

Get the value of tensor f by feeding in value of c

50

Example 4: Given 𝑥 = 3, 𝑦 = 4, find 𝑓 = 𝑥2𝑦 + 𝑦 + 𝑐Solve using computation graph, where 𝑐 is a placeholderimport tensorflow as tf




f = x*x*y + y + c




sess.run(init_op)

print(sess.run(f, c : 2))

writer.close()51

Example 4: Given 𝑥 = 3, 𝑦 = 4, find 𝑓 = 𝑥2𝑦 + 𝑦 +𝑐 + 𝑑, 𝑔 = 𝑥 ∗ 𝑦 + 𝑑 where 𝑐, 𝑑 are placeholdersimport tensorflow as tf




d = tf.placeholder(tf.int32)

f = x*x*y + y + c + d

g = x*y + d



sess.run(init_op)

print(sess.run([f,g], c:2, d:3))

>> [45, 15]

Output multiple values by concatenating them into an array

52

A quick word about Numpy…Numpy is the numerical computing library for Python

import numpy as np

Similar to MATLAB, low learning curve

It is used together with Tensorflow! (For instance, Tensorflow automatically converts feed dictionary numpy arrays into Tensor objects)

Popular applications

Loading data: np.load('./data.npy’)

Initializing variables: np.zero

Math operations: np.argmax, np.transpose53

About Tensorflow 2.0

Tensorflow is migrating from 1.x to 2.0

Here are some main changes:

1. Fully integrated with Keras, a high level API for Tensorflow (much cleaner syntax)

2. Uses eager execution by default, the with tf.session() workflow is removed

3. Tensorboard is fully integrated with Tensorflow (no more need to resolve localhost:6006)

4. Miscellaneous removal of duplicate functionalities (major problem in TF 1.x)

5. Syntactical consistency54

Current status of Tensorflow 2.0 (Sept, 2019)

Tensorflow 2.0 Release Candidate (RC) version is available as nightly build, but full release not available until around December of 2019.

All Tensorflow 1.x code will have to be migrated…many 1.x APIs depreciated, have towrite “tf.compact.v1” everywhere

Should I learn TF 1.x or 2.0? What do I do?

From Tensorflow engineers at Google:

Learn Keras.

Keras wraps around TF 1.x code and will be fully compatible with TF 2.0.

55

Topics that we left out…tf.data (possible replacement for tf.placeholder)

tf.assign (another way to assign values into variables / alternative to initializers)

tf.expand_dims and tf.squeeze (commonly used shape manipulation)

Saving and resuming your session (very simple code: see other document)

Trainability of Variables (non-trainable variables cannot be modified by Optimizer)

Dynamic versus static shape (placeholders often do not have a dynamic shape)

Plotting results using Matplotlib and seaborn

Estimators and feature columns (another high-level tensorflow API)

Other softwares and tools: Colab, GCE, Pytorch, Pandas, Scikit, JAX, Sonnet, etc.

You will find out about them as we go along…56

Recommended Programming References

Deep Learning with Python – François Chollet (author of Keras)

Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow – Aurélien Géron

(electronic version already out, paperback out in two weeks!)

Python Data Science Handbook – Jake VanderPlas

An introduction to Computer Science using Python 3.6 – Paul Gries, Greg Wilson, Jason Montojo, Jennifer T. Campbell

57

Tutorial 2

Binary Linear Classification and the Perceptron Algorithm

Bolin Gao

Sept 19, 2019

Binary Linear Classification

In binary linear classification, we are given a data set D = (xn, yn)Nn=1, where xn ∈ Rd, d ≥ 1, and yn ∈ Y =+1,−1. For instance, suppose we have a data set of images of cats and dogs. Each example xn is a vectorizedRGB image of a cat or a dog. We assume yn = +1 if it is a cat, yn = −1 if it is a dog.

Next, we seek a function that takes in xn ∈ X and outputs a prediction vector y ∈ Y. The function we selectwill come from the following hypothesis set,

H =

h : X → Y, x 7→ y

∣∣∣h(x) = sgn(d∑

i=1

wixi + b), w ∈ Rd, b ∈ R

(1)

where sgn is the signum function, sgn(d∑

i=1

wixi + b) =

+1 if

d∑i=1

wixi > −b

−1 ifd∑

i=1

wixi < −b

Suppose we have selected an arbitrary hypothesis function h ∈ H, the next step would be to evaluate how goodsuch a function is at predicting our label within the examples that we are given. Our assumption is that, if ourhypothesis does well on the examples we are given, then hopefully it will do well on the examples we are not given.To do so, we introduce the 0− 1 loss, L0−1 : Y × Y → 0, 1.

L0−1(h(x), y) =

0 if h(x) = y

1 if h(x) 6= y(2)

Remark 1. We note that the 0 − 1 loss is equivalent written as (in your book) L0−1(h(x), y) =N∑

n=1Jh(xn) 6= ynK.

This is the Iverson bracket notation.

Then our in-sample error is the sum total of our loss, Ein : H → R,

Ein(h) =1

N

N∑i=1

L0−1(h(xn), yn) (3)

Formally, binary linear classification involves the following problem

Binary Linear Classification Problem

Given a data set D = (xn, yn)Nn=1, where xn ∈ Rd, and yn ∈ +1,−1

Find a hypothesis g ∈ H, such that g = arg minh∈H

Ein(h), where Ein(h) =1

N

N∑i=1

L0−1(h(xn), yn).

Observe that the above problem is formulated in terms of minimizing over a function space. Since our hypothesisis associated with the weight parameter w ∈ Rd and bias b ∈ R, which uniquely identifies each hypothesis function,hence we can equivalently state the problem explicitly in terms of the weight vector as follows.

Binary Linear Classification Problem (Equivalent formulation)

Given a data set D = (xn, yn)Nn=1, where xn ∈ Rd, and yn ∈ +1,−1Find a weight vector w? ∈ Rd and bias b? ∈ R, such that (w?, b?) = min

w∈W,b∈REin(w, b), where Ein(w) =

1

N

N∑n=1L0−1(sgn(

d∑i=1

wixi + b), yn).

We note that it is common for the bias to be grouped in with the weights. To this end, we can redefine ourweight vector and our data vector as as

w =[b w1 w2 . . . wd

]∈ Rd+1 (4)

andx =

[1 x1 x2 . . . xd

]∈ Rd+1 (5)

This is the notation we will use in the following section.

1

Perceptron Learning algorithm

We claim that an optimal weight vector w? could be found whenever the dataset D is linearly separable.Assuming that our data is linearly separable, then the perceptron learning algorithm (PLA) is as follows:

Perceptron Learning Algorithm

Input: D = ∅, assumed to be linearly separableOutput: Optimal weight w?

while There exists n ∈ 1, . . . , N such that yn 6= sgn(w>k xn) doFind any n such that yn 6= sgn(w>k xn)wk+1 = wk + ynxn

endAlgorithm 1: Perceptron Learning Algorithm (PLA)

Remark 2. Observe that if our data is not linearly separable, then the weight will forever oscillate.

An intuitive proof showing why wk converges to w? is as follows. Suppose (xn, yn) is data that was misclassifiedat iteration k. Then this must mean that w>n xn has the opposite sign of yn (by definition of what misclassifiedmeans). Then yn(w>n xn) < 0. Now we need to show that the next weight improves upon the classification, whichmeans that yn(w>k xn) < yn(w>k+1xn). This can be shown by a direction calculation:

yn(w>k+1xn) = yn((wk + ynxn)>xn)

= yn(w>k xn + ynx>n xn)

= yn(w>k xn + yn‖xn‖22)

= ynw>k xn + y2n‖xn‖22

= ynw>k xn + ‖xn‖22

> ynw>k xn

We note that the last inequality always hold, because the first index of xn is assumed to be 1, i.e., xn0 = 1, andthe norm ‖x‖ = 0 if and only if x = 0. This means that we are moving in the correct direction.

Theorem 1. Assuming that w(0) = 0, the perceptron learning algorithm converges in finite number ofiterations.

Remark 3. Following the convention in your textbook, we will use the notation w(t) to denote the weight vectorat iteration t = 1, . . .. The proof roughly follows the one given by Block and Novikoff in 1962. The symbol ‖ · ‖2denotes the Euclidean norm, i.e., ‖v‖2 =

√v21 + v22 + . . . v2d, for any vector v ∈ Rd.

Proof. We will divide the proof into the following parts:

1. Let ρ = min1≤n≤N

yn(w?>xn), show ρ > 0.

If w? is optimal, then yn(w?>xn) > 0,∀n. This implies that ρ > 0.

2. Show that w(t)>w? ≥ w(t− 1)>w? + ρ and conclude w(t)>w? ≥ tρ.

(Use Definition) w(t)>w? = (w(t− 1) + ynxn)>w?

= w(t− 1)>w? + ynx>nw

?

≥ w(t− 1)>w? + min1≤n≤N

ynx>nw

?

= w(t− 1)>w? + ρ

2

Starting from t = 1, we have,

w(1)>w? ≥ ρw(2)>w? ≥ w(1)>w? + ρ = 2ρ

...

w(t)>w? ≥ tρ.

3. Show that ‖w(t)‖2 ≤ ‖w(t − 1)‖2 + ‖x(t − 1)‖2, where (x(t − 1), y(t − 1)) is a misclassified instance at theprevious time instance.

(Use definition) ‖w(t)‖2 = ‖w(t− 1) + y(t− 1)x(t− 1)‖22= (w(t− 1) + y(t− 1)x(t− 1))>(w(t− 1) + y(t− 1)x(t− 1))

= ‖w(t− 1)‖22 + 2y(t− 1)w(t− 1)>x(t− 1) + ‖y(t− 1)x(t− 1)‖22≤ ‖w(t− 1)‖22 + ‖y(t− 1)x(t− 1)‖22= ‖w(t− 1)‖22 + |y(t− 1)|2‖x(t− 1)‖22= ‖w(t− 1)‖22 + ‖x(t− 1)‖22

4. Show by induction, ‖w(t)‖22 ≤ tR2, R = max1≤n≤N

‖xn‖2

(Use (3)) t = 1 : ‖w(1)‖22 ≤=0︷︸︸︷

‖w(0)‖22 +‖x(0)‖22 ≤ ‖x(0)‖22‖w(2)‖22 ≤ ‖w(1)‖22 + ‖x(1)‖22 ≤ ‖x(0)‖22 + ‖x(1)‖22

...

‖w(t)‖2 ≤t−1∑k=0

‖x(k)‖22, x(k) ∈ x1, . . . , xn

‖w(t)‖2 ≤t−1∑k=0

(max

1≤n≤N‖xn‖2

)2

= tR2

5. Use (2), (4) to show thatw(t)>w?

‖w(t)‖2≥√tρ

Rand hence t ≤ R2

ρ2‖w?|2

Recall that from (2), we have, w(t)>w? ≥ tρ. And from (4), we have, ‖w(t)‖22 ≤ tR2 implies ‖w(t)‖2 ≤√tR,

w(t)>w?

‖w(t)‖2=⇒ ‖w(t)‖2 ≤

√tR

By Cauchy-Schwartz inequality, w(t)>w? ≤ ‖w(t)‖2‖w?‖2, we have,

‖w(t)‖2‖w?‖2‖w(t)‖2

≥√tρ

R=⇒ t ≤ R2‖w?‖2

ρ2

where R = max1≤n≤N

‖xn‖2, ρ = min1≤n≤N

yn(w?>xn) and w? is the optimal weight.

Remark 4. The interpretation of this result is as follows: the perceptron updates every time an incorrect label isfound. Hence t represents the total number of updates that the perceptron makes. Since the number of updates is

bounded (at most T =R2‖w?‖2

ρ2), therefore w(t)→ w? in finite number of iterations.

3

Observe also that we can write the bound as,

√tρ

R=⇒ t ≤ R2‖w?‖2

ρ2=

R2[ρ2

‖w?‖2

] = R2

[ρ2

‖w?‖2

]−1= R2

[ρ

‖w?‖

]−2.

Let d =ρ2

‖w?‖2, then d is by definition the smallest distance between a point in the data set and the hyperplane

H = x ∈ Rd|w?>x = 0. Therefore, if the distance d is too small, then it may possibly take more iterations forthe perception to get the weight correctly, although in practice, the perceptron algorithm converges to the optimalweight very quickly.

c©Copyright by Bolin Gao 2019

4

Tutorial 3

Linear Regression

Bolin Gao

Sept 26, 2019

Linear Regression

In linear regression, we are given a data set D = (xn, yn)Nn=1, where xn ∈ X = Rd, d ≥ 1, and yn ∈ Y = R. Next,we seek a function that takes in xn ∈ X and outputs a prediction vector y ∈ Y. The function we select will comefrom the following hypothesis set,

H =

h : X → Y, x 7→ y

∣∣∣h(x) =d∑

i=1

wixi + b, w ∈ Rd, x ∈ ×Rd

(1)

or in compact form,

H =

h : 1 × X → Y, x 7→ y

∣∣∣h(x) =d∑

i=0

wixi, w ∈ Rd+1, x ∈ 1 × Rd

(2)

The notation 1×Rd represents the set of vectors whose leading coefficient is 1, i.e., v ∈ Rd+1|v = (1, x1, x2, . . . , xd).

We will use the convention,

w =[b w1 w2 . . . wd

]> ∈ Rd+1 (3)

x =[1 x1 x2 . . . xd

]> ∈ Rd+1 (4)

throughout the rest of this section.

We evaluate our hypothesis function through the squared loss,

Lsq(h(x), y) = (h(x)− y)2, y ∈ R (5)

Then the in-sample error is given by,

Ein(h) =1

N

N∑n=1

Lsq(h(xn), yn) =1

N

N∑n=1

(w>xn − yn)2. (6)

A more compact form of the in-sample error can be constructed as follows. Define data matrix and the targetvector as,

X =

x>1...x>N

∈ RN×(d+1) and y =

y1...yN

(7)

where we assume that the number of data is much larger than the number of feature vectors, i.e., N >> d+ 1 Thedata matrix can also be expressed as a set of column vectors,

X =[q0 q1 . . . qd

](8)

where each qi, i ∈ 0, . . . , d is one feature across all training examples, with q0 = 1.

Then the in-sample error function can be re-written as,

Ein(w) =1

N

N∑n=1

(w>xn − yn)2 =1

N

∥∥∥∥∥∥∥|w>x1...

w>xN

− y1...yN

∥∥∥∥∥∥∥2

2

=1

N

∥∥∥∥∥∥∥x>1...x>N

w − y1...yN

∥∥∥∥∥∥∥2

2

=⇒ Ein(w) =1

N‖Xw − y‖22

Linear Regression Problem

Given a data set D = (xn, yn)Nn=1, where xn ∈ Rd, and yn ∈ RFind a weight vector w? ∈ Rd+1, such that w? = arg min

w∈Rd+1

Ein(w), where Ein(w) =1

N‖Xw − y‖2.

Remark 1. The solution to the linear regression problem is referred to as the least-squares solution. Suppose thatX is square and invertible, then the solution to the above problem is simply, w? = X−1y. This solution yieldsan in-sample error Ein(w?) = 0. However, this would mean that N = d + 1 (the number of data is equal to thedimension of our data vector (plus bias)). In practice, this is almost never true.

1

Solution to the Linear Regression problem

Theorem 1. Suppose that the matrix X>X is invertible, then w? = (X>X)−1X>y is the solution to thelinear regression problem.

Remark 2. We will offer the proof by using linear algebra. The proof using differentiation is in the lecture notes.

Proof. To begin, we need to define four fundamental objects in linear algebra:

Column space of X: C(X) = y ∈ RN |y = Xw =d∑

i=0

wiqi, wi ∈ R, qi ∈ RN. This is the set of all

linear combinations of columns of X.

Orthogonal complement of the column space C(X): C(X)⊥ = e ∈ RN |e>y = 0,∀y ∈ C(X) ⊆ RN

Null space of X: N(X) = w ∈ Rd+1|Xw = 0, 0 ∈ RN ⊆ Rd+1

Left Null space of X: N(X>) = e ∈ RN |X>e = 0, 0 ∈ Rd+1 ⊆ RN .

We claim that C(X)⊥ = N(X>).(⊆) First, we show that N(X>) ⊆ C(X)⊥. Let e ∈ N(X>), then X>e = 0 and hence e>X = 0>. Multipleboth side by w yields, e>Xw = 0>w =⇒ e>y = 0, y = Xw ∈ C(X) =⇒ e ∈ C(X)⊥.(⊇) Next, we show that N(X>) ⊇ C(X)⊥. Let e>y = 0 =⇒ e>(Xw) = 0 =⇒ (X>e)>w = 0 =⇒X>e = 0,∀w 6= 0. Hence e ∈ N(X>) =⇒ N(X>) ⊇ C(X)⊥.

Next, we define the projection of y ∈ RN onto C(X) as,

yls = Xw?, w? ∈ Rd+1, such that (yls − y)>yw = 0,∀yw ∈ C(X). (9)

Since (yls − y)>yw = 0,∀yw ∈ C(X) =⇒ yls − y ∈ C(X)⊥. But recall that C(X)⊥ = N(X>), thereforeyls − y ∈ N(X>). This means, X>(yls − y) = 0. Substituting in the definition of yls we have, X>(Xw?) =X>y, therefore,

w? = (X>X)−1X>y (10)

whenever X>X is invertible, and the orthogonal projection is yls = X(X>X)X>y.

Remark 3. Note thatX(X>X)X> 6= I. A possible wrong proof is as follows: X(X>X)−1X> = X(X−1X>−1

)X> =I. However, the error is that X is not invertible.

Remark 4. The symmetric matrix X>X is always positive semidefinite for any X: this means, z>X>Xz ≥ 0for all z ∈ Rd+1. The proof is very straightforward. z>X>Xz = (Xz)>(Xz) = ‖Xz‖22 ≥ 0. A fact about sym-metric positive semidefinite matrix is that the matrix has at least one eigenvalue at zero, with the rest being positive.

However, a positive semidefinite matrix is never invertible. This is because the determinant is the product ofeigenvalues of a matrix, and for positive semidefinite matrices, it is always zero. Think back to the inverse formulaof 2× 2 matrices, this causes a division by 0 problem.

To ensure that X>X is invertible, we assume that it is positive definite: this means, z>X>Xz > 0 for allz ∈ Rd+1\0. A symmetric positive semidefinite matrix has all positive eigenvalues.

2

Analysis of the In-Sample Error

Theorem 2. Assuming X>X is positive definite (hence invertible), then the in-sample error (ignoringcoefficient),

Ein(w) = ‖y −Xw‖22 (11)

can be written as,

Ein(w) = ‖yw − yls‖22 + ‖y − yls‖22= (w −X(X>X)−1X>y)>X>X(w −X(X>X)−1X>y) + y>(I −X(X>X)−1X>)y

(12)

where yw = Xw, yls = X(X>X)−1X>y = Xw?, w? = (X>X)−1X>y. At the optimal weight, w?, thein-sample error is given by,

Ein(w?) = y>(I −X(X>X)−1X>)y (13)

Proof. First, we show that the Ein(w) can be decomposed into two parts.Observe that the two vectors yw− yls and y− yls are orthogonal to each other (written as yw− yls ⊥ y− yls).As one lies in the column space of X, and the other lies in the orthogonal complement of the column space.For the vector lying in the column space of X, we have,

‖yw − yls‖22 = (yw −X(X>X)−1X>y)>(yw −X(X>X)−1X>y)

= (Xw −X(X>X)−1X>y)>(Xw −X(X>X)−1X>y)

= (w −X(X>X)−1X>y)>X>X(w −X(X>X)−1X>y)

(14)

For the vector that lies in the orthogonal complement of the column space of X, we have,

‖y − yls‖22 = (y −X(X>X)−1X>y)>(y −X(X>X)−1X>y)

= y>y − 2y>X(X>X)−1X>y + y>X(X>X)−1

=I︷︸︸︷(X>X)(X>X)−1X>y

= y>y − y>X(X>X)−1X>y

= y>(I −X(X>X)−1X>)y

(15)

Therefore, by the Pythagoras theorem (‖v − w‖ = ‖v‖+ ‖w‖ if and only if v ⊥ w) we obtain,

Ein(w) = ‖yw − yls‖22 + ‖y − yls‖22= (w −X(X>X)−1X>y)>X>X(w −X(X>X)−1X>y) + y>(I −X(X>X)−1X>)y

(16)

At w? = (X>X)−1X>y, we obtain,

Ein(w?) = y>(I −X(X>X)−1X>)y (17)

The Level Sets of the In-Sample Error

Recall, that the level set of a function f : Rn → R is the set

Lc(f) = x ∈ Rn|f(x) = c, c ∈ R (18)

In this section, we briefly examine the level set of the in-sample error, and show that the level set is characterizedby the X>X matrix.

This set is most easily visualized in 2D and lower. Hence we assume that w =[w1 w2

]> ∈ R2 Similarly,

x =[x1 x2

]> ∈ R2 (ignoring the bias term). We assume that the matrix X>X is positive definite (all eigenvaluesare positive).

3

Figure 1: A possible graph of the in-sample error in 2D. The level sets are ellipses as clearly shown. Why is thisthe case, how does these ellipses tilt and where is the center of the ellipse?

From the previous section, the in-sample error can be written as,

Ein(w) = (w −X(X>X)−1X>y)>X>X(w −X(X>X)−1X>y) + y>(I −X(X>X)−1X>)y︸︷︷︸constant

(19)

Since the constant term only shifts the graph up or down, therefore we can safely ignore it in our analysis of thelevel set. Hence we obtain,

Ein(w) = (w −X(X>X)−1X>y)>X>X(w −X(X>X)−1X>y) = c (20)

where we assume that our constant c is equal to one, i.e., c = 1.

Let ∆w = w −X(X>X)−1X>y = w − w?, then,

Ein(w) = (∆w)>X>X(∆w) = 1 (21)

We proceed with simplifying the above expression further using the following well-known theorem.

Spectral Theorem

Theorem 3. Every symmetric matrix A has the factorization A = QΛQ>, where

Λ = diag(λ1, . . . , λn) =

λ1 ∅. . .

∅ λn

is a diagonal matrix, λi is the ith real eigenvalue of A, and

Q =[v1| · · · |vn

]is an orthogonal matrix (Q> = Q−1) and vi is an orthonormal (i.e., v>i vj = 0 if i 6= j, and v>i vi = 1)eigenvector associated with λi (i.e., Avi = λivi).

Since our matrix X>X is symmetric, therefore by the spectral theorem, we can write X>X = QΛQ>, where Λ andQ is defined above. In this case, Λ and Q ∈ R2×2. Hence,

Ein(w) = (∆w)>X>X(∆w)

= (Q>∆w)>ΛQ>∆w

= 1

(22)

Let ∆z = Q>∆w, then,

4

Ein(w) = (∆w)>X>X(∆w)

= (Q>∆w)>ΛQ>∆w

= (∆z)>Λ(∆z)

=[∆z1 ∆z2

] [λ1 00 λ2

] [∆z1∆z2

]= λ1∆z21 + λ2∆z22

=∆z21

(1/√λ1)2

+∆z22

(1/√λ2)2

= 1

(23)

This is precisely the equation of an axis-aligned ellipse, centered at the origin, in the ∆z-coordinates. If we

assume that λ2 > λ1, then the ellipse has major axis with length1√λ1

and minor axis with length1√λ2

. It is a

circle whenever λ1 = λ2. However, this is the equation in terms of the ∆z-coordinates. We wish to know what thelevel set looks like in w-coordinate.

Transforming back involves the following two steps,

1. First, recall that, ∆z = Q>∆w. Using the fact that Q> = Q−1, therefore we obtain Q∆z = ∆w. This means,

given a unit vector

[10

]in the ∆z-coordinates, we obtain Q

[10

]=[v1 v2

] [10

]= ∆w. This implies that the

unit vector in ∆w-coordinates is v1, the first eigenvector of X>X. The other unit axis maps to the othereigenvector, v2.

2. Now we have the representation of the ellipse in the ∆w-coordinates, to transform back into the w coordinates,

simply note that ∆w = w−w?. Consider the origin

[00

]in the ∆w-coordinates. Then this is exactly equivalent

to w? in the w-coordinates.

The following figure represents the series of transformation.

∆z1

∆z2 ∆w2

∆w1

w2

w1

b

w⋆

v1

v2

v1

v2

1√

λ2

1√

λ1

Figure 2: Far left: the original ellipse in the ∆z-coordinates. The length of the major and minor axis are shown(assuming λ2 > λ1). Middle: the ellipse in the ∆w-coordinates, the eigenvectors of X>X specifies the direction(tilt) of the ellipse. Far right: the ellipse in the w-coordinates, all vectors are shifted by w?.

Remark 5. It is important to note that such a transformation from the ∆z coordinates to ∆w coordinates will notdistort the ellipse. Meaning, that the length of the major and minor axes remain the same. To see this, simply

consider the point

1√λ10

. Then a multiplication by Q yields, Q

1√λ10

=1√λ1v1. But this new vector has the

exact same length as

1√λ10

. To see this, simply note that, ‖ 1√λ1v1‖ =

1√λ1‖v1‖ =

1√λ1

since v>1 v1 = ‖v1‖22 =

1 =⇒ ‖v1‖ = 1. (We made use of the orthonormality of v1.)

5

A Related Problem: Polynomial Curve Fitting

We show that polynomial curve fitting is a generalization of the linear regression for 1D data.Given D = (x1, y1), . . . , (xN , yN ), xn ∈ R, yn ∈ R, we wish to find a polynomial

h(x) = w0 + w1 + w2x2 + . . . wMx

M ,

M ≥ 0 denoting the order of the polynomial, such that, w =[w0 w1 . . . wM

]> ∈ RM+1, such that,

w? = arg minw∈Rd+1

Ein(w) (24)

where Ein =1

N

N∑n=1

(h(xn)− yn)2.

This problem almost looks like a linear regression problem. Recall that for linear regression, h(xn) = w>xn =w>ψ(xn). where ψ is the identity function, ψ : Rd+1 → Rd+1, x 7→ x.

Following this idea, we define feature map ψ : R→ RM+1,

ψ(x) =[1 x x2 . . . xM

]>(25)

then h(x) = w>ψ(x) =M∑i=0

wiψi(x)

Therefore, Ein(w) =1

N

N∑n=1

(h(xn)−yn)2 =1

N

N∑n=1

(w>ψ(x)−yn)2 =1

N

N∑n=1

(M∑i=0

wiψi(x)−yn)2 = ‖Ψw−y‖22, where,

Ψ =

1 x1 x21 . . . xM1...

......

...1 xN x2N . . . xMN

y =

y1y2...yN

Then the least squares solution corresponding to this problem can be written as,

w = (Ψ>Ψ)−1Ψ>y

and when M +1 = N (the number of features is equal to the number of data points), and Ψ is invertible, we obtain,

w = (Ψ>Ψ)−1Ψ>y = Ψ−1Ψ>−1

Ψ>y = Ψ−1y

1 # The following MATLAB code generates 10 randomly generated data points between the range 1 to 202 # Then fits a 4th order polynomial to it3

4 N = 10;5 x =linspace(1, N, N)6 y = round(1 + (20-1).*rand(N,1));7

8 M = 4;9 P = zeros(N, M+1);

10

11 for n = 1:1:N12 for m = 1:1:M+113 P(n, m) = x(n)ˆ(m-1);14 end15 end16

17 w = inv(P.'*P)*P.'*y;18 plot(x, y, 'ro')19

20 t = linspace(min(x), max(x), 1000);21

6

22 L = 0;23 for i = 1:1:M+124 L i = w(i).*t.ˆ(i - 1)25 L = L + L i26 end27

28 hold on29 plot(t, L, 'b')30 axis([min(x), max(x), min(min(y), min(L)), max(max(y), max(L))])

1 2 3 4 5 6 7 8 9 10

4

6

8

10

12

14

16

18

20

Figure 3: Fitting 4th order polynomial to 10 randomly generated data points.

©Copyright by Bolin Gao 2019

7

Tutorial 4

Logistic Regression

Bolin Gao

Oct 3, 2019

Logistic Regression

In logistic regression, we are given a data set D = (xn, yn)Nn=1, xn ∈ X = Rd, d ≥ 1, and yn ∈ Y = −1,+1.Even though it is called regression, the purpose of logistic regression is for classifying the data into the two labelsgiven above. Therefore officially, we would like to learn the following hypothesis:

H =h : 1 × X → Y, x 7→ y

∣∣∣h(x) = sgn(θ(w>x)− 0.5), w ∈ Rd+1, x ∈ Rd+1

where θ is the logistic function, θ : R→ (0, 1) = int([0, 1]), z 7→ θ(z) =exp(z)

1 + exp(z)=

1

1 + exp(−z).

We briefly list some properties of the logistic function:

• limz→∞

θ(z) = 1, limz→−∞

θ(z) = 0

• θ(0) = 0.5 (Interpretation: if data falls on the hyperplane, then w>x = 0, and θ(0) = 0.5 simply meansthat the classifier is not sure which class the data belongs to.)

• 1− θ(z) = θ(−z)

• θ(z) =d log(1 + exp(z))

dz= ∇ log(1 + exp(z))

• ∇θ(z) = θ(z)(1− θ(z)) =d2 log(1 + exp(z))

dz= ∇2 log(1 + exp(z))

Alternatively, we can treat our hypothesis simply as h(x) = θ(w>x) ∈ (0, 1) (e.g., Shalev-Shwartz and Ben-David’s Understanding Machine Learning book, page 98). However, notice that this hypothesis function does notmap to the target space Y = −1, 1. So the output y = h(x) is not a prediction of y ∈ −1, 1, but rather aprobability that implicitly predicts y. We will assume this hypothesis for the rest of this section.

Assume that probability of y given x is given by,

Pr[y = +1|x] = h(x)

Pr[y = −1|x] = 1− h(x)

where Pr denotes the probability measure, x, y are random variables. This generates a conditional probability massfunction,

Pw(y|x) =

h(x) y = +1

1− h(x) y = −1= θ(yw>x) (1)

We wish to maximize the joint probability that given x1, . . . , xN , we obtain y1, . . . , yN .

Derivation of the In-Sample Error and Its Gradient

The optimal weight vector w?, is the solution to the maximum likelihood problem,

w? = argmaxw∈Rd+1

Pw(y1, . . . , yN |x1, . . . , xN )

= argmaxw∈Rd+1

N∏n=1

Pw(yn|xn) (i.i.d. assumption)

In general, this problem is difficult to solve. Instead, we solve the related maximum log-likelihood problem,

which yields the same optimizer. Define the log-likelihood as log(∏N

n=1 Pw(yn|xn)) =N∑

n=1log(pw(yn|xn)), which we

can simplify to,

1

log(

N∏n=1

Pw(yn|xn)) =

N∑n=1

log(Pw(yn|xn))

= −N∑

n=1

log

(1

Pw(yn|xN )

)

= −N∑

n=1

log(1 + exp(−ynw>xn))

Then equivalently,

w? = argmaxw∈Rd+1

N∏n=1

Pw(yn|xn)

= argmaxw∈Rd+1

log(

N∏n=1

Pw(yn|xn))

= argmaxw∈Rd+1

−N∑

n=1


= argminw∈Rd+1

N∑n=1


Then we define Ein(w) =1

N

N∑n=1

log(1 + exp(−ynw>xn)

Remark 1. (In-Sample Error in Entropy Form) Observe that since Ein(w) =1

N

N∑n=1

log

(1

Pw(yn|xn)

)and Pw(y|x) =

h(x) y = +1

1− h(x) y = −1. Hence, we can rewrite our in-sample error as,

Ein(w) =1

N

N∑n=1

log

(1

Pw(yn|xn)

)=

1

N

N∑n=1

log

(1

h(xn)

)y = +1

1

N

N∑n=1

log

(1

1− h(xn)

)y = −1

or all in one line,

Ein(w) =1

N

N∑n=1

Jyn = +1K log

(1

h(xn)

)+ Jyn = −1K log

(1

1− h(xn)

),

This is the entropy representation of the in-sample error.

In order to use a first-order (i.e., gradient-based) method to find the optimal weight of the model, we need tocalculate the gradient of the in-sample error. This follows directly from the chain-rule.

∇Ein(w) =1

N

N∑n=1

1

1 + exp(−ynwTxn)exp(−ynw>xn)(−ynxn)

=1

N

N∑n=1

exp(−ynw>xn)

1 + exp(−ynwTxn)(−ynxn)

=1

N

N∑n=1

−ynxnθ(−ynw>xn)

2

Remark 2. Observe that when a piece of data is mis-classified (by definition, ynw>xn < 0), the argument of θ is

positive, therefore the output of θ is closer to 1. Otherwise, if it was correctly classified, then the output of θ iscloser to 0. This means that a method such as the gradient descent or the stochastic gradient desent is updatingthe model more aggressively upon encountering a misclassified point, and keep the model relatively the same whenthe weight is correctly classified.

To Summarize

The logistic regression model is shown in Figure 1. The encircled figure on top is the training phrase of thelogistic regression model. The gradient is fed into the gradient descent optimizer, and a decision rule is appliedto determine whether to continue to train. After the training has stopped (the weights are no longer moving), webuild our hypothesis θ(x) which outputs a probability. We may threshold the probability using the sign functionto determine the label.

wk+1 = wk − ηek

∇Ein(w) =1

N

N∑

n=1

−ynxnθ(−ynw⊤xn)

‖wk+1 − wk‖2≤ ǫ

w = wk+1

h(x) = θ(w⋆⊤x)w⋆ = wk+1

NO

YES

ek = ∇Ein(wk)

x1, . . . , xN y1, . . . , yN

x

Training

h(x) = Pr[y = +1|x] ∈ (0, 1)

Testing

Figure 1: Full block diagram representation of logistic regression. ε is the threshold for stopping the training, andη > 0 is a learning rate. All supervised learning can be represented using a block diagram such as the one above(ignoring validation).

This is in fact a model for a very basic neural network (as shown in tutorial). However, there are couple ofdifficulties extending this to a multi-layer neural network model. One is that the gradient is usually very difficultto compute. The gradient block at the very top is replaced with the so-called backpropagation block. Anotherissue is that most of the data are not streamed all at once, but one at a time (or in a mini-batch). If the data arestreamed one at a time, then we change our optimizer to the stochastic gradient optimizer. If the data is streamedin mini-batch, we use the mini-batch gradient descent.


3

Tutorial 5

Softmax Regression and some problems that were left out

Bolin Gao

Oct 10, 2019

Softmax Regression

The softmax regression (or multi-class logistic regression) is a generalization of the logistic regression. We are givena data set D = (xn, yn)Nn=1, where xn ∈ X = Rd, d ≥ 1, and yn ∈ Y = 1, . . . , c, c ≥ 2 where c is the number ofclasses that the data belongs to.

Officially, we seek a model from the following class of hypothesis,

H =

h : 1 × X → Y, x 7→ y

∣∣∣h(x) = argmaxi∈1,...,c

e>i σ(W>x),W =[w(1) . . . w(c)

], w(i) ∈ Rd+1,W ∈ Rd+1×c

(1)

where σ is the softmax function,

σ(W>x) =

exp(w(i)>xn)c∑

j=1

exp(w(j)>xn)

c

i=1

=1

c∑j=1

exp(w(j)>xn)

exp(w(1)>xn)...

exp(w(c)>xn)

. (2)

The hypothesis h : 1 × X → Y is constructed as follows:

1. First, we multiply a weight matrix W with an example x. Since W is a matrix, this yields a vector W>x =[w(1)>x . . . w(c)>x

]> ∈ Rc

2. Next, we feed this vector W>x into the softmax function σ, to form a probability vector σ(W>x) ∈ int(∆c) =v ∈ Rc|v1 + v2 + . . .+ vc = 1, vi > 0,∀i ∈ 1, . . . , c (this is the interior of the simplex in Rc).

3. We form the inner product between this probability vector σ(W>x) with a basis vector ei ∈ Rc, where ei has0 for all entries except a 1 at the ith entry.

4. Finally, we choose the ith entry as the value of i that maximizes the inner product e>i σ(W>x). This optimumi is the index associated with the largest entry of the the probability vector σ(W>x). Since i ∈ 1, . . . c,therefore it is a prediction of y.

Alternatively, we can treat our hypothesis simply as h(x) = σ(W>x). However, notice that this hypothesisfunction does not map to the target space Y = 1, . . . , c. So the output y = h(x) is not a prediction of y ∈ 1, . . . , c.Despite this issue, we will assume this hypothesis for the rest of this section.

Derivation of the Sample Error and Its Gradient

Assume the conditional probability of predicting y = i given x is given by,

Pr[y = i|x] = hi(x) =exp(w(i)>x)c∑

j=1

exp(w(j)>x), i ∈ 1, . . . , c

This give rise to a conditional CDF parameterized by the weights W ,

PW (y|x) =

h1(x) y = 1...

...

hc(x) y = c

= σy(WTx) = e>y σ(W>x) = e>y

w(1)>x...

w(c)>x

where ey =

[0 . . . 1 . . . 0

]>, and 1 occupies the yth position (this is also referred to as the one-hot encoding of y).

1

The optimal weight matrix W ? =[w(1)? . . . w(c)?

]>, is the solution to the maximum likelihood problem,

W ? = argmaxW∈Rd+1×c

N∏n=1

PW (yn|xn)

= argmaxW∈Rd+1×c

log(

N∏n=1

PW (yn|xn))

= argmaxW∈Rd+1×c

N∑n=1

log(PW (yn|xn))

= argmaxW∈Rd+1×c

N∑n=1

log

exp(w(yn)>x)c∑

j=1

exp(w(j)>x)

= argminW∈Rd+1×c

−N∑

n=1

log

exp(w(yn)>x)c∑

j=1

exp(w(j)>x)

Remark 1. The formulation of the likelihood is slightly different than the one presented in the tutorial, which wassomewhat based on Chris Bishop’s formulation of the softmax regression problem (Page 209, Pattern Recognitionand Machine Learning). Please make appropriate changes in your notes. Thanks Arnav Goel for alerting me to thisproblem. In any case, for us the summation term is not important, we will only consider the sample error (thingswithin the sum).

We then define the in-sample error as

Ein(W ) =1

N

N∑n=1

en(W ) (3)

where en(W ) is the (per-)sample error, given by,

en(W ) = − log

exp(w(yn)>xn)c∑

j=1

exp(w(j)>xn)

(4)

which we can rearrange to be,

en(W ) = − log(exp(w(yn)>xn) + log(

c∑j=1

exp(w(j)>xn)) = −w(yn)>xn + log(

c∑j=1

exp(w(j)>wn))

Next, we wish to compute the gradient of the sample error with respect to some weight vector w(i) ∈ Rd+1. Inorder to do so, we need to consider two cases,

en(W ) = −w(yn)>xn + log(

c∑j=1

exp(w(j)>wn) =

−w(i)>xn + log(

c∑j=1

exp(w(j)>wn)) yn = i

−w(l)>xn + log(c∑

j=1

exp(w(j)>wn)) yn = l 6= i

Then it is clear,

∇w(i)en(W ) =

−∇w(i)[w(i)>xn + log(

c∑j=1

exp(w(j)>wn))] yn = i

−∇w(i)[w(l)>xn + log(c∑

j=1

exp(w(j)>wn))] yn = l 6= i=

−xn +exp(w(i)>xn)c∑

j=1

exp(w(j)>xn)xn yn = i

exp(w(l)>xn)c∑

j=1

exp(w(j)>xn)xn yn = l 6= i

2

where we have used the fact, ∇w(i)

[log(

c∑j=1

exp(w(j)>wn))

]=

exp(w(i)>xn)c∑

j=1

exp(w(j)>xn)xn = σi(W

>xn)xn

We can express everything in a more succinct way,

∇w(i)en(W ) = −xnJyn = iK +exp(w(i)>xn)c∑

j=1

exp(w(j)>xn)xn.

One may also wish to combine the above two terms into a single term.

Feature Transform

The idea of the feature transform is simple, and we have seen some of it when we discussed polynomial regression.Suppose that we wish to perform binary classification using linear classifiers, however the data points in our

data set are not linearly separable. How can this be done?The intuitive approach is to rearrange the data in a way so that it becomes linearly separable. This operation

is called the feature transform.

Example 1. Suppose we have integer data xn = −10,−9,−8, . . . , 8, 9, 10 and yn =

−1 |xn| > 2

+1 otherwise.

Clearly the data is not linearly separable.

But it becomes linearly separable when we apply the transform Φ(x) =

1xx2

. This function maps the space of

our data X = −10,−9,−8, . . . , 8, 9, 10 into a new space Z.One possible linear classifier in the new space is (plot it!),

h(x) = sgn(w>Φ(x)) = sgn

[−5 0 1] 1x1x2

, w =[−5 0 1

]>Then given a new data x, we can simply output a prediction of the label using the new classifier h(x) =

sgn(w>Φ(x)).

Following the approach in the textbook, we find such a transform in the following way,

1. Find a nonlinear boundary in the space of the data X that separates the data

2. Re-write the nonlinear boundary in the form of a hyperplane w>Φ(x)

3. Obtain the feature transform Φ(x)

Exercise 3.13

In this exercise, we wish to find w to represent a list of boundaries.Taking one of the example, say (b),The circle centered at (3,4) is given by (x1 − 3)2 + (x2 − 4)2 = 1 =⇒ x21 − 6x1 + 9 + x22 − 8x2 + 17 = 1. We

then re-write this boundary as a hyperplane, 0 = x21 − 6x1 + 9 + x22 − 8x2 + 17− 1 = w>Φ(x)

In this case, Φ(x) is given as[1 x1 x2 x21 x1x2 x22

]>. So we obtain w =

[24 −6 −8 1 0 1

]>.

The other cases are dealt with similarly.

3

Problem 3.17

(a)

Recall that given a function f : Rn → R and a vector p ∈ Rn, the inexact Taylor series expansion is,

f(x+ p) ≈ f(x) +∇f(x)>p (5)

Then,

E1(∆u,∆v) = E(u+∆u, v+∆v)|(u,v)=(0,0) ≈ E(0, 0)+∇E(0, 0)>[∆u∆v

]= 3+

[−2 −3

] [∆u∆v

]= 3−2∆u−3∆v

where we have used the fact that

∇(u,v)E(u, v) =

[exp(u) + v exp(uv) + 2u− 3v − 3

2 exp(2v) + u exp(uv)− 3u+ 8v − 5

]=⇒ ∇(u,v)E(0, 0) =

[−2−3

](b)

The way to approach this problem is to first form a unit vector in the negative gradient direction, then scale by 0.5.Consider the unit vector,

−∇E1/‖∇E1‖2 =

2√133√13

Then we form the vector with length 0.5

[∆u∆v

]= 0.5(−∇E1/‖∇E1‖2) = 0.5

2√133√13

We then obtain, E(u+ ∆u, v + ∆v)|(u,v)=(0,0) ≈ 2.25

(c)

The second-order Taylor series expansion is given by,

E1(∆u,∆v) = E(u+ ∆u, v + ∆v)|(u,v)=(0,0) ≈ E(0, 0) +∇E(0, 0)>[∆u∆v

]+

1

2

[∆u ∆v

]∇2E(u, v)|(0,0)

[∆u∆v

]We can show,

∇2(u,v)E(u, v) =

[exp(u) + v2 exp(uv) + 2 uv exp(uv) + exp(uv)− 3uv exp(uv) + exp(uv)− 3 4 exp(2v) + u2 exp(uv) + 8

]

=⇒ ∇2(u,v)E(u, v)|(0,0) =

[3 −2−2 12

]Plug in everything, we obtain,

E1(∆u,∆v) = E(u+ ∆u, v + ∆v)|(u,v)=(0,0) ≈ 3− 2∆u− 3∆v +3

2∆u2 − 4

2∆u∆v +

12

2∆v2

(d)

For arbitrary ∆u, ∆v,

E2(∆u,∆v) = E(u+ ∆u, v + ∆v) ≈ E(∆u,∆v) +∇E∆u,∆v)>[∆u∆v

]+

1

2

[∆u ∆v

]∇2E(u, v)

[∆u∆v

]Assuming that ∇2E(u, v) is positive definite, hence invertible, taking the gradient of E2(∆u,∆v)

4

∇E2(∆u,∆v) = ∇E(u, v) +∇2E(u, v)

[∆u∆v

]=⇒

[∆u∆v

]= −[∇2E(u, v)]−1∇E(u, v)

This is the optimal direction when we have second order information. This particular update is known as theNewton step.

(e)

(i) We first calculate the Newton step using the previously calculated Hessian,

p = −[∇2E(0, 0)]−1∇E(0, 0)

Then we normalized it to length of 1 and multiply it by 0.5 to obtain the desired value.[∆u∆v

]= 0.5p/‖p‖2 =

[0.45880.1988

]=⇒ E(u+ ∆u, v + ∆v) = 1.8805

(ii). Using MATLAB, and using the hint in the problem, we find[∆u∆v

]=

[0.43630.2440

]=⇒ E(u+ ∆u, v + ∆v) = 1.8685

Observe that the value of E(u+ ∆u, v + ∆v) for different parts satisfies the relationship, e(ii) < e(i) < b.


5

ECE421 – TUT0102Tutorial 6: Introduction to Convolutional

Neural NetworksBolin Gao

[email protected]

10/24/2019

For Midterm Questions: Friday 3 pm - 4 pm BA4162 see Sindhu Gowda

History: Kunihiko Fukushima proposed the first convolutional neural network in 1980. Popularized by Yann LeCun and Yoshua Bengio et al. in 1998 for document recognition.

Alex Krizhevsky, Ilya Sutskever, Geoffery Hinton used AlexNet to win the competition

Almost halved the error achieved by top non-neural net based vision algorithms! (15.5% of the time the correct label was not one of the top 5 answers; compare with 5-10% humans)

This event revolutionized machine learning in many ways.

2012 ImageNet Challenge: best 5 out of 1,000 categories

2012 ImageNet Challenge: best 5 out of 1,000 categories

1. Popularized Rectified Linear Unit (ReLu), Dropout, Convolutional Neural Network

2. Beginning of “deep learning” hype

3. Popularized training with GPUs

4. Put UofT solidly on the map (in terms of machine learning and AI)

5. UofT ECE department started to consider teaching machine learning courses…the rest is history.

AlexNet is a Convolutional Neural Network

Convolution

Nonlinear mapping

Pooling

FlattenFully connected

(dense layer)Softmax

CNN for image classification (Base architecture)

Image

Repeat step 1 - 3

1

2

3

Output probabilities

Convolution

Nonlinear mapping

Pooling

FlattenFully connected

(dense layer)Softmax

CNN for image classification

Image

Repeat step 1 - 3

1

2

3

Output probabilities

Stochastic Gradient Descent

(or variants)

Compute gradient of loss function

(backpropagation)

(one-hot encoded) true label 𝑦

ො𝑦weight update 𝑤𝑘+1

weight update 𝑤𝑘+1

Convolution

Image 2x2 Filter

Slide a matrix(or filter, kernel) across the image, pairwise multiply each entry in the image with corresponding value of the matrix, then sum all values. This creates a new matrix.

Convolution

Image 2x2 Filter

In practice, filter elements or

“weights” are randomly initialized,

e.g., sampled from Gaussian.

Slide a matrix(or filter, kernel) across the image, pairwise multiply each entry in the image with corresponding value of the matrix, then sum all values. This creates a new matrix.

ConvolutionSlide a matrix(or filter, kernel) across the image, pairwise multiply each entry in the image with corresponding value of the matrix, then sum all values. This creates a new matrix.

Image 2x2 Filter

𝑂𝑢𝑡𝑝𝑢𝑡 𝑥, 𝑦 =𝑚

𝑛𝐼𝑚𝑎𝑔𝑒 𝑥 + 𝑚, 𝑦 + 𝑛 𝐹𝑖𝑙𝑡𝑒𝑟(𝑚, 𝑛)

where Image and Filter represents their respective matrices.

Mathematically (for this simple example):

Convolution

Slide a filter across the image, multiply each entry in the image with corresponding value of matrix, then sum all values. This creates a new matrix.

Image

Convolution

Slide a filter across the image, multiply each entry in the image with corresponding value of matrix, then sum all values. This creates a new matrix.

(misnomer: not a convolution as in signal proc., more like cross-correlation)

Image 2x2 Filter Output

Most crucial idea in CNN:

Each convolution implements a small neural network. Therefore filters are learned.

Convolution

Image 2x2 Filter

1

2

3

4

5

6

7

8

9

Σ 6

1

0

0

1



Convolution

Image 2x2 Filter

1

2

3

4

5

6

7

8

9

Σ 6

Σ 8



Convolution

Image 2x2 Filter

1

2

3

4

5

6

7

8

9

Σ 6

Σ 8

Σ

Σ

12

14

Why supervised learning

using CNN is called

Deep Learning

Convolution

Four 2x2 Filters

Image

In practice, several filters are used to learn different representations of the same data.

Convolution

In practice, several filters are used to learn different representations of the same data.Convolution

Four 2x2 Filters

Image

Output

In practice, several filters are used to learn different representations of the same data.Convolution

Four 2x2 Filters

Image

Output Pictorial representation (both are equivalent)

This box

has

dimension

4 x 2 x 2

Explain.

ConvolutionReal images are represented by matrices stacked together

Each color specifies a “channel”. RGB image has three channels. BW image has two channels.

Assuming RGB image, we need to perform convolution on all three channels

2x2 Filters with depth 3Image in three RGB channels

For simplicity, assume all RGB layers have same value, all filter have the same value

Output

one matrix because we are convolving using one filter

2x2 Filters with depth 3

Pictorial representation

Input image Output

Length

Width

Depth (=3)

Filter

Convolution – stride

Stride: how many rows/columns the filter shifts by.

Some matrix 2x2 Filter (stride 1) 2x2 Filter (stride 2)

Nonlinear mapping

Convolution is a completely linear operation, want nonlinearity in network!

Add nonlinearity by sending each output of the convolution to a nonlinear function

Nonlinear mapping

Convolution is a completely linear operation, want nonlinearity in network!

Add nonlinearity by sending each output of the convolution to a nonlinear function

Image 2x2 Filter Output Nonlinearity (ReLu) Output

Image 2x2 Filter Nonlinear mapping

1

2

3

4

5

6

7

8

9

Σ 6

Σ 8

Σ

Σ

12

14

Nonlinear mapping

𝑚𝑎𝑥

𝑚𝑎𝑥

𝑚𝑎𝑥

𝑚𝑎𝑥

Pooling

Another nonlinear transform, takes a norm over a specified region of the input matrix

Max pooling: take the max-element of a region

matrix 2x2 max pool filter (with stride 1) 2x2 max pool filter (with stride 2)

Pooling enforces translational invariance

Translational invariance: give same prediction after a shift of object

Imagine a portion of a matrix (output of nonlinearity)

We now apply a 2x2 max pooling filter

with stride 2

Now imagine this matrix shifted upward

by one row

Shift up

Shift up

No change

(translation invariance achieved)

Flatten, Dense Layer and classify

Flatten: vectorize all output, stack them together

Feed flattened vector into one or several fully-connected (dense) layers, then feed into softmax to acquire probability over categories

where is the flattened vector

ො𝑦 = 𝜎(𝑊𝑇𝑥)

𝑥

Summary (from Stanford slides)

Training the CNNCan imagine the entire system as a (highly nonlinear) hypothesis

ℎ𝑤: 𝑥 → ℎ𝑤 𝑥

This function can have millions of weight parameters, they consist of

the weights of the last fully connected layer, and the weights of each filter.

Let be the loss function (same as softmax regression), then we simply need to update the weights using stochastic gradient descent (or variants),

w𝑘+1 = 𝑤𝑘 − 𝜂∇𝐿𝑖 𝑤𝑘 , 𝑖 ∼ 𝑈𝑛𝑖𝑓 1, … , 𝑁

Use backpropagation to speed up calculation of gradient

𝐿 = ∑𝐿𝑖

AlexNet revisited

Figure split into two-halves because it was trained on two GPUs

Stuffs we left out

Zero padding – Add 0 around border of image so filter operation won’t overflow due to large stride

Drop out – randomly disconnect certain edges to loosen dependency between neurons, prevents overfitting

Batch normalization – a method to normalize data across layers to speed up training

Xavier and He initialization – principled methods to initialize the weights of the filters

Bonus: can use convolutional neural network to play games (Deep Q Learning – Mnih et al.)

Click here to my project on Vimeo

Two outputs: flap (up) and do nothing (down)

https://vimeo.com/370445927

ECE1504 TutorialTensorFlow (part II)

1 / 15

Outline

1 Challenges of Training Deep Neural Networks

2 Initialization

3 Activation Functions

4 Batch Normalization

5 Reusing Pretrained Layer

6 Avoiding Overfitting Through Regularization

7 Convolutional Neural Network

2 / 15

Challenges of Training Deep Neural Networks

Training Deep Neural Nets

If you want to tackle complex problems, you need to design a verydeep neural network.

Training a deep neural net which contains hundreds of neurons andconnected by hundreds of thousands of connections is reallychallenging because

1 Vansihing gradients problem that affects DNNs and makes lower layersvery hard to train.

2 Due to number of parameters, training would be extremely slow.3 Risk of overfitting.

In this tutorial we will present available techniques in Python to solvethese problems.

3 / 15

Initialization

Vanishing Gradient Problem

Gradients get smaller and smaller as the training progresses down tothe lower layers.

There are two suspects for this problem : Initialization and ActivationFunction.

4 / 15

Initialization

Initialization

Variance of the outputs of each layer be equal to the variance of itsinput.It is not possible to guarantee unless the layer has an equal number ofinput and output connections.TensorFlow implementation

Figure: Initialization parameters for each type of activation function

he_init = tf.variance_scaling_initializer()

hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu,

kernel_initializer=he_init, name="hidden1")

5 / 15

Activation Functions


Vanishing/exploding gradients problems are in part due to a poorchoice of activation functions.ReLU and its variants

1 Leaky ReLU2 Randomized Leaky ReLU3 Parametric Leaky ReLU

ELU

ELUα (z) =

α (exp (z) − 1) if z < 0

z if z ≥ 0

Figure: ELU activation function

6 / 15


Activation Functions (cont’d)

Comparison of activation functions

ELU > leaky ReLU (and its variant)> ReLU > tanh > logstic

Implementation of ELU in TensorFlowhidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.elu, name="hidden1")

Implementation of Leaky ReLUdef leaky_relu(z, name=None):

return tf.maximum(0.01 * z, z, name=name)

hidden1 = tf.layers.dense(X, n_hidden1, activation=leaky_relu, name="hidden1")

7 / 15

Batch Normalization

Batch Normalization

He initialization along with ELU can significantly reduce the vanishinggradients problems at the beginning of training.

it doesn’t guarantee that they won’t come back during training.

Batch Normalization is a technique to provide any layer with inputsthat are zero mean/unit variance.

BN consists of zero-centring and normalizing the input, then scalingand shifting the result using two new parameters per layer.

The optimal scale and mean of the inputs of each layer are learntduring training.

8 / 15

Batch Normalization

Implementation of Batch Normalization in TensorFlow

training placeholderX = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")

training = tf.placeholder_with_default(False, shape=(), name=’training’)

hidden1 = tf.layers.dense(X, n_hidden1, name="hidden1")

bn1 = tf.layers.batch_normalization(hidden1, training=training, momentum=0.9)

bn1_act = tf.nn.elu(bn1)

execution phaseextra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)


init.run()

for epoch in range(n_epochs):

for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):

sess.run([training_op, extra_update_ops],

feed_dict=training: True, X: X_batch, y: y_batch)

accuracy_val = accuracy.eval(feed_dict=X: X_valid, y: y_valid)

print(epoch, "Validation accuracy:", accuracy_val)

save_path = saver.save(sess, "./my_model_final.ckpt")

9 / 15

Reusing Pretrained Models


What is reusing pretrained Models?

Figure: Reusing pretrained layer

Benefits: speeding up training, less training data!

10 / 15


Implementation in TensorFlow

import the operations into the default graphsaver = tf.train.import_meta_graph("./my_model_final.ckpt.meta")

Get a handle on the operations and tensorsfor op in tf.get_default_graph().get_operations():

print(op.name)

ExampleX = tf.get_default_graph().get_tensor_by_name("X:0")

y = tf.get_default_graph().get_tensor_by_name("y:0")

accuracy = tf.get_default_graph().get_tensor_by_name("eval/accuracy:0")

training_op = tf.get_default_graph().get_operation_by_name("GradientDescent")

Freezing the Lower Layerwith tf.name_scope("train"):

optimizer = tf.train.GradientDescentOptimizer(learning_rate)

train_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,

scope="hidden[34]|outputs")

training_op = optimizer.minimize(loss, var_list=train_vars)

11 / 15

Avoiding Overfitting Through Regularization

Early Stopping

Early Stopping = Stop training when its performance on thevalidation set starts dropping!Implementationmax_checks_without_progress = 20

checks_without_progress = 0

best_loss = np.infty


init.run()

for epoch in range(n_epochs):

[..training..]

loss_val, acc_val = sess.run([loss, accuracy], feed_dict=X: X_valid, y: y_valid)

if loss_val < best_loss:

save_path = saver.save(sess, "./model.ckpt")

best_loss = loss_val

checks_without_progress = 0

else:

checks_without_progress += 1

if checks_without_progress > max_checks_without_progress:

print("Early stopping!")

break

12 / 15


`1 and `2 regularization

`1 and `2 regularization can be used to constrain a neural network’sconnection weights.

You can add the regularization term to the cost function.

Also, TensorFlow provide the following built-in function.regularizer = tf.contrib.layers.l2_regularizer(scale=0.001)

hidden1 = tf.layers.dense(X, num_layer_1, kernel_regularizer=regularizer)

hidden1_act = tf.nn.relu(hidden1)

13 / 15


Dropout

In training, every neuron has a probability p of being temporarily”dropped-out”.

Implementationtraining = tf.placeholder_with_default(False, shape=(), name=’training’)

dropout_rate = 0.5 # == 1 - keep_prob

X_drop = tf.layers.dropout(X, dropout_rate, training=training)

During test, you need to multiply each neuron connection by keepprobability. why?

14 / 15

Convolutional Neural Network

Convolutional Neural Network

Convolutional Layerconv = tf.layers.conv2d(x_image, filters=32, kernel_size=5,

strides=[1,1], kernel_initializer=he_init, activation=tf.nn.relu,

kernel_regularizer=regularizer, padding="VALID", name="cnn_layer")

Max pooling layerconv_pool = tf.nn.max_pool(value=conv, ksize=[1, 3, 3, 1],

strides=[1, 2, 2, 1], padding="VALID", name="max_pool_layer")

15 / 15

Tutorial 8

K-means, Review of Probability, Jointly Gaussian Random

Variables

Bolin Gao

Nov 7, 2019

K-means and Lloyd’s Algorithm

We now discuss a classic unsupervised learning problem called K-means. Suppose we are given a set of data, wherewe assume that

(i) data xn ∈ Rd

(ii) data belong to K number of clusters (data that share commonality with each other)

(iii) similar data are close in the Euclidean distance

is it possible to partition the data in these K different clusters? The problem is formally given below.

K means clustering problem (set theoretic)

Given a data set D = xnNn=1, where xn ∈ Rd, find K clusters BkKk=1 = B1, B2, . . . , BK, each Bk ⊆ Rdand vectors µkKk=1, µk ∈ Rd, such that,

L(B1, . . . , BK , µ1, . . . , µK) =

K∑k=1

∑xn∈Bk

‖xn − µk‖22 (1)

is minimized.

The loss L is sometimes referred to as the distortion measure. We note that the variables here are the vectorsµ1, . . . , µk, as well as the sets B1, . . . , BK .

The following algorithm describes a popular method for minimizing the loss L, usually referred to as the K-meansalgorithm (K-means is the problem, there are various algorithms to solve this problem).

Lloyd’s Algorithm

Data: D = xnNn=1, xn ∈ RdInput: µkKk=1 set to some random values, e.g., those in D.Output: µkKk=1 corresponding to the optimal loss valueRepeat until convergence

1. For all n = 1, . . . , N , assign xn to the nearest µk, that is, compute

argmini=1,...,K

‖xn − µi‖2 = k ∈ 1, . . .K (2)

and assign xn to the set Bk.

2. For all k = 1, . . . ,K, compute µk via,

µk =1

|Bk|∑

xn∈Bk

xn, (3)

where |Bk| denotes the number of elements in the set Bk.

Algorithm 1: Lloyd’s Algorithm (“K-means algorithm”)

The vectors µkKk=1 are referred to as the “mean vectors” or sometimes the centroids. It is known that theK-means algorithm,

(i) The loss function is monotonically decreasing

(ii) No guarantee on the number of iterations to convergence

(iii) No nontrivial lower bound on the gap between value of K-means loss of algorithm output and the minimumachievable value of the loss

1

(iv) K means might converge to a point which is not a local minimum.

The recommendation is to run K means with different initialization and pick the best clustering.

Alternative description of K means problem (non-set theoretic)

While the previously mentioned algorithm is intuitive, it unfortunately requires us to associated vectors with thesets BkKk=1, which are not explicitly computed. We now offer an equivalent description of the K-means problem,as well as the Lloyd’s algorithm for this problem, whereby the membership is explicitly specified. The following istaken from Chapter 9 of Bishop’s book, which offers a non-set theoretic description of the K means problem.

K-means clustering problem (non-set theoretic)

Given a data set D = xnNn=1, where xn ∈ Rd. Find µkKk=1, µk ∈ Rd and responsibilities rn,kN,Kn=1,k=1,rn,k ∈ 0, 1, such that,

L(r1,1, . . . , rN,K , µ1, . . . , µK) =

K∑k=1

N∑n=1

rn,k‖xn − µk‖22 (4)

is minimized andK∑k=1

rn,k = 1,∀n.

Remark 1. For rn,K = 1 if xn is assigned to cluster K. The constraintK∑k=1

rn,k = 1 says that for each data xn, it is

only allowed to be assigned to a single class. We can gather all the responsibilities into a single vector,

rn = (rn,1, . . . , rn,K) =

rn,1...rn,K

(5)

Since only one of the numbers rn,1 is allowed to be 1, for example, rn = (0, 0, 1, 0, 0). Therefore, we say that rn isa “one-hot” encoding of the class membership of xn. Observe that each rn is a vertex of the simplex ∆.

Lloyd’s Algorithm (non set-theoretic)

Data: D = xnNn=1, xn ∈ RdInput: µKk=1 set to some random values, e.g., those in D.Output: µKk=1 corresponding to the optimal loss valueRepeat until convergence

1. For all n = 1, . . . , N , k = 1, . . . ,K,

rn,k =

1 k = argmini=1,...,K

‖xn − µi‖2

0 otherwise(6)


µk =

N∑n=1

rn,kxn

N∑n=1

rn,k

(7)

Algorithm 2: Lloyd’s Algorithm (“K-means algorithm”) - Equivalent

2

Soft K-means

Suppose that a data point we receive xn falls precisely on the point of intersection of two clusters, then eitherthe data point belong to one of the clusters (which we resolve perhaps using a coin flip), or perhaps it is moreappropriate to say that the data belongs to both clusters. The second method is so-called soft K-means.

Soft K-means clustering problem (non-set theoretic)

Given a data set D = xnNn=1, where xn ∈ Rd. Find µkKk=1, µk ∈ Rd and responsibilities rn,kN,Kn=1,k=1,rn,k ∈ [0, 1], such that,

L(r1,1, . . . , rN,K , µ1, . . . , µK) =

K∑k=1

N∑n=1

rn,k‖xn − µk‖22 (8)

is minimized andK∑k=1

rn,k = 1,∀n.

Remark 2. In this case, we may view rn,k as the “percentage” that a data xn belonging to cluster k. We can gatherall the responsibilities into a single vector,

rn = (rn,1, . . . , rn,K) =

rn,1...rn,K

(9)

Observe that each rn is an element of the simplex ∆.

Lloyd’s Algorithm for Soft K-means

Data: D = xnNn=1, xn ∈ RdInput: µKk=1 set to some random values, e.g., those in D.Output: µKk=1 corresponding to the optimal loss valueRepeat until convergence

1. For all n = 1, . . . , N , k = 1, . . . ,K,

rn,k =exp(−λ‖xn − µk‖22

)K∑l=1

exp(−λ‖xn − µl‖22)

= σk

−λ ‖xn − µ1‖22

...‖xn − µK‖22

, (10)

where λ > 0 and σk is the kth component of the softmax function.


µk =

N∑n=1

rn,kxn

N∑n=1

rn,k

(11)

Algorithm 3: Lloyd’s Algorithm for Soft K-means (“Soft K-means algorithm”)

Note that there was a slight typo from last time. The Euclidean distance term inside of the exponential shouldraise to the power of 2.

Remark 3. Note the following,

• For xn close to µk, ‖xn − µk‖2 is small, hence rn,k → 1

• For xn far from µk, ‖xn − µk‖2 is big, hence rn,k → 0

3

Review of Probability and Jointly Gaussian Random Variables

Random Variable

Our (abridged) story of probability starts with the notion of a random variable.

Definition 1. A random variable is a function X : Ω→ R, ω 7→ X(ω).

Ω is referred to as the sample space, which consists of the collection of outcomes of some underlying randomexperiment. This set is very general, it can contain the name of all the students in the classroom. Randomvariables provide us a way to talk about these very general objects using mathematics. Common examples includeΩ = Head,Tail or Ω = [0,∞). The symbol ω denotes a single outcome in the sample space Ω. We refer to anysubset of E ⊆ Ω as an event. It is important to note that ω is an element of Ω, but not a subset of it. This beingsaid, the set containing ω, i.e., w is a subset of Ω. It is referred to as an elementary event.

We say that the random variableX is discrete, if it maps to a discrete set x1, . . . , xN (finite), or x1, . . . , xN , . . .(countable), and continuous if it maps to some interval C ⊆ R (uncountable).

Probability Measure

To talk about probability of certain outcomes or events, we need the notion of the probability measure

Definition 2. A probability measure is a function Pr : Ω→ [0, 1], E 7→ Pr(E), that satisfies the following properties

(i) Pr[E] ∈ [0, 1]

(ii) Pr[Ω] = 1 Pr[∅] = 0

(iii) If Ei ∩ Ej = ∅,∀i, j, then Pr[⋃∞n=1En] =

∞∑n=1

Pr[En].

The final property is called countable additivity. We note that a curious thing about the notion of a probabilitymeasure is that not all subsets E ⊂ [0, 1] (or any set of a continuum of numbers) has a defined Pr[E], therefore,not all subsets E ⊂ [0, 1] can be events! Such sets are called Vitali sets and its construction is intimately linked theAxiom of Choice, which is at the heart of the foundation of mathematics. For details, see page 74 of the text byAlberto Leon Garcia.

Some properties associated with the probability measure that can be derived,

Pr[A ∪B] = Pr[A] + Pr[B]− Pr[A ∩B]

Pr[A ∪B ∪ C] = Pr[A] + Pr[B] + Pr[C]− Pr[A ∩B]− Pr[B ∩ C]− Pr[A ∩ C] + Pr[A ∪B ∪ C]

Pr[A ∩Bc] = Pr[A]− Pr[A ∩B]

Remark 4. There are two short hands used in probability that causes confusion for beginners,

(1) X is a function of ω, but in practice, ω is omitted, and one writes X = X(ω), where X is now a real number.This is of course bad practice, as we now confuse the function with the image of the function, but it hasstuck; virtually all applied probability textbook uses this notation. For example, when one write X = 2Y , weactually mean that for all ω ∈ Ω, X(ω) = 2Y (ω).

(2) Often times, when one wishes to talk about probability the random variable falling at, below, or within aninterval of some number, one write,

Pr[X = x] Pr[X ≤ x] Pr[x1 ≤ X ≤ x2] (12)

But this is strange, as Pr is a function of events (not statements such as “X ≤ x”). What we actually meanwhen we write these is, for example, Pr[X = x] = Prw ∈ Ω|X(w) = x.

Probability Mass Function, Cumulative Distribution Function, Probability DensityFunction

The three famous functions: probability mass function, cumulative distribution function and probability densityfunction are simply different ways of talking about the probability measure of certain events.

4

Probability Mass Function

Definition 3. The probability mass function (PMF) for a discrete random variable X is the function PX(x) =Pr[X = x] and for a pair of random variable X,Y is the function PX,Y (x, y) = Pr[X = x, Y = y].

The PMF is only defined for discrete random variable because for continuous random variable, the probabilityat any point x, Pr[X = x], is 0

Example 1. (Poisson RV)

PX(k) =αk

k!e−α, k = 0, . . . , E[X] = α,VAR[X] = α (13)

Cumulative Distribution Function

Definition 4. The cumulative distribution function (CDF) for a (discrete or continuous) random variable X is thefunction FX(x) = Pr[X ≤ x] and for a pair of random variable X,Y is the function FX,Y (x, y) = Pr[X ≤ x, Y ≤ y].

Example 2. (Exponential RV)

FX(x) =

1− exp(−λx) x ≥ 0

0 x < 0E[X] =

1

λ,E[Xk] =

k!

λk,VAR[X] =

1

λ2(14)

The CDF is non-decreasing and right-continuous. The limit of FX(x) as x→∞ is 1, and as x→ −∞ is 0.Here are some properties involving the CDF that might be good to be reminded of (you do not need to know

how they are derived).

Pr[X = x1] = FX(x1)− FX(x−1 )

Pr[x1 ≤ X ≤ x2] = FX(x2)− FX(x1) + Pr[X = x1]

And for pair of random variable,

Pr[X > x, Y > y] = 1−Pr[X ≤ x ∪ Y ≤ y] = 1−Pr[X ≤ x] + Pr[Y ≤ Y ]−Pr[X ≤ x ∩ Y ≤ y] = 1−Fx(x) +FY (y)− FXY (x, y)

Pr[x1 ≤ X ≤ x2, y1 ≤ Y ≤ y2] = FXY (x2, y2)−FXY (x1, y2)−FXY (x2, x1)+FXY (x1, y1)+Pr[X = x1, y1 ≤ Y ≤ y2]+Pr[x1 ≤ X ≤ x2, y = y1]

Pr[x1 < X ≤ x2, Y ≤ y2] = Pr[X ≤ x2, Y ≤ y2]− Pr[X ≤ x1, Y ≤ y2] = FXY (x2, y1)− FXY (x1, y1)

Probability Density Function

Definition 5. The probability density function (PDF) for a continuous random variable X is the function fX(x) =dFX(x)

dxand for a pair of random variable X,Y is the function fX,Y (x, y) =

∂2FX,Y (x, y)

∂x∂y.

Remark 5. We note that, if a random variable has a distribution fX(x), we write X ∼ fX(x). Confusingly, inmachine learning, the random variable is often written as a lower case character, x, and for the PDF, the subscriptis dropped, resulting in x ∼ f(x), or x ∼ p(x). So it is very important to have a clear understanding of what is therandom variable.

Some properties of the PDF:

fX(x) ≥ 0

Pr[x1 ≤ X ≤ x2] =x2∫x1

fX(x)dx

FX(x) =x∫−∞

fX(t)dt

1 =∞∫−∞

fX(x)dx

5

Example 3. (Exponential RV)

fX(x) =

λ exp(−λx) x ≥ 0

0 x < 0. (15)

Example 4. (Gaussian RV)

fX(x) =1√2πσ

exp

((x−m)2

2σ2

)E[X] = m,VAR[X] = σ2 (16)

The Gaussian random variable is often written as, N (m,σ2)

Theorem 1. Y = aX + b,X ∼ N (m,σ2) =⇒ Y ∼ N (am+ b, (|a|σ)2)

Theorem 2. If X,Y are independent and Gaussian, their sum Z = X + Y is also Gaussian.

Theorem 3. (Page 217, Papoulis, 4th Ed) (Cramer) If X,Y are independent, Z = X + Y is Gaussian, then Xand Y are both Gaussian.

Jointly Gaussian Random Variables

We close our review of probability with a description of the jointly Gaussian random variable.

Definition 6. We say that a pair of random variable X,Y is jointly Gaussian if

fXY (x, y) =1

2πσ1σ2√

1− ρ2exp

− x

21 − 2ρxy + y2

2(1− ρ2)

(17)

where x1 =x−m1

σ1, y =

y −m2

σ2, ρ =

COV(X,Y )

σ1σ2is the correlation coefficient, COV(X,Y ) = E[(X − E[X])(Y −

E[Y ]))] is the covariance.

Theorem 4. Let X,Y be jointly Gaussian, then X,Y are independent (fXY (x, y) = fX(x)fY (y)) if and only ifCOV(X,Y ) = 0 (X,Y are uncorrelated).

Remark 6. This is not true if X,Y is not jointly Gaussian. For general random variables, X,Y independent impliesuncorrelated. But uncorrelated does not imply independent.

Theorem 5. If X,Y jointly Gaussian, their marginal PDFs are Gaussian.

Theorem 6. If X,Y jointly Gaussian, the conditional PDF of X given Y = y is Gaussian.

Theorem 7. Linear transform of jointly Gaussian is jointly Gaussian. That is, suppose X,Y are jointly Gaussianrandom variables, define random variables Z,W ,[

ZW

]= A

[XY

], A ∈ Rn×n (18)

then Z,W are jointly Gaussian.

Finally, we note that if X,Y are Gaussian, their joint distribution is not necessarily jointly Gaussian. Despitebeing a very important result, these examples usually take a bit effort to construct.


6

ECE421 TUT0102 Tutorial 1: Introduction to Tensorflowexams.skule.ca/exams/ECE521H1_20199... ·...

Documents

Transcript of ECE421 TUT0102 Tutorial 1: Introduction to Tensorflowexams.skule.ca/exams/ECE521H1_20199... ·...