Download - Deep learning & tensorflow - Xiamen University · Tensorflow 7. Some demos 8. Other framework 9. Future work - 3 - 1. Let’s start with machine learning When we talk about machine

Deep learning & tensorflow

杞坚玮

2017.3.22

- 2 -

What I want to talk about

1. Let’s start with machine learning

2. Perceptron

3. Neural network

4. Why deep

5. CNN, RNN, LSTM

6. Tensorflow

7. Some demos

8. Other framework

9. Future work

- 3 -


When we talk about machine learning, it is actually looking for a function.

• Speech recognition: f( ) = “Deep learning is great.”

• Image recognition: f( ) = “Cat”

• Emotional analysis: f(“I think this car is great”) = “Positive attitude”

• Dialogue system: f(“How are you?”) = “I’m fine, and you?”

- 4 -


The framework of machine learning

Training Testing

Training data

A set of

function

cat dogcat

Goodness of

function f

Model 1, 2, 3…

The best

function f*

Pick the best function f*

Test data

“This is or not a cat”

- 5 -

2. Perceptron

Perceptron is a discriminative model of supervised learning in machine learning.

- 6 -

3. Neural network

Neural networks give a way of defining a complex, non-linear form of hypotheses

ℎ𝑊, 𝑏(𝑥), with parameters 𝑊,𝑏 that we can fit to our data.

𝒙𝟏

𝒙𝟐

𝒙𝟑

𝒉𝑾,𝒃(𝒙)

𝒘𝟏

𝒘𝟐

𝒘𝟑

𝒃

𝝈(𝒛)

- 7 -

3. Neural network

This neuron is a computational unit that takes as input 𝑥1, 𝑥2, 𝑥3, and the output

ℎ𝑊,𝑏 𝑥 = 𝜎(σ𝑤𝑖𝑥𝑖 + 𝑏), where 𝜎 𝑧 is called activation function.

There are several common choice for activation function:

Sigmoid: 𝜎 𝑧 =1

1+𝑒−𝑧; Tanh:𝜎 𝑧 =

𝑒𝑧−𝑒−𝑧

𝑒𝑧+𝑒−𝑧; Rectified linear:𝜎 𝑧 = max(0, 𝑧)

- 8 -

3. Neural network

We define the overall cost function to be:

𝐽 𝑊, 𝑏 =1

𝑚

𝑖=1

𝑚1

2||ℎ𝑊,𝑏 𝑥𝑖 − 𝑦𝑖||2 +

𝜃

2

𝑙=1

𝑛𝑙−1

𝑖=1

𝑠𝑙

𝑗=1

𝑠𝑙+1

(𝑊𝑗𝑖(𝑙))2

Where 𝑚 is the number of training examples, 𝑛𝑙 is number of layers, 𝑠𝑙 is the number of

nodes in layer l (not counting the bias unit).

average sum-of-squares error term weight decay term

- 9 -

3. Neural network

We can update the parameters 𝑊,𝑏 as follow: (gradient descent)

𝑊𝑖𝑗(𝑙)

= 𝑊𝑖𝑗(𝑙)

− 𝛼𝜕

𝜕𝑊𝑖𝑗𝑙𝐽 𝑊, 𝑏

𝑏𝑖(𝑙)

= 𝑏𝑖(𝑙)

− 𝛼𝜕

𝜕𝑏𝑖𝑙𝐽 𝑊, 𝑏

Where 𝛼 is the learning rate. The problem lies in computing the partial derivatives

above. The backpropagation algorithm, which gives an efficient way to compute these

partial derivatives.

- 10 -

3. Neural network

In the backpropagation algorithm, we will first describe how to compute the partial

derivatives of the cost function defined with respect to a single example. Once we can

compute these, we see that the derivative of the overall cost function can be computed

as:

𝜕

𝜕𝑊𝑖𝑗𝑙 𝐽 𝑊, 𝑏 =

1

𝑚σ𝑖=1𝑚 𝜕

𝜕𝑊𝑖𝑗𝑙 𝐽(𝑊, 𝑏, 𝑥𝑖 , 𝑦𝑖) + 𝜃𝑊𝑖𝑗

𝑙

𝜕

𝜕𝑏𝑖𝑙𝐽 𝑊, 𝑏 =

1

𝑚

𝑖=1

𝑚𝜕

𝜕𝑏𝑖𝑙𝐽(𝑊, 𝑏, 𝑥𝑖 , 𝑦𝑖)

- 11 -

3. Neural network

Here is the backpropagation algorithm:

1. Perform a feedforward pass, computing the activations for layers L2, L3, and so on

up to the output layer 𝐿𝑛𝑙

2. For each output unit 𝑖 in layer 𝐿𝑛𝑙 (the output layer), set

𝛿𝑖(𝑛𝑙)

=𝜕

𝜕𝑧𝑖𝑛𝑙

1

2||𝑦 − ℎ𝑊,𝑏(𝑥)||

2 = − 𝑦𝑖 − 𝑎𝑖𝑛𝑙 𝑓′ 𝑧𝑖

𝑛𝑙 , 𝑧𝑖𝑙 =

𝑗=1

𝑛

𝑊𝑖𝑗𝑙−1 + 𝑏𝑖

𝑙−1 , 𝑎𝑖𝑙 = 𝑓(𝑧𝑖

𝑙)

3. For l = 𝑛𝑙 − 1, 𝑛𝑙 − 2, 𝑛𝑙 − 3, … , 2, for each node 𝑖 in layer l, set

𝛿𝑖𝑙 =

𝑗=1

𝑠𝑙+1

𝑊𝑗𝑖𝑙 𝛿𝑗

𝑙+1 𝑓′(𝑧𝑖𝑙)

- 12 -

3. Neural network

4. Compute the desired partial derivatives, which are given as:𝜕

𝜕𝑊𝑖𝑗𝑙 𝐽 𝑊, 𝑏; 𝑥, 𝑦 = 𝑎𝑗

𝑙𝛿𝑖𝑙+1

𝜕

𝜕𝑏𝑖𝑙 𝐽 𝑊, 𝑏; 𝑥, 𝑦 = 𝛿𝑖

𝑙+1

Note: In steps 2 and 3 above, we need to compute 𝑓′(𝑧) for each value of 𝑖. Assuming

we choose sigmoid as the activation function, the 𝑓′(𝑧) can be computed as 𝑓′ 𝑧 =𝑎𝑖(1 − 𝑎𝑖).

- 13 -

3. Neural network

Finally, we can describe the full gradient descent algorithm in the pseudo-code below.

1. Set ∆𝑊𝑙 = 0, ∆𝑏𝑙 = 0 for all 𝑙.

2. For 𝑖 = 1 𝑡𝑜 𝑚:

1. Use backpropagation to compute 𝛻𝑊𝑙𝐽 𝑊, 𝑏; 𝑥, 𝑦 and 𝛻𝑏𝑙𝐽 𝑊, 𝑏; 𝑥, 𝑦 .

2. Set ∆𝑊𝑙 = ∆𝑊𝑙 + 𝛻𝑊𝑙𝐽 𝑊, 𝑏; 𝑥, 𝑦 .

3. Set ∆𝑏𝑙 = ∆𝑏𝑙 + 𝛻𝑏𝑙𝐽 𝑊, 𝑏; 𝑥, 𝑦 .

3. Update the parameters:

𝑊𝑙 = 𝑊𝑙 − 𝛼1

𝑚∆𝑊𝑙 + 𝜃𝑊𝑙 , 𝑏𝑙 = 𝑏𝑙 − 𝛼

1

𝑚∆𝑏𝑙

- 14 -

4. Why deep?

There is no doubt that the more parameters, the better performance. We can build

a deep neural network or a fat neural network, which is better?

The answer is deep neural network. Because each layer of the deep neural

network can be trained as a classifier module, this process is what we called

modularization, which need less data than one hidden layer neural network.

- 15 -

5. CNN, RNN, LSTM

CNN: Convolutional Neural Network

From image to everything, there are some properties:

• Some patterns are much smaller than the whole image.

• Smaller than the whole documents.

• The same patterns appear in different regions.

• In different position of the documents.

• Subsampling the pixels will not change the object.

• Subsampling the words will not change the topic of the documents.

So, we can do convolution with property 1 and property 2, do pooling with property 3.

And then, we flatten the feature as the input to a fully connected feedforward network.

- 16 -

5. CNN, RNN, LSTM

Convolution

1 0 0 0 0 1

0 1 0 0 1 0

0 0 1 1 0 0

1 0 0 0 1 0

0 1 0 0 1 0

0 0 1 0 1 0

1 -1 -1

-1 1 -1

-1 -1 1

-1 1 -1

-1 1 -1

-1 1 -1

-13

input filter Feature map

stride

- 17 -

5. CNN, RNN, LSTM

Max pooling

-13 -1-3

1-3 -30

-3-3 10

-23 -1-2

The same patterns appear in

different regions.

03

13

Subsampling the pixels will not

change the object.

6*6 to 4*4

1 0 0 0 0 1

0 1 0 0 1 0

0 0 1 1 0 0

1 0 0 0 1 0

0 1 0 0 1 0

0 0 1 0 1 0

Some patterns are much smaller

than the whole image.

4*4 to 2*2

- 18 -

5. CNN, RNN, LSTM

The whole CNN

convolution

begin

image

pooling

convolution

pooling

Can

repeat

many

times

flatten

- 19 -

5. CNN, RNN, LSTM

RNN: Recurrent neural network

The output of hidden layer are stored in the memory. The memory can be

consider as another input.

x1

x2

a1

a2

Example

Query1: I will leave Xiamen on March 17th.

Query2: I will arrive Xiamen on March 20th.

The values stored in memory are different between ‘leave’ and

‘arrive’, so the result of query1 and query2 for Xiamen is different,

it can be destination or departure.

- 20 -

5. CNN, RNN, LSTM

LSTM: Long short term memory

f

f

g

h

fC’

𝒛𝒐

𝒛𝒊

𝒛𝒇

𝒛

𝒇(𝒛𝒊)

𝒈(𝒛)𝒈(𝒛)𝒇(𝒛𝒊)

𝒇(𝒛𝒇)𝒄

𝒄𝒇(𝒛𝒇)

𝒉(𝒄′)

𝒇(𝒛𝒐)

𝒂 = 𝒉(𝒄′)𝒇(𝒛𝒐)

𝒄′ = 𝒈 𝒛 𝒇 𝒛𝒊 + 𝒄𝒇(𝒛𝒇)

- 21 -

6. Tensorflow

TensorFlow is an open source software library for numerical computation using

data flow graphs.

Nodes in the graph represent mathematical operations, while the graph edges

represent the multidimensional data arrays (tensors) communicated between

them.

The flexible architecture allows you to deploy computation to one or more CPUs

or GPUs in a desktop, server, or mobile device with a single API.

- 22 -

6. Tensorflow

Tensor

[1, 2, 3] # a rank 1 tensor with shape [3]

[[1, 2, 3], [4, 5, 6]] # a rank 2 tensor with shape [2, 3]

[[[1, 2, 3]], [[4, 5, 6]]] # a rank 3 tensor with shape [2, 1, 3]

Node

node = tf.constant(3.0, tf.float32)

print(node)

=> Tensor("Const:0", shape=(), dtype=float32)

sess = tf.Session()

print(sess.run([node]))

=>[3.0]

- 23 -

6. Tensorflow

import tensorflow as tf

a = tf.placeholder(tf.float32)

b = tf.placeholder(tf.float32)

adder_node = a+b

print(sess.run(adder_node,{a: 3, b: 4.5}))

=>7.5

print(sess.run(adder_node,{a: [1,3], b: [2,4]}))

=>[ 3. 7.]

a b

adder_node

- 24 -

6. Tensorflow

W = tf.Variable([.3], tf.float32)

b = tf.Variable([-.3], tf.float32)

x = tf.placeholder(tf.float32)

linear_model = W*x+b

print(sess.run(linear_model, {x:[1,2,3,4]}))

=>[0 0.3 0.6 0.9]

- 25 -

6. Tensorflow

loss = tf.reduce_sum(tf.square(linear_model - y))

optimizer = tf.train.GradientDescentOptimizer(0.01)

train = optimizer.minimize(loss)

x_train = [1,2,3,4]

y_train = [0,-1,-2,-3]

init = tf.global_variables_initializer()

sess = tf.Session()

sess.run(init)

for i in range(1000):

sess.run(train, {x:x_train, y:y_train})

curr_W, curr_b, curr_loss = sess.run([W, b, loss], {x:x_train, y:y_train})

print("W: %s b: %s loss: %s"%(curr_W, curr_b, curr_loss))

- 26 -

7. Some demos

CNN in image recognition

5*5 convolution

4 features, 0 padding

begin

28*28 images of

handwritten digits

2*2 pooling

5*5 convolution

8 features, 0 padding

2*2 pooling

flatten

1 28*28 pic

4 28*28 pic

4 14*14 pic

8 14*14 pic

8 7*7 pic

8*7*7 input nodes

1024 hide nodes

10 output nodes, softmax

Accuracy rate is 99.2%

- 27 -

7. Some demos

RNN in Segmentation

Dictionary index

begin

Pku corpus

Embedding

2 RNN layers

Dense

{b; m; e; s}

- 28 -

7. Some demos

RNN in Segmentation

Dictionary index

begin

Pku corpus

Embedding

2 RNN layers

Dense

{b; m; e; s}

- 29 -

8. Other framework

Software Open source Platform Written in Interface OpenMP support OpenCL support CUDA supportAutomatic

differentiation[1]

Has pretrained

modelsRecurrent nets Convolutional nets RBM/DBNs

Parallel execution

(multi node)

Apache Singa YesLinux, Mac OS X,

WindowsC++ Python, C++, Java No Yes Yes ? Yes Yes Yes Yes Yes

Deeplearning4j Yes

Linux, Mac OS X,

Windows, Android

(Cross-platform)

C, C++Java, Scala, Clojure,

Python (Keras)Yes On roadmap[2] Yes[3] Computational

GraphYes[4] Yes Yes Yes Yes[5]

Dlib Yes Cross-Platform C++ C++ Yes No Yes Yes Yes No Yes Yes Yes

Keras YesLinux, Mac OS X,

WindowsPython Python

Only if using

Theano as backend

Under development

for the Theano

backend (and on

roadmap for the

TensorFlow

backend)

Yes Yes Yes[6] Yes Yes Yes Yes[7]

Microsoft Cognitive

Toolkit - CNTKYes

Windows, Linux[9]

(OSX via Docker on

roadmap)

C++

Python, C++,

Command line,[10]

BrainScript[11]

(.NET on

roadmap[12])

Yes[13] No Yes Yes Yes[14] Yes[15] Yes[15] No[16] Yes[17]

MXNet Yes

Linux, Mac OS X,

Windows,[18][19]

AWS, Android,[20]

iOS, JavaScript[21]

Small C++ core

library

C++, Python, Julia,

Matlab, JavaScript,

Go, R, Scala, Perl

Yes On roadmap[22] Yes Yes[23] Yes[24] Yes Yes Yes Yes[25]

Neural Designer NoLinux, Mac OS X,

WindowsC++

Graphical user

interfaceYes No No ? ? No No No ?

OpenNN Yes Cross-platform C++ C++ Yes No No ? ? No No No ?

TensorFlow YesLinux, Mac OS X,

Windows[26] C++, Python

Python, (C/C++

public API only for

executing

graphs[27])

No On roadmap[28][29] Yes Yes[30] Yes[31] Yes Yes Yes Yes

Theano Yes Cross-platform Python Python YesUnder

development[32] Yes Yes[33][34] Through Lasagne's

model zoo[35] Yes Yes Yes Yes[36]

Torch Yes

Linux, Mac OS X,

Windows,[37]

Android,[38] iOS

C, Lua

Lua, LuaJIT,[39] C,

utility library for

C++/OpenCL[40]

Yes

Third party

implementations[41][

42]

Yes[43][44] Through Twitter's

Autograd[45] Yes[46] Yes Yes Yes Yes[47]

Wolfram

MathematicaNo

Windows, Mac OS

X, Linux, Cloud

computing

C++Command line,

Java, C++No Yes Yes Yes Yes Yes Yes Yes Yes

https://en.wikipedia.org/wiki/OpenMP

https://en.wikipedia.org/wiki/OpenCL

https://en.wikipedia.org/wiki/CUDA

https://en.wikipedia.org/wiki/Automatic_differentiation

https://en.wikipedia.org/wiki/Comparison_of_deep_learning_software#cite_note-2

https://en.wikipedia.org/wiki/Recurrent_neural_network

https://en.wikipedia.org/wiki/Convolutional_neural_network

https://en.wikipedia.org/wiki/Restricted_Boltzmann_machine

https://en.wikipedia.org/wiki/Deep_belief_network

https://en.wikipedia.org/wiki/Apache_Singa

https://en.wikipedia.org/wiki/Linux

https://en.wikipedia.org/wiki/Mac_OS_X

https://en.wikipedia.org/wiki/Windows

https://en.wikipedia.org/wiki/C++

https://en.wikipedia.org/wiki/Python_(programming_language)


https://en.wikipedia.org/wiki/Java_(programming_language)

https://en.wikipedia.org/wiki/Deeplearning4j




https://en.wikipedia.org/wiki/Android_(operating_system)

https://en.wikipedia.org/wiki/Cross-platform

https://en.wikipedia.org/wiki/C_(programming_language)



https://en.wikipedia.org/wiki/Scala_(programming_language)

https://en.wikipedia.org/wiki/Clojure_(programming_language)


https://en.wikipedia.org/wiki/Keras





https://en.wikipedia.org/wiki/Dlib




https://en.wikipedia.org/wiki/Keras








https://en.wikipedia.org/wiki/CNTK

https://en.wikipedia.org/wiki/Microsoft_Windows



https://en.wikipedia.org/wiki/OSX




https://en.wikipedia.org/wiki/Command_line



https://en.wikipedia.org/wiki/.NET_Framework




https://en.wikipedia.org/wiki/Comparison_of_deep_learning_software#cite_note-cntk.ai-16

https://en.wikipedia.org/wiki/Comparison_of_deep_learning_software#cite_note-cntk.ai-16



https://en.wikipedia.org/wiki/MXNet






https://en.wikipedia.org/wiki/Amazon_Web_Services



https://en.wikipedia.org/wiki/IOS

https://en.wikipedia.org/wiki/File_manager





https://en.wikipedia.org/wiki/Julia_(programming_language)

https://en.wikipedia.org/wiki/Matlab

https://en.wikipedia.org/wiki/JavaScript

https://en.wikipedia.org/wiki/Go_(programming_language)

https://en.wikipedia.org/wiki/R_(programming_language)

https://en.wikipedia.org/wiki/Scala_(programming_language)

https://en.wikipedia.org/wiki/Perl_(programming_language)





https://en.wikipedia.org/wiki/Neural_Designer





https://en.wikipedia.org/wiki/Graphical_user_interface

https://en.wikipedia.org/wiki/OpenNN




https://en.wikipedia.org/wiki/TensorFlow










https://en.wikipedia.org/wiki/Comparison_of_deep_learning_software#cite_note-TensorFlow-28

https://en.wikipedia.org/wiki/Comparison_of_deep_learning_software#cite_note-tensorflow-roadmap-29

https://en.wikipedia.org/wiki/Comparison_of_deep_learning_software#cite_note-GitHub-30



https://en.wikipedia.org/wiki/Theano_(software)









https://en.wikipedia.org/wiki/Torch_(machine_learning)







https://en.wikipedia.org/wiki/IOS


https://en.wikipedia.org/wiki/Lua_(programming_language)






https://en.wikipedia.org/wiki/OpenCL

https://en.wikipedia.org/wiki/Comparison_of_deep_learning_software#cite_note-jtorch-41





https://en.wikipedia.org/wiki/Twitter




https://en.wikipedia.org/wiki/Wolfram_Mathematica




https://en.wikipedia.org/wiki/Cloud_computing


https://en.wikipedia.org/wiki/Command_line



- 30 -

9. Future work

Next move

Event extraction, short or long term influence identification.

Change a framework?

It seems that Keras is a better choice.

- 31 -

Reference

1. http://ufldl.stanford.edu/tutorial

2. Deep Learning Tutorial, Hung-yi Le, NTU

3. https://www.tensorflow.org

4. https://en.wikipedia.org/wiki/Comparison_of_deep_learning_software