Deep learning & tensorflow
杞坚玮
2017.3.22
- 2 -
What I want to talk about
1. Let’s start with machine learning
2. Perceptron
3. Neural network
4. Why deep
5. CNN, RNN, LSTM
6. Tensorflow
7. Some demos
8. Other framework
9. Future work
- 3 -
1. Let’s start with machine learning
When we talk about machine learning, it is actually looking for a function.
• Speech recognition: f( ) = “Deep learning is great.”
• Image recognition: f( ) = “Cat”
• Emotional analysis: f(“I think this car is great”) = “Positive attitude”
• Dialogue system: f(“How are you?”) = “I’m fine, and you?”
- 4 -
1. Let’s start with machine learning
The framework of machine learning
Training Testing
Training data
A set of
function
cat dogcat
Goodness of
function f
Model 1, 2, 3…
The best
function f*
Pick the best function f*
Test data
“This is or not a cat”
- 5 -
2. Perceptron
Perceptron is a discriminative model of supervised learning in machine learning.
- 6 -
3. Neural network
Neural networks give a way of defining a complex, non-linear form of hypotheses
ℎ𝑊, 𝑏(𝑥), with parameters 𝑊,𝑏 that we can fit to our data.
𝒙𝟏
𝒙𝟐
𝒙𝟑
𝒉𝑾,𝒃(𝒙)
𝒘𝟏
𝒘𝟐
𝒘𝟑
𝒃
𝝈(𝒛)
- 7 -
3. Neural network
This neuron is a computational unit that takes as input 𝑥1, 𝑥2, 𝑥3, and the output
ℎ𝑊,𝑏 𝑥 = 𝜎(σ𝑤𝑖𝑥𝑖 + 𝑏), where 𝜎 𝑧 is called activation function.
There are several common choice for activation function:
Sigmoid: 𝜎 𝑧 =1
1+𝑒−𝑧; Tanh:𝜎 𝑧 =
𝑒𝑧−𝑒−𝑧
𝑒𝑧+𝑒−𝑧; Rectified linear:𝜎 𝑧 = max(0, 𝑧)
- 8 -
3. Neural network
We define the overall cost function to be:
𝐽 𝑊, 𝑏 =1
𝑚
𝑖=1
𝑚1
2||ℎ𝑊,𝑏 𝑥𝑖 − 𝑦𝑖||2 +
𝜃
2
𝑙=1
𝑛𝑙−1
𝑖=1
𝑠𝑙
𝑗=1
𝑠𝑙+1
(𝑊𝑗𝑖(𝑙))2
Where 𝑚 is the number of training examples, 𝑛𝑙 is number of layers, 𝑠𝑙 is the number of
nodes in layer l (not counting the bias unit).
average sum-of-squares error term weight decay term
- 9 -
3. Neural network
We can update the parameters 𝑊,𝑏 as follow: (gradient descent)
𝑊𝑖𝑗(𝑙)
= 𝑊𝑖𝑗(𝑙)
− 𝛼𝜕
𝜕𝑊𝑖𝑗𝑙𝐽 𝑊, 𝑏
𝑏𝑖(𝑙)
= 𝑏𝑖(𝑙)
− 𝛼𝜕
𝜕𝑏𝑖𝑙𝐽 𝑊, 𝑏
Where 𝛼 is the learning rate. The problem lies in computing the partial derivatives
above. The backpropagation algorithm, which gives an efficient way to compute these
partial derivatives.
- 10 -
3. Neural network
In the backpropagation algorithm, we will first describe how to compute the partial
derivatives of the cost function defined with respect to a single example. Once we can
compute these, we see that the derivative of the overall cost function can be computed
as:
𝜕
𝜕𝑊𝑖𝑗𝑙 𝐽 𝑊, 𝑏 =
1
𝑚σ𝑖=1𝑚 𝜕
𝜕𝑊𝑖𝑗𝑙 𝐽(𝑊, 𝑏, 𝑥𝑖 , 𝑦𝑖) + 𝜃𝑊𝑖𝑗
𝑙
𝜕
𝜕𝑏𝑖𝑙𝐽 𝑊, 𝑏 =
1
𝑚
𝑖=1
𝑚𝜕
𝜕𝑏𝑖𝑙𝐽(𝑊, 𝑏, 𝑥𝑖 , 𝑦𝑖)
- 11 -
3. Neural network
Here is the backpropagation algorithm:
1. Perform a feedforward pass, computing the activations for layers L2, L3, and so on
up to the output layer 𝐿𝑛𝑙
2. For each output unit 𝑖 in layer 𝐿𝑛𝑙 (the output layer), set
𝛿𝑖(𝑛𝑙)
=𝜕
𝜕𝑧𝑖𝑛𝑙
1
2||𝑦 − ℎ𝑊,𝑏(𝑥)||
2 = − 𝑦𝑖 − 𝑎𝑖𝑛𝑙 𝑓′ 𝑧𝑖
𝑛𝑙 , 𝑧𝑖𝑙 =
𝑗=1
𝑛
𝑊𝑖𝑗𝑙−1 + 𝑏𝑖
𝑙−1 , 𝑎𝑖𝑙 = 𝑓(𝑧𝑖
𝑙)
3. For l = 𝑛𝑙 − 1, 𝑛𝑙 − 2, 𝑛𝑙 − 3, … , 2, for each node 𝑖 in layer l, set
𝛿𝑖𝑙 =
𝑗=1
𝑠𝑙+1
𝑊𝑗𝑖𝑙 𝛿𝑗
𝑙+1 𝑓′(𝑧𝑖𝑙)
- 12 -
3. Neural network
4. Compute the desired partial derivatives, which are given as:𝜕
𝜕𝑊𝑖𝑗𝑙 𝐽 𝑊, 𝑏; 𝑥, 𝑦 = 𝑎𝑗
𝑙𝛿𝑖𝑙+1
𝜕
𝜕𝑏𝑖𝑙 𝐽 𝑊, 𝑏; 𝑥, 𝑦 = 𝛿𝑖
𝑙+1
Note: In steps 2 and 3 above, we need to compute 𝑓′(𝑧) for each value of 𝑖. Assuming
we choose sigmoid as the activation function, the 𝑓′(𝑧) can be computed as 𝑓′ 𝑧 =𝑎𝑖(1 − 𝑎𝑖).
- 13 -
3. Neural network
Finally, we can describe the full gradient descent algorithm in the pseudo-code below.
1. Set ∆𝑊𝑙 = 0, ∆𝑏𝑙 = 0 for all 𝑙.
2. For 𝑖 = 1 𝑡𝑜 𝑚:
1. Use backpropagation to compute 𝛻𝑊𝑙𝐽 𝑊, 𝑏; 𝑥, 𝑦 and 𝛻𝑏𝑙𝐽 𝑊, 𝑏; 𝑥, 𝑦 .
2. Set ∆𝑊𝑙 = ∆𝑊𝑙 + 𝛻𝑊𝑙𝐽 𝑊, 𝑏; 𝑥, 𝑦 .
3. Set ∆𝑏𝑙 = ∆𝑏𝑙 + 𝛻𝑏𝑙𝐽 𝑊, 𝑏; 𝑥, 𝑦 .
3. Update the parameters:
𝑊𝑙 = 𝑊𝑙 − 𝛼1
𝑚∆𝑊𝑙 + 𝜃𝑊𝑙 , 𝑏𝑙 = 𝑏𝑙 − 𝛼
1
𝑚∆𝑏𝑙
- 14 -
4. Why deep?
There is no doubt that the more parameters, the better performance. We can build
a deep neural network or a fat neural network, which is better?
The answer is deep neural network. Because each layer of the deep neural
network can be trained as a classifier module, this process is what we called
modularization, which need less data than one hidden layer neural network.
- 15 -
5. CNN, RNN, LSTM
CNN: Convolutional Neural Network
From image to everything, there are some properties:
• Some patterns are much smaller than the whole image.
• Smaller than the whole documents.
• The same patterns appear in different regions.
• In different position of the documents.
• Subsampling the pixels will not change the object.
• Subsampling the words will not change the topic of the documents.
So, we can do convolution with property 1 and property 2, do pooling with property 3.
And then, we flatten the feature as the input to a fully connected feedforward network.
- 16 -
5. CNN, RNN, LSTM
Convolution
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
1 -1 -1
-1 1 -1
-1 -1 1
-1 1 -1
-1 1 -1
-1 1 -1
-13
input filter Feature map
stride
- 17 -
5. CNN, RNN, LSTM
Max pooling
-13 -1-3
1-3 -30
-3-3 10
-23 -1-2
The same patterns appear in
different regions.
03
13
Subsampling the pixels will not
change the object.
6*6 to 4*4
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
Some patterns are much smaller
than the whole image.
4*4 to 2*2
- 18 -
5. CNN, RNN, LSTM
The whole CNN
convolution
begin
image
pooling
convolution
pooling
Can
repeat
many
times
flatten
- 19 -
5. CNN, RNN, LSTM
RNN: Recurrent neural network
The output of hidden layer are stored in the memory. The memory can be
consider as another input.
x1
x2
a1
a2
Example
Query1: I will leave Xiamen on March 17th.
Query2: I will arrive Xiamen on March 20th.
The values stored in memory are different between ‘leave’ and
‘arrive’, so the result of query1 and query2 for Xiamen is different,
it can be destination or departure.
- 20 -
5. CNN, RNN, LSTM
LSTM: Long short term memory
f
f
g
h
fC’
𝒛𝒐
𝒛𝒊
𝒛𝒇
𝒛
𝒇(𝒛𝒊)
𝒈(𝒛)𝒈(𝒛)𝒇(𝒛𝒊)
𝒇(𝒛𝒇)𝒄
𝒄𝒇(𝒛𝒇)
𝒉(𝒄′)
𝒇(𝒛𝒐)
𝒂 = 𝒉(𝒄′)𝒇(𝒛𝒐)
𝒄′ = 𝒈 𝒛 𝒇 𝒛𝒊 + 𝒄𝒇(𝒛𝒇)
- 21 -
6. Tensorflow
TensorFlow is an open source software library for numerical computation using
data flow graphs.
Nodes in the graph represent mathematical operations, while the graph edges
represent the multidimensional data arrays (tensors) communicated between
them.
The flexible architecture allows you to deploy computation to one or more CPUs
or GPUs in a desktop, server, or mobile device with a single API.
- 22 -
6. Tensorflow
Tensor
[1, 2, 3] # a rank 1 tensor with shape [3]
[[1, 2, 3], [4, 5, 6]] # a rank 2 tensor with shape [2, 3]
[[[1, 2, 3]], [[4, 5, 6]]] # a rank 3 tensor with shape [2, 1, 3]
Node
node = tf.constant(3.0, tf.float32)
print(node)
=> Tensor("Const:0", shape=(), dtype=float32)
sess = tf.Session()
print(sess.run([node]))
=>[3.0]
- 23 -
6. Tensorflow
import tensorflow as tf
a = tf.placeholder(tf.float32)
b = tf.placeholder(tf.float32)
adder_node = a+b
print(sess.run(adder_node,{a: 3, b: 4.5}))
=>7.5
print(sess.run(adder_node,{a: [1,3], b: [2,4]}))
=>[ 3. 7.]
a b
adder_node
- 24 -
6. Tensorflow
W = tf.Variable([.3], tf.float32)
b = tf.Variable([-.3], tf.float32)
x = tf.placeholder(tf.float32)
linear_model = W*x+b
print(sess.run(linear_model, {x:[1,2,3,4]}))
=>[0 0.3 0.6 0.9]
- 25 -
6. Tensorflow
loss = tf.reduce_sum(tf.square(linear_model - y))
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)
x_train = [1,2,3,4]
y_train = [0,-1,-2,-3]
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
for i in range(1000):
sess.run(train, {x:x_train, y:y_train})
curr_W, curr_b, curr_loss = sess.run([W, b, loss], {x:x_train, y:y_train})
print("W: %s b: %s loss: %s"%(curr_W, curr_b, curr_loss))
- 26 -
7. Some demos
CNN in image recognition
5*5 convolution
4 features, 0 padding
begin
28*28 images of
handwritten digits
2*2 pooling
5*5 convolution
8 features, 0 padding
2*2 pooling
flatten
1 28*28 pic
4 28*28 pic
4 14*14 pic
8 14*14 pic
8 7*7 pic
8*7*7 input nodes
1024 hide nodes
10 output nodes, softmax
Accuracy rate is 99.2%
- 27 -
7. Some demos
RNN in Segmentation
Dictionary index
begin
Pku corpus
Embedding
2 RNN layers
Dense
{b; m; e; s}
- 28 -
7. Some demos
RNN in Segmentation
Dictionary index
begin
Pku corpus
Embedding
2 RNN layers
Dense
{b; m; e; s}
- 29 -
8. Other framework
Software Open source Platform Written in Interface OpenMP support OpenCL support CUDA supportAutomatic
differentiation[1]
Has pretrained
modelsRecurrent nets Convolutional nets RBM/DBNs
Parallel execution
(multi node)
Apache Singa YesLinux, Mac OS X,
WindowsC++ Python, C++, Java No Yes Yes ? Yes Yes Yes Yes Yes
Deeplearning4j Yes
Linux, Mac OS X,
Windows, Android
(Cross-platform)
C, C++Java, Scala, Clojure,
Python (Keras)Yes On roadmap[2] Yes[3] Computational
GraphYes[4] Yes Yes Yes Yes[5]
Dlib Yes Cross-Platform C++ C++ Yes No Yes Yes Yes No Yes Yes Yes
Keras YesLinux, Mac OS X,
WindowsPython Python
Only if using
Theano as backend
Under development
for the Theano
backend (and on
roadmap for the
TensorFlow
backend)
Yes Yes Yes[6] Yes Yes Yes Yes[7]
Microsoft Cognitive
Toolkit - CNTKYes
Windows, Linux[9]
(OSX via Docker on
roadmap)
C++
Python, C++,
Command line,[10]
BrainScript[11]
(.NET on
roadmap[12])
Yes[13] No Yes Yes Yes[14] Yes[15] Yes[15] No[16] Yes[17]
MXNet Yes
Linux, Mac OS X,
Windows,[18][19]
AWS, Android,[20]
iOS, JavaScript[21]
Small C++ core
library
C++, Python, Julia,
Matlab, JavaScript,
Go, R, Scala, Perl
Yes On roadmap[22] Yes Yes[23] Yes[24] Yes Yes Yes Yes[25]
Neural Designer NoLinux, Mac OS X,
WindowsC++
Graphical user
interfaceYes No No ? ? No No No ?
OpenNN Yes Cross-platform C++ C++ Yes No No ? ? No No No ?
TensorFlow YesLinux, Mac OS X,
Windows[26] C++, Python
Python, (C/C++
public API only for
executing
graphs[27])
No On roadmap[28][29] Yes Yes[30] Yes[31] Yes Yes Yes Yes
Theano Yes Cross-platform Python Python YesUnder
development[32] Yes Yes[33][34] Through Lasagne's
model zoo[35] Yes Yes Yes Yes[36]
Torch Yes
Linux, Mac OS X,
Windows,[37]
Android,[38] iOS
C, Lua
Lua, LuaJIT,[39] C,
utility library for
C++/OpenCL[40]
Yes
Third party
implementations[41][
42]
Yes[43][44] Through Twitter's
Autograd[45] Yes[46] Yes Yes Yes Yes[47]
Wolfram
MathematicaNo
Windows, Mac OS
X, Linux, Cloud
computing
C++Command line,
Java, C++No Yes Yes Yes Yes Yes Yes Yes Yes
- 30 -
9. Future work
Next move
Event extraction, short or long term influence identification.
Change a framework?
It seems that Keras is a better choice.
- 31 -
Reference
1. http://ufldl.stanford.edu/tutorial
2. Deep Learning Tutorial, Hung-yi Le, NTU
3. https://www.tensorflow.org
4. https://en.wikipedia.org/wiki/Comparison_of_deep_learning_software
Top Related