Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky,...
Transcript of Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky,...
![Page 1: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov. ...](https://reader033.fdocuments.net/reader033/viewer/2022042709/5f4a1479e4945d2e344a1e39/html5/thumbnails/1.jpg)
Neural Networks
Instructor: Yoav Artzi
CS5740: Natural Language ProcessingSpring 2017
Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, Yejin Choi, and Slav Petrov
![Page 2: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov. ...](https://reader033.fdocuments.net/reader033/viewer/2022042709/5f4a1479e4945d2e344a1e39/html5/thumbnails/2.jpg)
Overview• Introduction to Neural Networks• Word representations• NN Optimization tricks
![Page 3: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov. ...](https://reader033.fdocuments.net/reader033/viewer/2022042709/5f4a1479e4945d2e344a1e39/html5/thumbnails/3.jpg)
Some History• Neural network algorithms date from the 80’s– Originally inspired by early neuroscience
• Historically slow, complex, and unwieldy• Now: term is abstract enough to encompass
almost any model – but useful!• Dramatic shift in last 2-3 years away from
MaxEnt (linear, convex) to “neural net” (non-linear architecture)
![Page 4: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov. ...](https://reader033.fdocuments.net/reader033/viewer/2022042709/5f4a1479e4945d2e344a1e39/html5/thumbnails/4.jpg)
The “Promise”• Most ML works well because of
human-designed representations and input features
• ML becomes just optimizing weights
• Representation learningattempts to automatically learn good features and representations
• Deep learning attempts to learn multiple levels of representation of increasing complexity/abstraction
Named Entity Recognition
WordNet
Semantic Role Labeling
![Page 5: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov. ...](https://reader033.fdocuments.net/reader033/viewer/2022042709/5f4a1479e4945d2e344a1e39/html5/thumbnails/5.jpg)
Neuron• Neural networks comes with their
terminological baggage
• Parameters: – Weights: wi and b– Activation function
• If we drop the activation function, reminds you of something?
![Page 6: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov. ...](https://reader033.fdocuments.net/reader033/viewer/2022042709/5f4a1479e4945d2e344a1e39/html5/thumbnails/6.jpg)
Biological “Inspiration”
![Page 7: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov. ...](https://reader033.fdocuments.net/reader033/viewer/2022042709/5f4a1479e4945d2e344a1e39/html5/thumbnails/7.jpg)
Neural Network
![Page 8: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov. ...](https://reader033.fdocuments.net/reader033/viewer/2022042709/5f4a1479e4945d2e344a1e39/html5/thumbnails/8.jpg)
Neural Network
![Page 9: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov. ...](https://reader033.fdocuments.net/reader033/viewer/2022042709/5f4a1479e4945d2e344a1e39/html5/thumbnails/9.jpg)
Matrix NotationW00(W0a+ b0) + b00
a1
a2
W0 W0
h1 = W011a1 +W0
12a2 + b01
h2 = W021a1 +W0
22a2 + b02
o2 = W
0021h1 +W
0022h2 + b
002
o1 = W
0011h1 +W
0012h2 + b
001
![Page 10: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov. ...](https://reader033.fdocuments.net/reader033/viewer/2022042709/5f4a1479e4945d2e344a1e39/html5/thumbnails/10.jpg)
Neuron and Other Models• A single neuron is a perceptron• Strong connection to MaxEnt – how?
![Page 11: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov. ...](https://reader033.fdocuments.net/reader033/viewer/2022042709/5f4a1479e4945d2e344a1e39/html5/thumbnails/11.jpg)
From MaxEnt to Neural Nets• Vector form MaxEnt:
• For two classes:
P (y1|x;w) =e
w
>�(x,y1)
e
w
>�(x,y1) + e
w
>�(x,y2)
=e
w
>�(x,y1)
e
w
>�(x,y1) + e
w
>�(x,y2)
e
�w
>�(x,y1)
e
�w
>�(x,y1)
=1
1 + e
w
>(�(x,y2)��(x,y2))
=1
1 + e
�w
>z
= f(w>z)
P (y|x;w) = e
w
>�(x,y)
Py
0 ew
>�(x,y0)
LogisitcFunction(sigmoid)
z = �(x, y1)� �(x, y2)
![Page 12: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov. ...](https://reader033.fdocuments.net/reader033/viewer/2022042709/5f4a1479e4945d2e344a1e39/html5/thumbnails/12.jpg)
From MaxEnt to Neural Nets• Vector form MaxEnt:
• For two classes:
• Neuron:– Add an “always on” feature for class prior à bias
term (b)
P (y|x;w) = e
w
>�(x,y)
Py
0 ew
>�(x,y0)
P (y1|x;w) =1
1 + e
�w>z= f(w>
z)
hw,b(z) = f(w>z + b)
f(u) =1
1 + e�u
![Page 13: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov. ...](https://reader033.fdocuments.net/reader033/viewer/2022042709/5f4a1479e4945d2e344a1e39/html5/thumbnails/13.jpg)
From MaxEnt to Neural Nets• Vector form MaxEnt:
• Neuron:
• Neuron parameters: w, b
P (y|x;w) = e
w
>�(x,y)
Py
0 ew
>�(x,y0)
hw,b(z) = f(w>z + b)
f(u) =1
1 + e�u
![Page 14: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov. ...](https://reader033.fdocuments.net/reader033/viewer/2022042709/5f4a1479e4945d2e344a1e39/html5/thumbnails/14.jpg)
Neural Net = Several MaxEntModels
• Feed a number of MaxEnt models àvector of outputs
• And repeat …
![Page 15: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov. ...](https://reader033.fdocuments.net/reader033/viewer/2022042709/5f4a1479e4945d2e344a1e39/html5/thumbnails/15.jpg)
Neural Net = Several MaxEntModels
• But: how do we tell the hidden layer what to do?– Learning will figure it out
![Page 16: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov. ...](https://reader033.fdocuments.net/reader033/viewer/2022042709/5f4a1479e4945d2e344a1e39/html5/thumbnails/16.jpg)
How to Train?• No hidden layer:– Supervised– Just like MaxEnt
• With hidden layers:– Latent units à not convex–What do we do?• Back-propagate the gradient• About the same, but no guarantees
![Page 17: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov. ...](https://reader033.fdocuments.net/reader033/viewer/2022042709/5f4a1479e4945d2e344a1e39/html5/thumbnails/17.jpg)
Probabilistic Output from Neural Nets
• What if we want the output to be a probability distribution over possible outputs?
• Normalize the output activations using softmax:
– Where 𝑞 is the output layer
y = softmax(W · z + b)
softmax(q) =
e
q
Pkj=1 e
qj
![Page 18: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov. ...](https://reader033.fdocuments.net/reader033/viewer/2022042709/5f4a1479e4945d2e344a1e39/html5/thumbnails/18.jpg)
Word Representations• So far, atomic symbols:– “hotel”, “conference”, “walking”, “___ing”
• But neural networks take vector input• How can we bridge the gap?• One-hot vectors
– Dimensionality:• Size of vocabulary• 20K for speech• 500K for broad-coverage domains• 13M for Google corpora
hotel = [0 0 0 0 … 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]conference = [0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
![Page 19: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov. ...](https://reader033.fdocuments.net/reader033/viewer/2022042709/5f4a1479e4945d2e344a1e39/html5/thumbnails/19.jpg)
Word Representations• One-hot vectors:
– Problems?– Information sharing? • “hotel” vs. “hotels”
hotel = [0 0 0 0 … 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]conference = [0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]hotels = [0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
![Page 20: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov. ...](https://reader033.fdocuments.net/reader033/viewer/2022042709/5f4a1479e4945d2e344a1e39/html5/thumbnails/20.jpg)
Word Embeddings• Each word is represented using a dense
low-dimensional vector– Low-dimensional << vocabulary size
• If trained well, similar words will have similar vectors
• How to train? What objective to maximize? – Soon …
![Page 21: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov. ...](https://reader033.fdocuments.net/reader033/viewer/2022042709/5f4a1479e4945d2e344a1e39/html5/thumbnails/21.jpg)
Word Embeddings as Features• Example: sentiment classification – very positive, positive, neutral, negative, very
negative• Feature-based models: bag of words• Any good neural net architecture?– Concatenate all the vectors• Problem: different document à different length
– Instead: sum, average, etc.
![Page 22: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov. ...](https://reader033.fdocuments.net/reader033/viewer/2022042709/5f4a1479e4945d2e344a1e39/html5/thumbnails/22.jpg)
Neural Bag-of-words
[Iyyer et al. 2015; Wang and Manning 2012]
Deep Averaging Networks
BOW + fancy smoothing + SVM
88.23
NBOW + DAN 89.4
IMDB sentiment analysis
![Page 23: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov. ...](https://reader033.fdocuments.net/reader033/viewer/2022042709/5f4a1479e4945d2e344a1e39/html5/thumbnails/23.jpg)
Practical Tips• Select network structure appropriate for the
problem– Window vs. recurrent vs. recursive– Non-linearity function
• Gradient checks to identify bugs• Parameter initialization• Model is powerful enough?– If not, make it larger – Yes, so regularize, otherwise it will overfit
• Know your non-linearity function and its gradient
![Page 24: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov. ...](https://reader033.fdocuments.net/reader033/viewer/2022042709/5f4a1479e4945d2e344a1e39/html5/thumbnails/24.jpg)
Avoiding Overfitting• Reduce model size (but not too much)• L1 and L2 regularization• Early stopping (e.g., patience)• Dropout (Hinton et al. 2012)– Randomly set 50% of inputs in each layer to 0