Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? •...

51
Statistical Natural Language Processing Recurrent and convolutional networks Çağrı Çöltekin University of Tübingen Seminar für Sprachwissenschaft Summer Semester 2020

Transcript of Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? •...

Page 1: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Statistical Natural Language ProcessingRecurrent and convolutional networks

Çağrı Çöltekin

University of TübingenSeminar für Sprachwissenschaft

Summer Semester 2020

Page 2: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Deep neural networks

x1 xm…

• Deep neural networks (>2 hidden layers)have recently been successful in many tasks

• They often use sparse connectivity andshared weights

• We will focus on two importantarchitectures: recurrent and convolutionalnetworks

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 1 / 27

Page 3: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Why deep networks?

• We saw that a feed-forward network with a single hidden layer is a universalapproximator

• However, this is a theoretical result – it is not clear how many units one mayneed for the approximation

• Successive layers may learn different representations• Deeper architectures have been found to be useful in many tasks

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 2 / 27

Page 4: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Why deep networks?

• We saw that a feed-forward network with a single hidden layer is a universalapproximator

• However, this is a theoretical result – it is not clear how many units one mayneed for the approximation

• Successive layers may learn different representations• Deeper architectures have been found to be useful in many tasks

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 2 / 27

Page 5: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Why deep networks?

• We saw that a feed-forward network with a single hidden layer is a universalapproximator

• However, this is a theoretical result – it is not clear how many units one mayneed for the approximation

• Successive layers may learn different representations

• Deeper architectures have been found to be useful in many tasks

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 2 / 27

Page 6: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Why deep networks?

• We saw that a feed-forward network with a single hidden layer is a universalapproximator

• However, this is a theoretical result – it is not clear how many units one mayneed for the approximation

• Successive layers may learn different representations• Deeper architectures have been found to be useful in many tasks

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 2 / 27

Page 7: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Why now?

• Increased computational power, especially advances in graphical processingunit (GPU) hardware

• Availability of large amounts of data– mainly unlabeled data (more on this later)– but also labeled data through ‘crowd sourcing’ and other sources

• Some new developments in theory and applications

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 3 / 27

Page 8: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Recurrent neural networks

• Feed forward networks– can only learn associations– do not have memory of earlier inputs: they cannot handle sequences

• Recurrent neural networks are ANN solution for sequence learning• This is achieved by recursive loops in the network

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 4 / 27

Page 9: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Recurrent neural networks

x1

h1

x2

h2

x3

h3

x4

h4

y

• Recurrent neural networks are similar to the standard feed-forward networks

• They include loops that use previous output (of the hidden layers) as well asthe input

• Forward calculation is straightforward, learning becomes somewhat tricky

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 5 / 27

Page 10: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Recurrent neural networks

x1

h1

x2

h2

x3

h3

x4

h4

y

• Recurrent neural networks are similar to the standard feed-forward networks• They include loops that use previous output (of the hidden layers) as well as

the input

• Forward calculation is straightforward, learning becomes somewhat tricky

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 5 / 27

Page 11: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Recurrent neural networks

x1

h1

x2

h2

x3

h3

x4

h4

y

• Recurrent neural networks are similar to the standard feed-forward networks• They include loops that use previous output (of the hidden layers) as well as

the input• Forward calculation is straightforward, learning becomes somewhat tricky

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 5 / 27

Page 12: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

A simple version: SRNsElman (1990)

InputContext units

Hidden units

Output units

copy

• The network keeps previous hiddenstates (context units)

• The rest is just like a feed-forwardnetwork

• Training is simple, but cannot learnlong-distance dependencies

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 6 / 27

Page 13: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Processing sequences with RNNs

• RNNs process sequences one unit at a time• The earlier inputs affect the output through recurrent links

h1 h2 h3 h4

y

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 7 / 27

Page 14: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Processing sequences with RNNs

• RNNs process sequences one unit at a time• The earlier inputs affect the output through recurrent links

h1 h2 h3 h4

y

not

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 7 / 27

Page 15: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Processing sequences with RNNs

• RNNs process sequences one unit at a time• The earlier inputs affect the output through recurrent links

h1 h2 h3 h4

y

really

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 7 / 27

Page 16: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Processing sequences with RNNs

• RNNs process sequences one unit at a time• The earlier inputs affect the output through recurrent links

h1 h2 h3 h4

y

worth

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 7 / 27

Page 17: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Processing sequences with RNNs

• RNNs process sequences one unit at a time• The earlier inputs affect the output through recurrent links

h1 h2 h3 h4

y

seeing

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 7 / 27

Page 18: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Learning in recurrent networks

x

h(1)

y(1)

W0

W1

W2

• We need to learn three sets of weights: W0, W1 andW2

• Backpropagation in RNNs are at first not thatobvious

• The main difficulty is in propagating the errorthrough the recurrent connections

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 8 / 27

Page 19: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Unrolling a recurrent networkBack propagation through time (BPTT)

x(0) x(1) … x(t−1) x(t)

h(0) h(1) … h(t−1) h(t)

y(0) y(1)

…y(t−1) y(t)

Note: the weights with the same color are shared.

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 9 / 27

Page 20: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Unstable gradients

• A common problem in deep networks is unstable gradients• The patial derivatives with respect to weights in the early layers calculated

using the chain rule• A long chain of multiplications may result in

– vanishing gradients if the values are in range (−1, 1)– exploding gradients if absolute values larger than 1

• A practical solution for exploding gradients is called gradient clipping• Solution to vanishing gradients is more involved (coming soon)

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 10 / 27

Page 21: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

RNN architecturesMany-to-many (e.g., POS tagging)

x(0) x(1) … x(t−1) x(t)

h(0) h(1) … h(t−1) h(t)

y(0) y(1)

…y(t−1) y(t)

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 11 / 27

Page 22: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

RNN architecturesMany-to-one (e.g., document classification)

x(0) x(1) … x(t−1) x(t)

h(0) h(1) … h(t−1) h(t)

y(t)

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 11 / 27

Page 23: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

RNN architecturesMany-to-many with a delay (e.g., machine translation)

x(0) x(1) … x(t−1) x(t)

h(0) h(1) … h(t−1) h(t)

y(t−1) y(t)

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 11 / 27

Page 24: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Bidirectional RNNs

x(t−1) x(t) x(t+1)

y(t−1) y(t) y(t−1)

Forward states … …

Backward states … …

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 12 / 27

Page 25: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Unstable gradients revisited

• We noted earlier that the gradients may vanish or explode duringbackpropagation in deep networks

• This is especially problematic for RNNs since the effective dept of the networkcan be extremely large

• Although RNNs can theoretically learn long-distance dependencies, this isaffected by unstable gradients problem

• The most popular solution is to use gated recurrent networks

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 13 / 27

Page 26: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Gated recurrent networks

σf σi tanh σo

×

×

× +

tanh

c(t-1)

h(t-1)

c(t)

h(t)

x(t)

• Most modern RNN architectures are ‘gated’• The main idea is learning a mask that controls what to remember (or forget)

from previous hidden layers• Two popular architectures are

– Long short term memory (LSTM) networks (above)– Gated recurrent units (GRU)

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 14 / 27

Page 27: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Convolutional networks

• Convolutional networks are particularly popular in image processingapplications

• They have also been used with success some NLP tasks• Unlike feed-forward networks we have discussed so far,

– CNNs are not fully connected– The hidden layer(s) receive input from only a set of neighboring units– Some weights are shared

• A CNN learns features that are location invariant• CNNs are also computationally less expensive compared to fully connected

networks

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 15 / 27

Page 28: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Convolution in image processing• Convolution is a common operation in image processing for effects like edge

detection, blurring, sharpening, …• The idea is to transform each pixel with a function of the local neighborhood

Input (X) Filter (W) Output (Y)

y =∑i

wixi

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 16 / 27

Page 29: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Convolution in image processing• Convolution is a common operation in image processing for effects like edge

detection, blurring, sharpening, …• The idea is to transform each pixel with a function of the local neighborhood

Input (X) Filter (W) Output (Y)

y =∑i

wixi

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 16 / 27

Page 30: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Convolution in image processing• Convolution is a common operation in image processing for effects like edge

detection, blurring, sharpening, …• The idea is to transform each pixel with a function of the local neighborhood

Input (X) Filter (W) Output (Y)

y =∑i

wixi

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 16 / 27

Page 31: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Convolution in image processing• Convolution is a common operation in image processing for effects like edge

detection, blurring, sharpening, …• The idea is to transform each pixel with a function of the local neighborhood

Input (X) Filter (W) Output (Y)

y =∑i

wixi

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 16 / 27

Page 32: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Convolution in image processing• Convolution is a common operation in image processing for effects like edge

detection, blurring, sharpening, …• The idea is to transform each pixel with a function of the local neighborhood

Input (X) Filter (W) Output (Y)

y =∑i

wixi

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 16 / 27

Page 33: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Convolution in image processing• Convolution is a common operation in image processing for effects like edge

detection, blurring, sharpening, …• The idea is to transform each pixel with a function of the local neighborhood

Input (X) Filter (W) Output (Y)

y =∑i

wixi

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 16 / 27

Page 34: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Convolution in image processing• Convolution is a common operation in image processing for effects like edge

detection, blurring, sharpening, …• The idea is to transform each pixel with a function of the local neighborhood

Input (X) Filter (W) Output (Y)

y =∑i

wixi

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 16 / 27

Page 35: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Convolution in image processing• Convolution is a common operation in image processing for effects like edge

detection, blurring, sharpening, …• The idea is to transform each pixel with a function of the local neighborhood

Input (X) Filter (W) Output (Y)

y =∑i

wixi

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 16 / 27

Page 36: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Convolution in image processing• Convolution is a common operation in image processing for effects like edge

detection, blurring, sharpening, …• The idea is to transform each pixel with a function of the local neighborhood

Input (X) Filter (W) Output (Y)

y =∑i

wixi

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 16 / 27

Page 37: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Example convolutions

• Blurring

1

16

1 2 1

2 4 2

1 2 1

• Edge detection −1 −1 −1

−1 8 −1

−1 −1 −1

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 17 / 27

Page 38: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Learning convolutions

• Some filters produce features that are useful for classification (e.g., of images,or sentences)

• In machine learning we want to learn the convolutions• Typically, we learn multiple convolutions, each resulting in a different feature

map• Repeated application of convolutions allow learning higher level features• The last layer is typically a standard fully-connected classifier

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 18 / 27

Page 39: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Convolution in neural networks

x1 x2

h2

x3

h3

x4

h4

x5

w-1 w

0

w1

w1

w0

w-1

w1

w0

w1

• Each hidden layer corresponds to a local window in the input• Weights are shared: each convolution detects the same type of features

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 19 / 27

Page 40: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Pooling

x1

h1

x2

h2

h′1

x3

h3

h′2

x4

h4

h′3

x5

h5

Conv

olut

ion

Pooling

• Convolution is combined with pooling• Pooling ‘layer’ simply calculates a statistic (e.g., max) over the convolution

layer• Location invariance comes from pooling

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 20 / 27

Page 41: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Pooling and location invariance

1 3 2 5 2

3 5 5

Convolution

Max pooling

• Note that the numbers at the pooling layer are stable in comparison to theconvolution layer

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 21 / 27

Page 42: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Pooling and location invariance

2 1 3 2 5

3 3 5

Convolution

Max pooling

• Note that the numbers at the pooling layer are stable in comparison to theconvolution layer

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 21 / 27

Page 43: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Pooling and location invariance

4 2 1 3 2

4 3 3

Convolution

Max pooling

• Note that the numbers at the pooling layer are stable in comparison to theconvolution layer

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 21 / 27

Page 44: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Padding in CNNs

• With successive layers ofconvolution and pooling, thelater layers shrink

• One way to avoid this is paddingthe input and hidden layerswith enough number of zeros

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 22 / 27

Page 45: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Padding in CNNs

• With successive layers ofconvolution and pooling, thelater layers shrink

• One way to avoid this is paddingthe input and hidden layerswith enough number of zeros

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 22 / 27

Page 46: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

CNNs: the bigger picture

• At each convolution/pooling step, we oftenwant to learn multiple feature maps

• After a (long) chain of hierarchical featuremaps, the final layer is typically afully-connected layer (e.g., softmax forclassification)

Convolution

Pooling...

Fully connected

classifier output

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 23 / 27

Page 47: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Real-world examples are complex

input

Conv

7x7+2(S)

MaxPool

3x3+2(S)

LocalRespNorm

Conv

1x1+1(V)

Conv

3x3+1(S)

LocalRespNorm

MaxPool

3x3+2(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

MaxPool

3x3+1(S)

DepthConcat

Conv

3x3+1(S)

Conv

5x5+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

MaxPool

3x3+1(S)

DepthConcat

Conv

3x3+1(S)

Conv

5x5+1(S)

Conv

1x1+1(S)

MaxPool

3x3+2(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

MaxPool

3x3+1(S)

DepthConcat

Conv

3x3+1(S)

Conv

5x5+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

MaxPool

3x3+1(S)

AveragePool

5x5+3(V)

DepthConcat

Conv

3x3+1(S)

Conv

5x5+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

MaxPool

3x3+1(S)

DepthConcat

Conv

3x3+1(S)

Conv

5x5+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

MaxPool

3x3+1(S)

DepthConcat

Conv

3x3+1(S)

Conv

5x5+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

MaxPool

3x3+1(S)

AveragePool

5x5+3(V)

DepthConcat

Conv

3x3+1(S)

Conv

5x5+1(S)

Conv

1x1+1(S)

MaxPool

3x3+2(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

MaxPool

3x3+1(S)

DepthConcat

Conv

3x3+1(S)

Conv

5x5+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

MaxPool

3x3+1(S)

DepthConcat

Conv

3x3+1(S)

Conv

5x5+1(S)

Conv

1x1+1(S)

AveragePool

7x7+1(V)

FC

Conv

1x1+1(S)

FC

FC

Softm

axActiv

atio

n

softm

ax0

Conv

1x1+1(S)

FC

FC

Softm

axActiv

atio

n

softm

ax1

Softm

axActiv

atio

n

softm

ax2

The real-world ANNs tend to be complex• Many layers (sometimes with repetition)• Large amount of branching

* Diagram describes an image classification network, GoogLeNet (Szegedy et al. 2014).

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 24 / 27

Page 48: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

CNNs in natural language processing

• The use of CNNs in image applications is rather intiutive– the first convolutional layer learns local features, e.g., edges

– successive layers learn more complex features that are combinations of thesefeatures

• In NLP, it is a bit less straight-forward– CNNs are typically used in combination with word vectors– The convolutions of different sizes correspond to (word) n-grams of different

sizes– Pooling picks important ‘n-grams’ as features for classification

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 25 / 27

Page 49: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

An example: sentiment analysis

not really worth seeingInput

Word vectors

Convolution

Feature maps

Pooling

Features

Classifier

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 26 / 27

Page 50: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Summary

• Deep networks use more than one hidden layer• Common (deep) ANN architectures include:CNN shared feed-forward weights, location invarianceRNN sequence learning

Next:• N-gram language models

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 27 / 27

Page 51: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Summary

• Deep networks use more than one hidden layer• Common (deep) ANN architectures include:CNN shared feed-forward weights, location invarianceRNN sequence learning

Next:• N-gram language models

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 27 / 27