Deep Learning, an interactive introduction for NLP-ers

83
@graphific Roelof Pieters Introduc0on to Deep Learning for NLP 22 January 2015 Stockholm Natural Language Processing Meetup FEEDA Slides at: http://www.slideshare.net/roelofp/220115dlmeetup 1

Transcript of Deep Learning, an interactive introduction for NLP-ers

Page 2: Deep Learning, an interactive introduction for NLP-ers

Deep Learning ???

2

Page 3: Deep Learning, an interactive introduction for NLP-ers

A couple of headlines… [all November ’14]

3

Page 4: Deep Learning, an interactive introduction for NLP-ers

(source: Google Trends)4

Page 5: Deep Learning, an interactive introduction for NLP-ers

Machine Learning ??

- Audience Check -

5

Page 6: Deep Learning, an interactive introduction for NLP-ers

• “Brain” inspired / simulations:

• vision: make learning algorithms better and easier to use

• goal: revolutions in (practical) advances for machine learning and AI

• Deep Learning = subfield of Machine Learning

Deep Learning ??

6

Page 7: Deep Learning, an interactive introduction for NLP-ers

Biological Inspiration

7

Page 8: Deep Learning, an interactive introduction for NLP-ers

Deep Learning ??

8

Page 9: Deep Learning, an interactive introduction for NLP-ers

DL: Impact

9

Speech Recognition

Page 10: Deep Learning, an interactive introduction for NLP-ers

DL: Impact

10

Deep Learning for the win!a few examples:

• IJCNN 2011 Traffic Sign Recognition Competition• ISBI 2012 Segmentation of neuronal structures in EM stacks

challenge• ICDAR 2011 Chinese handwriting recognition

Page 11: Deep Learning, an interactive introduction for NLP-ers

• Deals with “construction and study of systems that can learn from data”

Machine Learning ??

A computer program is said to learn from experience (E) with respect to some class of tasks (T) and performance measure (P), if its performance at tasks in T, as measured by P, improves with experience E

— T. Mitchell 1997

11

Page 12: Deep Learning, an interactive introduction for NLP-ers

Machine Learning ??

Traditional Programming:

Data

ProgramOutput

DataProgram

Output

Machine Learning:

12

Page 13: Deep Learning, an interactive introduction for NLP-ers

Supervised (inductive) learning

• Training data includes desired outputs

Unsupervised learning

• Training data does not include desired outputs

Semi-supervised learning

• Training data includes a few desired outputs

Reinforcement learning

• Rewards from sequence of actions

Types of Learning

13

Page 14: Deep Learning, an interactive introduction for NLP-ers

ML: Traditional Approach

1. Gather as much LABELED data as you can get

2. Throw some algorithms at it (mainly put in an SVM and keep it at that)

3. If you actually have tried more algos: Pick the best

4. Spend hours hand engineering some features / feature selection / dimensionality reduction (PCA, SVD, etc)

5. Repeat…

For each new problem/question::

14

Page 15: Deep Learning, an interactive introduction for NLP-ers

Machine Learning for NLP

Data

Classic Approach: Data is fed into a learning algorithm:

Learning Algorithm

15

Page 16: Deep Learning, an interactive introduction for NLP-ers

Machine Learning for NLP

some of the (many) treebank datasets

source: http://www-nlp.stanford.edu/links/statnlp.html#Treebanks

!

16

Page 17: Deep Learning, an interactive introduction for NLP-ers

Penn TreebankThat’s a lot of “manual” work:

17

Page 18: Deep Learning, an interactive introduction for NLP-ers

• the students went to class

DT NN VB P NN

• plays well with others

VB ADV P NN

NN NN P DT

• fruit flies like a banana

NN NN VB DT NN

NN VB P DT NN

NN NN P DT NN

NN VB VB DT NN

With a lot of issues:

Penn Treebank

18

Page 19: Deep Learning, an interactive introduction for NLP-ers

Machine Learning for NLP

Learning AlgorithmData

“Features”

PredictionPrediction/Classifier

train set

test set

19

Page 20: Deep Learning, an interactive introduction for NLP-ers

Machine Learning for NLP

Learning Algorithm

“Features”

PredictionPrediction/Classifier

train set

test set

20

Page 21: Deep Learning, an interactive introduction for NLP-ers

Machine Learning for NLP

• Until the early 1990’s, NLP systems were built manually with hand-crafted dictionaries and rules.

• As large electronic text corpora became increasingly available, researchers began using machine learning techniques to automatically build NLP systems.

• Today, the vast majority of NLP systems use machine learning.

21

Page 22: Deep Learning, an interactive introduction for NLP-ers

2. Neural Networks and a short history lesson

22

Page 23: Deep Learning, an interactive introduction for NLP-ers

Perceptron (1957)

Frank Rosenblatt (1928-1971)

Original Perceptron

Simplified model:

(From Perceptrons by M. L Minsky and S. Papert, 1969, Cambridge, MA: MIT Press. Copyright 1969

by MIT Press.

23

Page 24: Deep Learning, an interactive introduction for NLP-ers

Perceptron (1957)

Perceptron Research, youtube clip: https://www.youtube.com/watch?v=cNxadbrN_aI&feature=youtu.be&t=12

24

Page 25: Deep Learning, an interactive introduction for NLP-ers

Perceptron (1957)

25

Page 26: Deep Learning, an interactive introduction for NLP-ers

or

Multilayer Perceptron (1986)

inputs

weights

biasactivation

26

Page 27: Deep Learning, an interactive introduction for NLP-ers

Neuron Model

All you need to know:

27

Page 28: Deep Learning, an interactive introduction for NLP-ers

Activation functions

28

Page 29: Deep Learning, an interactive introduction for NLP-ers

Backpropagation (1974/1986)

1974 Paul Werbos’ invents Backpropagation algorithm for NN1986 Backdrop popularized by Rumelhart, Hinton, Williams1990: Renewed Interest in NN’s

29

Page 30: Deep Learning, an interactive introduction for NLP-ers

Backprop Renaissance

Forward Propagation

• Sum inputs, produce activation, feed-forward

30

Page 31: Deep Learning, an interactive introduction for NLP-ers

Backprop Renaissance

Back Propagation (of error)

• Calculate total error at the top

• Calculate contributions to error at each step going backwards

31

Page 32: Deep Learning, an interactive introduction for NLP-ers

• Compute gradient of example-wise loss wrt parameters

• Simply applying the derivative chain rule wisely

• If computing the loss (example, parameters) is O(n)computation, then so is computing the gradient

Backpropagation

32

Page 33: Deep Learning, an interactive introduction for NLP-ers

Simple Chain Rule

33

Page 34: Deep Learning, an interactive introduction for NLP-ers

Training procedure

• Initialize randomly• Sequentially give it data.• See what the difference is between network output

and actual output.• Update the weights according to this error.• End result: give a model input, and it produces a

proper output.

Quest for the weights. The weights are the model!

To reiterate:

34

Page 35: Deep Learning, an interactive introduction for NLP-ers

So why only now?

• Inspired by the architectural depth of the brain, researchers wanted for decades to train deep multi-layer neural networks.

• No successful attempts were reported before 2006 …Exception: convolutional neural networks, LeCun 1998

• SVM: Vapnik and his co-workers developed the Support Vector Machine (1993) (shallow architecture).

• Breakthrough in 2006!

35

Page 36: Deep Learning, an interactive introduction for NLP-ers

2006 Breakthrough

• More data

• Faster hardware: GPU’s, multi-core CPU’s

• Working ideas on how to train deep architectures

36

Page 37: Deep Learning, an interactive introduction for NLP-ers

2006 Breakthrough

• More data

• Faster hardware: GPU’s, multi-core CPU’s

• Working ideas on how to train deep architectures

37

Page 38: Deep Learning, an interactive introduction for NLP-ers

2006 Breakthrough

38

Page 39: Deep Learning, an interactive introduction for NLP-ers

2006 Breakthrough

• More data

• Faster hardware: GPU’s, multi-core CPU’s

• Working ideas on how to train deep architectures

39

Page 40: Deep Learning, an interactive introduction for NLP-ers

2006 Breakthrough

40

Page 41: Deep Learning, an interactive introduction for NLP-ers

2006 Breakthrough

• More data

• Faster hardware: GPU’s, multi-core CPU’s

• Working ideas on how to train deep architectures

41

Page 42: Deep Learning, an interactive introduction for NLP-ers

2006 Breakthrough

Stacked Restricted Boltzman Machines* (RBM) Hinton, G. E, Osindero, S., and Teh, Y. W. (2006).A fast learning algorithm for deep belief nets.Neural Computation, 18:1527-1554.

Stacked Autoencoders (AE) Bengio, Y., Lamblin, P., Popovici, P., Larochelle, H. (2007).Greedy Layer-Wise Training of Deep Networks,Advances in Neural Information Processing Systems 19

* called Deep Belief Networks (DBN)42

Page 43: Deep Learning, an interactive introduction for NLP-ers

3. Deep Learning onwards we go…

43

Page 44: Deep Learning, an interactive introduction for NLP-ers

44

Page 45: Deep Learning, an interactive introduction for NLP-ers

Hierarchies

Efficient

Generalization

Distributed

Sharing

Unsupervised*

Black Box

Training Time

Major PWNAGE!

Much Data

Why go Deep ?

45

Page 46: Deep Learning, an interactive introduction for NLP-ers

No More Handcrafted Features !

46

Page 47: Deep Learning, an interactive introduction for NLP-ers

— Andrew Ng

“I’ve worked all my life in Machine Learning, and I’ve

never seen one algorithm knock over benchmarks like Deep

Learning”

Deep Learning: Why?

47

Page 48: Deep Learning, an interactive introduction for NLP-ers

Biological JustificationDeep Learning = Brain “inspired”Audio/Visual Cortex has multiple stages == Hierarchical

• Computational Biology • CVAP

• Jorge Dávila-Chacón • “that guy”

“Brainiacs” “Pragmatists”vs

48

Page 49: Deep Learning, an interactive introduction for NLP-ers

Different Levels of Abstraction

49

Page 50: Deep Learning, an interactive introduction for NLP-ers

Hierarchical Learning

• Natural progression from low level to high level structure as seen in natural complexity

Different Levels of AbstractionFeature Representation

50

Page 51: Deep Learning, an interactive introduction for NLP-ers

Hierarchical Learning

• Natural progression from low level to high level structure as seen in natural complexity

• Easier to monitor what is being learnt and to guide the machine to better subspaces

Different Levels of AbstractionFeature Representation

51

Page 52: Deep Learning, an interactive introduction for NLP-ers

Hierarchical Learning

• Natural progression from low level to high level structure as seen in natural complexity

• Easier to monitor what is being learnt and to guide the machine to better subspaces

• A good lower level representation can be used for many distinct tasks

Different Levels of AbstractionFeature Representation

52

Page 53: Deep Learning, an interactive introduction for NLP-ers

Hierarchical Learning

• Natural progression from low level to high level structure as seen in natural complexity

• Easier to monitor what is being learnt and to guide the machine to better subspaces

• A good lower level representation can be used for many distinct tasks

Different Levels of AbstractionFeature Representation

53

Page 54: Deep Learning, an interactive introduction for NLP-ers

• Shared Low Level Representations

• Multi-Task Learning

• Unsupervised Training

Generalizable Learning

54

Page 55: Deep Learning, an interactive introduction for NLP-ers

• Shared Low Level Representations

• Multi-Task Learning

• Unsupervised Training

• Partial Feature Sharing

• Mixed Mode Learning

• Composition of Functions

Generalizable Learning

55

Page 56: Deep Learning, an interactive introduction for NLP-ers

Classic Deep Architecture

Input layer

Hidden layers

Output layer

56

Page 57: Deep Learning, an interactive introduction for NLP-ers

Modern Deep Architecture

Input layer

Hidden layers

Output layer

57

Page 58: Deep Learning, an interactive introduction for NLP-ers

Deep Learning: Why? (again)

Beat state of the art in many areas:• Language Modeling (2012, Mikolov et al)• Image Recognition (Krizhevsky won

2012 ImageNet competition)• Sentiment Classification (2011, Socher et

al)• Speech Recognition (2010, Dahl et al)• MNIST hand-written digit recognition (Ciresan et al, 2010)

58

Page 59: Deep Learning, an interactive introduction for NLP-ers

One Model rules them all ?DL approaches have been successfully applied to:

Deep Learning: Why for NLP ?

Automatic summarization Coreference resolution Discourse analysis

Machine translation Morphological segmentation Named entity recognition (NER)

Natural language generation

Natural language understanding

Optical character recognition (OCR)

Part-of-speech tagging

Parsing

Question answering

Relationship extraction

sentence boundary disambiguation

Sentiment analysis

Speech recognition

Speech segmentation

Topic segmentation and recognition

Word segmentation

Word sense disambiguation

Information retrieval (IR)

Information extraction (IE)

Speech processing

59

Page 60: Deep Learning, an interactive introduction for NLP-ers

- COFFEE BREAK -after the break we return with: CODE

Download the code samples already now from:https://github.com/graphific/DL-Meetup-intro

http://goo.gl/abX1E2 shortened url: 60

Page 61: Deep Learning, an interactive introduction for NLP-ers

• Deep Neural Network

• Multilayer Perceptron (MLP) or Artificial Neural Network (ANN)

1. MLP

Logistic regression

Training regime: Stochastic Gradient Descent (SGD) with minibatches

MNIST dataset

Simple hidden layer

61

Page 62: Deep Learning, an interactive introduction for NLP-ers

2. Convolutional Neural Network

62

from: Krizhevsky, Sutskever, Hinton. (2012). ImageNet Classification with Deep Convolutional Neural Networks[breakthrough in object recognition, Imagenet 2012]

Page 63: Deep Learning, an interactive introduction for NLP-ers

Convolutional Neural Network

http://ufldl.stanford.edu/wiki/index.php/Feature_extraction_using_convolution

movie time:http://www.cs.toronto.edu/~hinton/adi/index.htm

63

Page 64: Deep Learning, an interactive introduction for NLP-ers

Thats it, no more code! (for now)

64

Page 65: Deep Learning, an interactive introduction for NLP-ers

Deep Learning: Future Developments

Currently an explosion of developments• Hessian-Free networks (2010)• Long Short Term Memory (2011)• Large Convolutional nets, max-pooling (2011)• Nesterov’s Gradient Descent (2013)

Currently state of the art but...• No way of doing logical inference (extrapolation)• No easy integration of abstract knowledge• Hypothetic space bias might not conform with reality

65

Page 66: Deep Learning, an interactive introduction for NLP-ers

Deep Learning: Future Challenges

a

66

Szegedy, C., Wojciech, Z., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R. (2013) Intriguing properties of neural networks

L: correctly identified, Center: added noise x10, R: “Ostrich”

Page 67: Deep Learning, an interactive introduction for NLP-ers

• cuda-convnet2 (Alex Krizhevsky, Toronto) (c++/CUDA, optimized for GTX 580) https://code.google.com/p/cuda-convnet2/

• Caffe (Berkeley) (Cuda/OpenCL, Theano, Python)http://caffe.berkeleyvision.org/

• OverFeat (NYU) http://cilvr.nyu.edu/doku.php?id=code:start

Wanna Play ?

Page 68: Deep Learning, an interactive introduction for NLP-ers

• Theano - CPU/GPU symbolic expression compiler in python (from LISA lab at University of Montreal). http://deeplearning.net/software/theano/

• Pylearn2 - library designed to make machine learning research easy. http://deeplearning.net/software/pylearn2/

• Torch - Matlab-like environment for state-of-the-art machine learning algorithms in lua (from Ronan Collobert, Clement Farabet and Koray Kavukcuoglu) http://torch.ch/

• more info: http://deeplearning.net/software links/

Wanna Play ?

Wanna Play ?

Page 69: Deep Learning, an interactive introduction for NLP-ers

as PhD candidate KTH/CSC:“Always interested in discussing

Machine Learning, Deep Architectures, Graphs, and

Language Technology”

In touch!

[email protected]/~roelof/

Internship / EntrepeneurshipAcademic/Researchas CIO/CTO Feeda:

“Always looking for additions to our brand new R&D team”

[Internships upcoming on KTH exjobb website…]

[email protected]

Feeda

69

Page 70: Deep Learning, an interactive introduction for NLP-ers

Were Hiring!

[email protected]

Feeda

• Dev Ops • Software Developers • Data Scientists

70

Page 71: Deep Learning, an interactive introduction for NLP-ers

Thanks for listening

Mingling time!

71

Page 72: Deep Learning, an interactive introduction for NLP-ers

72

Can’t get enough? Come to my talk Tomorrow (friday)

Description on KTH website

Visual-Semantic Embeddings: some thoughts on Language

Roelof Pieters TCS/CSCFriday jan 23 13:30.

Room 304, Teknikringen 14 level 3

Page 73: Deep Learning, an interactive introduction for NLP-ers

Appendum

Some of the exciting recent developments in NLPespecially Distributed Semantics

73

Page 74: Deep Learning, an interactive introduction for NLP-ers

Word Embeddings: Turian (2010)

Turian, J., Ratinov, L., Bengio, Y. (2010). Word representations: A simple and general method for semi-supervised learning

code & info: http://metaoptimize.com/projects/wordreprs/ 74

Page 75: Deep Learning, an interactive introduction for NLP-ers

Word Embeddings: Turian (2010)

Turian, J., Ratinov, L., Bengio, Y. (2010). Word representations: A simple and general method for semi-supervised learning

code & info: http://metaoptimize.com/projects/wordreprs/ 75

Page 76: Deep Learning, an interactive introduction for NLP-ers

Word Embeddings: Collobert & Weston (2011)

Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P. (2011) . Natural Language Processing (almost) from Scratch

76

Page 77: Deep Learning, an interactive introduction for NLP-ers

Multi-embeddings: Stanford (2012)

Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Improving Word Representations via Global Context and Multiple Word Prototypes

77

Page 78: Deep Learning, an interactive introduction for NLP-ers

Linguistic Regularities: Mikolov (2013)

code & info: https://code.google.com/p/word2vec/ Mikolov, T., Yih, W., & Zweig, G. (2013). Linguistic Regularities in Continuous Space Word Representations

78

Page 79: Deep Learning, an interactive introduction for NLP-ers

Word Embeddings for MT: Mikolov (2013)

Mikolov, T., Le, V. L., Sutskever, I. (2013) . Exploiting Similarities among Languages for Machine Translation79

Page 80: Deep Learning, an interactive introduction for NLP-ers

Recursive Deep Models & Sentiment: Socher (2013)

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Chris Manning, Andrew Ng and Chris Potts. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. EMNLP 2013

code & demo: http://nlp.stanford.edu/sentiment/index.html80

Page 81: Deep Learning, an interactive introduction for NLP-ers

Paragraph Vectors: Le & Mikolov (2014)

Le, Q., Mikolov,. T. (2014) Distributed Representations of Sentences and Documents

81

• add context (sentence, paragraph, document) to word vectors during training

!

Results on Stanford Sentiment Treebank dataset:

Page 82: Deep Learning, an interactive introduction for NLP-ers

Global Vectors, GloVe: Stanford (2014)

Pennington, P., Socher, R., Manning,. D.M. (2014). GloVe: Global Vectors for Word Representation

code & demo: http://nlp.stanford.edu/projects/glove/

vsresults on the word analogy task

“similar accuracy”

82

Page 83: Deep Learning, an interactive introduction for NLP-ers

Dependency-based Embeddings: Levy & Goldberg (2014)

Levy, O., Goldberg, Y. (2014). Dependency-Based Word Embeddings

code & demo: https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/

- Syntactic Dependency Context

Australian scientist discovers star with telescope

- Bag of Words (BoW) Context

0.3$

0.4$

0.5$

0.6$

0.7$

0.8$

0.9$

1$

0$ 0.1$ 0.2$ 0.3$ 0.4$ 0.5$ 0.6$ 0.7$ 0.8$ 0.9$ 1$

Precision

$

Recall$

“Dependency-based embeddings have more

functional similarities”

83