Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 ·...

Introduction

Deep Learning: A Statistical Perspective

Myunghee Cho PaikGuest lectures by Gisoo Kim, Yongchan Kwon, Young-geun Kim, Wonyoung

Kim and Youngwon Choi

Seoul National University

March-June, 2018

Seoul National University Deep Learning March-June, 2018 1 / 56

Introduction

Introduction


Introduction

Natural Language Processing

Natural Language Processing (NLP) includes:

Sentiment analysisMachine translationText generation....

How to train language?

How can we convert language into numbers?


Introduction

Word Embedding

How to map words in to Rd?

One-hot encoding

Each vectors has nothing to do with other vectors∀u 6= v , ‖u − v‖ = 1, uT v = 0

However...

Each word is related with its companies.“Ice” is closer to “Solid” than “Gas”


Introduction

Main Questions in Word Embedding

Vocabulary set: V = {a, the,deep,statistics,. . .,}Size N corpus: C = (v (1), v (2), . . . , v (N)), v (1), v (2), . . . , v (N) ∈ VGiven the corpus data, how can we measure similarity between words,sim(deep, statistics)?

How can we define f and learn wdeep,wstatistics such thatsim(deep, statistics) = f (wdeep,wstatistics)?


Introduction

Some Famous Word Embedding Techniques

Latent Semantic Analysis (LSA) (Deerwester et al. 1990)

Hyperspace Analogue to Language (HAL) (Lund and Burgess, 1996)

Word2Vec (Mikolov et al. 2013a)

GloVe (Pennington et al. 2014)


Introduction

LSA (Deerwester et al. 1990)

Term-document matrix: Xt×d

sim(a, b) ∝ co-occurence in eachdocuments.

Sigular Value Decomposition:Xt×d = TSDT

With k largest singular values:

X̂t×d = Tt×kSk×k(Dd×k)T

Tt×k : k-dim term vectors,Dd×k : k-dim document vectors

Figure: from (Deerwester et al. 1990)Seoul National University Deep Learning March-June, 2018 6 / 56

Introduction

HAL (Lund and Burgess, 1996)

Term-context term matrix: XV×V

How many times the column word appear in front of the row term?

sim(a, b) ∝ co-occurence in nearby context.

Concatenate row/columnto make 2V -dim vector

Dimension reductionwith k principalcomponents.

Train 160M terms, withV = 70, 000

Figure: from (Lund and Burgess, 1996)


Introduction

Sim(·, ·) ∝ Cooccurence?

Cooccurence with ”and” or ”the” does not mean semantic similarity.

Appears just frequently? or has significant similarity?

Transformation or defining new measure of similarity.

Entropy/Correlation based normalization (Rohde et al., 2006)

Positive pointwise mutual information(PPMI) max{0, log p(context|term)p(context) }

(Bullinaria and Levy, 2007)Square root type transformation (Lebret and Collobert, 2014)Train p(context|term) within every local window. (Word2Vec)


Word2Vec

Word2Vec


Word2Vec

Model Setup (Mikolov et al., 2013)

Vocabulary set: V = {e1, e2, . . . , eV } ⊂ {0, 1}VSize N corpus: C = (v (1), v (2), . . . , v (N)), v (1), v (2), . . . , v (N) ∈ VEmbedded word vectors:

Wd×V =

| | |w1 w2 · · · wV

| | |

,W ′d×V =

| | |w ′1 w ′2 · · · w ′V| | |


Word2Vec

Model Setup (Mikolov et al., 2013)

Thus, the model becomes:

P(v (output)|v (input)) =exp(w ′output · winput)∑Vj=1 exp(w ′j · winput)

W /W ′ is called input/output representation

Note that, W 6= W ′. If W = W ′, P(·|·) is maximized when”context word = input word” which is a rare event.

If the output (or context) word appears in the window, w ′output · winput

increases.


Word2Vec

Training the Model (Rong, 2014)

Initialize W → Read a (context, input) pair → Update W ′ →Update W → Read another (context, input) pair → · · ·Initialization: Wij = U[−0.5, 0.5] ∀i , jSuppose v (output) = eo appeared in the context of v (input) = ei .

Update W ′ by minimizing − log-likelihood:

L ≡ − logP(eo |ei ) = log(V∑j=1

exp(uij))− uio

where uij = w ′j · wi , j = 1, . . .V


Word2Vec


Taking derivatives:

∂L

∂uik=

exp(uik)∑Vj=1 exp(uij)

− δ(k=o)

∂uik∂w ′k

= wi

∂L

∂w ′k= [P(ek |ei )− δ(k=o)]wi , ∀k = 1, . . . ,V

With gradient descent, the updating equation:

w ′k(new) = w ′k(old) − α[P(ek |ei )− δ(k=o)]wi ,∀k = 1, . . . ,V

If k = o, [P(ek |ei )− δk=o ] < 0. This indicates underestimating case.Thus the updating equation adds wi -direction on w ′kIn summary, the updating equation increases uio and decreasesuik , ∀k 6= o.


Word2Vec


Given W ′, update W .

Reminder: v (output) = eo appeared in the context of v (input) = ei .

Taking derivatives w.r.t. wi

∂L

∂uik=

exp(uik)∑Vj=1 exp(uij)

− δ(k = o)

∂uik∂wi

= w ′k

∂L

∂wi=

V∑j=1

∂L

∂uij

∂uij∂wi

=V∑j=1

[P(ej |ei )− δ(j=o)]w′j

Define EH =∑V

j=1 [P(ej |ei )− δ(j=o)]w′j : sum of output vectors,

weighted by their prediction error.

With gradient descent, the updating equation:wi(new) = wi(old) − αEH


Word2Vec

CBOW and Skip-gram (Mikolov et al., 2013)

Figure: CBOW ModelFigure: Skip-gram Model


Word2Vec

Training CBOW (Rong, 2014)

Input: v (t−c), · · · , v (t−1), v (t+1), · · · , v (t+c). Output: v (t)

suppose v (t−c) = et(1), . . . , v(t+c) = et(2c), and v (t) = eo .

Define:

ht ≡1

2c

±c∑j=1

Wv (t+j) =1

2c

2c∑k=1

wt(k)

Suppose Then the model becomes

P(eo |et(1), . . . , et(2c)) =exp(w ′o · ht)∑Vj=1 exp(w ′j · ht)

The loss is defined by negative log-likelihood:

L ≡ logV∑j=1

exp(w ′j · ht)− w ′o · ht

where utj = w ′j · ht ,∀j = 1, . . .V .


Word2Vec

Training CBOW (Rong, 2014)

With similar calculation, the updating equation for W ′ becomes:

w ′k(new) = w ′k(old) − α[P(eo |v (t−c), · · · , v (t+c))− δ(k=o)]ht

For W , note that utj = w ′j · ht = 12c

∑2ck=1 w

′j · wt(k)

For back propagation:

∂L

∂wt(k)=

V∑j=1

∂L

∂utj

∂utj∂wt(k)

=1

2c

V∑j=1

[P(eo |v (t−c), · · · , v (t+c))− δ(k=o)]w′j =

1

2cEH

Thus the updating equation becomes:

wt(k)(new) = w t(k)(old) − α1

2cEH,∀k = 1, . . . , 2c


Word2Vec

Training Skip-gram (Rong, 2014)

Input:v (t). Output: v (t−c), · · · , v (t−1), v (t+1), · · · , v (t+c).

Suppose, v (t) = ei , and v (t−c) = et(1), . . . , v(t+c) = et(2c). Then the

model becomes:

P(v (t−c), · · · , v (t−1), v (t+1), · · · , v (t+c)|v (t)) =2c∏k=1

P(et(k)|ei )

=2c∏k=1

exp(w ′t(k) · wi )∑Vj=1 exp(w ′j · wi )

The loss becomes:

L ≡2c∑k=1

Lk =2c∑k=1

[logV∑j=1

exp(u(k)ij )− u

(k)it(k)]

where u(k)ij = w ′j · wi : score for only k-th loss.


Word2Vec


For W ′, j = 1, . . . ,V

∂L

∂w ′j=

2c∑k=1

∂Lk

∂u(k)ij

∂u(k)ij

∂w ′j

=2c∑k=1

[P(et(k)|ei )− δ(j=t(k))]wi

Thus, the updating equation for W ′ becomes:

w ′j(new) = w ′j(old) − α2c∑k=1

[P(et(k)|ei )− δ(j=t(k))]wi ,∀j = 1, . . .V


Word2Vec


For W ,

∂L

∂wi=

2c∑k=1

V∑j=1

∂Lk

∂u(k)ij

∂u(k)ij

∂wi

=2c∑k=1

V∑j=1

[P(et(k)|ei )− δ(j=t(k))]w′j ≡

2c∑k=1

EH(k)

Thus, the updating equation for W becomes:

wi(new) = w i(old) − α2c∑k=1

EH(k)


Word2Vec

Compuational Problem

For each input, output pair in corpus C = (v (1), v (2), . . . , v (N)), themodel must calculate:

P(eo |ei ) =exp(w ′o · wi )∑Vj=1 exp(w ′j · wi )

For each epoch, almost N × V times of inner product of d-dimvectors. (skip-gram: N × V × 2c)

The calculation is proportional to V ≈ 106.

(Mikolov et al., 2013) suggests 2 alternative formulations:Hierarchical softmax and Negative sampling


Word2Vec

Hierarchical softmax (Mikolov et al., 2013)

Efficient way of computingsoftmax

Build a Huffman binary treeusing word frequency

Instead of w ′j the model usesw ′n(ej ,l)

n(ej , l): l-th node to the wayfrom root to the word ej

Figure: Binary tree for HS

Let hi be the hidden node. Then the probability model becomes:

P(eo |ei ) =

L(eo)−1∏l=1

σ([n(eo , l+1) is at left child of n(eo , l)]×w ′n(eo ,l)·hi )


Word2Vec

Training Hierarchical Softmax (Rong, 2014)

Let L = − logP(eo |ei ), and w ′n(eo ,l) = w ′l . Then:

∂L

∂w ′l · hi= {σ([· · · ]w ′l · hi )− 1}[· · · ] =

{σ(w ′l · hi )− 1 [· · · ] = 1

σ(w ′l · hi ) [· · · ] = −1

σ(w ′l · hi ) is the probability of [w ′l+1 is left child node of w ′l ]. Thus,

∂L

∂w ′l · hi= P[w ′l+1 is left child node of w ′l ]− δ[··· ]

Thus the updating equation becomes: for l = 1, . . . , L(eo)− 1

w ′l(new) = w ′l(old) − α(P[w ′l+1 is left child node of w ′l ]− δ[··· ])hiFor skip-gram model, repeat this procedure for 2c outputs.The updating equation for W becomes:

wi(new) = w ′i(old) − αEH∂hi∂wi

where, EH =

L(eo)−1∑l=1

(P[· · · ]− δ[··· ])w ′l


Word2Vec

Nagative Sampling (Mikolov et al., 2013)

Generate en(1), . . . , en(k) from the noise distribution Pn

The goal is to discriminate (hi , eo) from (hi , en(1)), . . . , (hi , en(k))

For skip gram model, repeat this procedure with each 2c outputs.

k = 5− 20. are useful. For large datasets, k can be small as 2− 5.

The noise distribution: Pn(en) ∝[#(en)N

]3/4outperformed

significantly.

Figure: 5-Negative Sampling


Word2Vec

Objective in Negative Sampling (Goldberg and Levy, 2014)

Suppose (hi , eo) and (hi , en(1)), . . . , (hi , en(k)),(o 6= n(j),∀j = 1, . . . k) is given.Let [D = 1|hi , ej ] be the event that the pair (hi , ej) is came from theoriginal corpus.The model assumes: P(D = 1|hi , ej) = σ(w ′j · hi ). Thus thelikelihood becomes:

σ(w ′o · hi )×k∏

j=1

[1− σ(w ′n(j) · hi )

]Taking log leads to the objective in (Mikolov et al. 2013):

log σ(w ′o · hi ) +k∑

j=1

log σ(−w ′n(j) · hi ) en(j) ∼ Pn

Note that training hi given w ′o ,w′n(1), . . . ,w

′n(k), is a logistic

regression.


Word2Vec

Training Negative Sampling (Rong, 2014)

Define loss as:

L = − log σ(w ′o · hi )−k∑

j=1

log σ(−w ′n(j) · hi )

Let Wneg = {w ′n(1), . . . ,w′n(k)}. Then the derivative:

∂L

∂w ′j · hi=

{σ(w ′j · hi )− 1 w ′j = w ′oσ(w ′j · hi ) w ′j ∈ Wneg

= P(D = 1|hi , ej)− δ(j=o)

Thus the updating equation for W ′: for j = o, n(1), . . . , n(k),

w ′j(new) = w ′j(old) − α[P(D = 1|hi , ej)− δ(j=o)]hi

Let ∂L∂hi

=∑n(k)

j=1,n(1)(P(D = 1|hi , ej)− δ(j=o))w′j ≡ EH. Then the

updating equation for W :

wi(new) = w i(old) − αEH∂hi∂wi


Word2Vec

Two Pre-processing Techniques (Mikolov et al., 2013)

Frequent words (such as “a”, “the”, “in”) provide less informationvalue than rare words.

Let V = {v1, . . . , vV }, be the vocabulary set. Discard each word viwith probability:

P(vi ) = 1−√

t

[#(vi )/N]

where t = 10−5 is a proper threshold value.

“New York Times”, “Toronto Maple Leafs” can be considered as oneword.

In order to find those phrases, define a score:

score(vi , vj) =#(vivj)− δ#(vi )#(vj)

Over 2-4 cycles of the training set, calculate the score with decreasingδ. Above some threshold value, set vivj as a word.


GloVe

GloVe


GloVe

Motivation (Pennington et al., 2014)

Let V = {v1, . . . , vV }, be the vocabulary set.

Throughout the corpus C, define some statistics:

Xij : #(word vj is in the context of word vi )Xi ≡

∑k Xik : #(Any word appear in the context of vi )

Pij = Xij/Xi : Probability that vj appear in the context of vi

How can we measure similarity between words, sim(vi , vj)?


GloVe

Motivation (Pennington et al., 2014)

Co-occurence probabilities for “ice” and “steam” with selectedcontext words from a corpus (N=6 billion)

If vk is related to vi rather than vj , than Pik/Pjk will be larger than 1.

If vk is related (or not related) to both vi and vj , then Pik/Pjk willclose to 1.

The ratio Pik/Pjk is useful to find out whether vk is close to vi (or vj)

Figure: from (Pennington et al., 2014)


GloVe

Model Setup (Pennington et al., 2014)

With the motivation, the model becomes:

Pik

Pjk= F (wi ,wj ,w

′k) wi ,wj ,w

′k ∈ Rd

Setting 2 kinds of parameters W ,W ′ can help reduce overfitting,noise and generally improve results (Ciresan et al., 2012)

In vector space, knowing w1, . . . ,wV is same as knowingw1 − wi , . . . ,wV − wi . Thus the F can be restricted to:

Pik

Pjk= F (wi − wj ,w

′k)

In order to match the dimension and preserve the linear structure, usedot products:

Pik

Pjk= F

[(wi − wj) · w ′k

]Seoul National University Deep Learning March-June, 2018 31 / 56

GloVe


For any i , j , k , l = 1, . . . ,V ,

F[(wi − wj) · w ′k

]F[(wj − wl) · w ′k

]=

Pik

Plk= F

[(wi − wl) · w ′k

]It is natural to define F satisfying F (x)F (y) = F (x + y). This impliesF = exp(·).

Moreover:

F[(wi − wj) · w ′k

]=

exp(wi · w ′k)

exp(wj · w ′k)=

Pik

Pjk

Thus,wi · w ′k = logPik = logXik − logXi

Since the role of a word and a context is exchangable,wi · w ′k = wk · w ′i .


GloVe


Consider logXi as a bias of input representation: bi and add anotherbias b′k .

Finally, the model becomes:

wi · w ′k + bi + b′k = logXik

Now, define a weighted cost function:

L =V∑

i ,j=1

f (Xij)(wi · w ′j + bi + b′j − logXij)2

The weight must satisfy:

f (0) = 0: In order to avoid the case Xij = 0.f must be non-decreasing: frequent co-occurence must be emphasizedf should be relatively small for large values: case of “in”,”the”,”and”


GloVe

Training GloVe (Pennington et al., 2014)

f is suggested as:

f (x) =

{(x/xmax)α x < xmax

1 x ≥ xmax

xmax is reported to have weak impact on performance.(fix xmax = 100)

α = 3/4 has a modest improvement over α = 1.

Training with AdaGrad (Duchi et al., 2011), stocastically samplingnon-zero elements of X .

The model generates W and W ′. The model concludes with W +W ′.


Toy Implementation

Toy Implementation


Toy Implementation

Data and Model Descriptions

Movie review data from NLTK corpus.

Consist of plot summary and critique.

Corpus size N = 1.5million, Vocabulary size V = 39768.

Embedding dimension:d = 100, window size:c = 5.

Negative sample size: k = 5.

GloVe trained with 10 epochs.

Time elapsed for training (Intel Core i7 CPU @ 3.60GHz):

Model CBOW+HS CBOW+NEG SG+HS SG+NEG GloVe

Time 9.14s 4.53s 12.4s 12.3s 44.2s


Toy Implementation

Results

Similarity between two vectors


Toy Implementation

Results

Similarity between two vectors (most frequent words)


Toy Implementation

Results

Top 5 similar words with “villian”


Toy Implementation

Results

Linear relationship: (“actor”+”she”-”actress”=?)


Toy Implementation

Results

Linear relationship: (“king”+”she”-”he”=?)


Performances

Performances


Performances

Intrinsic Performances (Pennington et al., 2014)

Word analogies task: 19,544 questions

Symantic:” Athens” is to “Greece” as “Berlin” is to ( ? )Syntatic: “dance” is to “dancing” as fly is to ( ? )

Corpus: Gigaword5 + Wikipedia2014

Percentage of correct answers:

Model d N Sem. Syn. Tot.

CBOW 300 6B 63.6 67.4 65.7

SG 300 6B 73.0 66.0 69.1

GloVe 300 6B 77.4 67.0 71.7

Table: From (Pennington et al., 2014)


Performances

Extrinsic Performances (Pennington et al., 2014)

Named entity recognition (NER) with Conditional Random Field(CRF) model

Input: Jim bought 300 shares of Acme Corp. in 2006

Output: [Jim](person) bought 300 shares of [AcmeCorp.](Organization) in 2006

4 Entities: person, location, organization, miscellaneous.


Performances

Extrinsic Performances (Pennington et al., 2014)

Trained with CoNLL-03 training set and 50-dimensional word vectors.

F1 score on validation set and 3 kinds of test sets:

Model Validation CoNLL-Test ACE MUC7

Discrete 91.0 85.4 77.4 73.4

CBOW 93.1 88.2 82.2 81.1

SG None None None None

GloVe 93.2 88.3 82.9 82.2

Table: From (Pennington et al., 2014)


Word Embedding + RNN




How to Add Embedded Vectors to RNN

Recall RNN model:

Input: xtHidden unit: ht = tanh(b + Uhht−1 + Uixt)Output unit: ot = c + UohtPredicted probability: pt = softmax(ot)Unknown parameters: (Ui ,Uo ,Uh, b, c)



How to Add Embedded Vectors to RNN

With word embeddings:

Input: wi(t) = WxtHidden unit: ht = tanh(b + Uhht−1 + Uiw i(t))Output unit: ot = c + UohtPredicted probability: pt = softmax(ot)Unknown parameters: (W ,Ui ,Uo ,Uh, b, c)

W is not just input. Instead, it is the initial weight of the wordvectors.

Fine tuning the word vectors for specific goal.

Another derivative is added: for k = 1, . . . ,V

∂L

∂wk=∑

i(t)=k

∂L

∂ot

∂ot∂ht

∂ht∂wk

Can be generalized to LSTM and GRU.



Word-rnn (Eidnes, 2015)

Goal: Generating clickbait headlines

Train 2M clickbait headlines scraped from Buzzfedd, Gawker, Jezebel,Huffington Post and Upworthy

RNN model using GloVe words vectors (N = 6B, d = 200) as initialweights.

3-layer LSTM model with T = 1200.




8 first completions of “Barack Obama Says”:

Barack Obama Says It’s Wrong To Talk About IraqBarack Obama Says He’s Like ‘A Single Mother’ And ‘Over The Top’Barack Obama Says He Did 48 Things OverBarack Obama Says About Ohio LawBarack Obama Says He Is WrongBarack Obama Says He Will Get The American IdolBarack Obama Says Himself Are “Doing Well Around The World”Barack Obama Says As He Leaves Politics With His Wife

More on the website written in the references

Most of the generated sentences are grammatically correct and makesense.




The model seems to understand the gender and political context.“Mary J. Williams On Coming Out As A Woman”“Romney Camp: ‘I Think You Are A Bad President’”

Updating W for only 2-layers works best.

Figure: From (Eidnes, 2015)


Conclusion

Conclusion


Conclusion

Summary

Embedding discrete words into Rd has interesting results

Similar word vectors has high-value of cosine-similiarity.

Linear relationships: “king” +”she” -”he” = ?

Embedded vectors can be used as an input or initial weights of deepneural network.


References

References


References

Key References

Goldberg, Y., & Levy, O. (2014). word2vec explained: Derivingmikolov et al.’s negative-sampling word-embedding method. arXivpreprint arXiv:1402.3722.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J.(2013). Distributed representations of words and phrases and theircompositionality. In Advances in neural information processingsystems (pp. 3111-3119).

Pennington, J., Socher, R., & Manning, C. (2014). Glove: Globalvectors for word representation. In Proceedings of the 2014conference on empirical methods in natural language processing(EMNLP) (pp. 1532-1543).


References

Key References

Rong, X. (2014). word2vec parameter learning explained. arXivpreprint arXiv:1411.2738.

Eidnes, L. (2015). Auto-Generating Clickbait With Recurrent NeuralNetworks. [online] Lars Eidnes’ blog. Available at:https://larseidnes.com/2015/10/13/auto-generating-clickbait-with-recurrent-neural-networks/ [Accessed 8 May2018].


Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 ·...

Documents

Transcript of Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 ·...