Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 ·...
Transcript of Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 ·...
Introduction
Deep Learning: A Statistical Perspective
Myunghee Cho PaikGuest lectures by Gisoo Kim, Yongchan Kwon, Young-geun Kim, Wonyoung
Kim and Youngwon Choi
Seoul National University
March-June, 2018
Seoul National University Deep Learning March-June, 2018 1 / 56
Introduction
Introduction
Seoul National University Deep Learning March-June, 2018 1 / 56
Introduction
Natural Language Processing
Natural Language Processing (NLP) includes:
Sentiment analysisMachine translationText generation....
How to train language?
How can we convert language into numbers?
Seoul National University Deep Learning March-June, 2018 2 / 56
Introduction
Word Embedding
How to map words in to Rd?
One-hot encoding
Each vectors has nothing to do with other vectors∀u 6= v , ‖u − v‖ = 1, uT v = 0
However...
Each word is related with its companies.“Ice” is closer to “Solid” than “Gas”
Seoul National University Deep Learning March-June, 2018 3 / 56
Introduction
Main Questions in Word Embedding
Vocabulary set: V = {a, the,deep,statistics,. . .,}Size N corpus: C = (v (1), v (2), . . . , v (N)), v (1), v (2), . . . , v (N) ∈ VGiven the corpus data, how can we measure similarity between words,sim(deep, statistics)?
How can we define f and learn wdeep,wstatistics such thatsim(deep, statistics) = f (wdeep,wstatistics)?
Seoul National University Deep Learning March-June, 2018 4 / 56
Introduction
Some Famous Word Embedding Techniques
Latent Semantic Analysis (LSA) (Deerwester et al. 1990)
Hyperspace Analogue to Language (HAL) (Lund and Burgess, 1996)
Word2Vec (Mikolov et al. 2013a)
GloVe (Pennington et al. 2014)
Seoul National University Deep Learning March-June, 2018 5 / 56
Introduction
LSA (Deerwester et al. 1990)
Term-document matrix: Xt×d
sim(a, b) ∝ co-occurence in eachdocuments.
Sigular Value Decomposition:Xt×d = TSDT
With k largest singular values:
X̂t×d = Tt×kSk×k(Dd×k)T
Tt×k : k-dim term vectors,Dd×k : k-dim document vectors
Figure: from (Deerwester et al. 1990)Seoul National University Deep Learning March-June, 2018 6 / 56
Introduction
HAL (Lund and Burgess, 1996)
Term-context term matrix: XV×V
How many times the column word appear in front of the row term?
sim(a, b) ∝ co-occurence in nearby context.
Concatenate row/columnto make 2V -dim vector
Dimension reductionwith k principalcomponents.
Train 160M terms, withV = 70, 000
Figure: from (Lund and Burgess, 1996)
Seoul National University Deep Learning March-June, 2018 7 / 56
Introduction
Sim(·, ·) ∝ Cooccurence?
Cooccurence with ”and” or ”the” does not mean semantic similarity.
Appears just frequently? or has significant similarity?
Transformation or defining new measure of similarity.
Entropy/Correlation based normalization (Rohde et al., 2006)
Positive pointwise mutual information(PPMI) max{0, log p(context|term)p(context) }
(Bullinaria and Levy, 2007)Square root type transformation (Lebret and Collobert, 2014)Train p(context|term) within every local window. (Word2Vec)
Seoul National University Deep Learning March-June, 2018 8 / 56
Word2Vec
Word2Vec
Seoul National University Deep Learning March-June, 2018 9 / 56
Word2Vec
Model Setup (Mikolov et al., 2013)
Vocabulary set: V = {e1, e2, . . . , eV } ⊂ {0, 1}VSize N corpus: C = (v (1), v (2), . . . , v (N)), v (1), v (2), . . . , v (N) ∈ VEmbedded word vectors:
Wd×V =
| | |w1 w2 · · · wV
| | |
,W ′d×V =
| | |w ′1 w ′2 · · · w ′V| | |
Seoul National University Deep Learning March-June, 2018 10 / 56
Word2Vec
Model Setup (Mikolov et al., 2013)
Thus, the model becomes:
P(v (output)|v (input)) =exp(w ′output · winput)∑Vj=1 exp(w ′j · winput)
W /W ′ is called input/output representation
Note that, W 6= W ′. If W = W ′, P(·|·) is maximized when”context word = input word” which is a rare event.
If the output (or context) word appears in the window, w ′output · winput
increases.
Seoul National University Deep Learning March-June, 2018 11 / 56
Word2Vec
Training the Model (Rong, 2014)
Initialize W → Read a (context, input) pair → Update W ′ →Update W → Read another (context, input) pair → · · ·Initialization: Wij = U[−0.5, 0.5] ∀i , jSuppose v (output) = eo appeared in the context of v (input) = ei .
Update W ′ by minimizing − log-likelihood:
L ≡ − logP(eo |ei ) = log(V∑j=1
exp(uij))− uio
where uij = w ′j · wi , j = 1, . . .V
Seoul National University Deep Learning March-June, 2018 12 / 56
Word2Vec
Training the Model (Rong, 2014)
Taking derivatives:
∂L
∂uik=
exp(uik)∑Vj=1 exp(uij)
− δ(k=o)
∂uik∂w ′k
= wi
∂L
∂w ′k= [P(ek |ei )− δ(k=o)]wi , ∀k = 1, . . . ,V
With gradient descent, the updating equation:
w ′k(new) = w ′k(old) − α[P(ek |ei )− δ(k=o)]wi ,∀k = 1, . . . ,V
If k = o, [P(ek |ei )− δk=o ] < 0. This indicates underestimating case.Thus the updating equation adds wi -direction on w ′kIn summary, the updating equation increases uio and decreasesuik , ∀k 6= o.
Seoul National University Deep Learning March-June, 2018 13 / 56
Word2Vec
Training the Model (Rong, 2014)
Given W ′, update W .
Reminder: v (output) = eo appeared in the context of v (input) = ei .
Taking derivatives w.r.t. wi
∂L
∂uik=
exp(uik)∑Vj=1 exp(uij)
− δ(k = o)
∂uik∂wi
= w ′k
∂L
∂wi=
V∑j=1
∂L
∂uij
∂uij∂wi
=V∑j=1
[P(ej |ei )− δ(j=o)]w′j
Define EH =∑V
j=1 [P(ej |ei )− δ(j=o)]w′j : sum of output vectors,
weighted by their prediction error.
With gradient descent, the updating equation:wi(new) = wi(old) − αEH
Seoul National University Deep Learning March-June, 2018 14 / 56
Word2Vec
CBOW and Skip-gram (Mikolov et al., 2013)
Figure: CBOW ModelFigure: Skip-gram Model
Seoul National University Deep Learning March-June, 2018 15 / 56
Word2Vec
Training CBOW (Rong, 2014)
Input: v (t−c), · · · , v (t−1), v (t+1), · · · , v (t+c). Output: v (t)
suppose v (t−c) = et(1), . . . , v(t+c) = et(2c), and v (t) = eo .
Define:
ht ≡1
2c
±c∑j=1
Wv (t+j) =1
2c
2c∑k=1
wt(k)
Suppose Then the model becomes
P(eo |et(1), . . . , et(2c)) =exp(w ′o · ht)∑Vj=1 exp(w ′j · ht)
The loss is defined by negative log-likelihood:
L ≡ logV∑j=1
exp(w ′j · ht)− w ′o · ht
where utj = w ′j · ht ,∀j = 1, . . .V .
Seoul National University Deep Learning March-June, 2018 16 / 56
Word2Vec
Training CBOW (Rong, 2014)
With similar calculation, the updating equation for W ′ becomes:
w ′k(new) = w ′k(old) − α[P(eo |v (t−c), · · · , v (t+c))− δ(k=o)]ht
For W , note that utj = w ′j · ht = 12c
∑2ck=1 w
′j · wt(k)
For back propagation:
∂L
∂wt(k)=
V∑j=1
∂L
∂utj
∂utj∂wt(k)
=1
2c
V∑j=1
[P(eo |v (t−c), · · · , v (t+c))− δ(k=o)]w′j =
1
2cEH
Thus the updating equation becomes:
wt(k)(new) = w t(k)(old) − α1
2cEH,∀k = 1, . . . , 2c
Seoul National University Deep Learning March-June, 2018 17 / 56
Word2Vec
Training Skip-gram (Rong, 2014)
Input:v (t). Output: v (t−c), · · · , v (t−1), v (t+1), · · · , v (t+c).
Suppose, v (t) = ei , and v (t−c) = et(1), . . . , v(t+c) = et(2c). Then the
model becomes:
P(v (t−c), · · · , v (t−1), v (t+1), · · · , v (t+c)|v (t)) =2c∏k=1
P(et(k)|ei )
=2c∏k=1
exp(w ′t(k) · wi )∑Vj=1 exp(w ′j · wi )
The loss becomes:
L ≡2c∑k=1
Lk =2c∑k=1
[logV∑j=1
exp(u(k)ij )− u
(k)it(k)]
where u(k)ij = w ′j · wi : score for only k-th loss.
Seoul National University Deep Learning March-June, 2018 18 / 56
Word2Vec
Training Skip-gram (Rong, 2014)
For W ′, j = 1, . . . ,V
∂L
∂w ′j=
2c∑k=1
∂Lk
∂u(k)ij
∂u(k)ij
∂w ′j
=2c∑k=1
[P(et(k)|ei )− δ(j=t(k))]wi
Thus, the updating equation for W ′ becomes:
w ′j(new) = w ′j(old) − α2c∑k=1
[P(et(k)|ei )− δ(j=t(k))]wi ,∀j = 1, . . .V
Seoul National University Deep Learning March-June, 2018 19 / 56
Word2Vec
Training Skip-gram (Rong, 2014)
For W ,
∂L
∂wi=
2c∑k=1
V∑j=1
∂Lk
∂u(k)ij
∂u(k)ij
∂wi
=2c∑k=1
V∑j=1
[P(et(k)|ei )− δ(j=t(k))]w′j ≡
2c∑k=1
EH(k)
Thus, the updating equation for W becomes:
wi(new) = w i(old) − α2c∑k=1
EH(k)
Seoul National University Deep Learning March-June, 2018 20 / 56
Word2Vec
Compuational Problem
For each input, output pair in corpus C = (v (1), v (2), . . . , v (N)), themodel must calculate:
P(eo |ei ) =exp(w ′o · wi )∑Vj=1 exp(w ′j · wi )
For each epoch, almost N × V times of inner product of d-dimvectors. (skip-gram: N × V × 2c)
The calculation is proportional to V ≈ 106.
(Mikolov et al., 2013) suggests 2 alternative formulations:Hierarchical softmax and Negative sampling
Seoul National University Deep Learning March-June, 2018 21 / 56
Word2Vec
Hierarchical softmax (Mikolov et al., 2013)
Efficient way of computingsoftmax
Build a Huffman binary treeusing word frequency
Instead of w ′j the model usesw ′n(ej ,l)
n(ej , l): l-th node to the wayfrom root to the word ej
Figure: Binary tree for HS
Let hi be the hidden node. Then the probability model becomes:
P(eo |ei ) =
L(eo)−1∏l=1
σ([n(eo , l+1) is at left child of n(eo , l)]×w ′n(eo ,l)·hi )
Seoul National University Deep Learning March-June, 2018 22 / 56
Word2Vec
Training Hierarchical Softmax (Rong, 2014)
Let L = − logP(eo |ei ), and w ′n(eo ,l) = w ′l . Then:
∂L
∂w ′l · hi= {σ([· · · ]w ′l · hi )− 1}[· · · ] =
{σ(w ′l · hi )− 1 [· · · ] = 1
σ(w ′l · hi ) [· · · ] = −1
σ(w ′l · hi ) is the probability of [w ′l+1 is left child node of w ′l ]. Thus,
∂L
∂w ′l · hi= P[w ′l+1 is left child node of w ′l ]− δ[··· ]
Thus the updating equation becomes: for l = 1, . . . , L(eo)− 1
w ′l(new) = w ′l(old) − α(P[w ′l+1 is left child node of w ′l ]− δ[··· ])hiFor skip-gram model, repeat this procedure for 2c outputs.The updating equation for W becomes:
wi(new) = w ′i(old) − αEH∂hi∂wi
where, EH =
L(eo)−1∑l=1
(P[· · · ]− δ[··· ])w ′l
Seoul National University Deep Learning March-June, 2018 23 / 56
Word2Vec
Nagative Sampling (Mikolov et al., 2013)
Generate en(1), . . . , en(k) from the noise distribution Pn
The goal is to discriminate (hi , eo) from (hi , en(1)), . . . , (hi , en(k))
For skip gram model, repeat this procedure with each 2c outputs.
k = 5− 20. are useful. For large datasets, k can be small as 2− 5.
The noise distribution: Pn(en) ∝[#(en)N
]3/4outperformed
significantly.
Figure: 5-Negative Sampling
Seoul National University Deep Learning March-June, 2018 24 / 56
Word2Vec
Objective in Negative Sampling (Goldberg and Levy, 2014)
Suppose (hi , eo) and (hi , en(1)), . . . , (hi , en(k)),(o 6= n(j),∀j = 1, . . . k) is given.Let [D = 1|hi , ej ] be the event that the pair (hi , ej) is came from theoriginal corpus.The model assumes: P(D = 1|hi , ej) = σ(w ′j · hi ). Thus thelikelihood becomes:
σ(w ′o · hi )×k∏
j=1
[1− σ(w ′n(j) · hi )
]Taking log leads to the objective in (Mikolov et al. 2013):
log σ(w ′o · hi ) +k∑
j=1
log σ(−w ′n(j) · hi ) en(j) ∼ Pn
Note that training hi given w ′o ,w′n(1), . . . ,w
′n(k), is a logistic
regression.
Seoul National University Deep Learning March-June, 2018 25 / 56
Word2Vec
Training Negative Sampling (Rong, 2014)
Define loss as:
L = − log σ(w ′o · hi )−k∑
j=1
log σ(−w ′n(j) · hi )
Let Wneg = {w ′n(1), . . . ,w′n(k)}. Then the derivative:
∂L
∂w ′j · hi=
{σ(w ′j · hi )− 1 w ′j = w ′oσ(w ′j · hi ) w ′j ∈ Wneg
= P(D = 1|hi , ej)− δ(j=o)
Thus the updating equation for W ′: for j = o, n(1), . . . , n(k),
w ′j(new) = w ′j(old) − α[P(D = 1|hi , ej)− δ(j=o)]hi
Let ∂L∂hi
=∑n(k)
j=1,n(1)(P(D = 1|hi , ej)− δ(j=o))w′j ≡ EH. Then the
updating equation for W :
wi(new) = w i(old) − αEH∂hi∂wi
Seoul National University Deep Learning March-June, 2018 26 / 56
Word2Vec
Two Pre-processing Techniques (Mikolov et al., 2013)
Frequent words (such as “a”, “the”, “in”) provide less informationvalue than rare words.
Let V = {v1, . . . , vV }, be the vocabulary set. Discard each word viwith probability:
P(vi ) = 1−√
t
[#(vi )/N]
where t = 10−5 is a proper threshold value.
“New York Times”, “Toronto Maple Leafs” can be considered as oneword.
In order to find those phrases, define a score:
score(vi , vj) =#(vivj)− δ#(vi )#(vj)
Over 2-4 cycles of the training set, calculate the score with decreasingδ. Above some threshold value, set vivj as a word.
Seoul National University Deep Learning March-June, 2018 27 / 56
GloVe
GloVe
Seoul National University Deep Learning March-June, 2018 28 / 56
GloVe
Motivation (Pennington et al., 2014)
Let V = {v1, . . . , vV }, be the vocabulary set.
Throughout the corpus C, define some statistics:
Xij : #(word vj is in the context of word vi )Xi ≡
∑k Xik : #(Any word appear in the context of vi )
Pij = Xij/Xi : Probability that vj appear in the context of vi
How can we measure similarity between words, sim(vi , vj)?
Seoul National University Deep Learning March-June, 2018 29 / 56
GloVe
Motivation (Pennington et al., 2014)
Co-occurence probabilities for “ice” and “steam” with selectedcontext words from a corpus (N=6 billion)
If vk is related to vi rather than vj , than Pik/Pjk will be larger than 1.
If vk is related (or not related) to both vi and vj , then Pik/Pjk willclose to 1.
The ratio Pik/Pjk is useful to find out whether vk is close to vi (or vj)
Figure: from (Pennington et al., 2014)
Seoul National University Deep Learning March-June, 2018 30 / 56
GloVe
Model Setup (Pennington et al., 2014)
With the motivation, the model becomes:
Pik
Pjk= F (wi ,wj ,w
′k) wi ,wj ,w
′k ∈ Rd
Setting 2 kinds of parameters W ,W ′ can help reduce overfitting,noise and generally improve results (Ciresan et al., 2012)
In vector space, knowing w1, . . . ,wV is same as knowingw1 − wi , . . . ,wV − wi . Thus the F can be restricted to:
Pik
Pjk= F (wi − wj ,w
′k)
In order to match the dimension and preserve the linear structure, usedot products:
Pik
Pjk= F
[(wi − wj) · w ′k
]Seoul National University Deep Learning March-June, 2018 31 / 56
GloVe
Model Setup (Pennington et al., 2014)
For any i , j , k , l = 1, . . . ,V ,
F[(wi − wj) · w ′k
]F[(wj − wl) · w ′k
]=
Pik
Plk= F
[(wi − wl) · w ′k
]It is natural to define F satisfying F (x)F (y) = F (x + y). This impliesF = exp(·).
Moreover:
F[(wi − wj) · w ′k
]=
exp(wi · w ′k)
exp(wj · w ′k)=
Pik
Pjk
Thus,wi · w ′k = logPik = logXik − logXi
Since the role of a word and a context is exchangable,wi · w ′k = wk · w ′i .
Seoul National University Deep Learning March-June, 2018 32 / 56
GloVe
Model Setup (Pennington et al., 2014)
Consider logXi as a bias of input representation: bi and add anotherbias b′k .
Finally, the model becomes:
wi · w ′k + bi + b′k = logXik
Now, define a weighted cost function:
L =V∑
i ,j=1
f (Xij)(wi · w ′j + bi + b′j − logXij)2
The weight must satisfy:
f (0) = 0: In order to avoid the case Xij = 0.f must be non-decreasing: frequent co-occurence must be emphasizedf should be relatively small for large values: case of “in”,”the”,”and”
Seoul National University Deep Learning March-June, 2018 33 / 56
GloVe
Training GloVe (Pennington et al., 2014)
f is suggested as:
f (x) =
{(x/xmax)α x < xmax
1 x ≥ xmax
xmax is reported to have weak impact on performance.(fix xmax = 100)
α = 3/4 has a modest improvement over α = 1.
Training with AdaGrad (Duchi et al., 2011), stocastically samplingnon-zero elements of X .
The model generates W and W ′. The model concludes with W +W ′.
Seoul National University Deep Learning March-June, 2018 34 / 56
Toy Implementation
Toy Implementation
Seoul National University Deep Learning March-June, 2018 35 / 56
Toy Implementation
Data and Model Descriptions
Movie review data from NLTK corpus.
Consist of plot summary and critique.
Corpus size N = 1.5million, Vocabulary size V = 39768.
Embedding dimension:d = 100, window size:c = 5.
Negative sample size: k = 5.
GloVe trained with 10 epochs.
Time elapsed for training (Intel Core i7 CPU @ 3.60GHz):
Model CBOW+HS CBOW+NEG SG+HS SG+NEG GloVe
Time 9.14s 4.53s 12.4s 12.3s 44.2s
Seoul National University Deep Learning March-June, 2018 36 / 56
Toy Implementation
Results
Similarity between two vectors
Seoul National University Deep Learning March-June, 2018 37 / 56
Toy Implementation
Results
Similarity between two vectors (most frequent words)
Seoul National University Deep Learning March-June, 2018 38 / 56
Toy Implementation
Results
Top 5 similar words with “villian”
Seoul National University Deep Learning March-June, 2018 39 / 56
Toy Implementation
Results
Linear relationship: (“actor”+”she”-”actress”=?)
Seoul National University Deep Learning March-June, 2018 40 / 56
Toy Implementation
Results
Linear relationship: (“king”+”she”-”he”=?)
Seoul National University Deep Learning March-June, 2018 41 / 56
Performances
Performances
Seoul National University Deep Learning March-June, 2018 42 / 56
Performances
Intrinsic Performances (Pennington et al., 2014)
Word analogies task: 19,544 questions
Symantic:” Athens” is to “Greece” as “Berlin” is to ( ? )Syntatic: “dance” is to “dancing” as fly is to ( ? )
Corpus: Gigaword5 + Wikipedia2014
Percentage of correct answers:
Model d N Sem. Syn. Tot.
CBOW 300 6B 63.6 67.4 65.7
SG 300 6B 73.0 66.0 69.1
GloVe 300 6B 77.4 67.0 71.7
Table: From (Pennington et al., 2014)
Seoul National University Deep Learning March-June, 2018 43 / 56
Performances
Extrinsic Performances (Pennington et al., 2014)
Named entity recognition (NER) with Conditional Random Field(CRF) model
Input: Jim bought 300 shares of Acme Corp. in 2006
Output: [Jim](person) bought 300 shares of [AcmeCorp.](Organization) in 2006
4 Entities: person, location, organization, miscellaneous.
Seoul National University Deep Learning March-June, 2018 44 / 56
Performances
Extrinsic Performances (Pennington et al., 2014)
Trained with CoNLL-03 training set and 50-dimensional word vectors.
F1 score on validation set and 3 kinds of test sets:
Model Validation CoNLL-Test ACE MUC7
Discrete 91.0 85.4 77.4 73.4
CBOW 93.1 88.2 82.2 81.1
SG None None None None
GloVe 93.2 88.3 82.9 82.2
Table: From (Pennington et al., 2014)
Seoul National University Deep Learning March-June, 2018 45 / 56
Word Embedding + RNN
Word Embedding + RNN
Seoul National University Deep Learning March-June, 2018 46 / 56
Word Embedding + RNN
How to Add Embedded Vectors to RNN
Recall RNN model:
Input: xtHidden unit: ht = tanh(b + Uhht−1 + Uixt)Output unit: ot = c + UohtPredicted probability: pt = softmax(ot)Unknown parameters: (Ui ,Uo ,Uh, b, c)
Seoul National University Deep Learning March-June, 2018 47 / 56
Word Embedding + RNN
How to Add Embedded Vectors to RNN
With word embeddings:
Input: wi(t) = WxtHidden unit: ht = tanh(b + Uhht−1 + Uiw i(t))Output unit: ot = c + UohtPredicted probability: pt = softmax(ot)Unknown parameters: (W ,Ui ,Uo ,Uh, b, c)
W is not just input. Instead, it is the initial weight of the wordvectors.
Fine tuning the word vectors for specific goal.
Another derivative is added: for k = 1, . . . ,V
∂L
∂wk=∑
i(t)=k
∂L
∂ot
∂ot∂ht
∂ht∂wk
Can be generalized to LSTM and GRU.
Seoul National University Deep Learning March-June, 2018 48 / 56
Word Embedding + RNN
Word-rnn (Eidnes, 2015)
Goal: Generating clickbait headlines
Train 2M clickbait headlines scraped from Buzzfedd, Gawker, Jezebel,Huffington Post and Upworthy
RNN model using GloVe words vectors (N = 6B, d = 200) as initialweights.
3-layer LSTM model with T = 1200.
Seoul National University Deep Learning March-June, 2018 49 / 56
Word Embedding + RNN
Word-rnn (Eidnes, 2015)
8 first completions of “Barack Obama Says”:
Barack Obama Says It’s Wrong To Talk About IraqBarack Obama Says He’s Like ‘A Single Mother’ And ‘Over The Top’Barack Obama Says He Did 48 Things OverBarack Obama Says About Ohio LawBarack Obama Says He Is WrongBarack Obama Says He Will Get The American IdolBarack Obama Says Himself Are “Doing Well Around The World”Barack Obama Says As He Leaves Politics With His Wife
More on the website written in the references
Most of the generated sentences are grammatically correct and makesense.
Seoul National University Deep Learning March-June, 2018 50 / 56
Word Embedding + RNN
Word-rnn (Eidnes, 2015)
The model seems to understand the gender and political context.“Mary J. Williams On Coming Out As A Woman”“Romney Camp: ‘I Think You Are A Bad President’”
Updating W for only 2-layers works best.
Figure: From (Eidnes, 2015)
Seoul National University Deep Learning March-June, 2018 51 / 56
Conclusion
Conclusion
Seoul National University Deep Learning March-June, 2018 52 / 56
Conclusion
Summary
Embedding discrete words into Rd has interesting results
Similar word vectors has high-value of cosine-similiarity.
Linear relationships: “king” +”she” -”he” = ?
Embedded vectors can be used as an input or initial weights of deepneural network.
Seoul National University Deep Learning March-June, 2018 53 / 56
References
References
Seoul National University Deep Learning March-June, 2018 54 / 56
References
Key References
Goldberg, Y., & Levy, O. (2014). word2vec explained: Derivingmikolov et al.’s negative-sampling word-embedding method. arXivpreprint arXiv:1402.3722.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J.(2013). Distributed representations of words and phrases and theircompositionality. In Advances in neural information processingsystems (pp. 3111-3119).
Pennington, J., Socher, R., & Manning, C. (2014). Glove: Globalvectors for word representation. In Proceedings of the 2014conference on empirical methods in natural language processing(EMNLP) (pp. 1532-1543).
Seoul National University Deep Learning March-June, 2018 55 / 56
References
Key References
Rong, X. (2014). word2vec parameter learning explained. arXivpreprint arXiv:1411.2738.
Eidnes, L. (2015). Auto-Generating Clickbait With Recurrent NeuralNetworks. [online] Lars Eidnes’ blog. Available at:https://larseidnes.com/2015/10/13/auto-generating-clickbait-with-recurrent-neural-networks/ [Accessed 8 May2018].
Seoul National University Deep Learning March-June, 2018 56 / 56