Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav...
-
Upload
ella-griffin -
Category
Documents
-
view
222 -
download
2
Transcript of Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav...
![Page 1: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/1.jpg)
1
Improving Distributional Similaritywith Lessons Learned from Word
Embeddings
Omer Levy Yoav Goldberg Ido Dagan
Bar-Ilan UniversityIsrael
![Page 2: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/2.jpg)
2
Word Similarity & Relatedness
• How similar is pizza to pasta?• How related is pizza to Italy?
• Representing words as vectors allows easy computation of similarity
![Page 3: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/3.jpg)
3
Approaches for Representing Words
Distributional Semantics (Count)• Used since the 90’s• Sparse word-context PMI/PPMI matrix• Decomposed with SVD
Word Embeddings (Predict)• Inspired by deep learning• word2vec (Mikolov et al., 2013)• GloVe (Pennington et al., 2014)
Underlying Theory: The Distributional Hypothesis (Harris, ’54; Firth, ‘57)“Similar words occur in similar contexts”
![Page 4: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/4.jpg)
4
Approaches for Representing Words
Both approaches:• Rely on the same linguistic theory• Use the same data• Are mathematically related• “Neural Word Embedding as Implicit Matrix Factorization” (NIPS 2014)
• How come word embeddings are so much better?• “Don’t Count, Predict!” (Baroni et al., ACL 2014)
• More than meets the eye…
![Page 5: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/5.jpg)
5
What’s really improving performance?
The Contributions of Word EmbeddingsNovel Algorithms(objective + training method)
• Skip Grams + Negative Sampling• CBOW + Hierarchical Softmax• Noise Contrastive Estimation• GloVe• …
New Hyperparameters(preprocessing, smoothing, etc.)
• Subsampling• Dynamic Context Windows• Context Distribution Smoothing• Adding Context Vectors• …
![Page 6: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/6.jpg)
6
What’s really improving performance?
The Contributions of Word EmbeddingsNovel Algorithms(objective + training method)
• Skip Grams + Negative Sampling• CBOW + Hierarchical Softmax• Noise Contrastive Estimation• GloVe• …
New Hyperparameters(preprocessing, smoothing, etc.)
• Subsampling• Dynamic Context Windows• Context Distribution Smoothing• Adding Context Vectors• …
![Page 7: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/7.jpg)
7
What’s really improving performance?
The Contributions of Word EmbeddingsNovel Algorithms(objective + training method)
• Skip Grams + Negative Sampling• CBOW + Hierarchical Softmax• Noise Contrastive Estimation• GloVe• …
New Hyperparameters(preprocessing, smoothing, etc.)
• Subsampling• Dynamic Context Windows• Context Distribution Smoothing• Adding Context Vectors• …
![Page 8: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/8.jpg)
8
What’s really improving performance?
The Contributions of Word EmbeddingsNovel Algorithms(objective + training method)
• Skip Grams + Negative Sampling• CBOW + Hierarchical Softmax• Noise Contrastive Estimation• GloVe• …
New Hyperparameters(preprocessing, smoothing, etc.)
• Subsampling• Dynamic Context Windows• Context Distribution Smoothing• Adding Context Vectors• …
![Page 9: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/9.jpg)
9
Our Contributions
1) Identifying the existence of new hyperparameters• Not always mentioned in papers
2) Adapting the hyperparameters across algorithms• Must understand the mathematical relation between algorithms
![Page 10: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/10.jpg)
10
Our Contributions
1) Identifying the existence of new hyperparameters• Not always mentioned in papers
2) Adapting the hyperparameters across algorithms• Must understand the mathematical relation between algorithms
3) Comparing algorithms across all hyperparameter settings• Over 5,000 experiments
![Page 11: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/11.jpg)
11
Background
![Page 12: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/12.jpg)
12
What is word2vec?
![Page 13: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/13.jpg)
13
What is word2vec?
How is it related to PMI?
![Page 14: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/14.jpg)
14
What is word2vec?
• word2vec is not a single algorithm• It is a software package for representing words as vectors, containing:• Two distinct models
• CBoW• Skip-Gram
• Various training methods• Negative Sampling• Hierarchical Softmax
• A rich preprocessing pipeline• Dynamic Context Windows• Subsampling• Deleting Rare Words
![Page 15: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/15.jpg)
15
What is word2vec?
• word2vec is not a single algorithm• It is a software package for representing words as vectors, containing:• Two distinct models
• CBoW• Skip-Gram (SG)
• Various training methods• Negative Sampling (NS)• Hierarchical Softmax
• A rich preprocessing pipeline• Dynamic Context Windows• Subsampling• Deleting Rare Words
![Page 16: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/16.jpg)
16
Skip-Grams with Negative Sampling (SGNS)Marco saw a furry little wampimuk hiding in the tree.
“word2vec Explained…”Goldberg & Levy, arXiv 2014
![Page 17: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/17.jpg)
17
Skip-Grams with Negative Sampling (SGNS)Marco saw a furry little wampimuk hiding in the tree.
“word2vec Explained…”Goldberg & Levy, arXiv 2014
![Page 18: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/18.jpg)
18
Skip-Grams with Negative Sampling (SGNS)Marco saw a furry little wampimuk hiding in the tree.
words contextswampimuk furrywampimuk littlewampimuk hidingwampimuk in… …
“word2vec Explained…”Goldberg & Levy, arXiv 2014
(data)
![Page 19: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/19.jpg)
19
Skip-Grams with Negative Sampling (SGNS)• SGNS finds a vector for each word in our vocabulary • Each such vector has latent dimensions (e.g. )• Effectively, it learns a matrix whose rows represent • Key point: it also learns a similar auxiliary matrix of context vectors• In fact, each word has two embeddings
“word2vec Explained…”Goldberg & Levy, arXiv 2014
𝑊𝑑
𝑉𝑊
:wampimuk
𝐶𝑉𝐶
𝑑
:wampimuk
≠
![Page 20: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/20.jpg)
20
Skip-Grams with Negative Sampling (SGNS)
“word2vec Explained…”Goldberg & Levy, arXiv 2014
![Page 21: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/21.jpg)
21
Skip-Grams with Negative Sampling (SGNS)• Maximize: • was observed with
words contextswampimuk furrywampimuk littlewampimuk hidingwampimuk in
“word2vec Explained…”Goldberg & Levy, arXiv 2014
![Page 22: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/22.jpg)
22
Skip-Grams with Negative Sampling (SGNS)• Maximize: • was observed with
words contextswampimuk furrywampimuk littlewampimuk hidingwampimuk in
• Minimize: • was hallucinated with
words contextswampimuk Australiawampimuk cyberwampimuk thewampimuk 1985
“word2vec Explained…”Goldberg & Levy, arXiv 2014
![Page 23: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/23.jpg)
23
Skip-Grams with Negative Sampling (SGNS)• “Negative Sampling”• SGNS samples contexts at random as negative examples• “Random” = unigram distribution
• Spoiler: Changing this distribution has a significant effect
![Page 24: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/24.jpg)
24
What is SGNS learning?
![Page 25: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/25.jpg)
25
What is SGNS learning?
• Take SGNS’s embedding matrices ( and )
“Neural Word Embeddings as Implicit Matrix Factorization”Levy & Goldberg, NIPS 2014
𝑊𝑑
𝑉𝑊
𝑉𝐶
𝑑
𝐶
![Page 26: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/26.jpg)
26
What is SGNS learning?
• Take SGNS’s embedding matrices ( and )• Multiply them• What do you get?
𝑊𝑑
𝑉𝑊
𝐶𝑉 𝐶
𝑑
“Neural Word Embeddings as Implicit Matrix Factorization”Levy & Goldberg, NIPS 2014
![Page 27: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/27.jpg)
27
What is SGNS learning?
• A matrix• Each cell describes the relation between a specific word-context pair
𝑊𝑑
𝑉𝑊
𝐶𝑉 𝐶
𝑑
“Neural Word Embeddings as Implicit Matrix Factorization”Levy & Goldberg, NIPS 2014
?¿
𝑉𝑊
𝑉 𝐶
![Page 28: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/28.jpg)
28
What is SGNS learning?
• We proved that for large enough and enough iterations
𝑊𝑑
𝑉𝑊
𝐶𝑉 𝐶
𝑑
“Neural Word Embeddings as Implicit Matrix Factorization”Levy & Goldberg, NIPS 2014
?¿
𝑉𝑊
𝑉 𝐶
![Page 29: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/29.jpg)
29
What is SGNS learning?
• We proved that for large enough and enough iterations• We get the word-context PMI matrix
𝑊𝑑
𝑉𝑊
𝐶𝑉 𝐶
𝑑
“Neural Word Embeddings as Implicit Matrix Factorization”Levy & Goldberg, NIPS 2014
𝑀𝑃𝑀𝐼¿
𝑉𝑊
𝑉 𝐶
![Page 30: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/30.jpg)
30
What is SGNS learning?
• We prove that for large enough and enough iterations• We get the word-context PMI matrix, shifted by a global constant
𝑊𝑑
𝑉𝑊
𝐶𝑉 𝐶
𝑑
“Neural Word Embeddings as Implicit Matrix Factorization”Levy & Goldberg, NIPS 2014
𝑀𝑃𝑀𝐼¿
𝑉𝑊
𝑉 𝐶
− log𝑘
![Page 31: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/31.jpg)
31
What is SGNS learning?
• SGNS is doing something very similar to the older approaches
• SGNS is factorizing the traditional word-context PMI matrix
• So does SVD!
• GloVe factorizes a similar word-context matrix
![Page 32: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/32.jpg)
32
But embeddings are still better, right?• Plenty of evidence that embeddings outperform traditional methods• “Don’t Count, Predict!” (Baroni et al., ACL 2014)• GloVe (Pennington et al., EMNLP 2014)
• How does this fit with our story?
![Page 33: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/33.jpg)
33
The Big Impact of “Small” Hyperparameters
![Page 34: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/34.jpg)
34
The Big Impact of “Small” Hyperparameters• word2vec & GloVe are more than just algorithms…
• Introduce new hyperparameters
• May seem minor, but make a big difference in practice
![Page 35: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/35.jpg)
35
Identifying New Hyperparameters
![Page 36: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/36.jpg)
36
New Hyperparameters
• Preprocessing (word2vec)• Dynamic Context Windows• Subsampling• Deleting Rare Words
• Postprocessing (GloVe)• Adding Context Vectors
• Association Metric (SGNS)• Shifted PMI• Context Distribution Smoothing
![Page 37: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/37.jpg)
37
New Hyperparameters
• Preprocessing (word2vec)• Dynamic Context Windows• Subsampling• Deleting Rare Words
• Postprocessing (GloVe)• Adding Context Vectors
• Association Metric (SGNS)• Shifted PMI• Context Distribution Smoothing
![Page 38: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/38.jpg)
38
New Hyperparameters
• Preprocessing (word2vec)• Dynamic Context Windows• Subsampling• Deleting Rare Words
• Postprocessing (GloVe)• Adding Context Vectors
• Association Metric (SGNS)• Shifted PMI• Context Distribution Smoothing
![Page 39: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/39.jpg)
39
New Hyperparameters
• Preprocessing (word2vec)• Dynamic Context Windows• Subsampling• Deleting Rare Words
• Postprocessing (GloVe)• Adding Context Vectors
• Association Metric (SGNS)• Shifted PMI• Context Distribution Smoothing
![Page 40: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/40.jpg)
40
Dynamic Context Windows
Marco saw a furry little wampimuk hiding in the tree.
![Page 41: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/41.jpg)
41
Dynamic Context Windows
Marco saw a furry little wampimuk hiding in the tree.
![Page 42: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/42.jpg)
42
Dynamic Context Windows
Marco saw a furry little wampimuk hiding in the tree.
word2vec:
GloVe:
Aggressive:
The Word-Space Model (Sahlgren, 2006)
![Page 43: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/43.jpg)
43
Adding Context Vectors
• SGNS creates word vectors • SGNS creates auxiliary context vectors • So do GloVe and SVD
![Page 44: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/44.jpg)
44
Adding Context Vectors
• SGNS creates word vectors • SGNS creates auxiliary context vectors • So do GloVe and SVD
• Instead of just • Represent a word as:
• Introduced by Pennington et al. (2014)• Only applied to GloVe
![Page 45: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/45.jpg)
45
Adapting Hyperparameters across Algorithms
![Page 46: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/46.jpg)
46
Context Distribution Smoothing
• SGNS samples to form negative examples
• Our analysis assumes is the unigram distribution
![Page 47: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/47.jpg)
47
Context Distribution Smoothing
• SGNS samples to form negative examples
• Our analysis assumes is the unigram distribution
• In practice, it’s a smoothed unigram distribution
• This little change makes a big difference
![Page 48: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/48.jpg)
48
Context Distribution Smoothing
• We can adapt context distribution smoothing to PMI!
• Replace with :
• Consistently improves PMI on every task
• Always use Context Distribution Smoothing!
![Page 49: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/49.jpg)
49
Comparing Algorithms
![Page 50: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/50.jpg)
50
Controlled Experiments
• Prior art was unaware of these hyperparameters
• Essentially, comparing “apples to oranges”
• We allow every algorithm to use every hyperparameter
![Page 51: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/51.jpg)
51
Controlled Experiments
• Prior art was unaware of these hyperparameters
• Essentially, comparing “apples to oranges”
• We allow every algorithm to use every hyperparameter*
* If transferable
![Page 52: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/52.jpg)
52
Systematic Experiments
• 9 Hyperparameters• 6 New
• 4 Word Representation Algorithms• PPMI (Sparse & Explicit)• SVD(PPMI)• SGNS• GloVe
• 8 Benchmarks• 6 Word Similarity Tasks• 2 Analogy Tasks
• 5,632 experiments
![Page 53: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/53.jpg)
53
Systematic Experiments
• 9 Hyperparameters• 6 New
• 4 Word Representation Algorithms• PPMI (Sparse & Explicit)• SVD(PPMI)• SGNS• GloVe
• 8 Benchmarks• 6 Word Similarity Tasks• 2 Analogy Tasks
• 5,632 experiments
![Page 54: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/54.jpg)
54
Hyperparameter Settings
Classic Vanilla Setting(commonly used for distributional baselines)
• Preprocessing• <None>
• Postprocessing• <None>
• Association Metric• Vanilla PMI/PPMI
![Page 55: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/55.jpg)
55
Hyperparameter Settings
Classic Vanilla Setting(commonly used for distributional baselines)
• Preprocessing• <None>
• Postprocessing• <None>
• Association Metric• Vanilla PMI/PPMI
Recommended word2vec Setting(tuned for SGNS)
• Preprocessing• Dynamic Context Window• Subsampling
• Postprocessing• <None>
• Association Metric• Shifted PMI/PPMI• Context Distribution Smoothing
![Page 56: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/56.jpg)
56
Experiments
PPMI (Sparse Vectors) SGNS (Embeddings)0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
WordSim-353 Relatedness
Spea
rman
’s Co
rrel
ation
![Page 57: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/57.jpg)
57
Experiments: Prior Art
PPMI (Sparse Vectors) SGNS (Embeddings)0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
VanillaSetting
0.54
VanillaSetting
0.587
word2vecSetting
0.688
word2vecSetting
0.623
WordSim-353 Relatedness
Spea
rman
’s Co
rrel
ation
Experiments: “Apples to Apples”Experiments: “Oranges to Oranges”
![Page 58: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/58.jpg)
58
Experiments: “Oranges to Oranges”Experiments: Hyperparameter Tuning
PPMI (Sparse Vectors) SGNS (Embeddings)0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
VanillaSetting
0.54
VanillaSetting
0.587
word2vecSetting
0.688
word2vecSetting
0.623
OptimalSetting
0.697
OptimalSetting
0.681
WordSim-353 Relatedness
Spea
rman
’s Co
rrel
ation
[different settings]
![Page 59: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/59.jpg)
59
Overall Results
• Hyperparameters often have stronger effects than algorithms
• Hyperparameters often have stronger effects than more data
• Prior superiority claims were not accurate
![Page 60: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/60.jpg)
60
Re-evaluating Prior Claims
![Page 61: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/61.jpg)
61
Don’t Count, Predict! (Baroni et al., 2014)• “word2vec is better than count-based methods”
• Hyperparameter settings account for most of the reported gaps
• Embeddings do not really outperform count-based methods
![Page 62: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/62.jpg)
62
Don’t Count, Predict! (Baroni et al., 2014)• “word2vec is better than count-based methods”
• Hyperparameter settings account for most of the reported gaps
• Embeddings do not really outperform count-based methods*
* Except for one task…
![Page 63: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/63.jpg)
63
GloVe (Pennington et al., 2014)
• “GloVe is better than word2vec”
• Hyperparameter settings account for most of the reported gaps• Adding context vectors applied only to GloVe• Different preprocessing
• We observed the opposite• SGNS outperformed GloVe on every task
• Our largest corpus: 10 billion tokens• Perhaps larger corpora behave differently?
![Page 64: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/64.jpg)
64
GloVe (Pennington et al., 2014)
• “GloVe is better than word2vec”
• Hyperparameter settings account for most of the reported gaps• Adding context vectors applied only to GloVe• Different preprocessing
• We observed the opposite• SGNS outperformed GloVe on every task
• Our largest corpus: 10 billion tokens• Perhaps larger corpora behave differently?
![Page 65: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/65.jpg)
65
Linguistic Regularities in Sparse and ExplicitWord Representations (Levy and Goldberg, 2014)
• “PPMI vectors perform on par with SGNS on analogy tasks”
• Holds for semantic analogies• Does not hold for syntactic analogies (MSR dataset)
• Hyperparameter settings account for most of the reported gaps• Different context type for PPMI vectors
• Syntactic Analogies: there is a real gap in favor of SGNS
![Page 66: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/66.jpg)
66
Conclusions
![Page 67: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/67.jpg)
67
Conclusions: Distributional Similarity
The Contributions of Word Embeddings:• Novel Algorithms• New Hyperparameters
What’s really improving performance?• Hyperparameters (mostly)• The algorithms are an improvement• SGNS is robust
![Page 68: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/68.jpg)
68
Conclusions: Distributional Similarity
The Contributions of Word Embeddings:• Novel Algorithms• New Hyperparameters
What’s really improving performance?• Hyperparameters (mostly)• The algorithms are an improvement• SGNS is robust & efficient
![Page 69: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1.](https://reader035.fdocuments.net/reader035/viewer/2022081515/5697bfa31a28abf838c96de6/html5/thumbnails/69.jpg)
69
Conclusions: Methodology
• Look for hyperparameters
• Adapt hyperparameters across different algorithms
• For good results: tune hyperparameters
• For good science: tune baselines’ hyperparameters
Thank you :)