Paraphrase Detection in NLP - Tian Jun · 2019-08-14 · The minimum amount of “work” needed to...
Transcript of Paraphrase Detection in NLP - Tian Jun · 2019-08-14 · The minimum amount of “work” needed to...
![Page 1: Paraphrase Detection in NLP - Tian Jun · 2019-08-14 · The minimum amount of “work” needed to transform document 1 to document 2. Inspired by Earth Mover’s Distance, a well-studied](https://reader035.fdocuments.net/reader035/viewer/2022081405/5f0a4cff7e708231d42afb9d/html5/thumbnails/1.jpg)
Paraphrase Detection in NLP
Yuriy GutsML Engineer
=?
![Page 2: Paraphrase Detection in NLP - Tian Jun · 2019-08-14 · The minimum amount of “work” needed to transform document 1 to document 2. Inspired by Earth Mover’s Distance, a well-studied](https://reader035.fdocuments.net/reader035/viewer/2022081405/5f0a4cff7e708231d42afb9d/html5/thumbnails/2.jpg)
Paraphrasing
Any trip to Italy should include a visit to Tuscany to sample their exquisite wines.
Be sure to include a Tuscan wine-tasting experience when visiting Italy.
![Page 3: Paraphrase Detection in NLP - Tian Jun · 2019-08-14 · The minimum amount of “work” needed to transform document 1 to document 2. Inspired by Earth Mover’s Distance, a well-studied](https://reader035.fdocuments.net/reader035/viewer/2022081405/5f0a4cff7e708231d42afb9d/html5/thumbnails/3.jpg)
Paraphrase Identification
Where can I get very professional and reliableenvelope printing service in Sydney?
Where can I get very affordable brandedenvelope printing service in Sydney?
Why are doctors always late?
Why doctors always make you wait for 15-20 minutesbefore they see you?
![Page 4: Paraphrase Detection in NLP - Tian Jun · 2019-08-14 · The minimum amount of “work” needed to transform document 1 to document 2. Inspired by Earth Mover’s Distance, a well-studied](https://reader035.fdocuments.net/reader035/viewer/2022081405/5f0a4cff7e708231d42afb9d/html5/thumbnails/4.jpg)
404,290 pairsTraining set
2,345,796 pairsTest set
DataQ1 (ID, Title) Q2 (ID, Title)
TargetBinary (metric: log-loss)
![Page 5: Paraphrase Detection in NLP - Tian Jun · 2019-08-14 · The minimum amount of “work” needed to transform document 1 to document 2. Inspired by Earth Mover’s Distance, a well-studied](https://reader035.fdocuments.net/reader035/viewer/2022081405/5f0a4cff7e708231d42afb9d/html5/thumbnails/5.jpg)
Challenges
![Page 6: Paraphrase Detection in NLP - Tian Jun · 2019-08-14 · The minimum amount of “work” needed to transform document 1 to document 2. Inspired by Earth Mover’s Distance, a well-studied](https://reader035.fdocuments.net/reader035/viewer/2022081405/5f0a4cff7e708231d42afb9d/html5/thumbnails/6.jpg)
What are the best books on IT leadership?
What are the best books about leadership?
Sometimes a single word matters
Will the Miami Heat win the NBA championship in 2011?
Will the Miami Heat win the NBA championship in 2012?
How do I lose 20 pounds?
How do I lose 15 pounds?
![Page 7: Paraphrase Detection in NLP - Tian Jun · 2019-08-14 · The minimum amount of “work” needed to transform document 1 to document 2. Inspired by Earth Mover’s Distance, a well-studied](https://reader035.fdocuments.net/reader035/viewer/2022081405/5f0a4cff7e708231d42afb9d/html5/thumbnails/7.jpg)
Sometimes there are almost no shared words
I am unable to talk to girls, leave being friendly with them. Why?
I am shy to talk to any woman because i get nervous and freaked out around them. What is the solution?
Is there a Quora user who have seen a UFO?
Have you seen an alien?
![Page 8: Paraphrase Detection in NLP - Tian Jun · 2019-08-14 · The minimum amount of “work” needed to transform document 1 to document 2. Inspired by Earth Mover’s Distance, a well-studied](https://reader035.fdocuments.net/reader035/viewer/2022081405/5f0a4cff7e708231d42afb9d/html5/thumbnails/8.jpg)
Sometimes ALL the words are shared
Is the Government of Pakistan encouraging India by not taking any real action against ceasefire violations?
Is the Government of India encouraging Pakistan by not taking any real action against ceasefire violations?
What is the most interesting thing we learned about Portugal's World Cup team in their match against Germany?
What is the most interesting thing we learned about Germany's World Cup team in their match against Portugal?
![Page 9: Paraphrase Detection in NLP - Tian Jun · 2019-08-14 · The minimum amount of “work” needed to transform document 1 to document 2. Inspired by Earth Mover’s Distance, a well-studied](https://reader035.fdocuments.net/reader035/viewer/2022081405/5f0a4cff7e708231d42afb9d/html5/thumbnails/9.jpg)
Approaches
1. Counting Stuff.
2. Information Theory.
3. Linguistics.
4. Deep Learning.
![Page 10: Paraphrase Detection in NLP - Tian Jun · 2019-08-14 · The minimum amount of “work” needed to transform document 1 to document 2. Inspired by Earth Mover’s Distance, a well-studied](https://reader035.fdocuments.net/reader035/viewer/2022081405/5f0a4cff7e708231d42afb9d/html5/thumbnails/10.jpg)
BoW Representations
Image copyright © S. Mukherjee
![Page 11: Paraphrase Detection in NLP - Tian Jun · 2019-08-14 · The minimum amount of “work” needed to transform document 1 to document 2. Inspired by Earth Mover’s Distance, a well-studied](https://reader035.fdocuments.net/reader035/viewer/2022081405/5f0a4cff7e708231d42afb9d/html5/thumbnails/11.jpg)
Cosine Similarity
Image copyright © C. Perone
![Page 12: Paraphrase Detection in NLP - Tian Jun · 2019-08-14 · The minimum amount of “work” needed to transform document 1 to document 2. Inspired by Earth Mover’s Distance, a well-studied](https://reader035.fdocuments.net/reader035/viewer/2022081405/5f0a4cff7e708231d42afb9d/html5/thumbnails/12.jpg)
Jaccard Similarity
Image copyright © C. Wagner
![Page 13: Paraphrase Detection in NLP - Tian Jun · 2019-08-14 · The minimum amount of “work” needed to transform document 1 to document 2. Inspired by Earth Mover’s Distance, a well-studied](https://reader035.fdocuments.net/reader035/viewer/2022081405/5f0a4cff7e708231d42afb9d/html5/thumbnails/13.jpg)
Edit Distances
1. Levenshtein
2. Damerau-Levenshtein
3. Jaro
4. Jaro-Winkler
… … ...
![Page 14: Paraphrase Detection in NLP - Tian Jun · 2019-08-14 · The minimum amount of “work” needed to transform document 1 to document 2. Inspired by Earth Mover’s Distance, a well-studied](https://reader035.fdocuments.net/reader035/viewer/2022081405/5f0a4cff7e708231d42afb9d/html5/thumbnails/14.jpg)
Lexical Databases and Ontologies
house, home
dwelling
hermitage cottage
backyard
veranda
study
“A place that serves as the living quarters of one or more families”
. . .
penthouse
Meronym
Hyponym
HyponymHypernym
Hyponym
Def
![Page 15: Paraphrase Detection in NLP - Tian Jun · 2019-08-14 · The minimum amount of “work” needed to transform document 1 to document 2. Inspired by Earth Mover’s Distance, a well-studied](https://reader035.fdocuments.net/reader035/viewer/2022081405/5f0a4cff7e708231d42afb9d/html5/thumbnails/15.jpg)
WordNet
![Page 16: Paraphrase Detection in NLP - Tian Jun · 2019-08-14 · The minimum amount of “work” needed to transform document 1 to document 2. Inspired by Earth Mover’s Distance, a well-studied](https://reader035.fdocuments.net/reader035/viewer/2022081405/5f0a4cff7e708231d42afb9d/html5/thumbnails/16.jpg)
“The complete meaning of a word is always contextual, and no study of
meaning apart from context can be taken seriously”
Distributed Hypothesis of Language
John R. Firth. The technique of semantics.
Transactions of the Philological Society, 1935.
![Page 17: Paraphrase Detection in NLP - Tian Jun · 2019-08-14 · The minimum amount of “work” needed to transform document 1 to document 2. Inspired by Earth Mover’s Distance, a well-studied](https://reader035.fdocuments.net/reader035/viewer/2022081405/5f0a4cff7e708231d42afb9d/html5/thumbnails/17.jpg)
The minimum amount of “work” needed to transform document 1 to document 2.Inspired by Earth Mover’s Distance, a well-studied transportation problem.
Word Mover’s Distance (WMD)
M. Kusner et al. “From Word Embeddings to Document Distances”, 2015.http://proceedings.mlr.press/v37/kusnerb15.pdf
![Page 18: Paraphrase Detection in NLP - Tian Jun · 2019-08-14 · The minimum amount of “work” needed to transform document 1 to document 2. Inspired by Earth Mover’s Distance, a well-studied](https://reader035.fdocuments.net/reader035/viewer/2022081405/5f0a4cff7e708231d42afb9d/html5/thumbnails/18.jpg)
WMD: Linear Optimization Problem
M. Kusner et al. “From Word Embeddings to Document Distances”, 2015.http://proceedings.mlr.press/v37/kusnerb15.pdf
nBOW frequency of the i-th word in the document
“How much” of word i travels to word j
![Page 19: Paraphrase Detection in NLP - Tian Jun · 2019-08-14 · The minimum amount of “work” needed to transform document 1 to document 2. Inspired by Earth Mover’s Distance, a well-studied](https://reader035.fdocuments.net/reader035/viewer/2022081405/5f0a4cff7e708231d42afb9d/html5/thumbnails/19.jpg)
Morphology & Syntax Features
https://demos.explosion.ai/displacy/
![Page 20: Paraphrase Detection in NLP - Tian Jun · 2019-08-14 · The minimum amount of “work” needed to transform document 1 to document 2. Inspired by Earth Mover’s Distance, a well-studied](https://reader035.fdocuments.net/reader035/viewer/2022081405/5f0a4cff7e708231d42afb9d/html5/thumbnails/20.jpg)
Architectural Principlesfor State-of-the-Art Neural NLP
![Page 21: Paraphrase Detection in NLP - Tian Jun · 2019-08-14 · The minimum amount of “work” needed to transform document 1 to document 2. Inspired by Earth Mover’s Distance, a well-studied](https://reader035.fdocuments.net/reader035/viewer/2022081405/5f0a4cff7e708231d42afb9d/html5/thumbnails/21.jpg)
Embed
Encode
Attend
Predict
![Page 22: Paraphrase Detection in NLP - Tian Jun · 2019-08-14 · The minimum amount of “work” needed to transform document 1 to document 2. Inspired by Earth Mover’s Distance, a well-studied](https://reader035.fdocuments.net/reader035/viewer/2022081405/5f0a4cff7e708231d42afb9d/html5/thumbnails/22.jpg)
State-of-the-Art NLP Pipeline
Image copyright © Explosion AI
![Page 23: Paraphrase Detection in NLP - Tian Jun · 2019-08-14 · The minimum amount of “work” needed to transform document 1 to document 2. Inspired by Earth Mover’s Distance, a well-studied](https://reader035.fdocuments.net/reader035/viewer/2022081405/5f0a4cff7e708231d42afb9d/html5/thumbnails/23.jpg)
Recurrent Highway Networks
Highway layer:
Srivastava et al. “Recurrent Highway Networks”, 2015https://arxiv.org/pdf/1607.03474.pdf
![Page 24: Paraphrase Detection in NLP - Tian Jun · 2019-08-14 · The minimum amount of “work” needed to transform document 1 to document 2. Inspired by Earth Mover’s Distance, a well-studied](https://reader035.fdocuments.net/reader035/viewer/2022081405/5f0a4cff7e708231d42afb9d/html5/thumbnails/24.jpg)
Two Input Documents? Siamese Network
Run the same encoding step for every input.
Share encoder weights, don’t learn WE1 and WE2 separately.
![Page 25: Paraphrase Detection in NLP - Tian Jun · 2019-08-14 · The minimum amount of “work” needed to transform document 1 to document 2. Inspired by Earth Mover’s Distance, a well-studied](https://reader035.fdocuments.net/reader035/viewer/2022081405/5f0a4cff7e708231d42afb9d/html5/thumbnails/25.jpg)
Distance Learning, Contrastive Loss
Penalize similar pairs by a monotonically increasing function of their learned distance.
Penalize different pairs by a monotonically decreasing function of their learned distance.
Hadsell, Chopra, LeCun “Dimensionality Reduction by Learning an Invariant Mapping”, 2006.http://yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun-06.pdf
![Page 26: Paraphrase Detection in NLP - Tian Jun · 2019-08-14 · The minimum amount of “work” needed to transform document 1 to document 2. Inspired by Earth Mover’s Distance, a well-studied](https://reader035.fdocuments.net/reader035/viewer/2022081405/5f0a4cff7e708231d42afb9d/html5/thumbnails/26.jpg)
404,290 pairsTraining set
2,345,796 pairsTest set
DataQ1 (ID, Title) Q2 (ID, Title)
TargetBinary (metric: log-loss)
![Page 27: Paraphrase Detection in NLP - Tian Jun · 2019-08-14 · The minimum amount of “work” needed to transform document 1 to document 2. Inspired by Earth Mover’s Distance, a well-studied](https://reader035.fdocuments.net/reader035/viewer/2022081405/5f0a4cff7e708231d42afb9d/html5/thumbnails/27.jpg)
How can I prevent pneumonia?
How do you prevent pneumonia?
Sometimes the labeling is just plain wrong
What is the car in this picture?
What car is in this picture?
How do I protect single phasing of a 3 phase motor?
What is single phase and 3 phase?
![Page 28: Paraphrase Detection in NLP - Tian Jun · 2019-08-14 · The minimum amount of “work” needed to transform document 1 to document 2. Inspired by Earth Mover’s Distance, a well-studied](https://reader035.fdocuments.net/reader035/viewer/2022081405/5f0a4cff7e708231d42afb9d/html5/thumbnails/28.jpg)
People misspell. A lot.
demonitization demonitozation masturbution masturbtion
demonitizing demonetiztion masturation mastburation
demonetising demoneitisation mastrubating masturabate
demonitisation demoneticzation masturbuation mastrubration
demonitize demonitizition masterbuting mastubrate
demonatisation demonetisations mastubate masturbute
demonitazation demonitasation mastribution mastribute
demonitesation demonestisation masturburation masterbuate
demoneytisation demotenization mastrabutation mastubating
demonestation demonitised mastubation
demonitising demonsation100,000+
spelling corrections
![Page 29: Paraphrase Detection in NLP - Tian Jun · 2019-08-14 · The minimum amount of “work” needed to transform document 1 to document 2. Inspired by Earth Mover’s Distance, a well-studied](https://reader035.fdocuments.net/reader035/viewer/2022081405/5f0a4cff7e708231d42afb9d/html5/thumbnails/29.jpg)
Different target balance for Train and Test
![Page 30: Paraphrase Detection in NLP - Tian Jun · 2019-08-14 · The minimum amount of “work” needed to transform document 1 to document 2. Inspired by Earth Mover’s Distance, a well-studied](https://reader035.fdocuments.net/reader035/viewer/2022081405/5f0a4cff7e708231d42afb9d/html5/thumbnails/30.jpg)
Final Solution (Simplified)
Tokenized Text
Sequences of Pre-trained
Embeddings
Statistical Features
Fuzzy Distances
Linguistic Features
Topic Features
Leaks ¯\_(ツ)_/¯
Deep NN Predictions
Embedding Features
Final Model (GBM)
PredictionCalibration
~200 Featuresfor the final model
![Page 31: Paraphrase Detection in NLP - Tian Jun · 2019-08-14 · The minimum amount of “work” needed to transform document 1 to document 2. Inspired by Earth Mover’s Distance, a well-studied](https://reader035.fdocuments.net/reader035/viewer/2022081405/5f0a4cff7e708231d42afb9d/html5/thumbnails/31.jpg)
![Page 32: Paraphrase Detection in NLP - Tian Jun · 2019-08-14 · The minimum amount of “work” needed to transform document 1 to document 2. Inspired by Earth Mover’s Distance, a well-studied](https://reader035.fdocuments.net/reader035/viewer/2022081405/5f0a4cff7e708231d42afb9d/html5/thumbnails/32.jpg)
![Page 33: Paraphrase Detection in NLP - Tian Jun · 2019-08-14 · The minimum amount of “work” needed to transform document 1 to document 2. Inspired by Earth Mover’s Distance, a well-studied](https://reader035.fdocuments.net/reader035/viewer/2022081405/5f0a4cff7e708231d42afb9d/html5/thumbnails/33.jpg)
facebook.com/yuriy.guts
github.com/YuriyGuts/kaggle-quora-question-pairs