Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A...
Transcript of Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A...
![Page 1: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted](https://reader034.fdocuments.net/reader034/viewer/2022052003/6015be50df313a7c091a030b/html5/thumbnails/1.jpg)
Word and Graph Embeddings for Machine Learning Models
Steven SkienaDept. of Computer Science
![Page 2: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted](https://reader034.fdocuments.net/reader034/viewer/2022052003/6015be50df313a7c091a030b/html5/thumbnails/2.jpg)
Which is the most useful definition of “cat”?
![Page 3: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted](https://reader034.fdocuments.net/reader034/viewer/2022052003/6015be50df313a7c091a030b/html5/thumbnails/3.jpg)
Distributed Word Representations• Words are represented
by low dimensionalvectors. (20-500 dimensions).
• The NN learning methods based on phrase fluency to learn these representations are language-agnostic, and can scale to huge datasets.
pineoak
rosedaisy
readingwriting
readwrite
|V|
|V|: size of vocabulary
pineoak
rosedaisy
readingwriting
readwrite
d
d << |V|
Similar words share similar
representations.
Latent Dimensions
Explicit Dimensions
![Page 4: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted](https://reader034.fdocuments.net/reader034/viewer/2022052003/6015be50df313a7c091a030b/html5/thumbnails/4.jpg)
The Fluency Test for Training Embeddings
● A good representation should be able to distinguish between real and
randomly corrupted phrases.
● For each phrase you sample a random word from the vocabulary.
S = (”When", "I", "visited", "New", "York")
S' = (”When", "I", "fix", "New", "York")
● The model is asked to perturb the representation for each word so the
score satisfies the following condition.
Score(S) > Score(S') + 1Margin
![Page 5: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted](https://reader034.fdocuments.net/reader034/viewer/2022052003/6015be50df313a7c091a030b/html5/thumbnails/5.jpg)
Neural Network (Word2Vec) [Mikolov et.al. 2013]
CImagination
Cis
Cgreater
Cthan
Cdetail
Score
Hidden Layer
H
CM
|V|
Projection Layer
W2
W1
Forward pass
Backward pass
![Page 6: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted](https://reader034.fdocuments.net/reader034/viewer/2022052003/6015be50df313a7c091a030b/html5/thumbnails/6.jpg)
Visualizing Word Embeddings: Animals, Colors, Numbers, Countries
![Page 7: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted](https://reader034.fdocuments.net/reader034/viewer/2022052003/6015be50df313a7c091a030b/html5/thumbnails/7.jpg)
Talk Outline
● Introduction: Word Embeddings● Multilingual Language Processing (NLP)● Name Embeddings● DeepWalk: Feature Extraction from Graphs● AI Institute
![Page 8: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted](https://reader034.fdocuments.net/reader034/viewer/2022052003/6015be50df313a7c091a030b/html5/thumbnails/8.jpg)
Polyglot Embeddings Demo@ https://bit.ly/embeddings
Examples of the nearest five neighbors of every word in several languages.
22
![Page 9: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted](https://reader034.fdocuments.net/reader034/viewer/2022052003/6015be50df313a7c091a030b/html5/thumbnails/9.jpg)
Applications of Word Embeddings
● Part of Speech (POS) tagging: what is a noun, verb, adjective? [CoNNL 2013]
● Entity Recognition: what are the people, places and things mentioned in the text? [SDM 2015]
● Sentiment Analysis: is a document saying positive or negative things? [ACL 2014]
● Transliteration: how can we say the same thing in different languages and recognize friends?
![Page 10: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted](https://reader034.fdocuments.net/reader034/viewer/2022052003/6015be50df313a7c091a030b/html5/thumbnails/10.jpg)
Polyglot-NER Demo@ https://bit.ly/polyglot-ner
R. Al-Rfou B. Perozzi, and S. Skiena, “Polyglot-NER: Massive Multilingual Named Entity Recognition” , SIAM Conf. Data Mining (SDM 2015)
Legend:LocationOrganizationPerson
33
![Page 11: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted](https://reader034.fdocuments.net/reader034/viewer/2022052003/6015be50df313a7c091a030b/html5/thumbnails/11.jpg)
Talk Outline
● Introduction: Word Embeddings● Multilingual Language Processing (NLP)● Name Embeddings● DeepWalk: Feature Extraction from Graphs● AI Institute
![Page 12: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted](https://reader034.fdocuments.net/reader034/viewer/2022052003/6015be50df313a7c091a030b/html5/thumbnails/12.jpg)
Your Name Tells a Lot About You
Your gender (male, female) Your ethnicity (white, black, hispanic, asian/pacific islander) Your nationality (which country is your family from?) Your marriage status (X Y-Z) Your socio-economic status (Jethro vs. Archibald) Your age (Fannie vs. Caitlin)
How can we capture these nuances into a feature representation for classification and other machine learning tasks?
![Page 13: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted](https://reader034.fdocuments.net/reader034/viewer/2022052003/6015be50df313a7c091a030b/html5/thumbnails/13.jpg)
Homophily and Communications Patterns Brad Pitt: Angelina Jolie, Jennifer Aniston, George Clooney, Cate Blanchett, Julia Roberts Saddam Hussein: Tarik Aziz, Uday Hussein, Samira Shahbandar, SajidaTalfah Donald Trump: Mike Pence, Vladimir Putin, Paul Ryan, IvanikaTrump, Mitch McConnell Xi Jinping: Hu Jintao, Jiang Zemin, Peng Liyuan, Xi Mingze, Ke Lingling
Homophily (“love of the same”) is the tendency for people to associate with people similar to them.
Our analysis of 57 million email contact lists from a major Internet company provided us with sequences of name token sufficient for training distributed word embeddings on.
Nationality classsification using name embeddings (with JuntingYe, Shuchu Han, Yifan Hu, Baris Coskun, Meizhu Liu, Steven Skiena, (CIKM 2017).
![Page 14: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted](https://reader034.fdocuments.net/reader034/viewer/2022052003/6015be50df313a7c091a030b/html5/thumbnails/14.jpg)
Regions in Name Space
These embeddings make natural features in any learning task where you have names: Nationality/ethnicity detection
for biomedical/sociology research.
Demographic analysis Social media analysis Security
![Page 15: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted](https://reader034.fdocuments.net/reader034/viewer/2022052003/6015be50df313a7c091a030b/html5/thumbnails/15.jpg)
NamePrism: A nationality classifier (www.name-prism.com)
Research Project Goal Research Group Country
“the impact of having foreign-sounding name on job search in financial” Nanyang Technological University Singapore
“determine if ethnic group size impacts national cabinet diversity” Department of Political Science, Washington University in St. Louis
U.S.
“promote the contributions of Iranian Americans to members with-in and outside of the Iranian community living in America.”
Iranian Americans' Contributions Project
U.S.
“determine if ethnicity plays a part/plays no part in whether a written evidence submitted to a Parliamentary Inquiry is accepted or rejected”
Parliamentary Digital Service UK
“working on a study on the network effects for long term unemployed” German Institute for Employment Research
Germany
“unveiling the origins of French citizens in order to study discrimination in several areas of the French society”
Laboratoire InterdisciplinaireSciences Innovations Sociétés(LISIS)
French
“Investigate whether hosts on Airbnb get discriminated based on their ethnicity”
Stockholm School of Economics Sweden
● WIRED Magazine;
● API used by 156 social science research projects
● 69.9 million names analyzed (the population of France)
![Page 16: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted](https://reader034.fdocuments.net/reader034/viewer/2022052003/6015be50df313a7c091a030b/html5/thumbnails/16.jpg)
Talk Outline
● Introduction: Word Embeddings● Multilingual Language Processing (NLP)● Name Embeddings● DeepWalk: Feature Extraction from Graphs● AI Institute
![Page 17: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted](https://reader034.fdocuments.net/reader034/viewer/2022052003/6015be50df313a7c091a030b/html5/thumbnails/17.jpg)
Features From Graphs
● Anomaly Detection● Attribute Prediction● Clustering● Link Prediction● ...A
djac
ency
Mat
rix
|V|
A first step in machine learning for graphs is to extract graph features:● node degree● pairs: # of common neighbors● groups: cluster assignments
![Page 18: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted](https://reader034.fdocuments.net/reader034/viewer/2022052003/6015be50df313a7c091a030b/html5/thumbnails/18.jpg)
Advantages of DeepWalk
Bryan Perozzi DeepWalk: Online Learning of Social Representations
● Anomaly Detection● Attribute Prediction● Clustering● Link Prediction● ...
|V|
DeepWalk
d << |V|
Latent Dimensions
Adj
acen
cy M
atri
x
● Scalable - An online algorithm that does not use entire graph at once● Walks as sentences metaphor● Works great!● Implementation available: bit.ly/deepwalk
![Page 19: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted](https://reader034.fdocuments.net/reader034/viewer/2022052003/6015be50df313a7c091a030b/html5/thumbnails/19.jpg)
DeepWalk: The Entire Idea
Bryan Perozzi DeepWalk: Online Learning of Social Representations
Short random walks = sentences
■ We generate random walks for each vertex in the graph.■ Each short random walk has length t .■ Pick the next step uniformly from the vertex neighbors.
![Page 20: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted](https://reader034.fdocuments.net/reader034/viewer/2022052003/6015be50df313a7c091a030b/html5/thumbnails/20.jpg)
Everyone is doing the DeepWalk!
2177 citations since August 2014! 28th in downloads from ACM Digitial Library past 12 months from among all 4,188
KDD papers ever! (by comparison, second best from KDD ‘14 ranks 104th) You, too, can do the DeepWalk!
![Page 21: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted](https://reader034.fdocuments.net/reader034/viewer/2022052003/6015be50df313a7c091a030b/html5/thumbnails/21.jpg)
Identifying Historically Similar EntitiesWe can construct embeddings for people, places and things, to recognize similar entities
Y. Chen, B. Perozzi, and S. Skiena “Vector-Based Similarity Measurements for Historical Figures”, SISAP 2015.
![Page 22: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted](https://reader034.fdocuments.net/reader034/viewer/2022052003/6015be50df313a7c091a030b/html5/thumbnails/22.jpg)
DeepWalk: Nearest NeighborsScarlett Johansson● Kirsten Dunst (0.784)● Natalie Portman (0.786)● Gwyneth Paltrow (0.796)● Brad Pitt (0.858)● Cameron Diaz (0.891)
Steven Skiena● Larry Page (1.597)● Sergey Brin (1.598)● Danny Hillis (1.644)● Andrei Broder (1.652)● Mark Weiser (1.653)
Barack Obama● George W. Bush (0.474)● Hillary Clinton (0.657)● Bill Clinton (0.658)● Joe Biden (0.750)● Al Gore (0.791)
Albert Einstein● Richard Feynman (1.049)● Max Planck (1.073)● Freeman Dyson (1.107)● Stephen Hawking (1.153)● Robert Oppenheimer (1.156)
Ludwig van Beethoven● Franz Schubert (0.489)● Johannes Brahms (0.532)● Wolfgang Mozart (0.567)● Robert Schumann (0.576)● Gustav Mahler (0.635)
Mick Jagger● John Lennon (0.687)● Keith Richards (0.687)● Paul McCartney (0.796)● Ronnie Wood (0.822)● Eric Clapton (0.833)
![Page 23: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted](https://reader034.fdocuments.net/reader034/viewer/2022052003/6015be50df313a7c091a030b/html5/thumbnails/23.jpg)
Institute for AI-Driven Discovery and Innovation
Professor Steven SkienaDirector, AI InstituteNovember 2019
23
![Page 24: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted](https://reader034.fdocuments.net/reader034/viewer/2022052003/6015be50df313a7c091a030b/html5/thumbnails/24.jpg)
Mission of the Institute • Promote activity in AI and related areas towards attracting more
federal, state, industrial and private funding.• Stimulate research and educational activities in AI in CS and
across CEAS.• Make Stony Brook a more attractive destination for graduate
students and faculty interested in AI and related areas.• Check out our website: ai.stonybrook.edu and @AI_SBU on
Twitter.
24
![Page 25: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted](https://reader034.fdocuments.net/reader034/viewer/2022052003/6015be50df313a7c091a030b/html5/thumbnails/25.jpg)
Stony Brook AI InstituteMajor Accomplishments 2018-19
• Building a stronger Stony Brook AI community• Three SUNY EIP hires (Haibin Ling, Michael Ryoo, Zhaozheng Yin)!!• Junior hire in machine learning (Yifan Sun, starting Fall 2020)• NSF Major Research Infrastructure (MRI) grant for large AI/ML cluster.• First substantial philanthropic gift ($375K) for postdoctoral scholars program• Communications Hire (Dan Olawski)• Bloomberg AI Institute kickoff
![Page 26: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted](https://reader034.fdocuments.net/reader034/viewer/2022052003/6015be50df313a7c091a030b/html5/thumbnails/26.jpg)
Core Faculty
26
![Page 27: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted](https://reader034.fdocuments.net/reader034/viewer/2022052003/6015be50df313a7c091a030b/html5/thumbnails/27.jpg)
Research InterestsWe have strong research groups in several areas of AI:
• Biomedical Informatics• Computer Vision• Computational Logic and Reasoning• Data Science• Machine Learning• Natural Language Processing• Social Media Analysis
27
![Page 28: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted](https://reader034.fdocuments.net/reader034/viewer/2022052003/6015be50df313a7c091a030b/html5/thumbnails/28.jpg)
Educational Initiatives• New undergraduate courses in Data Science and Natural
Language Processing• New undergraduate concentration in Artificial Intelligence• New graduate concentration in Data Science, approved• DS+X initiative with College of Arts and Sciences (CEAS)
28
![Page 29: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted](https://reader034.fdocuments.net/reader034/viewer/2022052003/6015be50df313a7c091a030b/html5/thumbnails/29.jpg)
Student Collaborators: Stony Brook Data Science Lab
Bryan Perozzi Vivek Kulkarni Rami al-Rfou Junting Ye
Haochen Chen. Yingtao Tian
![Page 30: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted](https://reader034.fdocuments.net/reader034/viewer/2022052003/6015be50df313a7c091a030b/html5/thumbnails/30.jpg)
Manual Laboring
![Page 31: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted](https://reader034.fdocuments.net/reader034/viewer/2022052003/6015be50df313a7c091a030b/html5/thumbnails/31.jpg)
Questions?