Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A...

31
Word and Graph Embeddings for Machine Learning Models Steven Skiena Dept. of Computer Science

Transcript of Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A...

Page 1: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted

Word and Graph Embeddings for Machine Learning Models

Steven SkienaDept. of Computer Science

Page 2: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted

Which is the most useful definition of “cat”?

Page 3: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted

Distributed Word Representations• Words are represented

by low dimensionalvectors. (20-500 dimensions).

• The NN learning methods based on phrase fluency to learn these representations are language-agnostic, and can scale to huge datasets.

pineoak

rosedaisy

readingwriting

readwrite

|V|

|V|: size of vocabulary

pineoak

rosedaisy

readingwriting

readwrite

d

d << |V|

Similar words share similar

representations.

Latent Dimensions

Explicit Dimensions

Page 4: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted

The Fluency Test for Training Embeddings

● A good representation should be able to distinguish between real and

randomly corrupted phrases.

● For each phrase you sample a random word from the vocabulary.

S = (”When", "I", "visited", "New", "York")

S' = (”When", "I", "fix", "New", "York")

● The model is asked to perturb the representation for each word so the

score satisfies the following condition.

Score(S) > Score(S') + 1Margin

Page 5: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted

Neural Network (Word2Vec) [Mikolov et.al. 2013]

CImagination

Cis

Cgreater

Cthan

Cdetail

Score

Hidden Layer

H

CM

|V|

Projection Layer

W2

W1

Forward pass

Backward pass

Page 6: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted

Visualizing Word Embeddings: Animals, Colors, Numbers, Countries

Page 7: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted

Talk Outline

● Introduction: Word Embeddings● Multilingual Language Processing (NLP)● Name Embeddings● DeepWalk: Feature Extraction from Graphs● AI Institute

Page 8: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted

Polyglot Embeddings Demo@ https://bit.ly/embeddings

Examples of the nearest five neighbors of every word in several languages.

22

Page 9: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted

Applications of Word Embeddings

● Part of Speech (POS) tagging: what is a noun, verb, adjective? [CoNNL 2013]

● Entity Recognition: what are the people, places and things mentioned in the text? [SDM 2015]

● Sentiment Analysis: is a document saying positive or negative things? [ACL 2014]

● Transliteration: how can we say the same thing in different languages and recognize friends?

Page 10: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted

Polyglot-NER Demo@ https://bit.ly/polyglot-ner

R. Al-Rfou B. Perozzi, and S. Skiena, “Polyglot-NER: Massive Multilingual Named Entity Recognition” , SIAM Conf. Data Mining (SDM 2015)

Legend:LocationOrganizationPerson

33

Page 11: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted

Talk Outline

● Introduction: Word Embeddings● Multilingual Language Processing (NLP)● Name Embeddings● DeepWalk: Feature Extraction from Graphs● AI Institute

Page 12: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted

Your Name Tells a Lot About You

Your gender (male, female) Your ethnicity (white, black, hispanic, asian/pacific islander) Your nationality (which country is your family from?) Your marriage status (X Y-Z) Your socio-economic status (Jethro vs. Archibald) Your age (Fannie vs. Caitlin)

How can we capture these nuances into a feature representation for classification and other machine learning tasks?

Page 13: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted

Homophily and Communications Patterns Brad Pitt: Angelina Jolie, Jennifer Aniston, George Clooney, Cate Blanchett, Julia Roberts Saddam Hussein: Tarik Aziz, Uday Hussein, Samira Shahbandar, SajidaTalfah Donald Trump: Mike Pence, Vladimir Putin, Paul Ryan, IvanikaTrump, Mitch McConnell Xi Jinping: Hu Jintao, Jiang Zemin, Peng Liyuan, Xi Mingze, Ke Lingling

Homophily (“love of the same”) is the tendency for people to associate with people similar to them.

Our analysis of 57 million email contact lists from a major Internet company provided us with sequences of name token sufficient for training distributed word embeddings on.

Nationality classsification using name embeddings (with JuntingYe, Shuchu Han, Yifan Hu, Baris Coskun, Meizhu Liu, Steven Skiena, (CIKM 2017).

Page 14: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted

Regions in Name Space

These embeddings make natural features in any learning task where you have names: Nationality/ethnicity detection

for biomedical/sociology research.

Demographic analysis Social media analysis Security

Page 15: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted

NamePrism: A nationality classifier (www.name-prism.com)

Research Project Goal Research Group Country

“the impact of having foreign-sounding name on job search in financial” Nanyang Technological University Singapore

“determine if ethnic group size impacts national cabinet diversity” Department of Political Science, Washington University in St. Louis

U.S.

“promote the contributions of Iranian Americans to members with-in and outside of the Iranian community living in America.”

Iranian Americans' Contributions Project

U.S.

“determine if ethnicity plays a part/plays no part in whether a written evidence submitted to a Parliamentary Inquiry is accepted or rejected”

Parliamentary Digital Service UK

“working on a study on the network effects for long term unemployed” German Institute for Employment Research

Germany

“unveiling the origins of French citizens in order to study discrimination in several areas of the French society”

Laboratoire InterdisciplinaireSciences Innovations Sociétés(LISIS)

French

“Investigate whether hosts on Airbnb get discriminated based on their ethnicity”

Stockholm School of Economics Sweden

● WIRED Magazine;

● API used by 156 social science research projects

● 69.9 million names analyzed (the population of France)

Page 16: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted

Talk Outline

● Introduction: Word Embeddings● Multilingual Language Processing (NLP)● Name Embeddings● DeepWalk: Feature Extraction from Graphs● AI Institute

Page 17: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted

Features From Graphs

● Anomaly Detection● Attribute Prediction● Clustering● Link Prediction● ...A

djac

ency

Mat

rix

|V|

A first step in machine learning for graphs is to extract graph features:● node degree● pairs: # of common neighbors● groups: cluster assignments

Page 18: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted

Advantages of DeepWalk

Bryan Perozzi DeepWalk: Online Learning of Social Representations

● Anomaly Detection● Attribute Prediction● Clustering● Link Prediction● ...

|V|

DeepWalk

d << |V|

Latent Dimensions

Adj

acen

cy M

atri

x

● Scalable - An online algorithm that does not use entire graph at once● Walks as sentences metaphor● Works great!● Implementation available: bit.ly/deepwalk

Page 19: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted

DeepWalk: The Entire Idea

Bryan Perozzi DeepWalk: Online Learning of Social Representations

Short random walks = sentences

■ We generate random walks for each vertex in the graph.■ Each short random walk has length t .■ Pick the next step uniformly from the vertex neighbors.

Page 20: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted

Everyone is doing the DeepWalk!

2177 citations since August 2014! 28th in downloads from ACM Digitial Library past 12 months from among all 4,188

KDD papers ever! (by comparison, second best from KDD ‘14 ranks 104th) You, too, can do the DeepWalk!

Page 21: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted

Identifying Historically Similar EntitiesWe can construct embeddings for people, places and things, to recognize similar entities

Y. Chen, B. Perozzi, and S. Skiena “Vector-Based Similarity Measurements for Historical Figures”, SISAP 2015.

Page 22: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted

DeepWalk: Nearest NeighborsScarlett Johansson● Kirsten Dunst (0.784)● Natalie Portman (0.786)● Gwyneth Paltrow (0.796)● Brad Pitt (0.858)● Cameron Diaz (0.891)

Steven Skiena● Larry Page (1.597)● Sergey Brin (1.598)● Danny Hillis (1.644)● Andrei Broder (1.652)● Mark Weiser (1.653)

Barack Obama● George W. Bush (0.474)● Hillary Clinton (0.657)● Bill Clinton (0.658)● Joe Biden (0.750)● Al Gore (0.791)

Albert Einstein● Richard Feynman (1.049)● Max Planck (1.073)● Freeman Dyson (1.107)● Stephen Hawking (1.153)● Robert Oppenheimer (1.156)

Ludwig van Beethoven● Franz Schubert (0.489)● Johannes Brahms (0.532)● Wolfgang Mozart (0.567)● Robert Schumann (0.576)● Gustav Mahler (0.635)

Mick Jagger● John Lennon (0.687)● Keith Richards (0.687)● Paul McCartney (0.796)● Ronnie Wood (0.822)● Eric Clapton (0.833)

Page 23: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted

Institute for AI-Driven Discovery and Innovation

Professor Steven SkienaDirector, AI InstituteNovember 2019

23

Page 24: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted

Mission of the Institute • Promote activity in AI and related areas towards attracting more

federal, state, industrial and private funding.• Stimulate research and educational activities in AI in CS and

across CEAS.• Make Stony Brook a more attractive destination for graduate

students and faculty interested in AI and related areas.• Check out our website: ai.stonybrook.edu and @AI_SBU on

Twitter.

24

Page 25: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted

Stony Brook AI InstituteMajor Accomplishments 2018-19

• Building a stronger Stony Brook AI community• Three SUNY EIP hires (Haibin Ling, Michael Ryoo, Zhaozheng Yin)!!• Junior hire in machine learning (Yifan Sun, starting Fall 2020)• NSF Major Research Infrastructure (MRI) grant for large AI/ML cluster.• First substantial philanthropic gift ($375K) for postdoctoral scholars program• Communications Hire (Dan Olawski)• Bloomberg AI Institute kickoff

Page 26: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted

Core Faculty

26

Page 27: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted

Research InterestsWe have strong research groups in several areas of AI:

• Biomedical Informatics• Computer Vision• Computational Logic and Reasoning• Data Science• Machine Learning• Natural Language Processing• Social Media Analysis

27

Page 28: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted

Educational Initiatives• New undergraduate courses in Data Science and Natural

Language Processing• New undergraduate concentration in Artificial Intelligence• New graduate concentration in Data Science, approved• DS+X initiative with College of Arts and Sciences (CEAS)

28

Page 29: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted

Student Collaborators: Stony Brook Data Science Lab

Bryan Perozzi Vivek Kulkarni Rami al-Rfou Junting Ye

Haochen Chen. Yingtao Tian

Page 30: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted

Manual Laboring

Page 31: Word and Graph Embeddings for Machine Learning Models...The Fluency Testfor Training Embeddings A good representation should be able to distinguish between real and randomly corrupted

Questions?