Multilinguality to the Rescue

31
Multilinguality to the Rescue Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU

description

Multilinguality to the Rescue. Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU. Multilinguality. Using more than one language at a time. Image source: https:// buffy.eecs.berkeley.edu /PHP/ resabs /images/2006//101268-1.png. Multilinguality. Why ?. बैंक. Bank. तट. - PowerPoint PPT Presentation

Transcript of Multilinguality to the Rescue

Page 1: Multilinguality to the Rescue

Multilinguality to the Rescue

Manaal Faruqui & Chris DyerLanguage Technologies Institute

SCS, CMU

Page 2: Multilinguality to the Rescue

Multilinguality

Using more than one language at a time

Image source: https://buffy.eecs.berkeley.edu/PHP/resabs/images/2006//101268-1.png

Page 3: Multilinguality to the Rescue

Multilinguality

Why ?

Bank

Images: http://www.realestategolfodulce.com/ , http://thetrustadvisor.com/

बैंक

तट

Cross lingual Word Sense Disambiguation(Diab and Resnik, 2002)

Page 4: Multilinguality to the Rescue

Multilinguality

Why ?

Bilingual Word Clustering (Faruqui & Dyer, 2013)

Page 5: Multilinguality to the Rescue

Multilinguality

Why ?

Bilingual Word Clustering (Faruqui & Dyer, 2013)

Page 6: Multilinguality to the Rescue

Multilinguality

Using data from other languages

Direct Indirect

Assume foreign = original language

Extract information from foreign language

Page 7: Multilinguality to the Rescue

Direct Information Transfer

NLP System

Language 1 data

Language 2 data

Output

Page 8: Multilinguality to the Rescue

Direct Information Transfer

Why would it work ?

• Works for specific tasks like NER• Many NEs retain their “orthographic” form• Across languages that use the same “alphabet”

• English, German, French, Spanish• Hindi, Marathi, Bihari

• Specially proper nouns• Names of Locations

• USA, London, New York, Pittsburgh• Names of People

• Obama, William, Roger

Page 9: Multilinguality to the Rescue

Barack Obama hat 2012 mit dieser Strategie die Präsidentschaftswahlen gewonnen.

The Obama administration has poured billions of dollars into expanding the reach of the Internet.

Pour finir, en défendant les bonus et en tentant de faire dérailler les nouvelles règles prudentielles, ce démocrate s'est mis à dos Barack Obama.

Direct Information Transfer

... sagte Jimmy Wales dem Wall Street Journal in einem Interview in Hongkong.

Mads Refslund, executive chef at Acme, forages in the overgrown spaces and hidden markets of Hongkong for regional delicacies.

Les sacs de luxe, nouvelle monnaie d'échange à Hongkong.

Page 10: Multilinguality to the Rescue

Direct Information Transfer

Semantic Generalization

Deutschland (100)Ostdeutschland (5)

Westdeutschland (0)LOC

Page 11: Multilinguality to the Rescue

Direct Information Transfer

How?

NER System

Language 1Training data Language 2

Word Clusters

NE-tagged Text

Input

Page 12: Multilinguality to the Rescue

EvaluationTools

• Stanford NER for training (Finkel and Manning, 2009)• In-built functionality to use word clusters for

generalization

• Word clustering software (distributional + morphological) (Clark., 2003)

Data

• NER training data• German, English: CoNLL 2003• Dutch, Spanish: CoNLL 2002

• Generalization data• WMT-2012 news commentary: 200 million tokens• English, German, French, Spanish, Czech

Page 13: Multilinguality to the Rescue

Results

Page 14: Multilinguality to the Rescue

Results

Page 15: Multilinguality to the Rescue

Results

Improvement in F1 scores by NE type

Page 16: Multilinguality to the Rescue

Quick Takeaways

• Multilingual data can be put to use for monolingual benefits

• The amount of help depends on how similar the two languages are “orthographically”

Page 17: Multilinguality to the Rescue

Indirect Information Transfer

NLP System

Language 1 data

Language 2 data

Output

+

Page 18: Multilinguality to the Rescue

Vector Space Word Models

Image: http://www.emeraldinsight.com

Page 19: Multilinguality to the Rescue

Vector Space Models

Image: http://d1avok0lzls2w.cloudfront.net/

Page 20: Multilinguality to the Rescue

Vector Space Models

Monolingual Word Vectors 1

Monolingual Word Vectors 2

+

Better Monolingual Word Vectors 1 ??

Page 21: Multilinguality to the Rescue

Indirect Information Transfer

+ = Canonical Correlation Analysis

n n

k

d2

nn

d1

k

+

Page 22: Multilinguality to the Rescue

wxd1

k

wy d2

k

x yn n

d2d1

Canonical Correlation Analysis

* *

nn

k k

Page 23: Multilinguality to the Rescue

Indirect Information Transfer

Word Vectors in Language 1

Word Vectors in Language 2

Obtain 1-to-1 mapping using word alignments

Word Vectors in Language 1

Word Vectors in Language 2

+

Word Vectors in Language 1

Word Vectors in Language 2

Page 24: Multilinguality to the Rescue

Experiments

Task: Word Pair Reranking• Rank a list of word pairs according to semantic similarity

Datasets• WS-353: 353 word pairs• RG-65: 65 noun pairs

Truncation• Maybe the correlation introduces noise • Keep only the top k% of correlated dimensions

Page 25: Multilinguality to the Rescue

EvaluationTools

• Word vectors: RNNLM Toolkit (Mikolov, 2009)• Word alignments: cdec (Dyer et al, 2013)• CCA: Matlab Toolkit

Data

• Word vector monolingual training data• WMT news commentary: 2011, 2012• English, French, Spanish, German

• Word alignment data• WMT news commentary 2010, 09, 08. 07, 06• {French, Spanish, German} - English

Page 26: Multilinguality to the Rescue

Results

Page 27: Multilinguality to the Rescue

Results

Page 28: Multilinguality to the Rescue

Original English Vectors

Page 29: Multilinguality to the Rescue

German Projected on English

Page 30: Multilinguality to the Rescue

Conclusion

• Word vector quality can be improved using multilingual data• At least for lexical semantic tasks

• The amount of help provided by these languages depend on how similar they are to each other

• A task like NER can use data from multiple languages in a simple framework

Page 31: Multilinguality to the Rescue

Thank You!