June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word...

28
Grammatical and topical gender in crosslinguistic word embeddings Kate McCurdy Berlin NLP June 14 2017

Transcript of June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word...

Page 1: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Grammatical and topical gender in crosslinguistic word embeddingsKate McCurdyBerlin NLPJune 14 2017

Page 2: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Word embeddings: From (almost) scratch to NLP● Goal: word representations that...

○ capture maximal semantic/syntactic information, yet○ require minimal task-specific feature engineering

● Neural embeddings to the rescue!○ Input: barely processed, massive corpora

■ In general: tokenization + trimming the long tail in vocab

■ Collobert et al.: capitalization as feature + a few extra tweaks

■ Mikolov et al: n-gram phrase identification

○ Output: dense, magically performant vectors

Page 3: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

… but there are pitfalls

Page 4: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

You shall know a word by the company it keeps.

Firth 1957

Page 5: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Pitfall #1What if your words keep

company with some unsavory stereotypes?

Page 6: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Analogous relations in the GloVe word embedding; from Caliskan-Islam et al 2016

Page 7: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Stereotypes in word embeddings:Bolukbasi et al. 2016

addiction

accountant

pilot

athlete

professor emeritus

eating disorder

paralegal

flight attendant

gymnast

associate professor

:

:

:

:

:

Page 8: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Bias in humans: the Implicit Association Test● Standard psychological

test to assess implicit bias● Design:

○ Two sets of attribute words■ Male, man, boy, …

■ Female, woman, …

○ Two sets of target words■ Children, wedding,...

■ Office, salary, …

○ Task: left vs right fast

categorization of both sets

● Measurement: differential association in average response timeGreenwald et al. 1998

Page 9: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

● WEAT: the Word Embedding Association Test

● Parallels the Implicit Association Test ● Measures the differential association

between paired target and attribute word sets via cosine distance

● Core finding: nearly every single prejudice uncovered by the IAT is replicated by the WEAT on Google News + GloVe word embeddings

Page 10: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Pitfall #1What if your words keep

company with some unsavory stereotypes?

Page 11: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Pitfall #2What if your content words hang out with your function

words and make weird artefacts?

Page 12: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Work with Oguz Serbetci (not pictured)

Crosslinguistic word embeddings

Page 13: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Data

● Corpus: OpenSubtitles● ~5.5K movies with subtitles in 4 languages (2.6-2.9m ws):

○ German - grammatical gender

○ Spanish - grammatical gender

○ Dutch - grammatical gender orthogonal to “natural” gender

○ English - “natural” gender

● Lemmatized each corpus to remove gender● Trained 10 word2vec CBOW embeddings per condition:

○ Language (4) x

○ Corpus version (2 - unprocessed vs lemmatized)

Page 14: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Method● Measurement:

○ differential association using the Word Embedding Association Test (WEAT - Caliskan et al.)

{male} {female}

{career} {family}

Page 15: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Method● Measurement:

○ differential association using the Word Embedding Association Test (WEAT - Caliskan et al.)

● Comparisons:○ “Topical” semantic gender bias

■ replicate IAT findings of Caliskan et al. on dimension

male:career::female:family

Page 16: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Method● Measurement:

○ differential association using the Word Embedding Association Test (WEAT - Caliskan et al.)

● Comparisons:○ “Topical” semantic gender bias

■ replicate IAT findings of Caliskan et al. on dimension

male:career::female:family

○ Grammatical gender bias ■ use stimuli from Phillips & Boroditsky on dimension

male:masculine::female:feminine■ e.g. Spanish el sol (m), German die Sonne (f)

Page 17: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Topical gender bias

≈ average increase in cosine similarity per word

Page 18: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Topical gender bias Grammatical gender bias

Page 19: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Pitfall #2What if your content words hang out with your function

words and make weird artefacts?

Page 20: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Words can keep strange company!And arbitrary properties like grammatical gender can distort your embeddings.

Page 21: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Thank! Q?

Page 22: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

ReferencesBolukbasi, T., Chang, K.-W., Zou, J.,

Saligrama, V., & Kalai, A. (2016). Quantifying and reducing stereotypes in word embeddings. arXiv Preprint arXiv:1606.06121.

Caliskan-Islam, A., Bryson, J. J., & Narayanan, A. (2016). Semantics derived automatically from language corpora necessarily contain human biases. arXiv Preprint arXiv:1608.07187.

Caliskan, A., Bryson, J. J., & Narayanan, A. (2017). Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334), 183–186.

Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug), 2493–2537.

Firth, John R. 1957. A synopsis of linguistic theory 1930–1955. In Studies in linguistic analysis, 1–32. Oxford: Blackwell.

Greenwald, A. G., McGhee, D. E., & Schwartz, J. L. (1998). Measuring individual differences in implicit cognition: the implicit association test. Journal of Personality and Social Psychology, 74(6), 1464.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119).

Page 23: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Appendix

Page 24: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Interaction between topical and grammatical gender effects in DE + ES

Page 25: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Stereotypes in word embeddings:Bolukbasi et al. 2016

1. Define gender subspace

Page 26: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Stereotypes in word embeddings:Bolukbasi et al. 2016

1. Define gender subspace

2. Project profession names

onto subspace

Page 27: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Stereotypes in word embeddings:Bolukbasi et al. 2016

1. Define gender subspace

2. Project profession names

onto subspace

3. Generate analogies & get

stereotype ratings from MTurk

addiction

accountant

pilot

athlete

professor emeritus

eating disorder

paralegal

flight attendant

gymnast

associate professor

:

:

:

:

:

Page 28: June 14 2017 Berlin NLP crosslinguistic Kate McCurdy word ...anacode.de/.../06/Kate-McCurdy-Grammatical-gender... · German - grammatical gender Spanish - grammatical gender Dutch

Stereotypes in word embeddings:Bolukbasi et al. 2016

1. Define gender subspace

2. Project profession names

onto subspace

3. Generate analogies & get

stereotype ratings from MTurk

4. Compute transformation matrix

to debias designated words