Gender, language, and Twitter: Social theory and computational methods

86
Gender, language and Twitter: Social theory and computational methods Tyler Schnoebelen (including work with David Bamman and Jacob Eisenstein) Tweet this talk! @Tschnoebelen

Transcript of Gender, language, and Twitter: Social theory and computational methods

Gender, language and Twitter: Social theory and computational methods

Tyler Schnoebelen (including work with David Bamman and Jacob Eisenstein)

Tweet this talk! @Tschnoebelen

Welcome to the slide-u-ment

• Hi, you may want to check out the “Notes” fields for additional context.

At its most basic

At its most basic

• Assumption 1: Men and women use different vocabularies– Hypothesis I: Computational methods can cut through

noise and predict speaker gender based on the words they use

• Assumption 2: Social networks are typically “homophilous” (birds of a feather flock together)– Hypothesis II: Adding the gender make-up of a user’s

social network should get even better prediction

Let’s say we can predict gender

• So what?• Does it license us to connect words/word

groups to the social category in question?• This assumes that gender is– Stable– The primary driving force

Our actual goal

• Problematize gender prediction as a task– Define a system where we could just “stop” and

call it good– But NOT ACTUALLY STOP

• Demonstrate that simple gender binaries aren’t actually descriptively accurate

• Show ways to combine social theory and computational methods that expand the questions on both sides

QUICK LITERATURE REVIEW

“Standard” is a keyword

Typical findings• Women use standard variables

more often than men.– In fact, early dialectologists

ignored women completely because they wanted “NORMS”—non-mobile, older, rural male speakers, seen as preserving the purest regional (non-standard) forms • See Chambers and Trudgill (1980).

– Did they do it for prestige (to acquire social capital)?

– To avoid losing status?– Are women actually creating

norms, not following them?

Computational/corpus work

• People are fascinated by gender differences• In order to get statistical significance, you have

to have enough data where you can detect a signal

• In the past, this has led researchers to roll up words into word classes

The most common distinctions

• Men use informative language– Prepositions (to), attributive adjectives (fat), higher

word lengths (gargantuan)• Women use involved language– First and second person pronouns (you), present

tense verbs (goes), contractions (don’t)

• (Argamon, Koppel, Fine, & Shimoni, 2003; Herring & Paolillo, 2006b; Schler, Koppel, Argamon, & Pennebaker, 2006…they are working off of dimensions in Biber 1995 and Chafe 1982)

Or “contextuality”

• Men are formal and explicit– Nouns (floor), adjectives (big), prepositions (to), articles (the)

• Women are deictic and contextual– Pronouns (you), verbs (run), adverbs (happily), interjections

(oh!)• “Contextuality” decreases when an unambiguous

understanding is more important or difficult—when people are physically or socially farther away

• (Mukherjee & Liu, 2010; Nowson, Oberlander, & Gill, 2005 building off of Heylighen and Dewaele 2002)

Are all nouns really the same?

Are all nouns really the same?

And what about…

And what about…

Our approach also lumps

• It’s just at a lower level – instead of “nouns” or

“blog words”– we assume all usages of a

unigram are identical

• Lumping itself isn’t a problem. In fact, you have to.– But ideologies are going

to structure your lumpings and divisions, so watch out!

OUR WORK(WITH DAVID BAMMAN AND JACOB EISENSTEIN)

Data• Public Twitter messages in same-gender and cross-gender social

networks– Word frequencies (unigrams)– Gender (induced from first names using the Social Security

Administration data)• 14,464 Twitter users (56% male)

– Geolocated in the US– Must use 50 of top 1,000 most frequent words – Between 4 and 100 ties (at least 2 “mutual @’s” separated by 14 days)

• Women have 58% female friends• Men have 67% male friends

• 9.2M tweets, Jan-Jun 2011

Twitter has a pretty good swath (Pew)

• Nearly identical usage among women and men:– 15% of female internet users are on Twitter– 14% of male internet users

• High usage among non-Hispanic Blacks (28%)• Even distribution across income and education

levels• Higher usage among young adults (26% for

ages 18-29, 4% for ages 65+)

First names are highly gendered

Matt

Alex

Chris

Kelly

Sarah

0 10 20 30 40 50 60 70 80 90 100

100

97

86

15

0

0

3

14

85

100

% female% male

95% of users have a name 85% associated with one genderMedian user name is 99.6% associated with its majority gender

First step: gender prediction

• Logistic regression: – Will you have a heart attack Y/N?– Will you vote for X or Y?– Will your Brazilian Portuguese nouns and modifiers

agree in number? • Logistic regression is the statistical technique at

the core of variable rule analysis (Tagliamonte 2006)

• But we’re going to reverse the direction for what sociolinguists typically do

First step: gender prediction

• The relevant linguistic variables aren’t known beforehand

• So the dependent variable—the thing we are trying to predict—is author gender

• The independent variables are the 10,000 most frequent lexical items in the tweets

Preventing overfitting

• This involves estimating a lot of parameters.• Which raises the risk of overfitting: learning

parameter values that perfectly describe the training data but won’t generalize to new data

Why regularize?

Regularization dampens the effect of an individual variable (Hastie et al 2009).

A single regularization parameter controls the tradeoff between perfectly describing the training data and generalizing to unseen data.

Evaluating accuracy

• We use the typical method of cross-validation.1. Randomly divide the full dataset into 10 parts.2. Train on 80% of the data3. Use 10% of the data to tune the regularization

parameter4. Now, use the model to predict the other 10%5. Compare the predictions to what really happened

• Do this 10 times and take the average.

Gender prediction results

• State-of-the-art accuracy: 88.0%– Lexical features strongly predict gender– Ignoring syntax (treating tweets as “bags of

words”) does pretty good

Previous literature In our dataPronouns F FEmotion terms F FFamily terms F Mixed results"Blog words" (lol, omg) F FConjunctions F F (weakly)Articles M No resultsNumbers M MQuantifiers M No resultsTechnology words M MPrepositions Mixed results F (weakly)Swear words Mixed results MAssent Mixed results Mixed resultsNegation Mixed results Mixed resultsEmoticons Mixed results FHesitation markers Mixed results F

Top 500 markers for each gender

At a corpus level, women use more non-dictionary words and men use more named entities. In a moment we’ll ask how universal this is.

Hand classification of most frequent 10k words (90.0% agreement)

Female authors Male authors Common words in a standard dictionary 74.2% 74.9% Punctuation 14.6% 14.2% Non-standard, unpronounceable words (e.g.,

:), lmao)

4.28% 2.99%

Non-standard, pronounceable words (e.g., luv) 3.55% 3.35% Named entities 1.94% 2.51% Numbers 0.83% 0.99% Taboo words 0.47% 0.69% Hashtags 0.16% 0.18%

Involvement

• Using traditional definitions, it looks as if our data confirms:– men as more informational (all those named

entities) – women as more interactive/involved (pronouns,

emoticons, etc.)• Note that most of the named entities for the

men are sports figures and teams

Right. These guys are not “involved”.

Clustering without regard to gender

• We apply probabilistic clustering in order to group authors who are linguistically similar

• Each author is represented as a list of word counts across the 10,000 words used in the classification experiment

Clustering! (Hastie et al 2009)

Easy example: 2 clusters “Expectation Maximization”1. Randomly assign all authors to one of

20 clusters2. Calculate the center of the cluster

from the average word counts of all authors put in it

3. Assign each author to the nearest cluster, based on the distance between their word counts and the average word counts of the cluster center

4. Keep iterating through this moving from random clustering to meaningful clusters

5. Repeat steps 1-4 (25 times)6. Pick the best

Some definitions

• Style: combinations of linguistic resources• Cluster: a group of authors who use a

particular style• Social network: each author has a social

network made up of people who they send AND receive messages from

• An author’s social network does not have to be a part of that author’s cluster

Majority female clusters  Size % fem Top words

c14 1,345 89.60% hubs blogged bloggers giveaway @klout recipe fabric recipes blogging tweetup

c7 884 80.40% kidd hubs xo =] xoxoxo muah xoxo darren scotty ttyl

c6 661 80.00% authors pokemon hubs xd author arc xxx ^_^ bloggers d:

c16 200 78.00% xo blessings -) xoxoxo #music #love #socialmedia slash :)) xoxo

c8 318 72.30% xxx :') xx tyga youu (: wbu thankyou heyy knoww

c5 539 71.10% (: :') xd (; /: <333 d: <33 </3 -___-

c4 1,376 63.00% && hipster #idol #photo #lessambitiousmovies hipsters #americanidol #oscars totes #goldenglobes

c9 458 60.00% wyd #oomf lmbo shyt bruh cuzzo #nowfollowing lls niggas finna

Looks like “women are trying to destroy the English language”

Female authors Male authors Common words in a standard dictionary 74.2% 74.9% Punctuation 14.6% 14.2% Non-standard, unpronounceable words (e.g.,

:), lmao)

4.28% 2.99%

Non-standard, pronounceable words (e.g., luv)

3.55% 3.35%

Named entities 1.94% 2.51% Numbers 0.83% 0.99% Taboo words 0.47% 0.69% Hashtags 0.16% 0.18%

Clusters that are majority female

• At the population level, women use many non-dictionary words.

• But there are clusters of (mostly) women who actually use fewer words like lol, nah, haha than men do

  Size % fem Top words

c14 1,345 89.60% hubs blogged bloggers giveaway @klout recipe fabric recipes blogging tweetup

c6 661 80.00% authors pokemon hubs xd author arc xxx ^_^ bloggers d:

c4 1,376 63.00% && hipster #idol #photo #lessambitiousmovies hipsters #americanidol #oscars totes #goldenglobes

Consider xo• A lot more women use xo than

men– 11% of all women– 2.5% of all men

• But that means that 89% of women aren’t using it at all.

• People who use xo are three times more likely to use ttyl (‘talk to you later’)– The style is more commonly adopted

by women– But there’s other stuff going on

here: age, job, etc.– It’s not clear that gender is even the

most important, it’s just that we’re starting with gender-colored glasses

Shit Girls Say

http://www.youtube.com/watch?feature=player_embedded&v=u-yLGIH7W9Y

Meme-splosion!

Group Gender Activity/social role Interactions GeographyShit Guys Don't Say Out LoudShit College Freshmen SayShit Girlfriends SayShit Asian Dads SayShit Redneck Guys SayShit Girls Say to Gay Guys SayShit Black Girls Say SayShit Black Guys Say SayShit People Say in LAShit White Girls Say…to Black GirlsShit New Yorkers SayShit Frat Guys SayShit Whipped Guys SayShit Guys Don't Say SayShit Asian Girls SayShit Tumblr Girls SayShit Brides SayShit Spanish Girls SayShit Asian Moms SayShit Vegans SayShit Hipsters SayShit Cyclists SayShit Yogis SayShit Skiers Say

Notice

• That gender wasn’t really limited to the “gender” column– “Moms” and “dads” are gendered social roles

• And that the words “guys” and “girls” aren’t really the same as “male” and “female”– What are the plausible age ranges and social styles

for “guys” and “girls”?

Clusters that are majority male Size % male Top words

c13 761 89.40% #nhl #bruins #mlb nhl #knicks qb @darrenrovell inning boozer jimmer

c10 1,865 85.40% /cc api ios ui portal developer e3 apple's plugin developers

c18 623 81.10% @macmiller niggas flyers cena bosh pacers @wale bruh melo @fucktyler

c11 432 73.80% niggas wyd nigga finna shyt lls ctfu #oomf lmaoo lmaooo

c20 429 72.50% gop dems senate unions conservative democrats liberal palin republican republicans

c15 963 65.30% #photo /cc #fb (@ brewing #sxsw @getglue startup brewery @foursquare

Looks like “men are Twitter-headed sailor-swearing accountants”

Female authors Male authors Common words in a standard dictionary 74.2% 74.9% Punctuation 14.6% 14.2% Non-standard, unpronounceable words (e.g.,

:), lmao)

4.28% 2.99%

Non-standard, pronounceable words (e.g., luv)

3.55% 3.35%

Named entities 1.94% 2.51% Numbers 0.83% 0.99% Taboo words 0.47% 0.69% Hashtags 0.16% 0.18%

Aggregates generally don’t hold Top words Notes

c13 #nhl #bruins #mlb nhl #knicks qb @darrenrovell inning boozer jimmer

Few Taboo/Hashes Lots of Punc

c10 /cc api ios ui portal developer e3 apple's plugin developers

Few Taboo/Hashes Lots of Punc

c18 @macmiller niggas flyers cena bosh pacers @wale bruh melo @fucktyler

c11 niggas wyd nigga finna shyt lls ctfu #oomf lmaoo lmaooo

Few Dict words, Lots of unPron and Pron

c20 gop dems senate unions conservative democrats liberal palin republican republicans

Few Taboo/Hashes Lots of Punc

c15 #photo /cc #fb (@ brewing #sxsw @getglue startup brewery @foursquare

Few Taboo Lots of Punc

Small exceptions

• At the population level, men use many named entities and numbers

• Clusters use these at various rates, but:– No female-skewed clusters use them *more* than the

male average– No male-skewed clusters use them *less* than the

female average• But again, the other 6 generalizations about

gender we might have made at an aggregate aren’t supported once we get to clusters

Erasure!• Clusters are highly gendered• For example, let’s consider clusters

made up of 60% or more of people of the same gender– That covers 82.95% of all the authors– But what about the 1,242 men who

are part of female-majority clusters?– The 1,052 women who are part of

male-majority clusters?– Are they just noise? Odd-balls? Is

there no structure to what they’re doing?

– These people are using language to do identity work, even as they construct identities at odds with conventional notions of masculinity and femininity.

Clusters vs. social networks

• The more skewed a cluster is, the more skewed the social networks of its members

Women with female networks use the most female markers

Men with male networks use the most male markers

Women with male networks use more male markers (and vice versa)

Women with highly female networks are easier to classify (and vice versa)

In other words

• The classifier is picking up on the fact that if you insist upon a gender binary then people with same-gender networks use language in a more “gender-coherent” way.

Does social network help prediction?

• 88% accuracy with text alone– Logistic regression, 10-fold cross-validation– State-of-the-art accuracy

• Add network information…– Still 88% accuracy

Once we have 1000 words/author, network info doesn’t help

Wait, why not?

• A new feature is only going to improve classification accuracy if it adds new information.

• There is strong homophily: 63% of the connections are between same-gender individuals.

• But language and social network can’t mutually disambiguate because they aren’t independent views on gender.

• Individuals who use linguistic resources from “the other gender” consistently have denser social network connections to the other gender. – Performance, style, accommodation

• Gender is not an “A or B” kind of thing

If we seek only predictive accuracy…

We’re awesome!

Not so simple

• If we want to understand categories, we should start with people in interactions.– Counting is great but we have to watch our bins

and investigate them, too.

Look at words a different way

Not markers…

Not markers…makers

Positioning

Positioning and stance• “Stance” is usually seen as an

expression of a speaker’s relationship to their talk and their interlocutors – E.g., Kiesling (2009); Du Bois

(2007); Bednarek (2008)• But “stance” (and “roles”)

seem static• I’d like something with more

motion and dynamism

Positioning and stance• “Stance” is usually seen as an

expression of a speaker’s relationship to their talk and their interlocutors – E.g., Kiesling (2009); Du Bois

(2007); Bednarek (2008)

• But “stance” (and “roles”) seem static

• I’d like something with more motion and dynamism

• I develop positioning to connect linguistic forms to social structures

• (Particularly affect, actually)

Positioning in a social grid

Sister

Daughter

Spinster

Subject

Object

Dentist

Farmer

Father

Positioning in a social grid

• Social structures are created, maintained, and changed by specific interactions

• People enter interactions already positioned

• Interactions change these positions, people are attentive to changes

Conventions

• Different linguistic resources come to be associated with different positionings

• Distributions of experiences are usually maintained

• The maintenance and disruption of expectations has (affective) consequences

A LITTLE BIT OF LITTLE

CHILDES (MacWhinney, 2000)

• 4,676 transcripts of parent-child interactions– American English

Observed little Expected little O/EMothers-to-boys 4,313 4,158 1.037Fathers-to-boys 1,516 1,381 1.098Mothers-to-girls 6,312 5,441 1.160Fathers-to-girls 230 281 0.819Girls-to-mothers 1,221 1,533 0.796Girls-to-fathers 4 3 1.482Boys-to-mothers 875 1,526 0.573Boys-to-fathers 117 265 0.441

Gender and little• Women tend to use little more—multiple corpora show significant

differences• But this misses the point

Buckeye OE

CALLHOME OE

Female 1.170 1.073

Male 0.855 0.725

Add interlocutor gender

 CHILDES Parent-Child OE

CHILDES Child-Parent OE

Buckeye OEFisher Am. Eng. OE

Fisher Ohioans OE

CALLHOME OE

Female to female

1.160 0.796 0.936 1.051 1.160 1.088

Female to male

1.037 1.482 1.290 0.887 0.771 1.064

Male to male

1.098 0.441 0.879 1.071 0.830 0.685

Male to female

0.819 0.573 0.908 0.842 0.836 0.727

Gender and topics• Some topics are more face-threatening than others.

– Face-threatening topics get less little. • When topic is held constant, men and women mostly have the

same little usage .– Regardless of the gender of the person they’re talking to.

• But there are some exceptions, which are connected to issues of masculinity, femininity, and emotional regulation. – Some examples:

• Generally, people don’t use little to talk about terrorism. EXCEPT women speaking to women use little to modify emotions (terrified, scared)

• Generally, people DO use little to talk about fitness. EXCEPT men talking to men. The men talking to women use little to talk about their pudgy, flabby bodies. The few men talking to men who use little use it to talk about working out a little harder or putting on a little more muscle mass.

ICSI meeting corpus (Janin et al., 2003)

• 75 meetings from Berkeley’s International Computer Science Institute (2000-2002)– 3-10 participants (avg of 6)– 17-103 minutes each (usually an hour)– 72 hours of data

# speakers (avg age)

Observed little

Expected little

O/E

Undergrad 6 (30 yo) 59 34 1.734Grad 14 (29 yo) 234 223 1.049Postdoc 1 (not given) 51 75 0.676Ph.D. 11 (37 yo) 152 228 0.667Professor 4 (52 yo) 278 213 1.302

Gender, genre, topic, style

• “Different ways of saying things are intended to signal different ways of being, which includes different potential things to say.” (Eckert 2008)

Majority female clusters  Size % fem Top words

c14 1,345 89.60% hubs blogged bloggers giveaway @klout recipe fabric recipes blogging tweetup

c7 884 80.40% kidd hubs xo =] xoxoxo muah xoxo darren scotty ttyl

c6 661 80.00% authors pokemon hubs xd author arc xxx ^_^ bloggers d:

c16 200 78.00% xo blessings -) xoxoxo #music #love #socialmedia slash :)) xoxo

c8 318 72.30% xxx :') xx tyga youu (: wbu thankyou heyy knoww

c5 539 71.10% (: :') xd (; /: <333 d: <33 </3 -___-

c4 1,376 63.00% && hipster #idol #photo #lessambitiousmovies hipsters #americanidol #oscars totes #goldenglobes

c9 458 60.00% wyd #oomf lmbo shyt bruh cuzzo #nowfollowing lls niggas finna

Clusters that are majority male Size % male Top words

c13 761 89.40% #nhl #bruins #mlb nhl #knicks qb @darrenrovell inning boozer jimmer

c10 1,865 85.40% /cc api ios ui portal developer e3 apple's plugin developers

c18 623 81.10% @macmiller niggas flyers cena bosh pacers @wale bruh melo @fucktyler

c11 432 73.80% niggas wyd nigga finna shyt lls ctfu #oomf lmaoo lmaooo

c20 429 72.50% gop dems senate unions conservative democrats liberal palin republican republicans

c15 963 65.30% #photo /cc #fb (@ brewing #sxsw @getglue startup brewery @foursquare

Gender is not something people have

It’s something people *do*

And there are a lot of ways to “do” gender.

Computational Judith Butler!

Gender is binary only with blinders

• “My mom doesn’t say that’s lovely or omg!...”– “Nevermind that!”

• Problem: Sliding from predictive accuracy to causal stories

• Realistic finding: There are lots of ways to do gender

Big data, big opportunities

• Big data offers us the opportunity to let clusters emerge (and test them against our big bins)

• We can show how language reflects and creates the social worlds we live in

THANKS!