Inferring(Author(Locaon(in(Social(Media Sanjay(Krishnan...

12
Inferring Author Loca1on in Social Media Sanjay Krishnan and Evan Sparks CS 2941 Behavioral Data Mining

Transcript of Inferring(Author(Locaon(in(Social(Media Sanjay(Krishnan...

Page 1: Inferring(Author(Locaon(in(Social(Media Sanjay(Krishnan ...bid.berkeley.edu/cs294-1-spring13/images/5/55/TweetLocalize... · Sanjay(Krishnan(and(Evan(Sparks(CS(294B1(Behavioral(DataMining(Mo1vaon(•

Inferring  Author  Loca1on  in  Social  Media  

Sanjay  Krishnan  and  Evan  Sparks  CS  294-­‐1  Behavioral  Data  Mining  

Page 2: Inferring(Author(Locaon(in(Social(Media Sanjay(Krishnan ...bid.berkeley.edu/cs294-1-spring13/images/5/55/TweetLocalize... · Sanjay(Krishnan(and(Evan(Sparks(CS(294B1(Behavioral(DataMining(Mo1vaon(•

Mo1va1on  •  Users  in  Social  Media  systems  (eg.  TwiLer)  oNen  self-­‐report  loca1on.  –  Roughly  1.5%  of  tweets.  

•  Loca1on  aLributes  may  be  incomplete,  masked  by  privacy  seTngs,  or  missing  altogether.  

Page 3: Inferring(Author(Locaon(in(Social(Media Sanjay(Krishnan ...bid.berkeley.edu/cs294-1-spring13/images/5/55/TweetLocalize... · Sanjay(Krishnan(and(Evan(Sparks(CS(294B1(Behavioral(DataMining(Mo1vaon(•

Problem  Statement  and  Dataset  

•  1-­‐year  of  the  TwiLer  firehose  (10%  sample).  •  Given  a  tweet,  predict  the  current  U.S  state  loca1on  of  the  author  

Page 4: Inferring(Author(Locaon(in(Social(Media Sanjay(Krishnan ...bid.berkeley.edu/cs294-1-spring13/images/5/55/TweetLocalize... · Sanjay(Krishnan(and(Evan(Sparks(CS(294B1(Behavioral(DataMining(Mo1vaon(•

Related  Work  

•  Geo-­‐loca1on  inference  is  a  well  studied  in  en1ty  resolu1on  and  image  recogni1on  

•  Content  based  inference  on  blogs  [Fink  et  al.  2008]  

•  Intra-­‐city  loca1ons  based  on  TwiLer  content  [Cheng  et  al.  2010]  

•  Infer  loca1ons  from  TwiLer  follower  networks  [Davis  et  al.  2011]  

•  Applied  to  other  micro-­‐blogs  [Ikawa  2012]  

Page 5: Inferring(Author(Locaon(in(Social(Media Sanjay(Krishnan ...bid.berkeley.edu/cs294-1-spring13/images/5/55/TweetLocalize... · Sanjay(Krishnan(and(Evan(Sparks(CS(294B1(Behavioral(DataMining(Mo1vaon(•

Models  and  Algorithms  

•  Geometry  of  the  problem  is  complex  – Mul1-­‐class  classifica1on  problem.  

•  Bag  of  NER  Features  •  Predic1on:  State  (from  lat,  long)  – Efficient  point-­‐in-­‐polygon  lookup.  

•  Mul1nomial  Logis1c  Regression  –  Implemented  on  Spark.  

•  Stochas1c  Gradient  Descent  

Page 6: Inferring(Author(Locaon(in(Social(Media Sanjay(Krishnan ...bid.berkeley.edu/cs294-1-spring13/images/5/55/TweetLocalize... · Sanjay(Krishnan(and(Evan(Sparks(CS(294B1(Behavioral(DataMining(Mo1vaon(•

Analysis  Pipeline  

•   2.6  TB,  gzip  1  Year  of  10%  Firehose  

•   42  GB,  gzip  Tweets  with  Loca4on  

• 6GB,  raw  NER  Tagged  Features  Top  60k  Features  

Model  mariangeeve            In/O  the/O  Apple/O  store/O  in/O  Atlanta/LOCATION  ./O  @Reinier_Sanders/O  with/O  Janneke/PERSON  trying/O  to/O  buy/O  an/O  IPad/O  ./O  hMp://t.co/AtX5E9Vc/O        ga  

Page 7: Inferring(Author(Locaon(in(Social(Media Sanjay(Krishnan ...bid.berkeley.edu/cs294-1-spring13/images/5/55/TweetLocalize... · Sanjay(Krishnan(and(Evan(Sparks(CS(294B1(Behavioral(DataMining(Mo1vaon(•

Results  

•  F1  score  by  state.  •  Typically  between  0.4  and  0.6.  

•  Precision  vs.  Recall.  •  Tends  to  be  a  liLle  lower  in  low  popula1on  states.  – R=0.45,  p=0.003  

25

30

35

40

45

50

−120 −100 −80Longitude

Latit

ude

0.3

0.4

0.5

0.6F1

F1 by State

Page 8: Inferring(Author(Locaon(in(Social(Media Sanjay(Krishnan ...bid.berkeley.edu/cs294-1-spring13/images/5/55/TweetLocalize... · Sanjay(Krishnan(and(Evan(Sparks(CS(294B1(Behavioral(DataMining(Mo1vaon(•

Feature  Engineering  Helps  

•  Named  en1ty  recogni1on  (NER)  loca1on  features  account  for  most  of  the  predic1ve  power.  

0.0

0.2

0.4

0.6

0.0

0.2

0.4

0.6

F1F1.No.Locations

ALAR AZ CACO CTDE FLG

A IA ID IL IN KS KY LA MA

MD

ME MI

MN

MO

MS

MT

NC ND NENH NJNM NV NYO

HO

KO

R PA RISC SD TN TX UT VA VTW

A WI

WV

WY

State

F1.S

core

F1 Before and After Feature Extraction

Page 9: Inferring(Author(Locaon(in(Social(Media Sanjay(Krishnan ...bid.berkeley.edu/cs294-1-spring13/images/5/55/TweetLocalize... · Sanjay(Krishnan(and(Evan(Sparks(CS(294B1(Behavioral(DataMining(Mo1vaon(•

Precision  and  Recall  

●●

●●

●●

●●

AL

AR

AZ

CA

CO

CT

DE

FL

GAIA

IDIL

INKSKY LA

MA

MD MEMIMN MO

MS

MTNC

ND

NE

NH

NJ

NM

NV

NY

OH

OK

OR PA

RISC

SD

TN

TX

UTVAVT

WAWI

WV

WY

0.25

0.50

0.75

0.2 0.4 0.6 0.8Precision

Reca

ll

Precision vs. Recall by State

ALARAZCACOCTDEFL

GAIAIDILIN

KSKYLAMAMDMEMI

MNMOMSMTNCNDNENHNJ

NMNVNYOHOKORPARI

SCSDTNTXUTVAVT

WAWI

WVWY

AL ARAZCACOCTDE FL GA IA ID IL IN KSKY LAMAMDME MIMNMOMSMTNCNDNENHNJNMNVNYOHOKORPA RI SCSDTN TX UT VA VTWAWIWVWYPrediction

Actu

al

1.000000

7.389056

54.598150

403.428793

Count

Classification − Predictions vs. Actuals

•  Populated  states  tend  to  have  lots  of  false  posi1ves,  small  western  states  tend  to  be  more  precise.  

Page 10: Inferring(Author(Locaon(in(Social(Media Sanjay(Krishnan ...bid.berkeley.edu/cs294-1-spring13/images/5/55/TweetLocalize... · Sanjay(Krishnan(and(Evan(Sparks(CS(294B1(Behavioral(DataMining(Mo1vaon(•

Misclassifica1ons  and  Geography  

•  False  posi1ves  have  a  strong  regional  dependence.  

25

30

35

40

45

50

−120 −100 −80Longitude

Latit

ude Misclassification

CaliforniaTexas

Misclassification Toward California or Texas

Page 11: Inferring(Author(Locaon(in(Social(Media Sanjay(Krishnan ...bid.berkeley.edu/cs294-1-spring13/images/5/55/TweetLocalize... · Sanjay(Krishnan(and(Evan(Sparks(CS(294B1(Behavioral(DataMining(Mo1vaon(•

Neat  Features  –  Apple/Organiza1on  

•  Good  predictors  are  not  always  loca1ons.  

•  Apple  tagged  as  an  organiza(on.  – States  where  people  have  money?  

– States  where  there’s  an  Apple  store?  

– Connec1on?  

25

30

35

40

45

50

−120 −100 −80Longitude

Latit

ude

0.0

0.1

0.2

0.3

Apple

Strength of Apple (Organization) Feature by State

Page 12: Inferring(Author(Locaon(in(Social(Media Sanjay(Krishnan ...bid.berkeley.edu/cs294-1-spring13/images/5/55/TweetLocalize... · Sanjay(Krishnan(and(Evan(Sparks(CS(294B1(Behavioral(DataMining(Mo1vaon(•

Summary  

•  Designed  a  distributed  framework  for  inferring  author  loca1on  from  NER  tagged  textual  data.  

•  Implemented  Mul1nomial  Logis1c  Regression  in  Spark.  

•  PreLy  good:  some  classes  had  F1  scores  >  0.6  •  Regional  and  social  insights  from  the  model.  – Texas  vs.  California  – Apple