計算社會科學初探-當電腦科學家遇上社會科學

Post on 20-Feb-2017

6.349 views 1 download

Transcript of 計算社會科學初探-當電腦科學家遇上社會科學

1

陳昇瑋台灣資料科學協會理事長

中央研究院資訊科學研究所研究員

計算社會科學初探-當電腦科學家遇上社會科學

2

Sheng-Wei (Kuan-Ta) Chen

Institute of Information ScienceAcademia Sinica

Computational Social ScienceThe Collaborative Futures of Big Data, Computer

Science, and Social Sciences

3

PEOPLE

4

The Favorite Major for US College Athletes

(Source: USA Today, http://usatoday30.usatoday.com/sports/college/2008-11-18-majors-graphic_N.htm)

5

Social Science

6

Social Life is Hard to See

We can interview friends, but we cannot interview a friendship

Fleeting interactionIn privateTedious to record over time, especially in large groups

7

Bigger Problems

Social phenomena involve many individuals interacting to produce collective entities

firms, markets, cultures, political parties, social movements, audiences“Micro-Macro” problem (aka “Emergence”)

Micro-macro problems are hard to study empirically

Difficult to collect observational data about individuals, networks, and populations at same timeEven more difficult to do “macro” scale experiments

8

1890 US Census

1st time Hollerith machines were used to tabulate US Census data (population: 62,947,714)

9

The Era of Big Data

Past: Government data, national survey dataToday: A variety of new data sources

Economic data: trade, finance, e-cash / e-wallet, ...GIS data: satellite, GPS loggers, laser scanning cars, …Sensor data: video surveillance, smart phones, wearable devices, mobile apps, beacons, …

10

New kinds of data

11

12

陳昇瑋 / 資料科學的第一堂課

Computer vision for healthcare

Video magnification(Slide Credit: Jia-Bin Huang)

14

15

New kinds of data

16

Engagement and Exploration

Standing face-to-face?Physical distanceHand gesture, postureConversation patternsFrequency of interruptions

18

Web as a Record of Social Interaction

Public web pages / discussionsTwitter, Facebook, blogs, news groups, wikis, MMOGs, Instagram, LastFM, Flickr, SpotifyPrivate email, Whatsapp, LINE, SlackText, images, sounds: speeches, commercials

19

New kinds of data

20

Computational Social Science

The science that investigates social phenomena through the medium of computing and

statistical data processing.

23

Computational Social ScienceAn instrument-enabled scientific discipline

microbiology

microscope

radio astronomy

radar

nanoscience

electron microscope

24

25

26

27

Technical Challenges

Computational infrastructures for dealing withMore data: analyzing large amounts of dataFuzzy data: cleaning up inprecise and noisy dataNew kinds of data: processing real-time sensor streams and web data

Need for new substantive ideasNeed for new statistical methods (WHY in addition to WHAT and HOW)

28

3 Common Approaches

Macroscope Virtual LabEmpirical Modeling

29

APPROACHES#1 MACROSCOPE #2 VIRTUAL LAB#3 EMPIRICAL MODELING

31

WE ARE WHAT WE SAYLinguistics

Schwartz, H. Andrew, et al. "Personality, gender, and age in the language of social media: The open-vocabulary approach." PloS one 8.9 (2013): e73791.

Macroscope

32

Dataset

700 million words, phrases, and topic instances collected from 75,000 volunteers’ FB posts Record users’ personality (5-factor), gender and age

33

What Words Do You Use?

male

female

34

How Old Are You? (#1)13 - 18

19 - 22

35

How Old Are You? (#2)

23 - 29

30 - 65

36

Personality Traits

Extraversion

Introversion

38

Topics Across 4 Age-groups

39

Warm and Negative Words

40

Usage of “I” & “We”

Huge-volume data + simple analysis crystal clear language use patterns

42

APPROACHES#1 MACROSCOPE #2 VIRTUAL LAB#3 EMPIRICAL MODELING

43

Scaling up the Lab

Social science experimental heavily constrained by scale and speed

Unit of analysis was individuals or small groupsExperiments took months to design and run

Potentially “virtual labs” lift both constraintsState of the art ~ 5000 workers, but in principle could construct subject panel ~ 100K – 1M Could shrink hypothesis-testing cycle to days or hours

44

MOOD CONTAGION (& MANIPULATION) ON FACEBOOK

Social Psychology

Kramer, Adam DI, Jamie E. Guillory, and Jeffrey T. Hancock. "Experimental evidence of massive-scale emotional contagion through social networks.” Proceedings of the National Academy of Sciences111.24 (2014): 8788-8790.

Virtual Lab

45

Facebook Mood Contagion

0.7 million (~ 0.04%) users on Facebook3 million posts manipulated in one weekHide some “positive” or “negative” emotional posts from users (in the experimental group)

46

Observations

Negative posts hidden

People who see more positive posts, tend to post more positively, and vice versa.Facebook users’ emotion can be easily

manipulated by changing ALGORITHMS

Positive posts hidden

47

Ethical Issues (!)

Unethical experiment because it’s conducted without users’ consent

Serious invasion of users’ perceptions about their friend circles (and the society)

Well, Facebook's data use policy states that users' information will be used "for internal operations, including troubleshooting, data analysis, testing, research and service improvement," meaning that any user can become a lab rat.

48

FACEBOOK “I VOTED” BUTTONSocial Psychology & Politics

Bond, Robert M., et al. "A 61-million-person experiment in social influence and political mobilization." Nature 489.7415 (2012): 295-298.

Virtual Lab

49

“I Voted” Button

Direct messages to 61 million users on FBInformational: 1% users receivedSocial: 98% users receivedControl group: 1% (no message received)

Informational

Social

51

Effect of Manipulation

Ratio of friends voted

Prob. of oneself claimed voted

53

2% more likely to click “I voted” button and 0.3%more likely to seek information about a polling place, and 0.4% more likely to head to the polls.

54

Real-world Consequence (!)

In total there were about 60,000 votes of turnout, and estimated 280,000 indirect turnout (out of 61 million users)

What if Facebook did not randomize the control/experimental groups?

62

APPROACHES#1 MACROSCOPE #2 VIRTUAL LAB#3 EMPIRICAL MODELING

63

Empirical Modeling

Traditional mathematical or computational modeling

Tends to rely on many, often unrealistic, assumptions Not generally tested in detail against dataResult is proliferation of models that exist in parallel and are often incompatible with each other

New sources/scales of data allow both to learn/test models and also calibrate them

Observations Models Lab Field Observations

65

Google Flu Trends

Nature 457, 1012-1014 (2009)

66

PREDICTION OF COUNTY-LEVEL HEART DISEASE MORTALITY

Medicine and Linguistics

Empirical Modeling

Eichstaedt, Johannes C., et al. "Psychological language on twitter predicts county-level heart disease mortality." Psychological science 26.2 (2015): 159-169.

68

Datsets

Heart disease Arteriosclerotic heart diseasemortality rates during 2009 -- 2010

Predictors826 million tweets collected between June 2009 and March 2010Socioeconomic (income and education)Demographic (percentages of Black, Hispanic, married, and female residents)Health status (diabetes, obesity, smoking, and hypertension)

69

Prediction Accuracy

70

71

74

Language Use in Tweets

75

Social media opens up a new window of what humans actually feel and think

76

YOU ARE WHAT YOU LIKESocial Psychology

Empirical Modeling

Kosinski, Michal, David Stillwell, and Thore Graepel. "Private traits and attributes are predictable from digital records of human behavior." Proceedings of the National Academy of Sciences 110.15 (2013): 5802-5805.

78

Personality Prediction

Personality traitsGender, age, relationship status, # friendsSexual orientation, ethnicity, religion, political inclinationAddictive substances (alcohol, drugs, cigarette), parental separationIQ, 5-Factor model, satisfaction with Life

79

Data Collection

9,939,220 Likes (55,814 unique ones) from 58,466 Facebook volunteers

SportsMusicBooksRestaurantsPopular websites

80

Ground truth

Political Inclination

Sexual Orientation

Democrat Republican

Democratic GOP (Grand Old Party)

Democratic Party Republican Party

Homosexual Heterosexual

1 / 0 1 / 0

82

Ground truth

5-Factor ModelOpennessConscientiousnessExtraversionAgreeablenessStability

83

Ground truth

Satisfaction with Life (SWL)

85

Methodology

User-Like matrix dimension reduction: Singular Value Decomposition (SVD)Prediction models: Logistic Regression & Linear Regression

87

Prediction ResultsSolid: Pearson corr. coef. between pred. & actual valuesTransparent: baseline acc. of the questionnaire, in terms of test-retest reliability

89

Discriminative Likes (#1)

90

Discriminative Likes (#2)

91

Discriminative Likes (#3)

92

Likes are Culture-Dependent (#1)

卡提諾正妹抱報 Catworld小舖

Garena《英雄聯盟 LOL》 QUEEN FASHION SHOP

遊戲大亂鬥 范范范瑋琪

Garena-TW 撿便宜特賣會

好色龍 衣芙日系

放棄治療 王大陸

這樣變型男 就愛網拍特賣會

Taipei Assassins (台北暗殺星) 86小舖商城

你為什麼要放棄治療呢 H.H先生

Toyz LOVFEE

93

Likes are Culture-Dependent (#2)

已婚

嬰兒與母親懷孕生產情報站 學生愛打工

未婚

味全MyWei Duncan

Estee Lauder Taiwan 雅詩蘭黛 林俊傑 JJ Lin

光泉"HOT"鮮奶 Cherng

舒潔溫柔心感動 Byebyechuchu

綠巨人 Dcard

AVON Taiwan 雅芳粉絲團 田馥甄Hebe

人人玩遊戲 彭于晏 Eddie Peng

Creative Baby -台灣 Dorothy

阿默典藏蛋糕 韋禮安Weibird

Can we have real privacy on social media?Unprecedented opportunity to observe individuals in a society

100

資料分析如何幫我們更瞭解捐款人?

101

x 3,518

in 10.5 years (since May 2003)

102

103

AppleDaily Charity Case Dataset

3000+ cases along with detailed description and donation records

104

20 50 80

捐款金額分布 (每戶個案家庭)

105

106

DATA COLLECTION

107

Crawling

http://search.appledaily.com.tw/charity/projlist/

108

Web page parsing

113

# donors

114

# donors w/ linear fitting

115

Adjusted Time Series

116

ANNOTATION

118

119

http://bountyworkers.net/

120

人工編碼成果

431編碼者

6532人次

255小時

8436家庭成員

1590個案

121

Sample Annotations

122

Variables we got (290+)

124

Methodology

Predict # donors and donation amountFeature selection based on mutation informationUsing libsvm to do 2-class classification

Classifying top 25% and bottom 25% cases by removing the middle 50% cases10-fold cross validation

Find out significant factors that determine the dependent variable(s)

126

Factor Categories

SubjectStructureFinance Member PresentationMeta

127

Factor – Members Category

Subject & MemberAge, gender, marital statusDisability, disease, accident, habit, status

129

Factor – Structure Category

StructureCount and ratio of particular types of family membersRelationships between members

130

Factor – Finance Category

FinanceIs the family below the poverty line?Regular income & expense

131

Factor – Presentation

PresentationCurrently, only title and images are evaluatedSubjective ratings from human subjects

133

Title & picture rating

http://mmnet.iis.sinica.edu.tw/~cslin/rating/welcome.php

134

Factor – Meta Information

Meta informationInformation unrelated to the family & its situationE.g., article writer and when was the article published

135

136

捐款意願與時間點高度相關

137

星期幾很重要

日 一 二 三 四 五

138

哪個月份也重要

一 二 三 四 五 六 七 八 九 十 十一 十二

139

受訪者的胖瘦會影響捐款決策

140

142

誰收到較多捐款?

144

捐款人對各式疾病及身心障礙有差別待遇

145

147

149

不可抗力因素較讓人同情

150

意外失業 離婚入獄 人為

意外輟學

152

154

捐款與固定支出成反比

個案家庭固定支出

捐款金額

155

捐款者期待能看見「希望」

156

CASE STUDY

157

Successful Case

158

Less Successful Case

162

TEXT MINING APPROACH

163

C-LIWC簡介從James Pennebaker的LIWC (Linguistic Inquiry and Word Count) 發展而來由台科大與台大心理團隊,依照中文特性增刪類別與語詞,編製而成總計88個類別,6862個詞與詞幹語言特性與寫作風格多少能反應個人特質、影響讀者的感受此文本分析方法,逐漸被廣泛使用在心理學相關研究主題。如:道歉與原諒、測謊、治療過程的語言變化、心理位移等C-LIWC官網:http://cliwc.weebly.com/

164

中文版語文探索與字詞計算字典(C-LIWC)

165

家庭詞、死亡詞、健康詞相關:家庭詞、死亡詞、健康詞大致和捐款皆成正相關推論:當事件主題符合傳統價值時較易引起捐款

(r, p-value) 家庭詞 死亡詞 健康詞log(捐款總額) (r=0.148,

p=0.000)(r=0.101, p=0.000)

(r=0.056, p=0.026)

捐款人數 (r=0.131, p=0.000)

(r=0.113, p=0.000)

(r=0.058, p=0.021)

每人平均捐款額 (r=0.129, p=0.000)

(r=0.084, p=0.001)

(r=0.007,p=0.771)

範例 母親、婆婆、阿公、家屬、堂妹、繼父、雙親

火化、死者、自殺、告別式、往生、致死

中風、糖尿病、結石、住院、安眠藥

166

文章總詞數相關:文章總詞數和捐款成正相關推論:將事件敘述越詳盡,越容易募到款

(r, p-value) 總詞數 (word count)log(捐款總額) (r=0.101, p=0.000)

捐款人數 (r=0.056, p=0.027)

每人平均捐款額 (r=0.143, p=0.000)

167

工作詞、成就詞、金錢詞

相關:工作詞、成就詞、金錢詞大致和捐款皆成負相關推論:和工作相關的主題,相較不易募得款項

(r, p-value) 工作詞 成就詞 金錢詞log(捐款總額) (r=-0.079, p=0.002) (r=-0.064, p=0.011) (r=-0.072, p=0.004)

捐款人數 (r=-0.099, p=0.000) (r=-0.085, p=0.000) (r=-0.025, p=0.319)

每人平均捐款額 (r=-0.022, p=0.380) (r=-0.020, p=0.001) (r=-0.101, p=0.000)

範例 勞工、契約、付費、裁員、生意、員工、職業

升遷、職權、權威、嘉獎、能幹、高層、榮耀

帳戶、租金、商店、現金、消費、捐贈

168

其它否定詞

範例:不滿、不幸、不能、無關、不料、不須相關:和平均每人捐款額呈負相關(r=-0.063, p=0.013)推論:正面描述較佳

副詞範例:真的、終於、確實、一定、一向、不管、全然相關:和平均每人捐款額呈負相關(r=-0.084, p=0.001)推論:平實地描述即可,過度誇大或多加贅述易有反效果

170

ONGOING WORK

171

174

Opportunities to explore

Incentive provisioningLet doners keep track their own donation recordDoner profile, like KivaRe-visit the families being helped

Viral marketing

Cognitive biasesAnchoring effectEndowed progress effect

(see https://en.wikipedia.org/wiki/List_of_cognitive_biases)

175

176

CONCLUSION & OUTLOOK

177

WE ARE STILL AT THE VERY START

178

LOTS of Big Questions

The polarization of global economic inequality What explains the success of social movements?The emergence of pro-sociality behaviorThe causality of video gaming and propensity of violence?The politics of censorshipThe causality of social selection and social influence?…

179

The Data Divide

Social scientists have good questions but…IT tools are not part of their toolkitsNot clear that we will/should make the investment

Computer scientists have powerful methods but…

Trained to resolve technical problemsIt seems there are less “methodological” contributions

180

The Challenges

Education and habits of social and computer scientists

Different ways of thinking Different methodologies Differences in framing questions and defining contributions

Data access and fragmentation issueData privacy issue Ethics issueOrganizational issue

182

Institutional Innovations

New platforms and protocols for data management

Better coordination of data collection, storage, sharingRecruitment and management of subject pools, field panels

Integrated research designsCoordination across theoretical, experimental and observational studies

Collaborative interdisciplinary teamsFor a given data set, often unclear what the most interesting question isFor a given question, often unclear how to collect the right data

184

Techniques to collect, manage, and process large

datasets

Knowledge about social theories, methods, and

issues

ComputationalSocial Science

185

Sheng-Wei ChenAcademia Sinica