國家科會委員會人文社會科學研究中心 『補助國內學術研討會』成 … · 會場五 古希臘哲學專題小組討論【a306】 主持人:徐學庸(台灣大學哲學系
計算社會科學初探-當電腦科學家遇上社會科學
-
Upload
sheng-wei-kuan-ta-chen -
Category
Data & Analytics
-
view
6.349 -
download
1
Transcript of 計算社會科學初探-當電腦科學家遇上社會科學
1
陳昇瑋台灣資料科學協會理事長
中央研究院資訊科學研究所研究員
計算社會科學初探-當電腦科學家遇上社會科學
2
Sheng-Wei (Kuan-Ta) Chen
Institute of Information ScienceAcademia Sinica
Computational Social ScienceThe Collaborative Futures of Big Data, Computer
Science, and Social Sciences
3
PEOPLE
4
The Favorite Major for US College Athletes
(Source: USA Today, http://usatoday30.usatoday.com/sports/college/2008-11-18-majors-graphic_N.htm)
5
Social Science
6
Social Life is Hard to See
We can interview friends, but we cannot interview a friendship
Fleeting interactionIn privateTedious to record over time, especially in large groups
7
Bigger Problems
Social phenomena involve many individuals interacting to produce collective entities
firms, markets, cultures, political parties, social movements, audiences“Micro-Macro” problem (aka “Emergence”)
Micro-macro problems are hard to study empirically
Difficult to collect observational data about individuals, networks, and populations at same timeEven more difficult to do “macro” scale experiments
8
1890 US Census
1st time Hollerith machines were used to tabulate US Census data (population: 62,947,714)
9
The Era of Big Data
Past: Government data, national survey dataToday: A variety of new data sources
Economic data: trade, finance, e-cash / e-wallet, ...GIS data: satellite, GPS loggers, laser scanning cars, …Sensor data: video surveillance, smart phones, wearable devices, mobile apps, beacons, …
10
New kinds of data
11
12
陳昇瑋 / 資料科學的第一堂課
Computer vision for healthcare
Video magnification(Slide Credit: Jia-Bin Huang)
14
15
New kinds of data
16
Engagement and Exploration
Standing face-to-face?Physical distanceHand gesture, postureConversation patternsFrequency of interruptions
18
Web as a Record of Social Interaction
Public web pages / discussionsTwitter, Facebook, blogs, news groups, wikis, MMOGs, Instagram, LastFM, Flickr, SpotifyPrivate email, Whatsapp, LINE, SlackText, images, sounds: speeches, commercials
19
New kinds of data
20
Computational Social Science
The science that investigates social phenomena through the medium of computing and
statistical data processing.
23
Computational Social ScienceAn instrument-enabled scientific discipline
microbiology
microscope
radio astronomy
radar
nanoscience
electron microscope
24
25
26
27
Technical Challenges
Computational infrastructures for dealing withMore data: analyzing large amounts of dataFuzzy data: cleaning up inprecise and noisy dataNew kinds of data: processing real-time sensor streams and web data
Need for new substantive ideasNeed for new statistical methods (WHY in addition to WHAT and HOW)
28
3 Common Approaches
Macroscope Virtual LabEmpirical Modeling
29
APPROACHES#1 MACROSCOPE #2 VIRTUAL LAB#3 EMPIRICAL MODELING
31
WE ARE WHAT WE SAYLinguistics
Schwartz, H. Andrew, et al. "Personality, gender, and age in the language of social media: The open-vocabulary approach." PloS one 8.9 (2013): e73791.
Macroscope
32
Dataset
700 million words, phrases, and topic instances collected from 75,000 volunteers’ FB posts Record users’ personality (5-factor), gender and age
33
What Words Do You Use?
male
female
34
How Old Are You? (#1)13 - 18
19 - 22
35
How Old Are You? (#2)
23 - 29
30 - 65
36
Personality Traits
Extraversion
Introversion
38
Topics Across 4 Age-groups
39
Warm and Negative Words
40
Usage of “I” & “We”
Huge-volume data + simple analysis crystal clear language use patterns
42
APPROACHES#1 MACROSCOPE #2 VIRTUAL LAB#3 EMPIRICAL MODELING
43
Scaling up the Lab
Social science experimental heavily constrained by scale and speed
Unit of analysis was individuals or small groupsExperiments took months to design and run
Potentially “virtual labs” lift both constraintsState of the art ~ 5000 workers, but in principle could construct subject panel ~ 100K – 1M Could shrink hypothesis-testing cycle to days or hours
44
MOOD CONTAGION (& MANIPULATION) ON FACEBOOK
Social Psychology
Kramer, Adam DI, Jamie E. Guillory, and Jeffrey T. Hancock. "Experimental evidence of massive-scale emotional contagion through social networks.” Proceedings of the National Academy of Sciences111.24 (2014): 8788-8790.
Virtual Lab
45
Facebook Mood Contagion
0.7 million (~ 0.04%) users on Facebook3 million posts manipulated in one weekHide some “positive” or “negative” emotional posts from users (in the experimental group)
46
Observations
Negative posts hidden
People who see more positive posts, tend to post more positively, and vice versa.Facebook users’ emotion can be easily
manipulated by changing ALGORITHMS
Positive posts hidden
47
Ethical Issues (!)
Unethical experiment because it’s conducted without users’ consent
Serious invasion of users’ perceptions about their friend circles (and the society)
Well, Facebook's data use policy states that users' information will be used "for internal operations, including troubleshooting, data analysis, testing, research and service improvement," meaning that any user can become a lab rat.
48
FACEBOOK “I VOTED” BUTTONSocial Psychology & Politics
Bond, Robert M., et al. "A 61-million-person experiment in social influence and political mobilization." Nature 489.7415 (2012): 295-298.
Virtual Lab
49
“I Voted” Button
Direct messages to 61 million users on FBInformational: 1% users receivedSocial: 98% users receivedControl group: 1% (no message received)
Informational
Social
51
Effect of Manipulation
Ratio of friends voted
Prob. of oneself claimed voted
53
2% more likely to click “I voted” button and 0.3%more likely to seek information about a polling place, and 0.4% more likely to head to the polls.
54
Real-world Consequence (!)
In total there were about 60,000 votes of turnout, and estimated 280,000 indirect turnout (out of 61 million users)
What if Facebook did not randomize the control/experimental groups?
62
APPROACHES#1 MACROSCOPE #2 VIRTUAL LAB#3 EMPIRICAL MODELING
63
Empirical Modeling
Traditional mathematical or computational modeling
Tends to rely on many, often unrealistic, assumptions Not generally tested in detail against dataResult is proliferation of models that exist in parallel and are often incompatible with each other
New sources/scales of data allow both to learn/test models and also calibrate them
Observations Models Lab Field Observations
65
Google Flu Trends
Nature 457, 1012-1014 (2009)
66
PREDICTION OF COUNTY-LEVEL HEART DISEASE MORTALITY
Medicine and Linguistics
Empirical Modeling
Eichstaedt, Johannes C., et al. "Psychological language on twitter predicts county-level heart disease mortality." Psychological science 26.2 (2015): 159-169.
68
Datsets
Heart disease Arteriosclerotic heart diseasemortality rates during 2009 -- 2010
Predictors826 million tweets collected between June 2009 and March 2010Socioeconomic (income and education)Demographic (percentages of Black, Hispanic, married, and female residents)Health status (diabetes, obesity, smoking, and hypertension)
69
Prediction Accuracy
70
71
74
Language Use in Tweets
75
Social media opens up a new window of what humans actually feel and think
76
YOU ARE WHAT YOU LIKESocial Psychology
Empirical Modeling
Kosinski, Michal, David Stillwell, and Thore Graepel. "Private traits and attributes are predictable from digital records of human behavior." Proceedings of the National Academy of Sciences 110.15 (2013): 5802-5805.
78
Personality Prediction
Personality traitsGender, age, relationship status, # friendsSexual orientation, ethnicity, religion, political inclinationAddictive substances (alcohol, drugs, cigarette), parental separationIQ, 5-Factor model, satisfaction with Life
79
Data Collection
9,939,220 Likes (55,814 unique ones) from 58,466 Facebook volunteers
SportsMusicBooksRestaurantsPopular websites
80
Ground truth
Political Inclination
Sexual Orientation
Democrat Republican
Democratic GOP (Grand Old Party)
Democratic Party Republican Party
Homosexual Heterosexual
1 / 0 1 / 0
82
Ground truth
5-Factor ModelOpennessConscientiousnessExtraversionAgreeablenessStability
83
Ground truth
Satisfaction with Life (SWL)
85
Methodology
User-Like matrix dimension reduction: Singular Value Decomposition (SVD)Prediction models: Logistic Regression & Linear Regression
87
Prediction ResultsSolid: Pearson corr. coef. between pred. & actual valuesTransparent: baseline acc. of the questionnaire, in terms of test-retest reliability
89
Discriminative Likes (#1)
90
Discriminative Likes (#2)
91
Discriminative Likes (#3)
92
Likes are Culture-Dependent (#1)
男
卡提諾正妹抱報 Catworld小舖
女
Garena《英雄聯盟 LOL》 QUEEN FASHION SHOP
遊戲大亂鬥 范范范瑋琪
Garena-TW 撿便宜特賣會
好色龍 衣芙日系
放棄治療 王大陸
這樣變型男 就愛網拍特賣會
Taipei Assassins (台北暗殺星) 86小舖商城
你為什麼要放棄治療呢 H.H先生
Toyz LOVFEE
93
Likes are Culture-Dependent (#2)
已婚
嬰兒與母親懷孕生產情報站 學生愛打工
未婚
味全MyWei Duncan
Estee Lauder Taiwan 雅詩蘭黛 林俊傑 JJ Lin
光泉"HOT"鮮奶 Cherng
舒潔溫柔心感動 Byebyechuchu
綠巨人 Dcard
AVON Taiwan 雅芳粉絲團 田馥甄Hebe
人人玩遊戲 彭于晏 Eddie Peng
Creative Baby -台灣 Dorothy
阿默典藏蛋糕 韋禮安Weibird
Can we have real privacy on social media?Unprecedented opportunity to observe individuals in a society
100
資料分析如何幫我們更瞭解捐款人?
101
x 3,518
in 10.5 years (since May 2003)
102
103
AppleDaily Charity Case Dataset
3000+ cases along with detailed description and donation records
104
20 50 80
捐款金額分布 (每戶個案家庭)
105
106
DATA COLLECTION
107
Crawling
http://search.appledaily.com.tw/charity/projlist/
108
Web page parsing
113
# donors
114
# donors w/ linear fitting
115
Adjusted Time Series
116
ANNOTATION
118
119
http://bountyworkers.net/
120
人工編碼成果
431編碼者
6532人次
255小時
8436家庭成員
1590個案
121
Sample Annotations
122
Variables we got (290+)
124
Methodology
Predict # donors and donation amountFeature selection based on mutation informationUsing libsvm to do 2-class classification
Classifying top 25% and bottom 25% cases by removing the middle 50% cases10-fold cross validation
Find out significant factors that determine the dependent variable(s)
126
Factor Categories
SubjectStructureFinance Member PresentationMeta
127
Factor – Members Category
Subject & MemberAge, gender, marital statusDisability, disease, accident, habit, status
129
Factor – Structure Category
StructureCount and ratio of particular types of family membersRelationships between members
130
Factor – Finance Category
FinanceIs the family below the poverty line?Regular income & expense
131
Factor – Presentation
PresentationCurrently, only title and images are evaluatedSubjective ratings from human subjects
133
Title & picture rating
http://mmnet.iis.sinica.edu.tw/~cslin/rating/welcome.php
134
Factor – Meta Information
Meta informationInformation unrelated to the family & its situationE.g., article writer and when was the article published
135
136
捐款意願與時間點高度相關
137
星期幾很重要
日 一 二 三 四 五
138
哪個月份也重要
一 二 三 四 五 六 七 八 九 十 十一 十二
139
受訪者的胖瘦會影響捐款決策
140
142
誰收到較多捐款?
144
捐款人對各式疾病及身心障礙有差別待遇
145
147
149
不可抗力因素較讓人同情
150
意外失業 離婚入獄 人為
意外輟學
152
154
捐款與固定支出成反比
個案家庭固定支出
捐款金額
155
捐款者期待能看見「希望」
156
CASE STUDY
157
Successful Case
158
Less Successful Case
162
TEXT MINING APPROACH
163
C-LIWC簡介從James Pennebaker的LIWC (Linguistic Inquiry and Word Count) 發展而來由台科大與台大心理團隊,依照中文特性增刪類別與語詞,編製而成總計88個類別,6862個詞與詞幹語言特性與寫作風格多少能反應個人特質、影響讀者的感受此文本分析方法,逐漸被廣泛使用在心理學相關研究主題。如:道歉與原諒、測謊、治療過程的語言變化、心理位移等C-LIWC官網:http://cliwc.weebly.com/
164
中文版語文探索與字詞計算字典(C-LIWC)
165
家庭詞、死亡詞、健康詞相關:家庭詞、死亡詞、健康詞大致和捐款皆成正相關推論:當事件主題符合傳統價值時較易引起捐款
(r, p-value) 家庭詞 死亡詞 健康詞log(捐款總額) (r=0.148,
p=0.000)(r=0.101, p=0.000)
(r=0.056, p=0.026)
捐款人數 (r=0.131, p=0.000)
(r=0.113, p=0.000)
(r=0.058, p=0.021)
每人平均捐款額 (r=0.129, p=0.000)
(r=0.084, p=0.001)
(r=0.007,p=0.771)
範例 母親、婆婆、阿公、家屬、堂妹、繼父、雙親
火化、死者、自殺、告別式、往生、致死
中風、糖尿病、結石、住院、安眠藥
166
文章總詞數相關:文章總詞數和捐款成正相關推論:將事件敘述越詳盡,越容易募到款
(r, p-value) 總詞數 (word count)log(捐款總額) (r=0.101, p=0.000)
捐款人數 (r=0.056, p=0.027)
每人平均捐款額 (r=0.143, p=0.000)
167
工作詞、成就詞、金錢詞
相關:工作詞、成就詞、金錢詞大致和捐款皆成負相關推論:和工作相關的主題,相較不易募得款項
(r, p-value) 工作詞 成就詞 金錢詞log(捐款總額) (r=-0.079, p=0.002) (r=-0.064, p=0.011) (r=-0.072, p=0.004)
捐款人數 (r=-0.099, p=0.000) (r=-0.085, p=0.000) (r=-0.025, p=0.319)
每人平均捐款額 (r=-0.022, p=0.380) (r=-0.020, p=0.001) (r=-0.101, p=0.000)
範例 勞工、契約、付費、裁員、生意、員工、職業
升遷、職權、權威、嘉獎、能幹、高層、榮耀
帳戶、租金、商店、現金、消費、捐贈
168
其它否定詞
範例:不滿、不幸、不能、無關、不料、不須相關:和平均每人捐款額呈負相關(r=-0.063, p=0.013)推論:正面描述較佳
副詞範例:真的、終於、確實、一定、一向、不管、全然相關:和平均每人捐款額呈負相關(r=-0.084, p=0.001)推論:平實地描述即可,過度誇大或多加贅述易有反效果
170
ONGOING WORK
171
174
Opportunities to explore
Incentive provisioningLet doners keep track their own donation recordDoner profile, like KivaRe-visit the families being helped
Viral marketing
Cognitive biasesAnchoring effectEndowed progress effect
(see https://en.wikipedia.org/wiki/List_of_cognitive_biases)
175
176
CONCLUSION & OUTLOOK
177
WE ARE STILL AT THE VERY START
178
LOTS of Big Questions
The polarization of global economic inequality What explains the success of social movements?The emergence of pro-sociality behaviorThe causality of video gaming and propensity of violence?The politics of censorshipThe causality of social selection and social influence?…
179
The Data Divide
Social scientists have good questions but…IT tools are not part of their toolkitsNot clear that we will/should make the investment
Computer scientists have powerful methods but…
Trained to resolve technical problemsIt seems there are less “methodological” contributions
180
The Challenges
Education and habits of social and computer scientists
Different ways of thinking Different methodologies Differences in framing questions and defining contributions
Data access and fragmentation issueData privacy issue Ethics issueOrganizational issue
182
Institutional Innovations
New platforms and protocols for data management
Better coordination of data collection, storage, sharingRecruitment and management of subject pools, field panels
Integrated research designsCoordination across theoretical, experimental and observational studies
Collaborative interdisciplinary teamsFor a given data set, often unclear what the most interesting question isFor a given question, often unclear how to collect the right data
184
Techniques to collect, manage, and process large
datasets
Knowledge about social theories, methods, and
issues
ComputationalSocial Science
185
Sheng-Wei ChenAcademia Sinica