Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations...
Transcript of Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations...
SocialDataScience
DataandBigdataDavidDreyerLassen
UCPHECONAugust12,2016
InGodwe trust,allothers mustbringdata
W.EdwardsDewing
Differenttypesofdata 2
Today:1.Empirical design2.datagenerating process3.modesofcollection
standardvsbig data;examples4.strategic dataprovision
roadmap
• Different datafordifferent questions• Theory andempirics,forecasting andhypothesis testing
• Effects ofcauses vs.Causes ofeffects• Datagenerating process• Modesofdatacollection – pros andcons• Strategicdatamanagementanddataproduction
Differenttypesofdata 4
Different datafordifferent questionsor
Different questions fordifferent data
Sometimes possible toseparatedatacollection processfromunderlying datagenerating process – andsometimes not
Fundamentaldifferencebetween what people doandwhat they say they do‘cheap talk’/‘putyour money where your mouth is’/honest/costly signaling
Differenttypesofdata 5
roadmap
• Different datafordifferent questions• Theory andempirics,forecasting andhypothesis testing
• Effects ofcauses vs.Causes ofeffects• Datagenerating process• Modesofdatacollection – pros andcons• Strategicdatamanagementanddataproduction
Differenttypesofdata 6
What isyour question,again?1. Researchquestion from
theory2. Idealempirical design3. Feasible empirical
design/collection4. Results5. Adjustment of
theory/question/design6. Newresults7. …
A. WhatdatadowehaveB. Whatquestioncanthey
answerC. ResearchquestionD. Results
Differenttypesofdata 7
Allmodelsare wrong –butsome are useful
Two key goals1. Forecasting:individual behavior,policy
consequences,voting,ChampionsLeague,grades…Datascience/machine learning (butalsomacroeconomics)
2. Hypothesis testing,derived fromtheory´Traditional’socialscience
Differenttypesofdata 8
GeorgeBox
1. Forecasting• Example:Bankwants toforecast non-payment onloans(P_d:probability ofdefault)
• Couldn’t care less about theory• Rough”DataScience”: try topredict fromallavailabledata
• Suppose we findthat birth weight predicts default– Bankishappy,better fit (defer ethics etc)– Policy:does investing inpre-natal care reduce defaults?
• Inpractice: setofpredictors typically taken from(some)theory,even ifcasual
• Complications:ifcustomers knowthat P_d depends onbirth weight,would/should they disclose it?What ifloans only todisclosers?Would they tell thetruth?
Differenttypesofdata 9
2.Hypothesis testing• Theory (rationalchoice,sociology,biology,common sense,…)posits effect ofXonYA. Selection/typetheory:Peoplewho are impatient
cannot defer immediate pleasures ->smoke anddrinkwhile pregnant ->givesbirth sooner.Ifimpatient parents ->impatient children (whether bynatureornurture),we haveanexplanation.
B. Biological theory:low birth weight affects braindevelopment andneurological wiring forpatience.
• If(A),little role forpolicy;also,both can be trueatsametime
• Howtodistinguish:exogenous shock tobirthweight,butethically tricky...
Differenttypesofdata 10
Goodhart’s law
• Mostpopular:“Whenameasurebecomesatarget,itceasestobeagoodmeasure.”
• What he wrote:“Anyobservedstatisticalregularitywilltendtocollapseoncepressureisplaceduponitforcontrolpurposes.”
Differenttypesofdata 11
TargetsandMeasures
• You cannot be toldhow your bankconstructsyour P_d.Why?– Goodhart’s law:people will attempt tooutmaneuver measure
– (thought)example:spending onshoes goodindicator ofaccount overdraft ->shoe lovers willhaveothers buy forthem,ceases tobe agoodmeasure
Differenttypesofdata 12
CaseofGoogleFlu
• GoogleFlu:websearches forFlu symptomspredicted actual flu cases
• By-product ofGoogle’s main service• Butfrom2010,notsowell:overestimatedactual flu cases,partly asresult ofautosuggestfeature,partly becausemodelwas overfitted(we’ll return tothat)
• Bestpredictor:number ofcasespast week
Differenttypesofdata 13
roadmap
• Different datafordifferent questions• Theory andempirics,forecasting andhypothesis testing
• Effects ofcauses vs.Causes ofeffects• Datagenerating process• Modesofdatacollection – pros andcons• Strategicdatamanagementanddataproduction
Differenttypesofdata 14
Effects ofcausesvs.
Causes ofeffects
Different questions• Effects ofcauses:intervention,what iseffectofpolicyXonoutcomeY
• Causes ofeffects:Why does Zoccur?
Differenttypesofdata 15
Effects ofcauses(forwardcausal questions)
• Narrow questions,sometimes (butnotalways)policyinterventions– Effect oftax change onbehavior– Effect ofregulation onrisk taking– Effect ofschooling onearnings– Effect ofsmokingonlung cancerpropensity– Effect ofpublichealth onschooling inAfrica– …
• Often,butnotalways,amenabletotreatments/randomization/experimentation
Differenttypesofdata 16
Causes ofeffects(reverse causal inference)
• Much harder,butoften moreinteresting–Why dosome people smoke?–What are thecauses ofdemocratization?–Why dosome people pursue aPhD why othersdropoutafter primary school?
–Why didGreece (almost)gobankrupt?• Tensionswith”effects ofcauses”– search forcauses sometimes derided as‘partychatter’
Differenttypesofdata 17
roadmap
• Different datafordifferent questions• Theory andempirics,forecasting andhypothesis testing
• Effects ofcauses vs.Causes ofeffects• Datagenerating process• Modesofdatacollection – pros andcons• Strategicdatamanagementanddataproduction
Differenttypesofdata 18
What isthedatagenerating process?
Observational:endogenousdecisions,researcherpassivecollector ofdataRandomization:treatment-control(Some)exogeneity:policyinterventions,sometimeswithcomparisons,researcherssometimes involved
Important:moredatadoes notgivebetterresult/moreprecision if estimator isbiased
Differenttypesofdata 19
Datagenerating process
Randomizedexperiments• Distinguish– Labexperiments:traditionallycomputer-basedinecon,butalsoeyetracking/brainimages(fMRI)/physiological
– Surveyexperiments:assignsurveyrespondentstodifferentframes/treatments/primings,e.g.haveSocDems andLiberalssaysamethingandlookatsupport
– Fieldexperiments:experimentalcontrolintherealworld,e.g.bankschargingdifferentratestolearnaboutmobilityofcustomers;interventionsagainstteacherabsenteeisminIndia;…)
Different typesofdata 20
Randomizedexperiments
• Distinguish– Naturalexperiments(weatherinduced:effectsofpovertyonviolence,randomizationofnamesonelectionballots,…)
– Quasi-experiments(effectsofchangeinpolicy;effectoftaxreformontaxplanning;effectofimmigrantallocationoncrime)
• Throughout:exogenous(outsideoftheindividual)change
Differenttypesofdata 21
Randomizedexperiments
• Large,importantcurrentdebatein(development)economics
• CofE:whatareeffectsofpenaltiesonteachers’absenceinIndianvillageschools– evidencefromrandomizedexperiments
• Randomlyselectedteachersgetharshpenaltyforno-shows->differenceinabsenteeismcausaleffect ofpenalty
• (BroaderEofC Q:whyiseducationsectorinruralIndiasoinefficient?)
Differenttypesofdata 22
Randomizedexperiments
• Strongoninternalvalidity:fromrandomizationany effectonabsenteeismisfromharsherpenalties;goodfortestingtheory
• Weak(er)onexternalvalidity– wouldeffectbesimilarinAfrica?Wouldeffectfromlabworkoutsidelab?Why,whynot?
• (compare:medicineworksinsimilarwaysacrosslocations)
Differenttypesofdata 23
Randomizedexperiments
• Challenges– Limitstowhatcanbestudiedbyexperimentation( ethics;law;feasibility)
– Funding(fieldexperimentsexpensive,surveyexplessso)
– Oftenparticipationconstraint– voluntaryparticipants’gain>=0ornoincentive
– Subjectsleaveforvarious(systematic)reasons– Large-scalerandomizationcanbehardinfieldexperiments
Differenttypesofdata 24
Observationaldata
• Generatedwithoutexperimentalorexogenousintervention
• Typicallyrevealscorrelationsordescriptivepatternsthatcanbeinterestinginthemselves
Differenttypesofdata 25
Example:Inequality
Differenttypesofdata 26
Source:Piketty andSaez,Science2014,taxreturndata
Observationaldata
• Generatedwithoutexperimentalorexogenousintervention
• Typicallyrevealscorrelationsordescriptivepatternsthatcanbeinterestinginthemselves– Areinthemselvessilentaboutcausality– Theorymaybeprovidestructuretolearnaboutcausalmechanismunderstrongassumptions
– Mayconflatecorrelationandcausality
Differenttypesofdata 27
Observationaldata
• Exple:Doesbeinginprivateschoolsaffectgrades– Classic:CatholicschoolsandgradesinUS– Collectattendanceandgrades->runregression
• But:supposesomeparentsaremorefocusedonschoolingthanothers– Sendkidstoprivateschoolmore– Moreinvolvedinschool+homework
• Whatdohighergradesmeasure?– EffectofprivateschoolOReffectofinvolvedparents?
Differenttypesofdata 28
Observationaldata
• Whattodo?– Assignkids/parentsrandomlytoprivateschools?
• Morecomplicated–Waiting-listexperimentdesign:peoplewhosignuprevealthemselvesasschoolinterested,comparegradesbetweenthoseinprogramandonwaitinglist->muchnarrowerdesign
– Modeling(UScase):usefactthatCatholicsaremuchmorelikelytochooseCatholicschools
Differenttypesofdata 29
roadmap
• Different datafordifferent questions• Theory andempirics,forecasting andhypothesis testing
• Effects ofcauses vs.Causes ofeffects• Datagenerating process• Modesofdatacollection – pros andcons• Strategicdatamanagementanddataproduction
Differenttypesofdata 30
Modesofdatacollection• (Ethnographic/participantobserver)• Survey– Interviewsurvey(inperson),phonesurvey,internetsurvey,…
• Administrativedata– Usedforadministrativepurposes– Somecountries:census,taxreturn– DK:CPR-registrybased
• (Primarycollection: texts,counting)• “Bigdata”:insocialsciencestypicallyaby-productofdigitalinformation
Differenttypesofdata 31
Modesofdatacollection
• Note:survey,admindata,bigdatacanallhaverandomized/exogenouselementsorbepurelyobservational
• OfteninLab/fieldexperiments:askaboutincome,educationetc – butmaybebiased
• Sometimes:combineexperimentaldatawithadminorbigdata(butrare)
Differenttypesofdata 32
Ethnographic
• Pros– Attempttounderstandsituationsfromparticipants’perspective
– Verydetailedobservations(e.g.dynamicsatameeting:whospeakswhen,wholistens,whonodsoffandflirtsetc)
• Cons– Verydifficulttogeneralize(ifeventhegoal)
– Typicallyverysmalln,notforstats
– Hardtoreproduce/replicate
Differenttypesofdata 33
Surveys• Pros
– Canbecheap– Elicitinfoonattitudes,
beliefs,expectations– Necessarywhennoother
meansexist– Combinewithopen-ended
info– Easilyanonymized (firms;
China)
• Cons– Canbeexpensive– Non-randomsamples,
sometimesverymuchso(paidsurveys)
– Cheaptalk– Diverseinterpretations
(e.g.1-10scales,Maasaiexample)
– Verydifferentquality:interviewvs.internet
– Notfullresearchercontrol:Interviewercompletions
Differenttypesofdata 34
Administrativedata
• Denmark,Norway,Sweden– Population-wide– Ex:Knowpopulation‘bypressingEnter’
• Mostothercountries:census(countingpeople),surveys,roughapproximations
– InDK,builtonCentralPersonRegistrynumber– Systemconstructedforsourcetaxationin1960s,nowusedasubiquitousidentifier
• WhydosomecountrieshaveCPR-likesystemsandsomenot?
BigDatainEconomics
Administrativedata
• Pros– Oftenfullpopulation– InDK:thirdpartyreported->noreportingbias,nosurveybias
– Verydetailed,nosurveyfatigue
– Oftenveryprecise,sinceusedforadminpurposes
• Cons– Nosoftdata(attitudes,expectations);canbelinkedtosurveys
– Privacyconcerns– Restrictedtowhatiscollectedforadminreasons,bothtypeandfrequency(e.g.annual)
BigDatainEconomics
Administrativedata
• LotsofworkinDanisheconutilizesregisterdata– Taxation– Education– Health– Financialdecisions– Labormarket
• Combinedwith– Personalitymeasures– Attitudes/politicalprefsfromsurveys
– Expectationsfromsurveys
– Biologicaldata(neuro-measures,genetics)
– Datafromexperiments
BigDatainEconomics
Viva la revolución?HarnessingtheDataRevolution
forGood
HumanDevelopmentReportOffice
Big data
BigDatainEconomics
NoagreedupondefinitionwhatBigDatais
• LargeN?• Highfrequency/muchdetail?
• Manydifferentmeasurements?
• Basedonwhatpeopledo(‘honestsignals’)– ctr surveys– Notalwayshonest
• Differenttodifferentpeople/traditions
• ToAmericans,Danishadmin/registerdataisbigdata
BigDatainEconomics
‘Bigdata’
• Pros– Oftenbasedonrealdecisions (asadmindata),butmoredetail,e.g.auctions
– Highfrequency (e.g.wifi),highgranularity->almost‘largeNethnographicdata’
– Sometimescheap/free
• Cons– Noestablishedprotocolforcollection
– Sometimesdubiousquality,selectionissues(bothknown/unknown)
– Start-upcosts– Evenmoreprivacyconcerns
– Corporategatekeepers->biasinaccess(Facebook,Google)
BigDatainEconomics
Characteristicsof‘bigdata’
• Structured(row/column-style)vs.unstructured(images/sound)
• Temporallyreferenced(date,time,frequency)• Geographicallyreferenced(wifi,bluetooth,Google)
• Personidentifiable(identifyvs.distinguishindividualsvs.notdistinguishindividuals)– Separatemedium(e.g.phone)fromowner
BigDatainEconomics
Example:SocialFabric
• Large-scale(N=1000)bigdataproject• HandedoutsmartphonestoDTUfreshmen• Collectedphone,SMS/text/email(notcontent),GPS,wifi,bluetooth data
• ->Where,when,withwhom• ->socialnetworks
BigDatainEconomics
Whyphonedata
• Phonesassociometers• Many/mostpeoplecarryphonewiththemallthetime
• WouldbeIMPOSSIBLEtohavepeoplereportindetailforevery10mineverydayforayear
• Forthisproject:tailoredsoftware,butrealizedthatmanyappscollectdetailedwifi-datawithouttelling
• Concern:take-upofphones
BigDatainEconomics
Example:SocialFabric
BigDatainEconomics
Phone locations0500hMondaymorning ->canpredictwherepeopleatgiventimewith85%accuracy
Example:SocialFabric
BigDatainEconomics
10minGPS wifi
Example:SocialFabric
BigDatainEconomics
Example:peereffectsineducationeconomics
• Studentsallocatedtostudyandsocialgroups,calledvectorgroups(randomly)
• Aretherepeereffects,i.e.arestudents’grades/healthbehavior/studybehavioraffectedbythegroup?
• Literature:sometimesyes,sometimesno;veryheterogeneous
• Why?Perhapsbeingallocatedtogroupisnot=toactuallymeeting/usinggroup
BigDatainEconomics
Example:peereffects• Thinkofallocationtogroupasintentiontotreat(similartoofferingtreatment)
• Interestingexample:Carrell etal,ECMA2013.Smallgroups,yespeereffects;largegroups:no/negativepeereffects– WHY?
• Usephonetomeasurefrequencyofgroupmembersbeingtogetherphysically,measuredbybluetooth
• Threeparts:(i)yestheyaremoretogether;(ii)moretogether=>workbettertogether;(iii)peereffects?
BigDatainEconomics
Broaderissue:Whomeets,andhowclosearethey?
• Again:usebluetooth signalstomeasuremeetings(duration,participants)
• Analyzes3.1mio meetingsovertwomonths• Someresults:– Women/womenpairs->closer– Facebookfriends->closer– Samestudy->closer– Differenceinbeauty->furtherapart– Oneoverweight,onenot->furtherapart
• Peoplewhostandvery(too)closetoothershavefewerfriends(!?)
BigDatainEconomics
Predictionvscausality
• Measureclassattendancefromphonedata(wifi/GPS/bluetooth)– Either:constructclustersatslotsknownasteachingtime;or:useadmininfoonclasslocationsandconstructGPSoverlays
• Facebookactivity
• Predictgrades
BigDatainEconomics
BigDatainEconomics
BigDatainEconomics
BigDatainEconomics
Predictionvscausality
Attendance->grades/comprehension– Peoplewhoattendmorelearnmore– PeoplewhospendlesstimeonFacebookhavemoretimeforstudying
AND/OR
Grades/comprehension->attendance– Findcourseshard->stayathome,moretemptedbyFacebook
BigDatainEconomics
Example:CSS
BigDatainEconomics
HeatmapofpeoplewithmobiledevicesonCSS(anonymous)
Example:DavidonSaturday
BigDatainEconomics
Example:DavidsomeSaturday
BigDatainEconomicsFleamarket
Example:howtomeasureconsumerspending
• Economicallyimportant:– Indicatorofhealthofeconomy– Importantforunderstandingindividualresponsestopolicy
– d.o.toeconomicshocks– Importantforconsumerprices->inflation->adjustmentsofwagesandtransfers
– Indevelopingcountries:importantforestimatesofpoverty,inequality
BigDatainEconomics
Example:consumerspending• Traditionalmethods:
– Consumerexpendituresurveys(DK:forbrugsundersøgelsen)
– Diaryorscanner– Errors,selection
• EconomistswantedaccesstoindividualspendingdatafromDankort foralongtime– Noluck
• Recently,StatisticsDenmarkgotaccesstoCOOP-carddatatomeasureinflation– Tobemadepublicsoon,
prettygoodfitwithexistingmeasures(andmuchfaster)
– Niceidea,incentivecompatible
– Indep ofpaymenttype– Butselection?
BigDatainEconomics
Example:consumerspending
• Attemptsindevelopingeconomics– Usesmartphonesasscannerormeansofpayment– whatcanweinferaboutindividualsfromsmartphoneuse(dedicatedusers)
– Selectionintowhohassmartphones– Butshouldbeseenagainstotherwaysofcollectingdata
• Qs:– Howcanweusesmartphonestoinferspendingbetter?– Whatkindsofeconomicallyinterestingdatacanwecollectviasmartphones?
BigDatainEconomics
StatisticalanalysisofBigData
• Manyobservations:whatdoesstatisticalsignificancemean?– Andwhatispracticalrelevance?Sizeeffects
• Multipletestingproblems?Ifbigdatageneratesmanyvariables,whynotrunthroughthemalltoseewhatissignificant?– Correctstandarderrors
• Insomecases,‘eyeballeconometrics’canbedifficult– Needsystematicapproach
BigDatainEconomics
Statistical/machinelearning
• Supposeyouhavenoorverylittletheorytoguideyou
• OLSisnotonlylinear,butalsopresumessomeideaofwhatactuallygoesinthereandhow
• Varian’sTitanicexample:whosurvivedtheTitanic– Twovariables:Classandage– Researcherdecide/guessvs.dataanalysisyieldmostlikely(decisiontree,butlotsmorecomplicated->Sebastian,later)
– Einav,Levin:Econshouldconsidermachinelearning
BigDatainEconomics
StatisticalanalysisofBigData
• Butwhatifyouhavetheory(orthinkyouhave)– e.g.combine econometricsandmachinelearning
• Goesbacktoolddebateineconomics– MiltonFriedman(1953): judgeamodelbyitspredictions,notitsassumptions
– Machinelearningmadeforpredictionnotforhypothesistestingandtheory(in)validation
BigDatainEconomics
roadmap
• Different datafordifferent questions• Theory andempirics,forecasting andhypothesis testing
• Effects ofcauses vs.Causes ofeffects• Datagenerating process• Modesofdatacollection – pros andcons• Strategicdatamanagementanddataproduction
Differenttypesofdata 65
Strategicdatamanagementandproduction
• People/firms/governmentsdonotalwaysprovidetruthfuland/orcompletedata
• Example:Nopenaltyforlyinginsurveys– butnoreasonnottoeither
• Politicalreasonsforobscuringorinventingdata:GreeceinEU,Chineseeconomy
• Firms:Proprietaryinfo,competitionreasons,foolingcustomersandregulators(VW)
BigDatainEconomics
Strategicdatamanagementandproduction
• Individualdemandforprivacy(Wereturntothis)– Couldbeinstrumental:• lackofprivacydecreasesconsumersurplusbybetterestimateofreservationprice(e.g.Steering:Macvs PCwhenorderingonline)• Concernsaboutpoliticalissues
– Oranobjectiveinitself:Privacyasapoliticalgoal
BigDatainEconomics
Socialdesirability biasI
• Key concern insurveys,butmoregeneralproblem:What ifpeople answer soastoconformwithgeneralnotions ofwhat’s desirable?– Examples:Won’t admit tonotvoting orhavingsexually transmitted diseases,exaggerates income
– Reportsbuying healthy food vs unhealthy food– Important forasking/assessing sensitivequestions
BigDatainEconomics
Socialdesirability biasII
• Why?• Distinguish
a) self-deceptionb) impression management
• Example:What doyou value mostinapotentialmate?– Peoplesay:"kindandunderstanding”– Fromdatingdata:physical attractiveness,status– Biascould be both (a)and(b)
BigDatainEconomics