Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations...

Post on 09-Aug-2020

3 views 0 download

Transcript of Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations...

SocialDataScience

DataandBigdataDavidDreyerLassen

UCPHECONAugust12,2016

InGodwe trust,allothers mustbringdata

W.EdwardsDewing

Differenttypesofdata 2

Today:1.Empirical design2.datagenerating process3.modesofcollection

standardvsbig data;examples4.strategic dataprovision

roadmap

• Different datafordifferent questions• Theory andempirics,forecasting andhypothesis testing

• Effects ofcauses vs.Causes ofeffects• Datagenerating process• Modesofdatacollection – pros andcons• Strategicdatamanagementanddataproduction

Differenttypesofdata 4

Different datafordifferent questionsor

Different questions fordifferent data

Sometimes possible toseparatedatacollection processfromunderlying datagenerating process – andsometimes not

Fundamentaldifferencebetween what people doandwhat they say they do‘cheap talk’/‘putyour money where your mouth is’/honest/costly signaling

Differenttypesofdata 5

roadmap

• Different datafordifferent questions• Theory andempirics,forecasting andhypothesis testing

• Effects ofcauses vs.Causes ofeffects• Datagenerating process• Modesofdatacollection – pros andcons• Strategicdatamanagementanddataproduction

Differenttypesofdata 6

What isyour question,again?1. Researchquestion from

theory2. Idealempirical design3. Feasible empirical

design/collection4. Results5. Adjustment of

theory/question/design6. Newresults7. …

A. WhatdatadowehaveB. Whatquestioncanthey

answerC. ResearchquestionD. Results

Differenttypesofdata 7

Allmodelsare wrong –butsome are useful

Two key goals1. Forecasting:individual behavior,policy

consequences,voting,ChampionsLeague,grades…Datascience/machine learning (butalsomacroeconomics)

2. Hypothesis testing,derived fromtheory´Traditional’socialscience

Differenttypesofdata 8

GeorgeBox

1. Forecasting• Example:Bankwants toforecast non-payment onloans(P_d:probability ofdefault)

• Couldn’t care less about theory• Rough”DataScience”: try topredict fromallavailabledata

• Suppose we findthat birth weight predicts default– Bankishappy,better fit (defer ethics etc)– Policy:does investing inpre-natal care reduce defaults?

• Inpractice: setofpredictors typically taken from(some)theory,even ifcasual

• Complications:ifcustomers knowthat P_d depends onbirth weight,would/should they disclose it?What ifloans only todisclosers?Would they tell thetruth?

Differenttypesofdata 9

2.Hypothesis testing• Theory (rationalchoice,sociology,biology,common sense,…)posits effect ofXonYA. Selection/typetheory:Peoplewho are impatient

cannot defer immediate pleasures ->smoke anddrinkwhile pregnant ->givesbirth sooner.Ifimpatient parents ->impatient children (whether bynatureornurture),we haveanexplanation.

B. Biological theory:low birth weight affects braindevelopment andneurological wiring forpatience.

• If(A),little role forpolicy;also,both can be trueatsametime

• Howtodistinguish:exogenous shock tobirthweight,butethically tricky...

Differenttypesofdata 10

Goodhart’s law

• Mostpopular:“Whenameasurebecomesatarget,itceasestobeagoodmeasure.”

• What he wrote:“Anyobservedstatisticalregularitywilltendtocollapseoncepressureisplaceduponitforcontrolpurposes.”

Differenttypesofdata 11

TargetsandMeasures

• You cannot be toldhow your bankconstructsyour P_d.Why?– Goodhart’s law:people will attempt tooutmaneuver measure

– (thought)example:spending onshoes goodindicator ofaccount overdraft ->shoe lovers willhaveothers buy forthem,ceases tobe agoodmeasure

Differenttypesofdata 12

CaseofGoogleFlu

• GoogleFlu:websearches forFlu symptomspredicted actual flu cases

• By-product ofGoogle’s main service• Butfrom2010,notsowell:overestimatedactual flu cases,partly asresult ofautosuggestfeature,partly becausemodelwas overfitted(we’ll return tothat)

• Bestpredictor:number ofcasespast week

Differenttypesofdata 13

roadmap

• Different datafordifferent questions• Theory andempirics,forecasting andhypothesis testing

• Effects ofcauses vs.Causes ofeffects• Datagenerating process• Modesofdatacollection – pros andcons• Strategicdatamanagementanddataproduction

Differenttypesofdata 14

Effects ofcausesvs.

Causes ofeffects

Different questions• Effects ofcauses:intervention,what iseffectofpolicyXonoutcomeY

• Causes ofeffects:Why does Zoccur?

Differenttypesofdata 15

Effects ofcauses(forwardcausal questions)

• Narrow questions,sometimes (butnotalways)policyinterventions– Effect oftax change onbehavior– Effect ofregulation onrisk taking– Effect ofschooling onearnings– Effect ofsmokingonlung cancerpropensity– Effect ofpublichealth onschooling inAfrica– …

• Often,butnotalways,amenabletotreatments/randomization/experimentation

Differenttypesofdata 16

Causes ofeffects(reverse causal inference)

• Much harder,butoften moreinteresting–Why dosome people smoke?–What are thecauses ofdemocratization?–Why dosome people pursue aPhD why othersdropoutafter primary school?

–Why didGreece (almost)gobankrupt?• Tensionswith”effects ofcauses”– search forcauses sometimes derided as‘partychatter’

Differenttypesofdata 17

roadmap

• Different datafordifferent questions• Theory andempirics,forecasting andhypothesis testing

• Effects ofcauses vs.Causes ofeffects• Datagenerating process• Modesofdatacollection – pros andcons• Strategicdatamanagementanddataproduction

Differenttypesofdata 18

What isthedatagenerating process?

Observational:endogenousdecisions,researcherpassivecollector ofdataRandomization:treatment-control(Some)exogeneity:policyinterventions,sometimeswithcomparisons,researcherssometimes involved

Important:moredatadoes notgivebetterresult/moreprecision if estimator isbiased

Differenttypesofdata 19

Datagenerating process

Randomizedexperiments• Distinguish– Labexperiments:traditionallycomputer-basedinecon,butalsoeyetracking/brainimages(fMRI)/physiological

– Surveyexperiments:assignsurveyrespondentstodifferentframes/treatments/primings,e.g.haveSocDems andLiberalssaysamethingandlookatsupport

– Fieldexperiments:experimentalcontrolintherealworld,e.g.bankschargingdifferentratestolearnaboutmobilityofcustomers;interventionsagainstteacherabsenteeisminIndia;…)

Different typesofdata 20

Randomizedexperiments

• Distinguish– Naturalexperiments(weatherinduced:effectsofpovertyonviolence,randomizationofnamesonelectionballots,…)

– Quasi-experiments(effectsofchangeinpolicy;effectoftaxreformontaxplanning;effectofimmigrantallocationoncrime)

• Throughout:exogenous(outsideoftheindividual)change

Differenttypesofdata 21

Randomizedexperiments

• Large,importantcurrentdebatein(development)economics

• CofE:whatareeffectsofpenaltiesonteachers’absenceinIndianvillageschools– evidencefromrandomizedexperiments

• Randomlyselectedteachersgetharshpenaltyforno-shows->differenceinabsenteeismcausaleffect ofpenalty

• (BroaderEofC Q:whyiseducationsectorinruralIndiasoinefficient?)

Differenttypesofdata 22

Randomizedexperiments

• Strongoninternalvalidity:fromrandomizationany effectonabsenteeismisfromharsherpenalties;goodfortestingtheory

• Weak(er)onexternalvalidity– wouldeffectbesimilarinAfrica?Wouldeffectfromlabworkoutsidelab?Why,whynot?

• (compare:medicineworksinsimilarwaysacrosslocations)

Differenttypesofdata 23

Randomizedexperiments

• Challenges– Limitstowhatcanbestudiedbyexperimentation( ethics;law;feasibility)

– Funding(fieldexperimentsexpensive,surveyexplessso)

– Oftenparticipationconstraint– voluntaryparticipants’gain>=0ornoincentive

– Subjectsleaveforvarious(systematic)reasons– Large-scalerandomizationcanbehardinfieldexperiments

Differenttypesofdata 24

Observationaldata

• Generatedwithoutexperimentalorexogenousintervention

• Typicallyrevealscorrelationsordescriptivepatternsthatcanbeinterestinginthemselves

Differenttypesofdata 25

Example:Inequality

Differenttypesofdata 26

Source:Piketty andSaez,Science2014,taxreturndata

Observationaldata

• Generatedwithoutexperimentalorexogenousintervention

• Typicallyrevealscorrelationsordescriptivepatternsthatcanbeinterestinginthemselves– Areinthemselvessilentaboutcausality– Theorymaybeprovidestructuretolearnaboutcausalmechanismunderstrongassumptions

– Mayconflatecorrelationandcausality

Differenttypesofdata 27

Observationaldata

• Exple:Doesbeinginprivateschoolsaffectgrades– Classic:CatholicschoolsandgradesinUS– Collectattendanceandgrades->runregression

• But:supposesomeparentsaremorefocusedonschoolingthanothers– Sendkidstoprivateschoolmore– Moreinvolvedinschool+homework

• Whatdohighergradesmeasure?– EffectofprivateschoolOReffectofinvolvedparents?

Differenttypesofdata 28

Observationaldata

• Whattodo?– Assignkids/parentsrandomlytoprivateschools?

• Morecomplicated–Waiting-listexperimentdesign:peoplewhosignuprevealthemselvesasschoolinterested,comparegradesbetweenthoseinprogramandonwaitinglist->muchnarrowerdesign

– Modeling(UScase):usefactthatCatholicsaremuchmorelikelytochooseCatholicschools

Differenttypesofdata 29

roadmap

• Different datafordifferent questions• Theory andempirics,forecasting andhypothesis testing

• Effects ofcauses vs.Causes ofeffects• Datagenerating process• Modesofdatacollection – pros andcons• Strategicdatamanagementanddataproduction

Differenttypesofdata 30

Modesofdatacollection• (Ethnographic/participantobserver)• Survey– Interviewsurvey(inperson),phonesurvey,internetsurvey,…

• Administrativedata– Usedforadministrativepurposes– Somecountries:census,taxreturn– DK:CPR-registrybased

• (Primarycollection: texts,counting)• “Bigdata”:insocialsciencestypicallyaby-productofdigitalinformation

Differenttypesofdata 31

Modesofdatacollection

• Note:survey,admindata,bigdatacanallhaverandomized/exogenouselementsorbepurelyobservational

• OfteninLab/fieldexperiments:askaboutincome,educationetc – butmaybebiased

• Sometimes:combineexperimentaldatawithadminorbigdata(butrare)

Differenttypesofdata 32

Ethnographic

• Pros– Attempttounderstandsituationsfromparticipants’perspective

– Verydetailedobservations(e.g.dynamicsatameeting:whospeakswhen,wholistens,whonodsoffandflirtsetc)

• Cons– Verydifficulttogeneralize(ifeventhegoal)

– Typicallyverysmalln,notforstats

– Hardtoreproduce/replicate

Differenttypesofdata 33

Surveys• Pros

– Canbecheap– Elicitinfoonattitudes,

beliefs,expectations– Necessarywhennoother

meansexist– Combinewithopen-ended

info– Easilyanonymized (firms;

China)

• Cons– Canbeexpensive– Non-randomsamples,

sometimesverymuchso(paidsurveys)

– Cheaptalk– Diverseinterpretations

(e.g.1-10scales,Maasaiexample)

– Verydifferentquality:interviewvs.internet

– Notfullresearchercontrol:Interviewercompletions

Differenttypesofdata 34

Administrativedata

• Denmark,Norway,Sweden– Population-wide– Ex:Knowpopulation‘bypressingEnter’

• Mostothercountries:census(countingpeople),surveys,roughapproximations

– InDK,builtonCentralPersonRegistrynumber– Systemconstructedforsourcetaxationin1960s,nowusedasubiquitousidentifier

• WhydosomecountrieshaveCPR-likesystemsandsomenot?

BigDatainEconomics

Administrativedata

• Pros– Oftenfullpopulation– InDK:thirdpartyreported->noreportingbias,nosurveybias

– Verydetailed,nosurveyfatigue

– Oftenveryprecise,sinceusedforadminpurposes

• Cons– Nosoftdata(attitudes,expectations);canbelinkedtosurveys

– Privacyconcerns– Restrictedtowhatiscollectedforadminreasons,bothtypeandfrequency(e.g.annual)

BigDatainEconomics

Administrativedata

• LotsofworkinDanisheconutilizesregisterdata– Taxation– Education– Health– Financialdecisions– Labormarket

• Combinedwith– Personalitymeasures– Attitudes/politicalprefsfromsurveys

– Expectationsfromsurveys

– Biologicaldata(neuro-measures,genetics)

– Datafromexperiments

BigDatainEconomics

Viva la revolución?HarnessingtheDataRevolution

forGood

HumanDevelopmentReportOffice

Big data

BigDatainEconomics

NoagreedupondefinitionwhatBigDatais

• LargeN?• Highfrequency/muchdetail?

• Manydifferentmeasurements?

• Basedonwhatpeopledo(‘honestsignals’)– ctr surveys– Notalwayshonest

• Differenttodifferentpeople/traditions

• ToAmericans,Danishadmin/registerdataisbigdata

BigDatainEconomics

‘Bigdata’

• Pros– Oftenbasedonrealdecisions (asadmindata),butmoredetail,e.g.auctions

– Highfrequency (e.g.wifi),highgranularity->almost‘largeNethnographicdata’

– Sometimescheap/free

• Cons– Noestablishedprotocolforcollection

– Sometimesdubiousquality,selectionissues(bothknown/unknown)

– Start-upcosts– Evenmoreprivacyconcerns

– Corporategatekeepers->biasinaccess(Facebook,Google)

BigDatainEconomics

Characteristicsof‘bigdata’

• Structured(row/column-style)vs.unstructured(images/sound)

• Temporallyreferenced(date,time,frequency)• Geographicallyreferenced(wifi,bluetooth,Google)

• Personidentifiable(identifyvs.distinguishindividualsvs.notdistinguishindividuals)– Separatemedium(e.g.phone)fromowner

BigDatainEconomics

Example:SocialFabric

• Large-scale(N=1000)bigdataproject• HandedoutsmartphonestoDTUfreshmen• Collectedphone,SMS/text/email(notcontent),GPS,wifi,bluetooth data

• ->Where,when,withwhom• ->socialnetworks

BigDatainEconomics

Whyphonedata

• Phonesassociometers• Many/mostpeoplecarryphonewiththemallthetime

• WouldbeIMPOSSIBLEtohavepeoplereportindetailforevery10mineverydayforayear

• Forthisproject:tailoredsoftware,butrealizedthatmanyappscollectdetailedwifi-datawithouttelling

• Concern:take-upofphones

BigDatainEconomics

Example:SocialFabric

BigDatainEconomics

Phone locations0500hMondaymorning ->canpredictwherepeopleatgiventimewith85%accuracy

Example:SocialFabric

BigDatainEconomics

10minGPS wifi

Example:SocialFabric

BigDatainEconomics

Example:peereffectsineducationeconomics

• Studentsallocatedtostudyandsocialgroups,calledvectorgroups(randomly)

• Aretherepeereffects,i.e.arestudents’grades/healthbehavior/studybehavioraffectedbythegroup?

• Literature:sometimesyes,sometimesno;veryheterogeneous

• Why?Perhapsbeingallocatedtogroupisnot=toactuallymeeting/usinggroup

BigDatainEconomics

Example:peereffects• Thinkofallocationtogroupasintentiontotreat(similartoofferingtreatment)

• Interestingexample:Carrell etal,ECMA2013.Smallgroups,yespeereffects;largegroups:no/negativepeereffects– WHY?

• Usephonetomeasurefrequencyofgroupmembersbeingtogetherphysically,measuredbybluetooth

• Threeparts:(i)yestheyaremoretogether;(ii)moretogether=>workbettertogether;(iii)peereffects?

BigDatainEconomics

Broaderissue:Whomeets,andhowclosearethey?

• Again:usebluetooth signalstomeasuremeetings(duration,participants)

• Analyzes3.1mio meetingsovertwomonths• Someresults:– Women/womenpairs->closer– Facebookfriends->closer– Samestudy->closer– Differenceinbeauty->furtherapart– Oneoverweight,onenot->furtherapart

• Peoplewhostandvery(too)closetoothershavefewerfriends(!?)

BigDatainEconomics

Predictionvscausality

• Measureclassattendancefromphonedata(wifi/GPS/bluetooth)– Either:constructclustersatslotsknownasteachingtime;or:useadmininfoonclasslocationsandconstructGPSoverlays

• Facebookactivity

• Predictgrades

BigDatainEconomics

BigDatainEconomics

BigDatainEconomics

BigDatainEconomics

Predictionvscausality

Attendance->grades/comprehension– Peoplewhoattendmorelearnmore– PeoplewhospendlesstimeonFacebookhavemoretimeforstudying

AND/OR

Grades/comprehension->attendance– Findcourseshard->stayathome,moretemptedbyFacebook

BigDatainEconomics

Example:CSS

BigDatainEconomics

HeatmapofpeoplewithmobiledevicesonCSS(anonymous)

Example:DavidonSaturday

BigDatainEconomics

Example:DavidsomeSaturday

BigDatainEconomicsFleamarket

Example:howtomeasureconsumerspending

• Economicallyimportant:– Indicatorofhealthofeconomy– Importantforunderstandingindividualresponsestopolicy

– d.o.toeconomicshocks– Importantforconsumerprices->inflation->adjustmentsofwagesandtransfers

– Indevelopingcountries:importantforestimatesofpoverty,inequality

BigDatainEconomics

Example:consumerspending• Traditionalmethods:

– Consumerexpendituresurveys(DK:forbrugsundersøgelsen)

– Diaryorscanner– Errors,selection

• EconomistswantedaccesstoindividualspendingdatafromDankort foralongtime– Noluck

• Recently,StatisticsDenmarkgotaccesstoCOOP-carddatatomeasureinflation– Tobemadepublicsoon,

prettygoodfitwithexistingmeasures(andmuchfaster)

– Niceidea,incentivecompatible

– Indep ofpaymenttype– Butselection?

BigDatainEconomics

Example:consumerspending

• Attemptsindevelopingeconomics– Usesmartphonesasscannerormeansofpayment– whatcanweinferaboutindividualsfromsmartphoneuse(dedicatedusers)

– Selectionintowhohassmartphones– Butshouldbeseenagainstotherwaysofcollectingdata

• Qs:– Howcanweusesmartphonestoinferspendingbetter?– Whatkindsofeconomicallyinterestingdatacanwecollectviasmartphones?

BigDatainEconomics

StatisticalanalysisofBigData

• Manyobservations:whatdoesstatisticalsignificancemean?– Andwhatispracticalrelevance?Sizeeffects

• Multipletestingproblems?Ifbigdatageneratesmanyvariables,whynotrunthroughthemalltoseewhatissignificant?– Correctstandarderrors

• Insomecases,‘eyeballeconometrics’canbedifficult– Needsystematicapproach

BigDatainEconomics

Statistical/machinelearning

• Supposeyouhavenoorverylittletheorytoguideyou

• OLSisnotonlylinear,butalsopresumessomeideaofwhatactuallygoesinthereandhow

• Varian’sTitanicexample:whosurvivedtheTitanic– Twovariables:Classandage– Researcherdecide/guessvs.dataanalysisyieldmostlikely(decisiontree,butlotsmorecomplicated->Sebastian,later)

– Einav,Levin:Econshouldconsidermachinelearning

BigDatainEconomics

StatisticalanalysisofBigData

• Butwhatifyouhavetheory(orthinkyouhave)– e.g.combine econometricsandmachinelearning

• Goesbacktoolddebateineconomics– MiltonFriedman(1953): judgeamodelbyitspredictions,notitsassumptions

– Machinelearningmadeforpredictionnotforhypothesistestingandtheory(in)validation

BigDatainEconomics

roadmap

• Different datafordifferent questions• Theory andempirics,forecasting andhypothesis testing

• Effects ofcauses vs.Causes ofeffects• Datagenerating process• Modesofdatacollection – pros andcons• Strategicdatamanagementanddataproduction

Differenttypesofdata 65

Strategicdatamanagementandproduction

• People/firms/governmentsdonotalwaysprovidetruthfuland/orcompletedata

• Example:Nopenaltyforlyinginsurveys– butnoreasonnottoeither

• Politicalreasonsforobscuringorinventingdata:GreeceinEU,Chineseeconomy

• Firms:Proprietaryinfo,competitionreasons,foolingcustomersandregulators(VW)

BigDatainEconomics

Strategicdatamanagementandproduction

• Individualdemandforprivacy(Wereturntothis)– Couldbeinstrumental:• lackofprivacydecreasesconsumersurplusbybetterestimateofreservationprice(e.g.Steering:Macvs PCwhenorderingonline)• Concernsaboutpoliticalissues

– Oranobjectiveinitself:Privacyasapoliticalgoal

BigDatainEconomics

Socialdesirability biasI

• Key concern insurveys,butmoregeneralproblem:What ifpeople answer soastoconformwithgeneralnotions ofwhat’s desirable?– Examples:Won’t admit tonotvoting orhavingsexually transmitted diseases,exaggerates income

– Reportsbuying healthy food vs unhealthy food– Important forasking/assessing sensitivequestions

BigDatainEconomics

Socialdesirability biasII

• Why?• Distinguish

a) self-deceptionb) impression management

• Example:What doyou value mostinapotentialmate?– Peoplesay:"kindandunderstanding”– Fromdatingdata:physical attractiveness,status– Biascould be both (a)and(b)

BigDatainEconomics