大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random...

53
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 大数据时代的管理决策 Lecture 2 : Econometrics, Machine Learning and Causal Inference Zhaopeng Qu Business School,Nanjing University March. 21, 2020 Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 1 / 53

Transcript of 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random...

Page 1: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

大数据时代的管理决策Lecture 2 : Econometrics, Machine Learning and Causal Inference

Zhaopeng Qu

Business School,Nanjing University

March. 21, 2020

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 1 / 53

Page 2: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Outlines

1 Review to the last lecture

2 The Purpose of Econometrics and Machine Learning

3 Causal Inference in Social Science

4 Experimental Design as an Benchmark

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 2 / 53

Page 3: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Review to the last lecture

Review to the last lecture

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 3 / 53

Page 4: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Review to the last lecture

Introduction: Two Missions for Data Science

My View: In general, a series of scientific methods tosearching for economic logics from data.It could include two jobs

Making a causal inference,such asTesting economic theories.Estimating causal effects.Using data to give policy recommendations.

Forecasting or predicting future valuesMore and more prevalence in

other social science such as political science, sociology,law and education studies etcand business practice, like the hottest one: Data Science.

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 4 / 53

Page 5: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Review to the last lecture

Conditional Expectation Function

Conditional ExpectationConditional on X, Y’s Conditional Expectation is

E(Y|X) =

{∑yfY|X(y|x) discrete Y∫

yfY|X(y|x)dy continuous Y

Conditional Expectation Function(CEF) is a function of x, since X isa random variable, so CEF is also a random variable.Intuition:期望就是求平均值,而条件期望就是“分组取平均”或“在... 条件下的均值”。

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 5 / 53

Page 6: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Review to the last lecture

Properties of Conditional Expectation

1 E[c(X) | X] = c(X) for any function c(X).Thus if we know X, thenwe also know c(X).

eg. E[(X2 + 2X3) | X] =X2 + 2X3

if X and Y are independent r.v.s, then

E[Y | X = x] = E[Y]

if X and Y independent conditional on Z, thus X ⊥ Y | Z ,

E[Y | X = x,Z = z] = E[Y | Z = z]

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 6 / 53

Page 7: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The Purpose of Econometrics and Machine Learning

The Purpose of Econometrics and Machine Learning

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 7 / 53

Page 8: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The Purpose of Econometrics and Machine Learning

The Purposes of Econometrics(Machine Learning)

To prove or disprove a theory(a relations)“The objective of science is the discovery of the relations”—Lord Kelvin

In most cases,we often want to explore the relationship betweentwo variables in one study.

eg. education and wageThen, in simplicity, there are two relationships between twovariables.

Correlation(相关)V.S. Causality(因果)

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 8 / 53

Page 9: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The Purpose of Econometrics and Machine Learning

A Classical Example: Hemline Index(裙边指数)

George Taylor, an economist in the United States, made up thephrase it in the 1920s. The phrase is derived from the idea thathemlines on skirts are shorter or longer depending on theeconomy.

Before 1930s, fashion women favored middle skirts most.In 1929, long skirts became popular. While the Dow Jones IndustrialIndex(DJII) plunged from about 400 to 200 and to 40 two years later.In 1960s, DJII rushed to 1000. At the same time, short skirts showedup.In 1970s, DJII fell to 590 and women began to wear long skirts again.In 1990s, mini skirt debuted, DJII rushed to 10000.In 2000s, bikini became a nice choice for girls, DJII was high up to13000.

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 9 / 53

Page 10: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The Purpose of Econometrics and Machine Learning

Hemline Index:1920s-2010s

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 10 / 53

Page 11: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The Purpose of Econometrics and Machine Learning

The Core of Empirical Studies: Causality v.s. Forecasting

Questions you would like to ask

1 What was the real relationship between two variables? TheAmerican economy depended on how long the skirts or the viceverse?

In other words, is the length of skirts the reason(at least oneof) why economy are good or bad?

2 Now the economy seems extremely bad. But if we could shortenthe length of skirts, could the economy boom again?

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 11 / 53

Page 12: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The Purpose of Econometrics and Machine Learning

The Core of Empirical Studies: Explaining v.s. Forecasting

1 The first case is a typical “explanatory”question: Trying tofind the reason behind the phenomenon or calculate the effect ofa counterfactual intervention.

2 The second case is a typical“predictive” question: Tryingcalculating the effect of a hypothetical intervention.

Having a good prediction does work sometimes but does NOTmean understanding causality.While In most cases, if we could find real reasons to behind thephenomena, we could safely to predict the phenomenon by aproper intervention.

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 12 / 53

Page 13: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The Purpose of Econometrics and Machine Learning

The Core of Empirical Studies: Causality v.s. Forecasting

Questions you would like to ask

1 What was the real relationship between two variables? TheAmerican economy depended on how long the skirts or the viceverse?

In other words, is the length of skirts the reason(at least oneof) why economy are good or bad?

2 Now the economy seems extremely bad. But if we could shortenthe length of skirts, could the economy boom again?

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 13 / 53

Page 14: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The Purpose of Econometrics and Machine Learning

Machine Learning is mostly about prediction

Some Big Data researchers think causality is not important anymore in our times..“Look at correlations. Look at the ’what’ rather than the

’why’, because that is often good enough.”-ViktorMayer-Schonberger(2013)

Machine learning is a set of data-driven algorithms that usedata to predict or classify some variable Y as a function of othervariables X.

There are many machine learning algorithm. The bestmethods vary with the particular data applicationMachine learning is mostly about prediction.

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 14 / 53

Page 15: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The Purpose of Econometrics and Machine Learning

Econometrics is mostly about causal inference

Most empirical economists think that correlation only tell us thesuperficial, even false relationship while causal relationship canprovide solid evidence to make interference to the realrelationship.

Today, empirical economists care more about the causalrelationship of their interests than ever before.“the most interesting and challenging research in social

science is about cause and effect”——Angrist andLavy(2008)

The biggest difference between machine learning andeconometrics(or causal inference)

Different goals: Causal inference v.s Prediction

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 15 / 53

Page 16: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The Purpose of Econometrics and Machine Learning

Econometrics and Machine Learning

By a perspective of methodology, they share some commonstatistical methods.Machine learning use a lot of classical statistical method andseveral advanced econometric method(semi and non parametricalmethods)

Supervised learning: regression and classification,K-nearstneighbors(matching)Unsupervised learning: cluster analysis, principal components analysis.

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 16 / 53

Page 17: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The Purpose of Econometrics and Machine Learning

Our Roadmap

Causal Inference and Random Controlled TrailOLS RegressionLogistical RegressionIntroduction to ML

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 17 / 53

Page 18: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The Purpose of Econometrics and Machine Learning

An Amazing Journey

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 18 / 53

Page 19: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Causal Inference in Social Science

Causal Inference in Social Science

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 19 / 53

Page 20: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Causal Inference in Social Science

The Central Question of Causality(I)

A simple example: Do hospitals make people healthier? (Q:Dependent variable and Independent variable?)A naive solution: compare the health status of those who havebeen to the hospital to the health of those who have not.Two key questions are documented by the questionnaires(问卷)from The National Health Interview Survey(NHIS)

1“During the past 12 months, was the respondent a patient ina hospital overnight?”

2“Would you say your health in general is excellent, verygood, good ,fair and poor”and scale it from the number“1”to “5”respectively.

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 20 / 53

Page 21: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Causal Inference in Social Science

The Central Question of Causality(II)Hospital v.s. No Hospital

Group Sample Size Mean Health Status Std.DevHospital 7774 2.79 0.014

No Hospital 90049 2.07 0.003

In favor of the non-hospitalized, WHY?Hospitals not only cure but also hurt people.

1 hospitals are full of other sick people who might infect us2 dangerous machines and chemicals that might hurt us.

More important : people having worse health tends to visithospitals.

This simple case exhibits that it is not easy to answer an causalquestion, so let us formalize an model to show where the problemis.

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 21 / 53

Page 22: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Causal Inference in Social Science

The Central Question of Causality(III)

So A right way to answer a causal questions is construct acounterfactual world, thus “What If ....then”, Such asAn example: How much wage premium you can get from collegeattendance(上大学使工资增加多少?)

For any worker, we want to compareWage if he have a college degree (上了大学后的工资)Wage if he had not a college degree (假设没上大学,工作后的工资)

Then make a difference. This is the right answer to ourquestion.

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 22 / 53

Page 23: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Causal Inference in Social Science

Difficulty in Identification

Others are the same asMilitary serviceMigrationRoad buildingJob trainingParty membershipPublic policiesOthers...

Difficulty: only one state can be observed

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 23 / 53

Page 24: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Causal Inference in Social Science

Formalization: Rubin Causal Model

Treatment : Di is a dummy that indicate whether individual ireceive treatment or not

Di =

{1 if individual i received the treatment0 otherwise

Examples:Go to college or notHave health insurance or notJoin a training program or notMake an online-advertisement or not....

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 24 / 53

Page 25: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Causal Inference in Social Science

Formalization: Treatment

Treatment : Di can be a multiple valued(continues) variable

Di = s

Examples:Schooling yearsNumber of ChildrenNumber of advertisementsMoney Supply

For simplicity, we assume treatment variable Di is just a dummy.

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 25 / 53

Page 26: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Causal Inference in Social Science

Formalization: Potential Outcomes

A potential outcome is the outcome that would be realized if theindividual received a specific value of the treatment.

Annual earnings if attending to collegeAnnual earnings if not attending to college

For each individual, we has two potential outcomes,Y1i and Y0i,one for each value of the treatment

Y1i : Potential outcome for an individual i with treatment.Y0i : Potential outcome for an individual i with treatment.

Potential Outcomes ={

Y1i if Di = 1

Y0i if Di = 0

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 26 / 53

Page 27: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Causal Inference in Social Science

Stable Unit Treatment Value Assumption (SUTVA)

Observed outcomes are realized as

Yi = Y1iDi + Y0i(1− Di)

Implies that potential outcomes for an individual i are unaffectedby the treatment status of other individual jIndividual j ’ s potential outcomes are only affected by his/herown treatment.Rules out possible treatment effect from other individuals(spillover effect/externality)

ContagionDisplacement

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 27 / 53

Page 28: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Causal Inference in Social Science

Causal effect for an Individual

To know the difference between Y1i and Y0i, which can be saidto be the causal effect of going to college for individual i. (Doyou agree with it?)

DefinitionCausal inference is the process of estimating a comparison ofcounterfactuals under different treatment conditions on the same setof units. It also call Individual Treatment Effect(ICE)

δi = Y1i − Y0i

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 28 / 53

Page 29: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Causal Inference in Social Science

Formalization: Estimate ICE

Due to unobserved counterfactual outcome, we need to makestrong assumptions to estimate ICE.

Rule out that the ICE differs across individuals (“heterogeneity effect”)

Knowing individual effect is not our final goal. As a socialscientist, we would like more to know the Average effect as asocial pattern.So it make us focus on the average wage for a group of people.

How can we get the average wage premium for collegeattendance?

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 29 / 53

Page 30: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Causal Inference in Social Science

Conditional Expectation:

Expectation: We usually use E[Yi] (the expectation of avariable Yi) to denote population average of Yi

Suppose we have a population with N individuals

E[Yi] =1

NΣNi=1Yi

Conditional Expectation:The average wage for those who attend college: E[Yi|Di = 1]The average wage for those who did not attend college: E[Yi|Di = 0]

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 30 / 53

Page 31: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Causal Inference in Social Science

Average Causal Effects

Average Treatment Effect (ATE)

αATE = E[δi] = E[Y1i − Y0i]

It is average of ICEs over the population.

Average treatment effect on the treated(ATT)

αATT = E[δi|Di = 1] = E[Y1i − Y0i|Di = 1]

Average of ICEs over the treated population

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 31 / 53

Page 32: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Causal Inference in Social Science

Fundamental Problem of Causal Inference

We can never directly observe causal effects (ICE, ATE or ATT)Because we can never observe both potential outcomes (Y0i,Y1i)for any individual.We need to compare potential outcomes, but we only haveobserved outcomesSo by this view, causal inference is a missing data problem.

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 32 / 53

Page 33: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Causal Inference in Social Science

Fundamental Problem of Causal Inference

Imagine a population with 4 people

i Yi1 Y0i Yi Di Yi1 − Y0i

Tom 3 ? 3 1 ?Jerry 2 ? 2 1 ?

Scarlett ? 1 1 0 ?Nicole ? 1 1 0 ?

What is Individual causal effect (ICE) of attending college forTom? for Nicole?

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 33 / 53

Page 34: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Causal Inference in Social Science

Individual Causal Effect

Suppose we can observe counterfactual outcomes

i Yi1 Y0i Yi Di Yi1 − Y0i

Tom 3 2 3 1 1Jerry 2 1 2 1 1

Scarlett 1 1 1 0 0Nicole 1 1 1 0 0

The ICE for TomδTom = 3− 2 = 11

The ICE for NicoleδNicole = 1− 1 = 0

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 34 / 53

Page 35: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Causal Inference in Social Science

Average Treatment Effect(ATE)

Missing data problem also arises when we estimate ATE

i Yi1 Y0i Yi Di Yi1 − Y0i

Tom 3 ? 3 1 ?Jerry 2 ? 2 1 ?

Scarlett ? 1 1 0 ?Nicole ? 1 1 0 ?E[Y1i] ?E[Y0i] ?

E[Y1i − Y0i] ?

What is the effect of attending college on average wage ofpopulation(ATE)

αATE = E[δi] = E[Y1i − Y0i]

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 35 / 53

Page 36: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Causal Inference in Social Science

Average Treatment Effect(ATE)Missing data problem also arises when we estimate ATE

i Yi1 Y0i Yi Di Yi1 − Y0i

Tom 3 2 3 1 1Jerry 2 1 2 1 1

Scarlett 1 1 1 0 0Nicole 1 1 1 0 0E[Y1i]

3+2+1+14 = 1.75

E[Y0i]2+1+1+1

4 = 1.25

E[Y1i − Y0i] 0.5

What is the effect of attending college on average wage of thepopulation(ATE)

αATE = E[δi] = E[Y1i − Y0i] =1 + 1 + 0 + 0

4= 0.5

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 36 / 53

Page 37: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Causal Inference in Social Science

Average Treatment Effect on the Treated(ATT)

Missing data problem arises when we estimate ATT

i Yi1 Y0i Yi Di Yi1 − Y0i

Tom 3 ? 3 1 ?Jerry 2 ? 2 1 ?

Scarlett ? 1 1 0 ?Nicole ? 1 1 0 ?

E[Y1i|Di = 1] ?E[Y0i|Di = 1] ?

E[Y1i − Y0i|Di = 1] ?

What is the effect of attending college on average wage for thosewho attend college(ATT)

αATE = E[δi] = E[Y1i − Y0i|Di = 1]

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 37 / 53

Page 38: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Causal Inference in Social Science

Average Treatment Effect on the Treated(ATT)Missing data problem also arises when we estimate ATE

i Yi1 Y0i Yi Di Yi1 − Y0i

Tom 3 2 3 1 1Jerry 2 1 2 1 1

Scarlett 1 1 1 0 0Nicole 1 1 1 0 0

E[Y1i|Di = 1] 3+22 = 2.5

E[Y0i|Di = 1] 2+12 = 1.5

E[Y1i − Y0i|Di = 1] 1

The effect of attending college on average wage for those whoattend college(ATT)

αATE = E[Y1i − Y0i|Di = 1] =1 + 1

2= 1

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 38 / 53

Page 39: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Causal Inference in Social Science

Observed Association and Selection Bias

Causality is defined by potential outcomes, not by realized(observed) outcomes.In fact, we can not observe all potential outcomes .Therefore, wecan not estimate the above causal effects without furtherassumptions.By using observed data, we can only establish association(correlation), which is the observed difference in averageoutcome between those getting treatment and those not gettingtreatment.

αcorr = E[Y1i|Di = 1]− E[Y0i|Di = 0]

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 39 / 53

Page 40: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Causal Inference in Social Science

College vs Non-College Wage Differentials:

Comparing the average wage in labor market who went to collegeand did not go.

College vs Non-College Wage Differentials:

=E[Y1i|Di = 1]− E[Y0i|Di = 0]

={E[Y1i|Di = 1]−E[Y0i|Di = 1]}+ {E[Y0i|Di = 1]− E[Y0i|Di = 0]}

Question 1: Which one defines the causal effect of collegeattendance?

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 40 / 53

Page 41: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Causal Inference in Social Science

Formalization: Rubin Causal Model

Selection Bias(SB) implies the potential outcomes oftreatment and control groups are different even if both groupsreceive the same treatment

E[Y0i|Di = 1]− E[Y0i|Di = 0]

Question 2: Selection Bias is positive or negative in the case?This means two groups could be quite different in otherdimensions: other things are not equal.Observed association is neither necessary nor sufficient forcausality.

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 41 / 53

Page 42: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Causal Inference in Social Science

Observed Association:College vs Non-College WageDifferentials:

Missing data problem also arises when we estimate ATE

i Yi1 Y0i Yi Di Yi1 − Y0i

Tom 3 ? 3 1 ?Jerry 2 ? 2 1 ?

Scarlett ? 1 1 0 ?Nicole ? 1 1 0 ?

E[Y1i|Di = 1] 3+22 = 2.5

E[Y0i|Di = 0] 1+12 = 1

E[Y1i|Di = 1]− E[Y0i|Di = 0] 1.5

The Observed Association of attending college on average wage

αcorr = 2.5− 1 = 1.5

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 42 / 53

Page 43: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Causal Inference in Social Science

Observed Association and Selection Bias

But we are interested in causal effect, here is ATT

αATT = E[δi|Di = 1] = E[Y1i − Y0i|Di = 1] = 1

So the selection bias

E[Y0i|Di = 1]− E[Y0i|Di = 0] = 0.5

The Selection Bias is positive: Those who attend college couldbe more intelligent so they can earn more even if they did notattend college.

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 43 / 53

Page 44: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Causal Inference in Social Science

Causal Effect and Identification Strategy

Many Many Other examplesthe effect of job training program on worker’s earningsthe effect of class size on students performance....

Identification strategy tells us what we can learn about a causaleffect from the available data.The main goal of identification strategy is to eliminate theselection bias.Identification depends on assumptions, not on estimationstrategies.“What’s your identification strategy?”= what are theassumptions that allow you to claim you’ve estimated a causaleffect?

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 44 / 53

Page 45: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Experimental Design as an Benchmark

Experimental Design as an Benchmark

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 45 / 53

Page 46: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Experimental Design as an Benchmark

Randomized Controlled Trial

A randomized controlled trial (RCT) is a form of investigation inwhich units of observation (e.g. individuals, households, schools,states) are randomly assigned to treatment and control groups.RCT has two features that can help us hold “other things equal”and then eliminates selection bias

Random assign treatment:Randomly assign treatment (such as a coin flip) ensures that everyobservation has the same probability of being assigned to the treatmentgroup.Therefore, the probability of receiving treatment is unrelated to anyother confounding factors.

Sufficient large sampleLarge sample size can ensure that the group differences in individualcharacteristics wash out

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 46 / 53

Page 47: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Experimental Design as an Benchmark

How to Solves the Selection Problem

Random assignment of treatment Di can eliminates selectionbias. It means that the treated group is a random sample fromthe population.Being a random sample, we know that those included in thesample are the same, on average, as those not included in thesample on any measure.Mathematically ,it makes Di independent of potentialoutcomes, thus

Di ⊥ (Y0i,Y1i)

Independence: Two variables are said to be independent ifknowing the outcome of one provides no useful information aboutthe outcome of the other.

Knowing outcome of Di(0, 1) does not help us understand whatpotential outcomes of (Y0i,Y1i) will be

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 47 / 53

Page 48: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Experimental Design as an Benchmark

Random Assignment Solves the Selection Problem

So we haveE[Y0i|Di = 1] = E[Y0i|Di = 0]

Thus the Selection Bias equals to ZERO.Then ATT equals Observed Association because the

E[Y1i|Di = 1]− E[Y0i|Di = 0] = E[Y1i|Di = 1]− E[Y0i|Di = 1]

=E[Y1i − Y0i|Di = 1]

No matter what assumptions we make about the distribution ofY , we can always estimate it with the difference in means.

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 48 / 53

Page 49: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Experimental Design as an Benchmark

Our Benchmark: Randomized Experimental Methods

Think of causal effects in terms of comparing counterfactuals orpotential outcomes. However, we can never observe bothcounterfactuals —fundamental problem of causal inference.To construct the counterfactuals, we could use two broadcategories of empirical strategies.

Random Controlled Trials/Experiments:it can eliminates selection bias which is the mostimportant bias arises in empirical research. If we couldobserve the counterfactual directly, then there is noevaluation problem, just simply difference.

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 49 / 53

Page 50: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Experimental Design as an Benchmark

Our Benchmark: Randomized Experimental Methods

We can generate the data of our interest by controllingexperiments just as physical scientists or biologists do. But tooobviously, we face more difficult and controversy situation thanthose in any other sciences.The various approaches using naturally-occurring data providealternative methods of constructing the proper counterfactual

EconometricsCongratulation! We are working and studying in a more tough andintractable area than others including most science knowledge.

We should take the randomized experimental methods as ourbenchmark when we do empirical research whatever the methodswe apply.

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 50 / 53

Page 51: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Experimental Design as an Benchmark

Program Evaluation Econometrics(项目评估计量经济学)

Question: How to do empirical research scientifically when wecan not do experiments? It means that we always have selectionbias in our data, or in term of “endogeneity”.Answer: Build a reasonable counterfactual world by naturallyoccurring data to find a proper control group is the core ofeconometrical methods.Here you Furious Seven Weapons in Applied Econometrics(七种盖世武器)

1 Random Controlled Trials(RCT)2 OLS(回归)3 Matching(匹配)4 Decomposition(分解)5 Instrumental Variable(工具变量)6 Regression Discontinuity(断点回归)7 Differences in Differences(双差分) and Synthetic Control (合成控制法)

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 51 / 53

Page 52: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Experimental Design as an Benchmark

Program Evaluation Econometrics(项目评估计量经济学)

Question: How to do empirical research scientifically when wecan not do experiments? It means that we always have selectionbias in our data, or in term of “endogeneity”.

Answer: Build a reasonable counterfactual world by naturallyoccurring data to find a proper control group is the core ofeconometrical methods.Here you Furious Seven Weapons in AppliedEconometrics(七种盖世武器)

1 Random Controlled Trials(RCT)2 OLS(回归)3 Matching(匹配)4 Decomposition(分解)5 Instrumental Variable(工具变量)6 Regression Discontinuity(断点回归)7 Differences in Differences(双差分) and Synthetic Control (合成控制法)

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 52 / 53

Page 53: 大数据时代的管理决策 - byelenin.github.io · a random variable, so CEF is also a random variable. Intuition:期望就是求平均值,而条件期望就是“分组取平均”或

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Experimental Design as an Benchmark

Q & A

Question?

Zhaopeng Qu (Nanjing University) 大数据时代的管理决策 March. 21, 2020 53 / 53