March 19 - King's Facultydkerr.kingsfaculty.ca/dkerr/assets/lecture11_3306_2015... · 2016. 2....

March 19th

Tamara Stallard; Cynthia Wheeler; Chris Tuskan; Daniella

Odesa; Kristine Petitta; Taylor Hood; Michelle Janulis; Diana

Rusin;

March 26th

Nicole Bullock; Kelly Robertson; Daniel Wood; Steph Boak;

Pouya Moghaddam; Joanna Lee; Nicole Corbo; Misghana

Ghebredingle

April 2nd (last class)Blake Huggins; Jamie-Lee Bossenberry; Ryan O’Quinn;

Sarah Stewart; Brittany Jenkins; Ryan Higgins; Greg

Morrow; Baley Gofton; Cynthia Hill; Brandi Spitzig;

Class presentation (10 minutes each) • •

Brief introduction to your topic:

Dependent variable & independent variables

Sample (what’s your target population)

What are you anticipating? (optional: causal diagram)

What has previous literature suggested that you can expect?

Findings: optional

• 1 page handout and/or power point (for my records)..

• Assignment 3–due next week March 19th

• Do note: In this assignment I also get you

to run the model with the “WEIGHTED”

data..

• Today I will begin by talking a bit about

this (sampling and sample weights)

• To be followed by:

• Tips on creating models, regardless of

whether we are working with OLS or

Logistic Regression..

In Soc 2206, you learned a bit about sampling strategies..

1. SRS sampling: Everyone in the target population has an equal chance

of being selected.

2. Stratified Samples:

First divide your sampling frame into strata (ex: provinces, or anything else),

and then use SRS sampling within each province

Proportionate: number within each strata is proportionate to population

Disproportionate: number within each strata is disproportionate

Typical of Stats Can: Stratified by province, over represent some provinces

and under represent others in their samples.

3. Cluster sampling (involving clusters, which can be almost anything:

Geographies, institutions, social groups, etc.)

Multistage sample of Canadian households (more than 10 million in Canada)1. Randomly selected metropolitan areas across Canada (35 CMAs current)

2. Within those selected CMA, randomly select census tracts (typically 100s)

3. Within those selected census tract, randomly select city blocks (typically dozens within each)

4. Within those selected city blocks, randomly select households (typically 100s)

5000 10,000

Beyond a certain size (eg. 5000), limited

returns on reducing sampling error

Sample size is important,

BUT:

ANOTHER IMPORTANT POINT:

In predicting “sampling error”, it is “SAMPLE SIZE” that counts and not

the size of the “targeted POPULATION”

In the above example, to get a margin of error +/- 2.5 %,

you require a sample size of 1,534 persons for a population of 1,000,000

you require a roughly equally sized sample size for a population of 100,000

In other words, in Stats Can’s surveys, a sample of roughly a couple of 1000 in

each province would give estimates of roughly equivalent sampling error

for all provinces, regardless of the size of their respective populations..

Assume we take a simple random sample (SRS) of Canada:

Every individual is given an equal chance of selection in the final sample

Lets assume that we interview 2000 Canadians in our sample (n-2000)

Probability of selection: n/P , i.e. 2000/35,540,419 = 0.000056274

P n

Each person in this sample

has the same “weight”

(inverse of the probability of

selection)

W = P/n =

35,540,419/2,000 =

17770.2

2014

Canada 35,540,419 2000

Newfoundland and Labrador 526,977 30

Prince Edward Island 146,283 8

Nova Scotia 942,668 53

New Brunswick 753,914 42

Quebec 8,214,672 462

Ontario 13,678,740 770

Manitoba 1,282,043 72

Saskatchewan 1,125,410 63

Alberta 4,121,692 232

British Columbia 4,631,302 261

In a SRS (n=2000) of Canadians, each person in our “unweighted” sample

would represent 17,770 persons…

It is possible to “weight your sample”, by making each case in your sample

represent 17770.2 cases (your weighted sample would look like your Population)

“unweighted” sample

2014

Canada 35,540,419 2000

Newfoundland and Labrador 526,977 30

Prince Edward Island 146,283 8

Nova Scotia 942,668 53

New Brunswick 753,914 42

Quebec 8,214,672 462

Ontario 13,678,740 770

Manitoba 1,282,043 72

Saskatchewan 1,125,410 63

Alberta 4,121,692 232

British Columbia 4,631,302 261

For this reason: All of Stats Can surveys make sure that

they have at a few 1000 for “all Provinces”

Why, reasonable quality statistical estimates for all provinces…

While this sample at the national level works well (n=2000), we can not

draw inferences for specific provinces (for example, PEI = 8 cases)

Both GIS, NLSCY and Health Survey are:

Stratified Samples (Disproportionate to size)

1. Divide Canada up into provinces

2. Take sufficient sample from each province to get good estimates

3. Simple random sample within provinces

Unweighted sample

Note: Probability of selection

differs by province

PEI: 1500/146,283

Ontario: 4000/13,678,740

Similarly, weights differ by province

PEI 146,283/1500 = 97.5

Ontario 13,678,740/4000 = 3419.7

Unweighted sample here: roughly 6.00 per cent (1500/2500) of the sample is

In PEI. In the population it is 0.41 per cent (146,273/35,540,419).

NOTE: WEIGHTED RESULTS WILL HAVE AN EQUAL DISTRIBUTION TO THE POPULATION

In assignment 3, I have you run your “model results” with the

appropriate “weights” (easy to do)

Corrects for potential biases due to your sampling strategy.

e.g. with the Census, I focus on “likelihood of

low income” as my dependent variable

e.g. assume my research interests relate to the higher than

average incidence of low income among immigrants in Canada

Immigrant status and the likelihood of low income will be my primary

emphasis

Other variables? What’s important in the literature?

Recommend either Social Science Citation Index or

Sociological Abstracts..

Relevant in this context:

What if any sub-sample should be selected for your research…

Therefore I ask you in Assignment 3 (necessary step):

No need to focus on a specific subsample here: I will be comparing

immigrants with other Canadians (the full population will be involved)

Find 5 studies that

explicitly focus on

your topic:

why immigrants are

more likely to

experience low

income

Note: your literature review is brief, so

stick to research directly related

to your research..

http://www.kings.uwo.ca/academics/soc

iology/resources-and-

information/sociology-department-

academic-awards/

Example:

this study hypothesizes:

Canadian immigrants are more likely to experience low income than other

Canadians.

This relationship is expected to be partially explained by “length of

“residence in Canada”.

The incidence of low income is expected to be highest among recent

immigrants and lowest among well-established immigrants

Yet the disadvantage of being an immigrant is expected to persist,

even among more established immigrants (even after controlling

for other relevant controls (sex, language age and education)

Relevant in developing hypotheses:

Types of Multivariate Relationships

With contingency tables, we covered:

1. Spuriousness (not likely in your paper)

2. Causal chains

3. Suppressor variables

4. Multiple causes (independent effects)

5. Interaction effects

With regression, we can test for all of these

1. Spurious relationships

• Initially an association is documented, yet with a control, the initial relationship disappears

Evidence in regression:

• Initial bivariate regression has a statistically significant slope or odds ratio

• When the control variable(s) are introduced, the coefficient is no longer significant

Bivariate:

X1 Y

1. Spurious relationships

• Initially an association is documented, yet with a control, the initial relationship disappears

Evidence in regression:

• Initial bivariate regression has a statistically significant slope or odds ratio

• When the control variable(s) are introduced, the coefficient is no longer significant

Bivariate:

X1 Y

Multivariate: Y

X2

X1


Example: We conduct research on a sample of FORD

assembly line workers and document a positive

relationship between “Salary” and “Absenteeism”:


One might speculate a spurious relationship:

Salary

Age

Absenteeism

Types of Multivariate Relationships2) chain relationships

• A relationship exists between X1 and Y at the bivariate level, which is modified with the addition of control variable(s)

•

• Consider:

X1 Y

X1

X2 Y

In this case, X1 is said to be the antecedent variable in the causal chain,

whereas X2 is referred to as an “intervening” variable..

Consider the following example:

We consider gender and income; what sorts of intervening variables might explain the initial relationship

SEX -> ? -> income

X2

• Assume we are examining the relevance of “sex” to “market

earnings”..

• Sex Market income

• (0-female; 1-male)

• Run a linear regression:

Men are earning, 18111 dollars more than women

beta suggests a moderate effect (.154)

Why such a gap?

How about the simple fact that women are more likely to work part-time?

Sex # hrs worked (weekly) Market income

(0-female; 1-male)

Why such a gap?



(0-female; 1-male)

Run a regression with both independent variables

Why such a gap?



(0-female; 1-male)

Difference persists!! Women are now making 12000 less

Yet not as strong an effect..

Part of the initial relationship between sex and income is explained by the intervening

variable (hours worked). Sex (as an antecedent variable) continues to be important,

even after controlling for hours worked.

Run a regression with both independent variables

Types of Multivariate Relationships2) chain relationships

• A relationship exists between X1 and Y at the bivariate level, which is modified with the addition of control variable(s)

•

• Consider:

X1 Y

X1

X2 Y

• If we control for X2, various possibilities with the initial relationship (X1 and Y):

• - the initial effect on X1 on Y might disappear completely

• - the initial effect on X1 on Y is weakened (this is the most common outcome)

• - the initial effect on X2 on Y can even get larger (rarer, but it can happen)

• Note: the results in multiple regression can sometimes look the same as with spuriousness (i.e. the initial relationship disappears)..major difference is in interpretation

• Also: although the effect of X1 on Y might disappear, X1 is still involved in our “causal explanation” as an “indirect” cause

X2

• Another example:

• Return to my initial hypothesis:

• Canadian immigrants are more likely to experience low income than other Canadians.

• This relationship is expected to be partially explained by “length of “residence in Canada”.

• Length of residence in Canada Low income

We create a variable for this purpose, to be run in logistic regression:

YR OF IMMIGRATION Low income

0 – Canadian born (set as reference category)

1 – immigrated prior to 1980

2- immigrated 1980-1999

3- immigrated 2000 or later (recent immigrant)






Immigrants who came to Canada since 2000 are the most disadvantaged

314.1 per cent higher odds of being poor relative to our reference category:

the Canadian born (4.141 – 1.0)*100

Immigrants who came 1980-1999 still have higher odds, 79 per cent higher

odds of poverty (1.790 – 1.0) *100

Interestingly, the immigrants who arrived prior to 1980 have lower odds

(0.744 – 1) * 100 25.6 % lower


• Might “Knowledge of Official Language” be relevant in this

context, as an important intervening variable? Is this why

recent immigrants are struggling?

YR OF IMMIGRATION Language Low income

•Might an important reason why recent immigration be so important in

• explaining “poverty” be the simple fact that immigrants are less likely

• to converse in English and/or French (one of Canada’s official languages?)






25.6 % lower

79.0 % higher

314.1 % higher






YR OF IMMIGRATION -> language -> Low income

0 – English (our reference category)

1 – French

2- English and French

3- NO KNOWLEDGE

Immigrants are only slightly

less likely to be poor after

introducing control

25.6 % lower

79.0 % higher

314.1 % higher

24.9 % lower

75.4 % higher

303.3% higher

Persons with no knowledge

of either Eng/French have

67.3 higher odds of poverty



• Occurs when independent variables have separate effects on the dependent variable

• The introduction of controls has little influence on initial bivariate associations



• Occurs when independent variables have separate effects on the dependent variable

• The introduction of controls has little influence on initial bivariate associations

X1

X2

X3 Y

X4

• Slope in simple regression very similar to that obtained in multiple regression

• eg. Age, sex, urban/rural residence & province, all have independent effects on Y

NOTE: in the context of my research on

immigration and low income

Immigrant -> low income

I would want to control for age/sex as relevant background variables

Immigrant low income

Sex

Age

Do all these variables have independent effects in explaining immigrant status

and income?

Of these variables, how do each impact low income, when controlling for the

others? (which seems to be most important?)

Or are immigrants more likely to be poor, merely because they are much

younger? With more women than men? Are the effects “not independent”?

Run initial logistic regression:

Dependent variable (0-not low income; 1- low income)

Independent variable (0-not immigrant; 2- immigrant)

Canadian immigrants are more likely to experience low income than

other Canadians.

Odds of low income

are 95.6 per cent

higher for immigrants

relative to other

Canadians.

What of the relevance of age/sex??

Run a second regression (with controls for age/sex)

Initial regression:

We introduce controls for “age and sex”…

sex (0-female; 1-male); age (5 year age groups)

Odds of low income

are 95.6 per cent


relative to other

Canadians.

Odds of low income

are now 90.7 per cent


relative to other

Canadians, after

controlling for age/sex

Slight decline in the effect of Immigrant, in controlling for these variables ..

The effects seem to be largely independent here..

RESULTS of MULTIPLE REGRESSION:


4. Suppressor variables

• initially we find no relationship between two variables (i.e., non-significant slope)

• After introducing control variable(s), the slope becomes significant

• OR

• initially we find a relationship between two variables

relationship gets stronger after control variable is introduced

Example: Consider women aged 18-39 in the 2006 Census Public Use File

Recall earlier:

for all persons 18+

Difference was 18,000$

Sex (0-female, 1-male)

men earning still $11,587 more than women

What’s going on here?

What do you predict if we control for education?

Are young women paid less because they are less educated?

Example: Consider women aged 18-39 in the 2006 Census Public Use File

Effect of sex is even greater when we control for whether or not someone

attended university!

Education does not explain the lower income of women! In fact, education

serves as a suppressor variable (the effect of gender gets even stronger)!



• When you enter a statistical control, the original bivariate

association differs by category of the control variable

• Previous examples with contingency tables:



• When you enter a statistical control, the original bivariate

association differs by category of the control variable

• Previous examples with contingency tables:

Contingency Table Relating Education, Income and Place of Birth

Foreign born Canadian born

High income Low income Total High income Low income Total

High 125 35.7% 225 64.3% 350 125 83.3% 25 16.7% 150

Education

Low 65 34.2% 125 65.8% 190 80 39.0% 125 61.0% 205

190 350 540 205 150 355

Interaction Effects

• Can also test for interaction with regression when working with quantitative data (interval/ratio)

• When testing for interaction using multiple regression, we are testing whether the effect of an independent variable (X1) on Y differs by category of another independent variable (X2)

• Regression can test for this by introducing “interaction terms” into the multiple regression

Interaction Effects

Modeling Interaction effects:

• If variables interact, we can improve the fit of our

model by introducing “cross-product” terms (also

referred to as “interaction terms”)

• A “cross-product” term is an artificial variable created

by multiplying two variables together

• If we want to test for an interaction between X1 and

X2 in explaining Y

• So, include a new variable which is simply a product

of X1 and X2:

2132211ˆ XXbXbXbaY

Interaction Effects

• If the slope of our interaction term (b3) is found to be significant, we find evidence of a significant interaction between X1 and X2 in explaining Y

• If b3 is not significant, it is best to drop this “interaction term” from the regression

• Example:2 variables used to predict “income”

• -> immigrant (0-no; 1 yes)

• -> university (0-no; 1 yes)

• Like in the cross tab, we will test for whether immigrant status and education interact in explaining market income

• Using Linear regression (interval/ratio dep var)

IN our initial regression,

We include both variables:

University (0-no; 1-yes)

Immigrant (0-no; 1-yes)

University grads make more (+$33,548) when controlling for Immigrant status

Immigrants make less (-$7879) when controlling for education

• We want to determine whether there’s an

interaction effect ..

• Does the effect of having a university

education differ for immigrants relative to

the Canadian born?

In this case, we clearly have an interaction effect…

Interaction term

is significant!!

CAN CONCLUDE:

The effect of education is not the same for immigrants/Canadian born…

Interaction Effects

Example (page 406-407 of text):

• Examining the relationship between:

Y mental impairment score (index)

X1 life events (number of stressful events)

X2 SES (index on socioeconomic status)

• Our research hypothesis: X1 has a positive effect on Y

X2 has a negative effect on Y

• As # of stressful events ↑, mental impairment ↑

• As SES ↑, mental impairment ↓

Interaction Effects

Coefficients(a)

Unstandardized Coefficients

Model B Std. Error t Sig.

(Constant) 28.2298 2.1742 12.984 .0001

LIFE 0.1033 0.0325 3.177 .0030

1

SES -0.0975 0.0291 -3.351 .0019

Results:

Interaction Effects

• Assume that we hypothesize an interaction: Upper SES

persons are better able to handle stressful life events

than are lower SES person

• That is, the effect of LIFE events is expected to interact

with SES as SES increases, the effect of LIFE events

on mental impairment decreases

• Our results with the interaction term:

Interaction Effects

Coefficients(a)

Unstandardized Coefficients

Model B Std. Error t Sig.

(Constant) 26.0366 3.9488 6.594 .0001

LIFE 0.1559 0.0853 1.826 .0761

SES -0.0604 0.0627 -0.965 .3409

1

LIFE*SES -0.0008 0.0013 -0.668 .5087

The interaction term is not significant (p = .5087) and the

direct effects of SES and Life are no longer significant

Thus, we don’t have support for the hypothesis that SES

and LIFE interact to affect mental impairment

Interaction Effects

• What if the interaction of LIFE * SES were significant?

• The negative slope suggests that the effect of Life

Events on Mental Impairment gets weaker as SES gets

higher

• That is, Life Events is more likely to lead to Mental

Impairment for those of higher socio-economic status

• If the coefficient were positive? Life Events has a

stronger impact on Mental Impairment for people of

higher SES

Interaction Effects

• Again, if the interaction effect is not significant, then drop it from the model

• You MUST ALWAYS include both variables you hypothesize to be interacting (called the “main effects”), along with the interaction term in your regression model

• You cannot introduce all potential interaction terms into your multivariate model to see what is significant prior to testing for interaction effects, you should be able to justify it theoretically

• If an interaction effect really exists, then it makes no sense to interpret the main effects

• PART C. Model Building

Examining the likelihood of low income, among Canadians with a

specific emphasis on the experience of immigrants.

The incidence of low income is expected to be highest among recent

immigrants and lowest among well-established immigrants

Yet the disadvantage of being an immigrant is expected to persist,

even among more established immigrants (even after controlling

for other relevant controls (language, education, sex, age and)

• Model 1.

• Yr of immigration -> incidence of low income

The disadvantage of being an immigrant is expected to

persist, even among more established immigrants (even

after controlling for other relevant controls (sex, language

age and education)

Can merely plug in values, for interpretation if you wish:

The above results give us the following equation:

Y = 27864.481 + 38215.784 (UNIV) -4349.417 (IMMIG) – 15237.378 (UN_IMM)

Remembering that 2 variables used to predict “income” are coded as follows:

-> UNIV (0-no; 1 yes)

-> IMMIG (0-no; 1 yes)

Note: if either UNIV=0 or IMMIG=0, then UN_IMM=0 must be 0 as well

What is the effect of having a university education for the Canadian born?

Y = 27864.481 + 38215.784 (UNIV) - 4349.417 (0) – 15237.378 (0)

it means a gain of $38215.78 for those with a degree

What is the effect of having a university education for an immigrant?

Y = 27864.481 + 38215.784 (UNIV) - 4349.417 (0) – 15237.378 (UNIV)

it means a gain of $27978.40 (38215.78-15237.38) for those

with a degree

NOTE:

March 19 - King's Facultydkerr.kingsfaculty.ca/dkerr/assets/lecture11_3306_2015... · 2016. 2....

Documents

Transcript of March 19 - King's Facultydkerr.kingsfaculty.ca/dkerr/assets/lecture11_3306_2015... · 2016. 2....