Graduate School Quantitative Research Methods Gwilym Pryce [email protected]

Click here to load reader

  • date post

    02-Jan-2016
  • Category

    Documents

  • view

    16
  • download

    1

Embed Size (px)

description

Graduate School Quantitative Research Methods Gwilym Pryce [email protected] Lecture 7: Two Way Tables. Notices:. Register. Aims and Objectives:. Aim: This session introduces methods of examining relationships between categorical variables Objectives: - PowerPoint PPT Presentation

Transcript of Graduate School Quantitative Research Methods Gwilym Pryce [email protected]

  • Graduate SchoolQuantitative Research MethodsGwilym [email protected] 7: Two Way Tables

  • Notices:Register

  • Aims and Objectives:Aim:This session introduces methods of examining relationships between categorical variablesObjectives:By the end of this session the reader should be able to: Understand how to examine relationships between categorical variables using: 2 way tables Chi square test for independence.

  • Plan:1. Independent events2. Contingent events3. Chi square test for independence4. Further Study

  • 1. Probability of two Independent events occurringIf knowing that one event occurs does not affect the outcome of another event, we say those two outcomes are independent.And if A and B are independent, and we know the probability of each of them occurring, we can calculate the probability of them both occurring

  • Example: You have a two sided die and a coin, find Pr(1 and H).Answer: x = Rule: P(A B) = P(A) x P(B)

  • e.g. You have one coin which you toss twice: whats the probability of getting two heads?Suppose:A = 1st toss is a headB = 2nd toss is a headwhat is the probability of A B?Answer: A and B are independent and are not disjoint. P(A) = 0.5 and P(B) = 0.5. P (A B) = 0.5 x 0.5 = 0.25.

  • 2. Probability of two contingent events occurringIf knowing that one event occurs does change the probability that the other occurs, then two events are not independent and are said to be contingent upon each otherIf events are contingent then we can say that there is some kind of relationship between themSo testing for contingency is one way of testing for a relationship

  • Example of contingent events:There is a 70% chance that a child will go to university if its parents are middle class, but only a 10% chance if its parents are working class. Given that there is a 60% chance of a childs parents being working class: What are the chances that a child will be working class and go to University? What proportion of people at university will be from working working class backgrounds?

  • A tricky one...

  • 6% of all children are both working class and end up going to University

  • % = as percent of all children

  • % at Uni from WC parents?Of all children, only 32% end up at university (6% WC; 28% MC)I.e 6 out of every 32 University students are from WC parents:6/32 = 18.75% of University students are WC

  • Probability theory states that: if x and y are independent, then the probability of events x and y simultaneously occurring is simply equal to the product of the two events occurring:

    if x and y are not independent, then:Prob(x y) = Prob(x) Prob(y given that x has occurred)

  • Test for independenceWe can use these two rules to test whether events are independentDoes the distribution of observations across possible outcomes resemble the random distribution we would get if events were independent?I.e. if we assume independence and calculate the expected number of of cases in each category, do these figures correspond fairly closely to the actual distribution of outcomes found in our data?

  • Example 1: Is there a relationship between social class and education? We might test this by looking at categories in our data of WC, MC, University, no University. Suppose we have 300 observations distributed as follows:

  • To do the test for independence we need to compare expected with observed.How do we calculate ei, the expected number of observations in category i?I.e. number of cases expected in i assuming that the two categorical variables are independentthe formula for ei is the probability of an observation falling into category i multiplied simply by the total number of observations.I.e. No contingency

  • So, if UNIY or UNIN and WC or MC are independent (i.e. assuming H0) then:Prob(UNIY WC) = Prob(UNIY)Prob(WC)so the expected number of cases for each of the four mutually exclusive categories are as follows:

    Working classMiddle classGo to UniversityP(UNIY) x P(WC) x nP(UNIY) x P(MC) x n

    Do not go to UniversityP(UNIN) x P(WC) x nP(UNIN) x P(MC) x n

  • But how do we work out:Prob(UNIY) and Prob(WC) which are needed to calcluate Prob(UNIY WC): Prob(UNIY WC) = Prob(UNIY)Prob(WC)

    Answer: we assume independence and so estimate them from out data by simply dividing the total observations by the total number in the given category:E.g. Prob(UNIY) = Total no. cases UNIY All observations = (18 + 84) / 300 = 0.34

  • Expected count in each category:

  • So we have the actual count (I.e. from our data set):

  • And the expected count (I.e. the numbers wed expect if we assume class & education to be independent of each other):

  • What does this table tell you?

  • It tells you that if class and education were indeed independent of each otherI.e. the outcome of one does not affect the chances of outcome of the otherThen youd expect a lot more working class people in the data to have gone to university than actually recorded (61 people, rather than 18)Conversely, youd expect far fewer middle class people to have gone to university (half the number actually recorded).

  • But remember, all this is based on a sample, not the entire population

    Q/ Is this discrepancy due to sampling variation alone or does it indicate that we must reject the assumption of independence?

  • 3. Chi-square test for independence (non-parametric -- I.e. no presuppositions re distribution of variables; sample size not relevant)(1) H0: expected = actual x & y are independent I.e. Prob(x) is not affected by whether or not y occurs; H1: expected actual there is some relationshipI.e. Prob(x) is affected by y occurring.(2) a = 0.05

    k = no. of categoriesei = expected (given H0) no. of sample observations in the ith categoryoi = actual no. of sample observations in the ith categoryd = no. of parameters that have to be estimated from the sample data.r = no. of rows in tablec = no. of colums

  • Chi-square distribution changes shape for different df:

  • (3) Reject H0 iff P < a(4) Calculate P:P = Prob(c2 > c2c) N.B. Chi-square tests are always an upper tail testc2 Tables: are usually set up like a t-table with df down the side, and the probabilities listed along the top row, with values of c2c actually in the body of the table. So look up c2c in the body of the table for the relevant df and then find the upper tail probability that heads that column.SPSS: - CDF.CHISQ(c2c,df) calculates Prob(c2 < c2c), so use the following syntax: COMPUTE chi_prob = 1 - CDF.CHISQ(c2c,df).EXECUTE.

  • Do a chi-square test on the following table:

  • H0: expected = actual class and Higher Education are independent

    H1: expected actual there is some relationship between class and Higher Education

  • (2) State the formula & calc c2 :c2 = ( (18 - 61.2)2 / 61.2 +(84 - 40.8)2/ 40.8+ (162-118.8)2 / 118.8+(36 - 79.2)2/ 79.2 )

  • c2 = ((18 - 61.2)2 / 61.2 + (84 - 40.8)2/ 40.8 + (162-118.8)2 /118.8 + (36 - 79.2)2/ 79.2 )= 30.49 + 45.74 + 15.71 + 23.56= 115.51df = (r-1)(c-1) = 1

    Sig = P(c2 > 115.51) = 0

  • (3) Reject H0 iff P < a(4) Calculate P:

    COMPUTE chi_prob = 1 - CDF.CHISQ(115.51,1).EXECUTE.Sig = P(c2 > 115.51) = 0

    Reject H0

  • Caveat:As with the 2 proportions tests, the chi-square test is, an approximate method that becomes more accurate as the counts in the cells of the table get larger (Moore, Basic Practice of Statistics, 2000, p. 485)Cell counts required for the Chi-square test:You can safely use the chi-square test with critical values from the chi-square distribution when no more than 20% of the expected counts are less than 5 and all individual expected counts are 1 or greater. In particular, all four expected counts in a 2x2 table should be 5 or greater (Moore, Basic Practice of Statistics, 2000, p. 485)

  • Example 2: Is there a relationship between whether a borrower is a first time buyer and whether they live in Durham or Cumberland?Only real problem is how do we calculate ei the expected number of observations in category i?(I.e. number of cases expected in i assuming that the variables are independent)the formula for ei is the probability of an observation falling into category i multiplied by the total number of observations.

  • Probability theory states that: if x and y are independent, then the probability of events x and y simultaneously occurring is simply equal to the product of the two events occurring:

    if x and y are not independent, then:Prob(x y) = Prob(x) Prob(y given that x has occurred)

  • So, if FTBY or N and CountyD or C are independent (i.e. assuming H0) then:Prob(FTBY CountyD) = Prob(FTBY)Prob(CountyD)so the expected number of cases for each of the four mutually exclusive categories are as follows:

    CountyC

    CountyD

    FTBN

    ( n

    ( n

    FTBY

    ( n

    ( n

    _1069143486.unknown

    _1069143499.unknown

    _1069143532.unknown

    _1069143473.unknown

  • Prob(FTBN) = Total no. cases FTBN All observations

  • This gives us the expected count:To obtain this table in SPSS, go to Analyse, Descriptive Statistics, Crosstabs, Cells, and choose expected count rather than observed

  • What does this table tell you? Does it suggest that the probability of being an FTB independent of location?Or does it suggest that the two factors are contingent on each other in some way?Can it tell you anything about the direction of causation?What about sampling variation?

  • Summary o