COMP6053 lecture: Multivariate ANOVAmb1a10/stats/FEEG... · Complex scenario: MANOVA...
Transcript of COMP6053 lecture: Multivariate ANOVAmb1a10/stats/FEEG... · Complex scenario: MANOVA...
Multiple dependent variables?
• All the analyses we've done so far have had
a single dependent or outcome variable.
• E.g., regression, ANOVA.
• These are known as Univariate Tests.
• What if you are interested in more than one
outcome variable?
• Multivariate Test
Multiple analyses
• You could do multiple analyses in turn,
focusing on one outcome measure at a time.
A simple solution, used often.
• But: we risk inflating the type-I error rate by
conducting a global analysis that includes
many components.
• And we might miss the effects of interactions
between our dependent variables.
Multivariate ANOVA
• Essentially this is ANOVA applied to a vector
(list) of dependent variables (DVs), rather
than just one.
• The logic is very similar: instead of different
means across groups, we look for different
locations in dependent-variable-space
across groups.
Multivariate ANOVA
• The null hypothesis is that the different
groups all have a common centroid in our
DV-vector-space.
• The alternative hypothesis is that at least
one group has a distinct centroid in DV-
space.
An example scenario
• Let's say we're trying to compare the
effectiveness of two different mathematics
textbooks.
• We have a large group of people (N=100)
taking a mathematics course, and we
randomly allocate them to two textbook
groups: A and B.
Our dependent variables
• We care about the average exam
performances of the two groups: we want to
know if one textbook is better than the other
at helping people to learn.
• DV1 = Performance.
• If that was all we wanted to know, this would
be an ANOVA of performance on group.
• But we also want to compare the textbooks
on how much people enjoyed using them.
• DV2 = Enjoyment
Our dependent variables
• So our dependent variable is not a scalar
quantity, but a collection of points in the
space ( Performance, Enjoyment ).
• Performance ranges from 0--100;
Enjoyment ranges from 0--10.
• We want to know whether those using
textbook A and those using textbook B end
up in the same or different parts of this
space.
Scenario 1: No relationship
• Suppose that
there is no
relationship
between
performance and
enjoyment, and
also that group
membership has
no effect on either
score.
Scenario 1: No relationship
• Group A shown in
red, group B
shown in blue.
• Note the near-
complete overlap
between the two
groups in DV-
space.
Scenario 1: DVs by group
Scenario 1: MANOVA
• Step 1 is to link the performance and
enjoyment variables in a vector format.
Y = cbind(d1$Performance,d1$Enjoyment)
• The model-fitting command is simple: we fit
the DV-vector on the group variable.
m1 = manova(Y~d1$Group)
summary(m1)
Scenario 1: MANOVA output
• The output from MANOVA is analogous to
an F-test.
• There is no exact solution: four different
statistics available. Wilks' lambda is a
reasonably robust option.
summary(m1,test="Wilks")
Df Wilks approx F num Df den Df Pr(>F)
d1$Group 1 0.96252 1.8886 2 97 0.1568
Residuals 98
Scenario 1: MANOVA output
• P-value is not significant in this case, as
expected.
• We can't reject the null hypothesis that
groups A and B have the same centroid in
the Performance-Enjoyment space.
Scenario 2: Enjoyment
advantage for group A
• Suppose that
people enjoy
textbook A more
than B, but that
there's no
difference in exam
performance.
Scenario 2: Enjoyment
advantage for group A
• Suppose that
people enjoy
textbook A more
than B, but that
there's no
difference in exam
performance.
• Note the differing
centroid positions.
Scenario 2: DVs by group
Scenario 2: MANOVA output
m2 = manova(Y~d2$Group)
summary(m2,test="Wilks")
Df Wilks approx F num Df den Df Pr(>F)
d2$Group 1 0.50786 46.999 2 97 5.351e-15 ***
Residuals 98
summary.aov(m2)
Response 1 :
Df Sum Sq Mean Sq F value Pr(>F)
d2$Group 1 99.8 99.768 1.0703 0.3034
Residuals 98 9135.3 93.218
Response 2 :
Df Sum Sq Mean Sq F value Pr(>F)
d2$Group 1 79.162 79.162 93.018 7.152e-16 ***
Residuals 98 83.402 0.851
Scenario 2: MANOVA output
• As expected, we can reject the null
hypothesis that both groups share the same
centroid in DV-space.
• From MANOVA we know they're different,
but not exactly how they're different.
• Univariate analyses confirm that there's a
significant difference on enjoyment (R2) but
not performance (R1).
Scenario 3: Performance and
enjoyment are related
• Suppose that
performance
and enjoyment
are correlated
for our
subjects, (0.69)
but not in a
group-specific
way.
Scenario 3: Performance and
enjoyment are related
Scenario 3: Performance and
enjoyment are related
• The link between performance and
enjoyment doesn't change the fact that they
overlap in DV-space.
• So the MANOVA results are not significant. m3 = manova(Y~d3$Group)
summary(m3,test="Wilks")
Df Wilks approx F num Df den Df Pr(>F)
d3$Group 1 0.95923 2.0613 2 97 0.1328
Residuals 98
Scenario 4: Performance and
enjoyment both higher in group A
• P and E are
again correlated
(0.82) but this
time those who
used textbook A
tended to
perform better on
the exam and to
enjoy themselves
more.
Scenario 4: Performance and
enjoyment both higher in group A
• P and E are
again correlated
(0.82) but this
time those who
used textbook A
tended to
perform better on
the exam and
enjoy themselves
more.
Scenario 4: Performance and
enjoyment both higher in group A
Scenario 4: Performance and
enjoyment both higher in group A
m4 = manova(Y~d4$Group)
summary(m4,test="Wilks")
Df Wilks approx F num Df den Df Pr(>F)
d4$Group 1 0.42272 66.234 2 97 < 2.2e-16 ***
Residuals 98
summary.aov(m4)
Response 1 :
Df Sum Sq Mean Sq F value Pr(>F)
d4$Group 1 11810 11810 132.65 < 2.2e-16 ***
Residuals 98 8725 89
Response 2 :
Df Sum Sq Mean Sq F value Pr(>F)
d4$Group 1 90.949 90.949 54.483 5.168e-11 ***
Residuals 98 163.591 1.669
Scenario 4: Performance and
enjoyment both higher in group A
• Unsurprisingly, the overall MANOVA model
is highly significant (because the two groups
occupy different parts of DV-space).
• The univariate analyses confirm significant
between-group differences on both
performance and enjoyment.
A more complex scenario?
• The examples so far have been very simple:
two continuous DVs, and a single binary
group variable.
• In these cases the results of the analysis are
unlikely to surprise us: the labelled
scatterplot of enjoyment on performance
tells us all we need to know.
A more complex scenario:
more predictor variables
• Suppose that instead of 2 textbooks, we look
at 4. We now have groups A, B, C, and D.
• We also expand our study to include another
factor: whether or not the person had access
to a web-based demo that accompanied
each textbook ("demo").
• We're also interested in the potential
interaction between these two variables.
Maybe one text is especially useful with its
demo but not without.
A more complex scenario:
more outcome variables
• Previously we had two dependent variables:
exam performance and enjoyment of the
textbook.
• We expand the study to look at attendance
in class (perhaps some textbooks inspire
people to come to class more often) and the
long-term knowledge imparted by the book
(e.g., test scores one year later).
Complex scenario:
Descriptive statistics
• The pairs
command in R can
show us all four
DVs and their
relationships with
the group variable.
• Effect of demo not
shown.
Complex scenario:
Descriptive statistics
Complex scenario:
Descriptive statistics
Complex scenario: MANOVA
• In this case the point of a MANOVA
becomes clearer.
• It's not easy to see whether the four groups
(and the with-demo and without-demo
variants of each) lie far apart in the four-
dimensional DV-space of performance,
enjoyment, attendance, and long-term
knowledge.
Complex scenario: MANOVA
Y = cbind(d5$Performance,d5$Enjoyment,d5$Attendance,d5$Longterm)
m5 = manova(Y~d5$Group*d5$Demo)
summary(m5,test="Wilks")
Df Wilks approx F num Df den Df Pr(>F)
d5$Group 3 0.22864 14.6702 12 235.76 <2e-16 ***
d5$Demo 1 0.97937 0.4686 4 89.00 0.7586
d5$Group:d5$Demo 3 0.92475 0.5896 12 235.76 0.8497
Residuals 92
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
• Groups are significantly separated in DV-
space, but the group-demo interaction and
the demo variable are not significant.
Complex scenario: MANOVA
summary.aov(m5)
Response 1 :
Df Sum Sq Mean Sq F value Pr(>F)
d5$Group 3 3360.4 1120.14 13.2840 2.806e-07 ***
d5$Demo 1 67.9 67.89 0.8051 0.3719
d5$Group:d5$Demo 3 139.0 46.32 0.5493 0.6499
Residuals 92 7757.7 84.32
Response 2 :
Df Sum Sq Mean Sq F value Pr(>F)
d5$Group 3 86.387 28.7957 18.9683 1.165e-09 ***
d5$Demo 1 0.915 0.9147 0.6025 0.4396
d5$Group:d5$Demo 3 1.449 0.4831 0.3182 0.8121
Residuals 92 139.665 1.5181
Performance
Enjoyment
Complex scenario: MANOVA Response 3 :
Df Sum Sq Mean Sq F value Pr(>F)
d5$Group 3 1610.15 536.72 18.2334 2.286e-09 ***
d5$Demo 1 17.96 17.96 0.6100 0.4368
d5$Group:d5$Demo 3 11.35 3.78 0.1285 0.9430
Residuals 92 2708.11 29.44
Response 4 :
Df Sum Sq Mean Sq F value Pr(>F)
d5$Group 3 3200.8 1066.94 8.8547 3.237e-05 ***
d5$Demo 1 11.3 11.29 0.0937 0.7602
d5$Group:d5$Demo 3 518.9 172.98 1.4356 0.2375
Residuals 92 11085.4 120.49
• Group-x-demo and demo can be dropped as
they are not significant on any of the
univariate analyses.
Attendence
Long-Term
Complex scenario: reduced model
m5b = manova(Y~d5$Group)
summary(m5b,test="Wilks")
Df Wilks approx F num Df den Df Pr(>F)
d5$Group 3 0.23335 15.053 12 246.35 < 2.2e-16 ***
Residuals 96
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
• The overall analysis by group is highly
significant.
Complex scenario: reduced model
summary.aov(m5b)
Response 1 :
Df Sum Sq Mean Sq F value Pr(>F)
d5$Group 3 3360.4 1120.14 13.502 2.017e-07 ***
Response 2 :
Df Sum Sq Mean Sq F value Pr(>F)
d5$Group 3 86.387 28.7957 19.463 6.132e-10 ***
Response 3 :
Df Sum Sq Mean Sq F value Pr(>F)
d5$Group 3 1610.2 536.72 18.823 1.108e-09 ***
Response 4 :
Df Sum Sq Mean Sq F value Pr(>F)
d5$Group 3 3200.8 1066.9 8.8179 3.201e-05 ***
Complex scenario: reduced model
• The univariate comparisons indicate that all
four dependent variables differ significantly
across the four groups.
• If we wanted to go further, we could perform
multiple comparisons for each DV by group,
using the TukeyHSD method described in
the lecture on ANOVA.
Additional material
• Here is the Python program for generating
the various data sets; here are the five data
files produced: 1 2 3 4 5.
• Here is the R script that runs the MANOVA
analyses and produces the figures.