Inferential Statistics & Correlational Research Days 5 & 6.

Inferential Statistics & Correlational Research

Days 5 & 6

For Monday

• Submit first draft of your literature review (Chapter 2).– It should read like one document rather than a

series of abstracts.– Use transition sentences.– Group studies into a logical order.– You should include at least ten research-

based articles/dissertations.– Include a Reference list in perfect APA format.

Statistics Review

Measurement ScalesLevels of Measurement

• [NOIR]• Nominal

– Categorical, frequency counts (gender, color, yes/no, etc.)

• Ordinal– Rank-order (Contest Ratings, Likert data??)

• Interval– Continuous scale with consistent distances between points. No

meaningful absolute zero (test scores, singing range, Fahrenheit or Celsius temperature, knowledge).

• Ratio– Continuous scale with consistent distances between points and an

absolute zero (decibels, money, age)

• Examples: N-choir robes; O – div. 1 at contest; I – 96/100 score; R – Kelvin scale for temperature (has absolute 0)

Normal Curve/Distribution(+/- 1 = 68.26%, 2 = 95%, 3 = 99.7%)

Altogether.. describes the shape of a distributionMore on distributions..

• Normal Curve (bell curve)– Most scores clustered at the middle with fewer scores

falling at the extreme highs and lows

• Kurtosis - When the distribution is…• …PEAKED = positive kurtosis = leptokurtic = greater than +2 or

+3 depending on who you ask…• …SMALLER PEAK (flatter) THAN A NORMAL CURVE =

negative kurtosis = platykurtic = less than -2 or -3

• Skewness - When the scores tend to bunch up…• …on the HIGH END = Negative Skew = less than -1• …on the LOW END = Positive Skew = greater than +1

• Bi-Modal– When there are two humps in the curve, more than one

mode

Standard Error of the Mean & Confidence Limits

• SEM is the standard deviation of infinite number of sample means– Example: If I tested 30

students on music theory• Test 0-100

• Mean 75; SD 10

• Standard Error of the Mean (SEM) would estimate average SD among any number of same size samples taken from the population

• SEM = SD/sq root N– Calculate for example on

the left.

• 95% Confidence Interval– 95% of the area under a

normal curve lies within roughly 1.96 SD units above or below the Mean (rounded to +/-2)

• 95% CI = M + or – (SEM X 1.96)

• 99% CI = M + or – (SEM X 2.76)

Confidence Limits/Intervalhttp://graphpad.com/quickcalcs/CImean1.cfm

• Attempts to define range of true population mean based on Standard Error estimate.

• Confidence level– 95% chance vs. 99% chance

• Confidence Limits– 2 numbers that define the range

• Confidence Interval– The range b/w confidence limits

http://graphpad.com/quickcalcs/CImean1.cfm

Confidence

• Calculate at a 95% confidence level– ACT Explore scores

• Confidence Limits• Interval

• How close to your scores represent true population?

Inferential Statistics

• Statistic = number describing a variable• Descriptive statistics = describe population• Inferential statistics = used when making

inferences about a population based on the sample

• Stat. used based on type of data and other assumptions

• Stats used to compare and find differences

Two Types of Inferential Stats

• Parametric– Interval & ratio data– Normal or near

normal curve (distribution)

– Equal variances (Levene’s test)

– Sample reflects pop. (randomized)

– Most powerful

• Non-Parametric– Nominal & Ordinal

data– Not normal

distribution (skewness or kurtosis)

– Unequal variances – Less powerful– More conservative

Statistical Significance

• Probability that result happened by chance and not due to treatment– Expressed as p– p < .1 = less than 10% (1/10) probability…– p < .05 = less than 5% (1/20) probability…– p < .01 – less than 1% (1/100) probability…– p < .001 – less than .1% (1/1000) probability…

• Computer software reports actual p• alpha level = probability level to be accepted as

significant set b/f study begins– .05 standard for significance– Only reason to set higher bar is for multiple comparisons

• Bonferroni Correction

• Statistical significance does not equal practical significance

Review-Hypothesis Testing

• Research Hypothesis (H)– The predicted outcome of the research study

• The chunking method will be more effective than the whole song method in teaching a song by rote

• Null Hypothesis (Ho)– An outcome in which no difference/relationship exists

• There will be no difference in effectiveness between the chunking and whole song methods (value free)

Statistical Power

• Likelihood that a particular test of statistical significance will lead to the rejection of null hypothesis– Parametric tests more powerful than nonparametric.

(Par. more likely to discover differences b/w groups. Choice depend on type of data)

• The larger the sample size, the more likely you will be to find statistically significant effects.

• The less stringent your criteria (e.g., .05 vs. .01 vs. .001), the easier it is to find statistical significance

Review-Type I and Type II Error

• Type I Error is erroneously claiming statistical significance or rejecting the null hypothesis when in fact, it’s true (claiming success when experiment failed to produce results)– Possible w. incorrect statistical test– Or when conducting multiple tests on same data (i.e. comparing

2 groups on multiple variables (achievement test parts). [solution, lower alpha level]

• Type II Error is when a researcher fails to reject the null hypothesis when it is in fact false– The smaller the sample size, the more difficult it is to detect

statistical significance– In this case, a researcher could be missing an important finding

because of study design

Statistical Tests - Parametric

http://faculty.vassar.edu/lowry/VassarStats.html


Parametric Assumptions

• Interval Data• Normality - Scores are normally distributed in

each group• Homogeneity of Variance - The amount of

variability in scores is similar between each group (Levin’s test)

One- vs. Two-Tailed Tests• If a hypothesis is directional in nature it is one-tailed

– The chunking method will be more effective than the whole song method

• If a hypothesis is not directional in nature it is two-tailed– There will be a no difference in effectiveness between the chunking

method and the whole song method• Two-tailed tests are most commonly used since specific hypotheses

are rare in music education research. • If study is designed knowing that results can only go one direction

(e.g., beginning violin), a one tail test is OK. If treatment can only lead to positive results (improvement) use a one tail test. If treatment could result in positive or negative results, use a two tail test.

• One Tailed test more powerful. If your experiment led to improvement but a two tail test only comes close to significance, try a one tail test. (specify which you used in your study)

• GO TO: http://vassarstats.net/

http://vassarstats.net/

Independent Samples t-test[see data set]

• Used to determine whether differences between two independent group means are statistically significant

• n = < 30 for each group. Though many researchers have used the t test with larger groups.

• Groups do not have to be even. Only concerned with overall group differences w/o considering pairs– [A robust statistical technique is one that performs well even if its

assumptions are somewhat violated by the true model from which the data were generated. Unequal variances = alternative t test or better Mann-Whitney U]

Correlated (paired, dependent) Samples t-test [see data set]

• Used to determine whether differences between two means taken from the same group, or from two groups with matched pairs are statistically significant– e.g., pre-test achievement scores for the whole song

group vs. post-test achievement scores for the whole song group

• Group size must be even (paired)• N = < 30 for each group

ANOVA(use ACT Explore test data from Day 4)

• Analyze means of 2+ groups• Homogeneity of variance• Independent or correlated (paired) groups• More rigorous than t-test (b/w group & w/i group

variance). Often used today instead of T test.• F statistic• One-Way = 1 independent variable• Two-Way/Three-Way = 2-3 independent

variables (one active & one or two an attribute)

One-Way ANOVA

• Calculate a One-Way ANOVA for data-set– Difference b/w subject tests for 2007 Non-band students? (correlated)

Implications?

– Difference b/w reading scores for non-band students 2007/2008/2009

• Post Hoc tests– Used to find differences b/w groups using one test. You could

compare all pairs w/ individual t tests or ANOVA, but leads to problems w/ multiple comparisons on same data

• Bonferroni correction – reduce alpha to compensate for multiple comparisons. (.05/N comparisons)

– Tukey – Equal Sample Sizes (though can be used for unequal sample sizes as well) [HSD = honest significant difference]

– Sheffe – Unequal Sample Sizes (though can be used for equal sample sizes as well)

Two Way ANOVA(2X2) [see data set day 5]

Achievement Level

Method CAI Traditional

High

Test

Test

Low

Test

Test

Interpreting Results of 2x2 ANOVA

• (columns) CAI was more effective than Traditional methods for both high and low achieving students

• (rows) High Achieving students scored significantly higher than Low achieving students, regardless of teaching method

• There was no significant interaction between rows & columns– If there was significant interaction, we would

need to do post hoc Tukey or Sheffe do determine where the differences lie.

ANCOVA – Analysis of Covariance[data set day 5]

• Statistical control for unequal groups

• Adjusts posttest means based on pretest means.

• [example – see data set] http://faculty.vassar.edu/lowry/VassarStats.html

• [The homogeneity of regression assumption is met if within each of the groups there is an linear correlation between the dependent variable and the covariate and the correlations are similar b/w groups]


Effect Size (Cohen’s d)http://www.uccs.edu/~lbecker/

• [Mean of Experimental group – Mean of Control group/average SD]• The average percentile standing of the average treated (or experimental)

participant relative to the average untreated (or control) participant.• Effect sizes can also be interpreted in terms of the percent of nonoverlap

of the treated group's scores with those of the untreated group.• Use table to find where someone ranked in the 50th percentile in the

experimental group would be in the control group• Good for showing practical significance

– When test in non-significant– When both groups got significantly better (really effective vs. really

really effective!

• Calculate effect size:– Pretest: M=10.8; SD= 5.7– Posttest: M=24.6; SD=4.3

http://www.uccs.edu/~lbecker/

Effect Size (d) InterpretationCohen's Standard Effect Size Percentile Standing

Percent of Nonoverlap

2.0 97.7 81.1%

1.9 97.1 79.4%

1.8 96.4 77.4%

1.7 95.5 75.4%

1.6 94.5 73.1%

1.5 93.3 70.7%

1.4 91.9 68.1%

1.3 90 65.3%

1.2 88 62.2%

1.1 86 58.9%

1.0 84 55.4%

0.9 82 51.6%

LARGE 0.8 79 47.4%

0.7 76 43.0%

0.6 73 38.2%

MEDIUM 0.5 69 33.0%

0.4 66 27.4%

0.3 62 21.3%

SMALL 0.2 58 14.7%

0.1 54 7.7%

0.0 50 0%

Correlational Research

Correlational Research Basics

• Relationships among two or more variables are investigated

• The researcher does not manipulate the variables

• Direction (positive [+] or negative [-]) and degree (how strong) in which two or more variables are related

Uses of Correlational Research

• Clarifying and understanding important phenomena (relationship b/w variables—e.g., height and voice range in MS boys)

• Explaining human behaviors (class periods per weeks correlated to practice time)

• Predicting likely outcomes (one test predicts another)

Uses of Correlation Research

• Particularly beneficial when experimental studies are difficult or impossible to design

• Allows for examinations of relationships among variables measured in different units (decibels, pitch; retention numbers and test scores, etc.)

• DOES NOT indicate causation– Reciprocal effect (a change in weight may affect body image, but

body image does not cause a change in weight)– Third (other) variable actually responsible for difference

(Tendency of smart kids to persist in music is cause of higher SATs among HS music students rather than music study itself)

Interpreting Correlations– r

• Correlation coefficient (Pearson, Spearman)• Can range from -1.00 to +1.00

– Direction• Positive

– As X increases, so does Y and vice versa• Negative

– As X decreases, Y increases and vice versa– Degree or Strength (rough indicators)

• < + .30; weak• < + .65; moderate• > + .65; strong• > + .85; very strong

– r2 (% of shared variance)• Overlap b/w two variables• percent of the variation in one variable that is related to the

variation in the other.• Example: Correlation b/w musical achievement and minutes of

instruction is r = .86. What is the % of shared variance (r2)?– Easy to obtain significant results w/ correlation. Strength is

most important

Interpreting Correlations (cont.)

• Words typically used to describe correlations– Direct (Large values w/ large values or small values w/ small

values. Moving parallel. 0 to +1– Indirect or inverse (Large values w/small values. Moving in

opposite directions. 0 to -1– Perfect (exactly 1 or -1)– Strong, weak– High, moderate, low– Positive, Negative

• Correlations vs. Mean Differences– Groups of scores that are correlated will not necessarily have

similar means. Correlation also works w/ different units of measurement.

50 75 9 40 62 1435 53 2024 35 4515 21 58

Statistical Assumptions• The mathematical equations used to determine various correlation

coefficients carry with them certain assumptions about the nature of the data used…– Level of data (types of correlation for different levels)– Normal curve (Pearson, if not-Spearman)– Linearity (relationships move parallel or inverse)

• Young students initially have a low level of performance anxiety, but it rises with each performance as they realize the pressure and potential rewards that come with performance. However, once they have several performances under their belts, the anxiety subsides. (non linear relationship of # of performances & anxiety scores)

– Presence of outliers (all)– Homoscedascity – relationship consistent throughout

• Performance anxiety levels off after several performances and remains static (relationship lacks Homoscedascity)

– Subjects have only one score for each variable– Minimum sample size needed for significance

Correlational Approaches for Assessing Measurement Reliability

• Consistency over time– test-retest (Pearson, Spearman)

• Consistency within the measure– internal consistency (split-half, KR-20,

Cronbach’s alpha)– Spearman Brown Prophecy formula

• 2r/(1 + r)

• Among judges– Interjudge (Cronbach’s Alpha)

• Consistency b/w one measure and another– (Pearson, Spearman)

Reliability of Survey

• What broad single dimension is being studied?– e.g. = attitudes towards elementary music– Preference for Western art music– “People who answered a on #3 answered c on #5”

• Use Cronbach’s alpha– Measure of internal consistency– Extent to which responses on individual items

correspond to each other

Examples

• Calculate the Pearson correlation between each a subject test and combined score on the ACT Explore for 2007-2009. (each take one subject)

• Calculate a Spearman Correlation for Contest ratings each judge vs. final rating

• Calculate internal consistency (reliability) of all three judges using Cronbach’s alpha.– http://www.wessa.net/rwasp_cronbach.wasp

http://www.wessa.net/rwasp_cronbach.wasp

Other Stats

Chi-Squaredhttp://vassarstats.net/

• Measure statistical significance b/w frequency counts (nominal/categorical data)

• Test for independence/association: Compare 2 or more proportions– Example: Proportion of females to males in choirs at 4 district

HSs.

• Goodness of Fit: compare w/ you have with what is expected– Example: Proportions of females to males in choir compared to

school population

http://vassarstats.net/

Method Discussions

Inferential Statistics & Correlational Research Days 5 & 6.

Documents

Transcript of Inferential Statistics & Correlational Research Days 5 & 6.