Measures of Central Tendency & Variability Dhon G. Dungca, M.Eng’g.
M OF CENTRAL 7/30/2017 T & V + N D · VARIABILITY / DISPERSION 变异性 Variability is defined as...
Transcript of M OF CENTRAL 7/30/2017 T & V + N D · VARIABILITY / DISPERSION 变异性 Variability is defined as...
MEASURES OF CENTRAL TENDENCY & VARIABILITY + NORMAL DISTRIBUTION
Day 3 Summer 2017.07.31
7/30/2017
1
Fang Chen EC
NU
陈芳
华东
师大
英语系
DISTRIBUTION Symmetry Modality
单峰,双峰 Skewness
正偏或负偏 Kurtosis
7/30/2017 Fang C
hen ECN
U 陈
芳 华
东师
大英语系
2
CHAPTER 4 Measures of Central Tendency 集中趋势
7/30/2017
3
Fang Chen EC
NU
陈芳
华东
师大
英语系
One major purpose of statistical procedures is to summarize raw data in a meaningful way to make some conclusions.
e.g. You wonder how the students in your colleague’s class are doing in the final exam this year. There is a number you REALLY want to know: ___________
Statistics that describe central tendency are
numerical values that describe the center of a distribution of scores for a variable.
7/30/2017
4
Fang Chen EC
NU
陈芳
华东
师大
英语系
CENTRAL TENDENCY
Three common measures of central tendency: Mode 众数 Median 中数 Mean 平均数
7/30/2017
5
Fang Chen EC
NU
陈芳
华东
师大
英语系
when we are dealing with populations
when we are dealing with samplesX
µ
7/30/2017
6
the most common measure of central tendency is the mean; the statistical notation for the mean is:
The median can be an observed number or not. How about the mode?
NX
X ∑=
Fang Chen EC
NU
陈芳
华东
师大
英语系
MEAN AND MEDIAN
7/30/2017
7
Fang Chen EC
NU
陈芳
华东
师大
英语系
1 2
3 4
7/30/2017
8
Fang Chen EC
NU
陈芳
华东
师大
英语系
MODE
7/30/2017
9
Fang Chen EC
NU
陈芳
华东
师大
英语系
7/30/2017 Fang C
hen ECN
U 陈
芳 华
东师
大英语系
10
COMPARE AND CONTRAST The more symmetric a distribution is, the
closer these three measures of central tendency will be
If a distribution is truly normal (symmetric
and unimodal), then the mean, median, and mode will be exactly the same Unfortunately, this rarely happens. We must choose a measure that best suits our purposes
and data.
7/30/2017
11
Fang Chen EC
NU
陈芳
华东
师大
英语系
ADVANTAGES AND DISADVANTAGES - MEAN Advantages:
The mean can be defined mathematically with a simple equation and can easily be manipulated algebraically.
Is the most stable estimate of the central tendency of population than would the sample medians or modes
Disadvantages:
Influenced by the extreme values. (Very sensitive to outliers.)
The sample mean may not be an actual value observed in the data.
7/30/2017
12
Fang Chen EC
NU
陈芳
华东
师大
英语系
ADVANTAGES AND DISADVANTAGES - MEDIAN
Advantages: It is unaffected by extreme scores (outliers) …
Disadvantages: Depends on the sample of data and is not easily
generalized to the greater population Does not enter statistical equations readily and
therefore more difficult to work with than the mean.
may not be an actual value observed in the data.
7/30/2017
13
Fang Chen EC
NU
陈芳
华东
师大
英语系
SOME ADVANTAGES AND DISADVANTAGES - MODE Advantages:
Any randomly selected observation, Xi, is more likely to be the mode than any other score.
It is the only measure of central tendency that can be used with nominal data.
Is not affected by extreme scores
Disadvantages: Depends on the sample of data and may not be
representative of the population Can depend on the way the data is grouped Cannot be defined in simple mathematical equation
7/30/2017
14
Fang Chen EC
NU
陈芳
华东
师大
英语系
CHAPTER 5 Measures of Variability 分散趋势/变异性
7/30/2017
15
Fang Chen EC
NU
陈芳
华东
师大
英语系
VARIABILITY / DISPERSION 变异性 Variability is defined as how the data
is distributed around a measure of central tendency (e.g.mean)
Measures of variability describe the
way and degree to which the data is spread
Measures of variability quantify how
similar the scores in a sample are to one another.
7/30/2017
16
Fang Chen EC
NU
陈芳
华东
师大
英语系
CONSIDER THE FOLLOWING:
Two classes were assigned to the same teacher. In the first class, kids come with various background; in the second class, all the kids come from a family where at least one parent is a teacher/professor. How similar do you expect the pretest scores within the two groups to be?
7/30/2017
17
Fang Chen EC
NU
陈芳
华东
师大
英语系
THE RESULTING DATA…
7/30/2017
18
Fang Chen EC
NU
陈芳
华东
师大
英语系
THE DATA FROM A GRAPHICAL PERSPECTIVE
Class1- more variability Class 2- less variability
7/30/2017
19
Fang Chen EC
NU
陈芳
华东
师大
英语系
MEASURES OF VARIABILITY
7/30/2017
20
The range 全距 The interquartile range 四分位距 Deviation 离差 Average deviation Mean Absolute deviation Variance 方差 Standard deviation 标准差
Fang Chen EC
NU
陈芳
华东
师大
英语系
RANGE 全距 The distance
between the lowest and highest value.
Data from the previous example:
The range can be heavily influenced by extreme scores.
7/30/2017 Fang C
hen ECN
U 陈
芳 华
东师
大英语系
21
THE INTERQUARTILE RANGE 四分位距 The interquartile range is the range of the middle 50% of the observations.
A trimmed statistic: how much from the lower end and the upper end respectively?
Calculated by taking the difference between the 75th percentile and 25th percentile.
The interquartile range has the opposite problem as the range—it gets rid of too much of the data
Percentile: the percentage of observations that
are below a particular score value.
7/30/2017
22
Fang Chen EC
NU
陈芳
华东
师大
英语系
FINDING THE INTERQUARTILE RANGE:
7/30/2017 Fang C
hen ECN
U 陈
芳 华
东师
大英语系
23
DEVIATION 离差 The difference between every data point and the
mean The average deviation The mean absolute deviation, m.a.d. Variance Standard deviation / SD
7/30/2017
24
Fang Chen EC
NU
陈芳
华东
师大
英语系
AVERAGE DEVIATION We could find for each observed
value. Then use to look at on average how far the observations are from the mean. While, the logic is sound, the average
deviances for any sample will always be equal to zero --- Why?
7/30/2017
25
( )i id X X= −
1 ( )
N
ii
i
dmean d
N= =∑
Fang Chen EC
NU
陈芳
华东
师大
英语系
There are two ways to eliminate problems connected with the positive and negative deviances Take the absolute value of the deviances (ignore
the sign) or MAD Square each deviance, since the square of a
negative number is positive
7/30/2017
26
Fang Chen EC
NU
陈芳
华东
师大
英语系
MAD Mean absolute deviation
Not convenient for statistical manipulation
7/30/2017
27
NXX
MAD i∑ −=
Fang Chen EC
NU
陈芳
华东
师大
英语系
VARIANCE We start by finding how each observed value
differs from the mean:
To get rid of the negative deviances, we square each of these values:
Then, we sum the squared deviances (often called the “sum of squares”)
Calculate the average.
7/30/2017
28
( )iX X−
( )2iX X−
( )2
1
N
ii
X X=
−∑
Fang Chen EC
NU
陈芳
华东
师大
英语系
VARIANCE: FINAL EQUATIONS
( )
( )
2
2 1
2
2 1
1
N
ii
x
n
ii
x
X X
N
X Xs
n
σ =
=
−=
−=
−
∑
∑
7/30/2017
29
Fang Chen EC
NU
陈芳
华东
师大
英语系
STANDARD DEVIATION- SD 标准方差
Because we squared the deviations while calculating the variance, we have altered the original scale. This makes the variance difficult to interpret.
To convert this back to the original scale, we take the square root—called the standard deviation. σ is the population standard deviation s is the sample standard deviation
Think of SD as a measure of how far our data values deviate from the mean, on average
7/30/2017
30
Fang Chen EC
NU
陈芳
华东
师大
英语系
STANDARD DEVIATION: FINAL EQUATIONS
7/30/2017
31
( )
( )
2
1
2
1
1
N
ii
x
n
ii
x
X X
N
X Xs
n
σ =
=
−=
−=
−
∑
∑
Fang Chen EC
NU
陈芳
华东
师大
英语系
OUR EXAMPLE…
7/30/2017
32
Fang Chen EC
NU
陈芳
华东
师大
英语系
BACK TO OUR EXAMPLE… A loose interpretation:
Class 1 deviated, either positively or negatively, on average, 24 points from the mean
Class 2 deviated, either positively or negatively, on average, 12 points from the mean
In general, we can conclude that the values in class 2 tend to be more similar to one another (homogeneous) than that of class 1.
Interpretation in terms of our example: Teachers’ kids all performed very similarly, whereas those from other families were much more variable in the performance.
7/30/2017
33
Fang Chen EC
NU
陈芳
华东
师大
英语系
CHARACTERISTICS OF SD Basically a measure of the average of the
deviations of each score from the mean.
Can be used to build confidence intervals to see how many scores fall below or above the mean --- more on this in Chapter6.
7/30/2017
34
Fang Chen EC
NU
陈芳
华东
师大
英语系
DON’T BE SCARED….
Definitional Computational
7/30/2017
35
2
1
2
1
2
2 1
2
2 1
1
N
ii
n
ii
XN
i Ni
x
Xn
i ni
x
X
N
Xs
n
σ
=
=
=
=
∑−
=
∑−
=−
∑
∑
( )
( )
2
2 1
2
2 1
1
N
ii
x
n
ii
x
X X
N
X Xs
n
σ =
=
−=
−=
−
∑
∑
Fang Chen EC
NU
陈芳
华东
师大
英语系
THE PERPETUAL QUESTION: “WHY DIVIDE BY n-1 FOR SAMPLE STATISTICS”?
Adjustment to produce an unbiased estimate 无偏估计. 1. Concrete examples
in the book. Gravetter & Wallnau
P100-101 Seeing the statistics
Howell p99-101
7/30/2017
36
Fang Chen EC
NU
陈芳
华东
师大
英语系
SEEING THE STATISTICS www.uvm.edu/~dhowell/fundamentals8/SeeingSt
atisticsApplets/Applets.html The true mean of the population is 50, and SD is
29.2. Now we will sample from this population...
7/30/2017
37
Fang Chen EC
NU
陈芳
华东
师大
英语系
REPRESENTING DISTRIBUTIONS WITH GRAPHICS --- BOXPLOT A boxplot ( or box and whisker plot) includes a
measure of central tendency (the median) and a measure of dispersion (the interquartile range) Hinges= 1st and 3rd quartiles= 25th and 75th quantile H-spread: the range between the two quartiles Whisker: 1.5*H-spread from the top and bottom of
the box
7/30/2017
38
Fang Chen EC
NU
陈芳
华东
师大
英语系
BOXPLOT: HOWELL
7/30/2017
39
Fang Chen EC
NU
陈芳
华东
师大
英语系
The whiskers stop at the farthest numbers observable in the data set but no more than 1.5*H. Observed values above that point is marked as outliers. You see the full range of the upper whisker, but a very short lower whisker. Why?
OUR EXAMPLE IN SPSS At least two routes Graphs Boxplot Analyze Descriptive statistics Explore
7/30/2017
40
Fang Chen EC
NU
陈芳
华东
师大
英语系
KEY TERMS Describing distribution:4
_______________, _______________, _______________, _______________.
Measures of central tendency:3 ______________, _______________, _____________
Measures of variability:2 ______________, _______________
Displaying distribution:1 _______________
7/30/2017
41
Fang Chen EC
NU
陈芳
华东
师大
英语系
BREAK
7/30/2017
42
Fang Chen EC
NU
陈芳
华东
师大
英语系
THE NORMAL DISTRIBUTION & Z-SCORES Summer 2017.07.31
7/30/2017 Fang C
hen ECN
U 陈
芳 华
东师
大英语系
43
OVERVIEW Probability for discrete vs. continuous
data The normal distribution Standard Normal Distribution z-transformations and z-scores Using z-scores to find probabilities
7/30/2017
44
Fang Chen EC
NU
陈芳
华东
师大
英语系
Think of discrete variables with the notion of a probability of a specific outcome We have a known number (100) of
purple(10), red(40) & white(50) marbles—what is the probability of choosing a red marble?
7/30/2017
45
Fang Chen EC
NU
陈芳
华东
师大
英语系
FREQUENCY, AREA, AND PROBABILITY FOR DISCRETE VARIABLES
The pie chart to the left represents the frequency distribution of red, purple and white marbles in a bag .
7/30/2017
46
10%
40%
50%
Fang Chen EC
NU
陈芳
华东
师大
英语系
We think of continuous variables with the idea of a probability of obtaining a value that falls within a range With the distribution of IQ scores that I collected for a
study, what is probability that somebody will have an IQ score of 90?
7/30/2017
47
Fang Chen EC
NU
陈芳
华东
师大
英语系
7/30/2017
48
IQ Score Ranges Frequency Proportion Cumulative
70-74 1 0.02 0.02 75-79 2 0.04 0.06 80-84 3 0.06 0.12 85-89 5 0.1 0.22 90-94 6 0.12 0.34 95-99 12 0.24 0.58 100-104 8 0.16 0.74 105-109 6 0.12 0.86 110-114 3 0.06 0.92 115-119 3 0.06 0.98 120-124 1 0.02 1
Total 50 1
Fang Chen EC
NU
陈芳
华东
师大
英语系
A PROBABLY NON-PROFESSIONAL WAY TO EXPLAIN Like with the pie chart
earlier, we can relate area to probability. Think of the area as the interval area for each bar. Your answer?
How many potential ranges could we create?
What does this mean?
7/30/2017 Fang C
hen ECN
U 陈
芳 华
东师
大英语系
49
I AM INTERESTED IN THE SCORE 90: AN INTERVAL OF 20 POINTS / 3 GROUPS
7/30/2017
50
90-109:31/50=0.62
Fang Chen EC
NU
陈芳
华东
师大
英语系
AN INTERVAL OF 10 POINTS / 6 GROUPS
7/30/2017
51
90-99: 18/50=0.36
Fang Chen EC
NU
陈芳
华东
师大
英语系
WITH AN INTERVAL OF 5 POINTS/11 GROUPS
7/30/2017
52
90-94: 6/50=0.12
Fang Chen EC
NU
陈芳
华东
师大
英语系
WITH AN INTERVAL OF 2 POINTS/50 GROUPS
7/30/2017
53
90-92: 4/50=0.08
Fang Chen EC
NU
陈芳
华东
师大
英语系
A CHANGE OF CONCEPT The probability of exactly any single value is 0,
because we can break down the intervals into finer and finer ones…until infinity, meaning the bar size will become smaller and smaller until 0.
But we want to talk about a specific value. We want to use the same probability to interpret the score…we will use probability density function (PDF).
An x value will correspond to only one PDF value that is kind of the frequency, and is the height of a point on the normal curve.
How does this work?
7/30/2017
54
Fang Chen EC
NU
陈芳
华东
师大
英语系
7/30/2017
55
Fang Chen EC
NU
陈芳
华东
师大
英语系
GRAPHING THE PDF AND RELATE TO AREA
7/30/2017
56
0.0000 0.0050 0.0100 0.0150 0.0200 0.0250 0.0300 0.0350 0.0400
70 80 90 100 110 120 130
Den
sity
IQ Scores
Graphing probability density function
0.0000 0.0050 0.0100 0.0150 0.0200 0.0250 0.0300 0.0350 0.0400
70 75 80 85 90 95 100 105 110 115 120 125 130
Den
sity
IQ Scores
Graphing probability density function
0.0000 0.0050 0.0100 0.0150 0.0200 0.0250 0.0300 0.0350 0.0400
70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 122 124 126 128 130
Den
sity
IQ Scores
Graphing probability density function
Fang Chen EC
NU
陈芳
华东
师大
英语系
PROBABILITY DENSITY FUNCTION/ PDF 概率密度函数
7/30/2017
57
For every x value, we can plug the value into the function and get a f(X) number, which corresponds to the height of the point on the normal curve corresponding to the X value, we call it density. This is the y value in your z-table. The largest y value is at the center of the normal distribution where z=0.
E.g.
2.718e 3.14 where
)(2
1)(22 2/)(
==
= −−
ππσ
σµXeXf
0279.0)718.2(14.3*233.11
1)90(22 33.11*2/)74.9790( == −−f
Fang Chen EC
NU
陈芳
华东
师大
英语系
PERCENTILES Percentile: the point below which a specified
percentage of scores in the distribution fall Percentile rank: the percentage of scores equal to
or less than the given score. To get the percentile rank involves integration in calculus.
You don’t have to calculate for that, someone has already prepared the table for us ( z table). We just need to know how to use it.
A percentile is a score, a percentile rank is a percentage.
7/30/2017
58
Fang Chen EC
NU
陈芳
华东
师大
英语系
NORMAL DISTRIBUTION正态分布 Normal distribution is important because:
Many dependent variables are assumed to be normally distributed in the population
The sampling distribution of the mean is normally distributed ( more coming.)
Many statistics models are based on an assumption of a normally distributed variable.
7/30/2017
59
Fang Chen EC
NU
陈芳
华东
师大
英语系
NORMAL DISTRIBUTION
7/30/2017
60
0.0000
0.0050
0.0100
0.0150
0.0200
0.0250
0.0300
0.0350
0.0400
70 80 90 100 110 120 130
Den
sity
IQ Scores
Graphing probability density function
Bell-shaped curve Unimodal Symmetric—mean,
median and mode are all in the center
Not skewed Extends from -∞ to
+∞ The total area
under the curve is 1
Fang Chen EC
NU
陈芳
华东
师大
英语系
NORMAL DISTRIBUTION
7/30/2017
61
About 68%of the distribution lies within 1 SD of the mean, 95% lies within 2 SD of the mean and 99.7% of the distribution lies within 3 SD of the mean.
We can immediately make some inferences.
∞
Fang Chen EC
NU
陈芳
华东
师大
英语系
STANDARD NORMAL DISTRIBUTION 标准正态分布 The standard normal distribution is just a special
case of normal distribution with a mean=0 and SD=1. Any normal distribution can be transformed to be a standardized normal distribution.
Why bother transforming, or standardizing a distribution?
7/30/2017
62
Fang Chen EC
NU
陈芳
华东
师大
英语系
HOW MANY TABLES DO WE NEED? For our IQ data, our mean is 97.74, SD=11.33,
one SD below the mean is 97.74-11.33=86.41, one SD above the mean is 97.74+11.33=109.07. The percentile rank of 84.13% corresponds to a raw score of 109.07.
For SAT score, mean=500, SD=100, one SD below the mean is 400, one SD above the mean is 600. The percentile rank of 84.13% corresponds to a raw score of 600.
……
7/30/2017
63
Fang Chen EC
NU
陈芳
华东
师大
英语系
STANDARDIZED SCORES 标准分
When we transform our variables to the z-distribution (the standard normal distribution), we are standardizing our scores.
This essentially means we put all of our values on the same scale and end up with a distribution of mean=0 and SD=1.
We call the process the z-transformation The standardized scores that come out of
this process are called z-scores.
7/30/2017
64
Fang Chen EC
NU
陈芳
华东
师大
英语系
Z-SCORE TRANSFORMATION
The end result will be a set of standardized scores. All scores that are below the mean will be negative and all
scores above the mean will be positive We can interpret the value of the z-score as how many
standard deviation above or below the mean A z-score =1.0 is a score that is exactly 1 SD above the
mean A z-score of -1.5 is score that is exactly 1.5 SD below the
mean
7/30/2017
65
ii
Xz
µσ−
=• X is our original data • µ is the mean of the population • σ is the population standard
deviation
Fang Chen EC
NU
陈芳
华东
师大
英语系
Z-SCORE EXAMPLE Test score:
Mean = 50 Standard deviation = 10 So the z-score if you received a 60 is
and the z-score if you received a 45 is
7/30/2017
66
11010
105060
==−
=−
=σµXz
5.010
510
5045−=
−=
−=
−=
σµXz
Fang Chen EC
NU
陈芳
华东
师大
英语系
SO? Now we can refer to the z-table to see what
percentile a score value of 60 or 45 corresponds to.
A full z-score table can be found in Howell p598-601 Table E-10.
A z-score of 1 corresponds to a percentile of 0.8413. This means 84.13% of scores fall at or below a z-score of 1 or the raw score of 60.
A z-score of -.5 corresponds to a percentile rank of 0.3085. This means 30.85% of scores fall at or below a z-score of -.5 or a raw score of 45.
7/30/2017
67
Fang Chen EC
NU
陈芳
华东
师大
英语系
SUMMARY PDF is introduced to get to the probability for
continuous variable. How to transform any scores within a
distribution into a z score ( or to standardize the raw scores)?
7/30/2017
68
Fang Chen EC
NU
陈芳
华东
师大
英语系