M OF CENTRAL 7/30/2017 T & V + N D · VARIABILITY / DISPERSION 变异性 Variability is defined as...

MEASURES OF CENTRAL TENDENCY & VARIABILITY + NORMAL DISTRIBUTION

Day 3 Summer 2017.07.31

7/30/2017

1

Fang Chen EC

NU

陈芳

华东

师大

英语系

DISTRIBUTION Symmetry Modality

单峰，双峰 Skewness

正偏或负偏 Kurtosis

7/30/2017 Fang C

hen ECN

U 陈

芳华

东师

大英语系

2

CHAPTER 4 Measures of Central Tendency 集中趋势

7/30/2017

3

Fang Chen EC

NU

陈芳

华东

师大

英语系

One major purpose of statistical procedures is to summarize raw data in a meaningful way to make some conclusions.

e.g. You wonder how the students in your colleague’s class are doing in the final exam this year. There is a number you REALLY want to know: ___________

Statistics that describe central tendency are

numerical values that describe the center of a distribution of scores for a variable.

7/30/2017

4

Fang Chen EC

NU

陈芳

华东

师大

英语系

CENTRAL TENDENCY

Three common measures of central tendency: Mode 众数 Median 中数 Mean 平均数

7/30/2017

5

Fang Chen EC

NU

陈芳

华东

师大

英语系

when we are dealing with populations

when we are dealing with samplesX

µ

7/30/2017

6

the most common measure of central tendency is the mean; the statistical notation for the mean is:

The median can be an observed number or not. How about the mode?

NX

X ∑=

Fang Chen EC

NU

陈芳

华东

师大

英语系

MEAN AND MEDIAN

7/30/2017

7

Fang Chen EC

NU

陈芳

华东

师大

英语系

1 2

3 4

7/30/2017

8

Fang Chen EC

NU

陈芳

华东

师大

英语系

MODE

7/30/2017

9

Fang Chen EC

NU

陈芳

华东

师大

英语系

7/30/2017 Fang C

hen ECN

U 陈

芳华

东师

大英语系

10

COMPARE AND CONTRAST The more symmetric a distribution is, the

closer these three measures of central tendency will be

If a distribution is truly normal (symmetric

and unimodal), then the mean, median, and mode will be exactly the same Unfortunately, this rarely happens. We must choose a measure that best suits our purposes

and data.

7/30/2017

11

Fang Chen EC

NU

陈芳

华东

师大

英语系

ADVANTAGES AND DISADVANTAGES - MEAN Advantages:

The mean can be defined mathematically with a simple equation and can easily be manipulated algebraically.

Is the most stable estimate of the central tendency of population than would the sample medians or modes

Disadvantages:

Influenced by the extreme values. (Very sensitive to outliers.)

The sample mean may not be an actual value observed in the data.

7/30/2017

12

Fang Chen EC

NU

陈芳

华东

师大

英语系

ADVANTAGES AND DISADVANTAGES - MEDIAN

Advantages: It is unaffected by extreme scores (outliers) …

Disadvantages: Depends on the sample of data and is not easily

generalized to the greater population Does not enter statistical equations readily and

therefore more difficult to work with than the mean.

may not be an actual value observed in the data.

7/30/2017

13

Fang Chen EC

NU

陈芳

华东

师大

英语系

SOME ADVANTAGES AND DISADVANTAGES - MODE Advantages:

Any randomly selected observation, Xi, is more likely to be the mode than any other score.

It is the only measure of central tendency that can be used with nominal data.

Is not affected by extreme scores

Disadvantages: Depends on the sample of data and may not be

representative of the population Can depend on the way the data is grouped Cannot be defined in simple mathematical equation

7/30/2017

14

Fang Chen EC

NU

陈芳

华东

师大

英语系

CHAPTER 5 Measures of Variability 分散趋势/变异性

7/30/2017

15

Fang Chen EC

NU

陈芳

华东

师大

英语系

VARIABILITY / DISPERSION 变异性 Variability is defined as how the data

is distributed around a measure of central tendency (e.g.mean)

Measures of variability describe the

way and degree to which the data is spread

Measures of variability quantify how

similar the scores in a sample are to one another.

7/30/2017

16

Fang Chen EC

NU

陈芳

华东

师大

英语系

CONSIDER THE FOLLOWING:

Two classes were assigned to the same teacher. In the first class, kids come with various background; in the second class, all the kids come from a family where at least one parent is a teacher/professor. How similar do you expect the pretest scores within the two groups to be?

7/30/2017

17

Fang Chen EC

NU

陈芳

华东

师大

英语系

THE RESULTING DATA…

7/30/2017

18

Fang Chen EC

NU

陈芳

华东

师大

英语系

THE DATA FROM A GRAPHICAL PERSPECTIVE

Class1- more variability Class 2- less variability

7/30/2017

19

Fang Chen EC

NU

陈芳

华东

师大

英语系

MEASURES OF VARIABILITY

7/30/2017

20

The range 全距 The interquartile range 四分位距 Deviation 离差 Average deviation Mean Absolute deviation Variance 方差 Standard deviation 标准差

Fang Chen EC

NU

陈芳

华东

师大

英语系

RANGE 全距 The distance

between the lowest and highest value.

Data from the previous example:

The range can be heavily influenced by extreme scores.

7/30/2017 Fang C

hen ECN

U 陈

芳华

东师

大英语系

21

THE INTERQUARTILE RANGE 四分位距 The interquartile range is the range of the middle 50% of the observations.

A trimmed statistic: how much from the lower end and the upper end respectively?

Calculated by taking the difference between the 75th percentile and 25th percentile.

The interquartile range has the opposite problem as the range—it gets rid of too much of the data

Percentile: the percentage of observations that

are below a particular score value.

7/30/2017

22

Fang Chen EC

NU

陈芳

华东

师大

英语系

FINDING THE INTERQUARTILE RANGE:

7/30/2017 Fang C

hen ECN

U 陈

芳华

东师

大英语系

23

DEVIATION 离差 The difference between every data point and the

mean The average deviation The mean absolute deviation, m.a.d. Variance Standard deviation / SD

7/30/2017

24

Fang Chen EC

NU

陈芳

华东

师大

英语系

AVERAGE DEVIATION We could find for each observed

value. Then use to look at on average how far the observations are from the mean. While, the logic is sound, the average

deviances for any sample will always be equal to zero --- Why?

7/30/2017

25

( )i id X X= −

1 ( )

N

ii

i

dmean d

N= =∑

Fang Chen EC

NU

陈芳

华东

师大

英语系

There are two ways to eliminate problems connected with the positive and negative deviances Take the absolute value of the deviances (ignore

the sign) or MAD Square each deviance, since the square of a

negative number is positive

7/30/2017

26

Fang Chen EC

NU

陈芳

华东

师大

英语系

MAD Mean absolute deviation

Not convenient for statistical manipulation

7/30/2017

27

NXX

MAD i∑ −=

Fang Chen EC

NU

陈芳

华东

师大

英语系

VARIANCE We start by finding how each observed value

differs from the mean:

To get rid of the negative deviances, we square each of these values:

Then, we sum the squared deviances (often called the “sum of squares”)

Calculate the average.

7/30/2017

28

( )iX X−

( )2iX X−

( )2

1

N

ii

X X=

−∑

Fang Chen EC

NU

陈芳

华东

师大

英语系

VARIANCE: FINAL EQUATIONS

( )

( )

2

2 1

2

2 1

1

N

ii

x

n

ii

x

X X

N

X Xs

n

σ =

=

−=

−=

−

∑

∑

7/30/2017

29

Fang Chen EC

NU

陈芳

华东

师大

英语系

STANDARD DEVIATION- SD 标准方差

Because we squared the deviations while calculating the variance, we have altered the original scale. This makes the variance difficult to interpret.

To convert this back to the original scale, we take the square root—called the standard deviation. σ is the population standard deviation s is the sample standard deviation

Think of SD as a measure of how far our data values deviate from the mean, on average

7/30/2017

30

Fang Chen EC

NU

陈芳

华东

师大

英语系

STANDARD DEVIATION: FINAL EQUATIONS

7/30/2017

31

( )

( )

2

1

2

1

1

N

ii

x

n

ii

x

X X

N

X Xs

n

σ =

=

−=

−=

−

∑

∑

Fang Chen EC

NU

陈芳

华东

师大

英语系

OUR EXAMPLE…

7/30/2017

32

Fang Chen EC

NU

陈芳

华东

师大

英语系

BACK TO OUR EXAMPLE… A loose interpretation:

Class 1 deviated, either positively or negatively, on average, 24 points from the mean

Class 2 deviated, either positively or negatively, on average, 12 points from the mean

In general, we can conclude that the values in class 2 tend to be more similar to one another (homogeneous) than that of class 1.

Interpretation in terms of our example: Teachers’ kids all performed very similarly, whereas those from other families were much more variable in the performance.

7/30/2017

33

Fang Chen EC

NU

陈芳

华东

师大

英语系

CHARACTERISTICS OF SD Basically a measure of the average of the

deviations of each score from the mean.

Can be used to build confidence intervals to see how many scores fall below or above the mean --- more on this in Chapter6.

7/30/2017

34

Fang Chen EC

NU

陈芳

华东

师大

英语系

DON’T BE SCARED….

Definitional Computational

7/30/2017

35

2

1

2

1

2

2 1

2

2 1

1

N

ii

n

ii

XN

i Ni

x

Xn

i ni

x

X

N

Xs

n

σ

=

=

=

=

∑−

=

∑−

=−

∑

∑

( )

( )

2

2 1

2

2 1

1

N

ii

x

n

ii

x

X X

N

X Xs

n

σ =

=

−=

−=

−

∑

∑

Fang Chen EC

NU

陈芳

华东

师大

英语系

THE PERPETUAL QUESTION: “WHY DIVIDE BY n-1 FOR SAMPLE STATISTICS”?

Adjustment to produce an unbiased estimate 无偏估计. 1. Concrete examples

in the book. Gravetter & Wallnau

P100-101 Seeing the statistics

Howell p99-101

7/30/2017

36

Fang Chen EC

NU

陈芳

华东

师大

英语系

SEEING THE STATISTICS www.uvm.edu/~dhowell/fundamentals8/SeeingSt

atisticsApplets/Applets.html The true mean of the population is 50, and SD is

29.2. Now we will sample from this population...

7/30/2017

37

Fang Chen EC

NU

陈芳

华东

师大

英语系

REPRESENTING DISTRIBUTIONS WITH GRAPHICS --- BOXPLOT A boxplot ( or box and whisker plot) includes a

measure of central tendency (the median) and a measure of dispersion (the interquartile range) Hinges= 1st and 3rd quartiles= 25th and 75th quantile H-spread: the range between the two quartiles Whisker: 1.5*H-spread from the top and bottom of

the box

7/30/2017

38

Fang Chen EC

NU

陈芳

华东

师大

英语系

BOXPLOT: HOWELL

7/30/2017

39

Fang Chen EC

NU

陈芳

华东

师大

英语系

The whiskers stop at the farthest numbers observable in the data set but no more than 1.5*H. Observed values above that point is marked as outliers. You see the full range of the upper whisker, but a very short lower whisker. Why?

OUR EXAMPLE IN SPSS At least two routes Graphs Boxplot Analyze Descriptive statistics Explore

7/30/2017

40

Fang Chen EC

NU

陈芳

华东

师大

英语系

KEY TERMS Describing distribution:4

_______________, _______________, _______________, _______________.

Measures of central tendency:3 ______________, _______________, _____________

Measures of variability:2 ______________, _______________

Displaying distribution:1 _______________

7/30/2017

41

Fang Chen EC

NU

陈芳

华东

师大

英语系

BREAK

7/30/2017

42

Fang Chen EC

NU

陈芳

华东

师大

英语系

THE NORMAL DISTRIBUTION & Z-SCORES Summer 2017.07.31

7/30/2017 Fang C

hen ECN

U 陈

芳华

东师

大英语系

43

OVERVIEW Probability for discrete vs. continuous

data The normal distribution Standard Normal Distribution z-transformations and z-scores Using z-scores to find probabilities

7/30/2017

44

Fang Chen EC

NU

陈芳

华东

师大

英语系

Think of discrete variables with the notion of a probability of a specific outcome We have a known number (100) of

purple(10), red(40) & white(50) marbles—what is the probability of choosing a red marble?

7/30/2017

45

Fang Chen EC

NU

陈芳

华东

师大

英语系

FREQUENCY, AREA, AND PROBABILITY FOR DISCRETE VARIABLES

The pie chart to the left represents the frequency distribution of red, purple and white marbles in a bag .

7/30/2017

46

10%

40%

50%

Fang Chen EC

NU

陈芳

华东

师大

英语系

We think of continuous variables with the idea of a probability of obtaining a value that falls within a range With the distribution of IQ scores that I collected for a

study, what is probability that somebody will have an IQ score of 90?

7/30/2017

47

Fang Chen EC

NU

陈芳

华东

师大

英语系

7/30/2017

48

IQ Score Ranges Frequency Proportion Cumulative

70-74 1 0.02 0.02 75-79 2 0.04 0.06 80-84 3 0.06 0.12 85-89 5 0.1 0.22 90-94 6 0.12 0.34 95-99 12 0.24 0.58 100-104 8 0.16 0.74 105-109 6 0.12 0.86 110-114 3 0.06 0.92 115-119 3 0.06 0.98 120-124 1 0.02 1

Total 50 1

Fang Chen EC

NU

陈芳

华东

师大

英语系

A PROBABLY NON-PROFESSIONAL WAY TO EXPLAIN Like with the pie chart

earlier, we can relate area to probability. Think of the area as the interval area for each bar. Your answer?

How many potential ranges could we create?

What does this mean?

7/30/2017 Fang C

hen ECN

U 陈

芳华

东师

大英语系

49

I AM INTERESTED IN THE SCORE 90: AN INTERVAL OF 20 POINTS / 3 GROUPS

7/30/2017

50

90-109:31/50=0.62

Fang Chen EC

NU

陈芳

华东

师大

英语系

AN INTERVAL OF 10 POINTS / 6 GROUPS

7/30/2017

51

90-99: 18/50=0.36

Fang Chen EC

NU

陈芳

华东

师大

英语系

WITH AN INTERVAL OF 5 POINTS/11 GROUPS

7/30/2017

52

90-94: 6/50=0.12

Fang Chen EC

NU

陈芳

华东

师大

英语系

WITH AN INTERVAL OF 2 POINTS/50 GROUPS

7/30/2017

53

90-92: 4/50=0.08

Fang Chen EC

NU

陈芳

华东

师大

英语系

A CHANGE OF CONCEPT The probability of exactly any single value is 0,

because we can break down the intervals into finer and finer ones…until infinity, meaning the bar size will become smaller and smaller until 0.

But we want to talk about a specific value. We want to use the same probability to interpret the score…we will use probability density function (PDF).

An x value will correspond to only one PDF value that is kind of the frequency, and is the height of a point on the normal curve.

How does this work?

7/30/2017

54

Fang Chen EC

NU

陈芳

华东

师大

英语系

PDF

7/30/2017

55

Fang Chen EC

NU

陈芳

华东

师大

英语系

GRAPHING THE PDF AND RELATE TO AREA

7/30/2017

56

0.0000 0.0050 0.0100 0.0150 0.0200 0.0250 0.0300 0.0350 0.0400

70 80 90 100 110 120 130

Den

sity

IQ Scores

Graphing probability density function

0.0000 0.0050 0.0100 0.0150 0.0200 0.0250 0.0300 0.0350 0.0400

70 75 80 85 90 95 100 105 110 115 120 125 130

Den

sity

IQ Scores


0.0000 0.0050 0.0100 0.0150 0.0200 0.0250 0.0300 0.0350 0.0400

70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 122 124 126 128 130

Den

sity

IQ Scores


Fang Chen EC

NU

陈芳

华东

师大

英语系

PROBABILITY DENSITY FUNCTION/ PDF 概率密度函数

7/30/2017

57

For every x value, we can plug the value into the function and get a f(X) number, which corresponds to the height of the point on the normal curve corresponding to the X value, we call it density. This is the y value in your z-table. The largest y value is at the center of the normal distribution where z=0.

E.g.

2.718e 3.14 where

)(2

1)(22 2/)(

==

= −−

ππσ

σµXeXf

0279.0)718.2(14.3*233.11

1)90(22 33.11*2/)74.9790( == −−f

Fang Chen EC

NU

陈芳

华东

师大

英语系

PERCENTILES Percentile: the point below which a specified

percentage of scores in the distribution fall Percentile rank: the percentage of scores equal to

or less than the given score. To get the percentile rank involves integration in calculus.

You don’t have to calculate for that, someone has already prepared the table for us ( z table). We just need to know how to use it.

A percentile is a score, a percentile rank is a percentage.

7/30/2017

58

Fang Chen EC

NU

陈芳

华东

师大

英语系

NORMAL DISTRIBUTION正态分布 Normal distribution is important because:

Many dependent variables are assumed to be normally distributed in the population

The sampling distribution of the mean is normally distributed ( more coming.)

Many statistics models are based on an assumption of a normally distributed variable.

7/30/2017

59

Fang Chen EC

NU

陈芳

华东

师大

英语系

NORMAL DISTRIBUTION

7/30/2017

60

0.0000

0.0050

0.0100

0.0150

0.0200

0.0250

0.0300

0.0350

0.0400

70 80 90 100 110 120 130

Den

sity

IQ Scores


Bell-shaped curve Unimodal Symmetric—mean,

median and mode are all in the center

Not skewed Extends from -∞ to

+∞ The total area

under the curve is 1

Fang Chen EC

NU

陈芳

华东

师大

英语系

NORMAL DISTRIBUTION

7/30/2017

61

About 68%of the distribution lies within 1 SD of the mean, 95% lies within 2 SD of the mean and 99.7% of the distribution lies within 3 SD of the mean.

We can immediately make some inferences.

∞

Fang Chen EC

NU

陈芳

华东

师大

英语系

STANDARD NORMAL DISTRIBUTION 标准正态分布 The standard normal distribution is just a special

case of normal distribution with a mean=0 and SD=1. Any normal distribution can be transformed to be a standardized normal distribution.

Why bother transforming, or standardizing a distribution?

7/30/2017

62

Fang Chen EC

NU

陈芳

华东

师大

英语系

HOW MANY TABLES DO WE NEED? For our IQ data, our mean is 97.74, SD=11.33,

one SD below the mean is 97.74-11.33=86.41, one SD above the mean is 97.74+11.33=109.07. The percentile rank of 84.13% corresponds to a raw score of 109.07.

For SAT score, mean=500, SD=100, one SD below the mean is 400, one SD above the mean is 600. The percentile rank of 84.13% corresponds to a raw score of 600.

……

7/30/2017

63

Fang Chen EC

NU

陈芳

华东

师大

英语系

STANDARDIZED SCORES 标准分

When we transform our variables to the z-distribution (the standard normal distribution), we are standardizing our scores.

This essentially means we put all of our values on the same scale and end up with a distribution of mean=0 and SD=1.

We call the process the z-transformation The standardized scores that come out of

this process are called z-scores.

7/30/2017

64

Fang Chen EC

NU

陈芳

华东

师大

英语系

Z-SCORE TRANSFORMATION

The end result will be a set of standardized scores. All scores that are below the mean will be negative and all

scores above the mean will be positive We can interpret the value of the z-score as how many

standard deviation above or below the mean A z-score =1.0 is a score that is exactly 1 SD above the

mean A z-score of -1.5 is score that is exactly 1.5 SD below the

mean

7/30/2017

65

ii

Xz

µσ−

=• X is our original data • µ is the mean of the population • σ is the population standard

deviation

Fang Chen EC

NU

陈芳

华东

师大

英语系

Z-SCORE EXAMPLE Test score:

Mean = 50 Standard deviation = 10 So the z-score if you received a 60 is

and the z-score if you received a 45 is

7/30/2017

66

11010

105060

==−

=−

=σµXz

5.010

510

5045−=

−=

−=

−=

σµXz

Fang Chen EC

NU

陈芳

华东

师大

英语系

SO? Now we can refer to the z-table to see what

percentile a score value of 60 or 45 corresponds to.

A full z-score table can be found in Howell p598-601 Table E-10.

A z-score of 1 corresponds to a percentile of 0.8413. This means 84.13% of scores fall at or below a z-score of 1 or the raw score of 60.

A z-score of -.5 corresponds to a percentile rank of 0.3085. This means 30.85% of scores fall at or below a z-score of -.5 or a raw score of 45.

7/30/2017

67

Fang Chen EC

NU

陈芳

华东

师大

英语系

SUMMARY PDF is introduced to get to the probability for

continuous variable. How to transform any scores within a

distribution into a z score ( or to standardize the raw scores)?

7/30/2017

68

Fang Chen EC

NU

陈芳

华东

师大

英语系

M OF CENTRAL 7/30/2017 T & V + N D · VARIABILITY / DISPERSION 变异性 Variability is defined as...

Documents

Transcript of M OF CENTRAL 7/30/2017 T & V + N D · VARIABILITY / DISPERSION 变异性 Variability is defined as...