Review: what is a distribution ?

18
VARIABILITY

description

Review: what is a distribution ?. Arrangement of all cases in a sample or population along a single variable, according to their score or value. Dispersion. How scores or values arrange themselves around the mean If most scores cluster about the mean the shape of the distribution is peaked - PowerPoint PPT Presentation

Transcript of Review: what is a distribution ?

Page 1: Review: what is a  distribution ?

VARIABILITY

Page 2: Review: what is a  distribution ?

Case no. Age Height M/F

1 23 68 M2 22 64 F3 23 69 F4 25 71 M5 27 64 F6 22 72 M7 24 65 F8 23 66 M9 23 66 F

10 25 68 F11 21 68 M12 21 62 F13 24 71 M14 27 66 F15 21 62 F16 25 56 F17 22 71 M18 22 70 M19 25 66 F20 26 60 F21 21 52 F22 31 70 F23 24 71 M24 31 61 F25 23 72 M26 27 71 F27 25 71 M28 26 64 F29 22 66 F30 29 69 M31 24 67 F

Summarystatistics

mean = 24 mean = 67%M 39%F 61

Review: DistributionAn arrangement of cases according to their score or value on one or more variables

•Categoricalvariable

•Continuousvariable

Page 3: Review: what is a  distribution ?

Dispersion and the mean• Dispersion: How scores or values

arrange themselves around the mean• If most scores cluster about the mean the

shape of the distribution is peaked– This is the so-called “normal”

distribution– In social science the scores or values

for many variables are normally or near-normally distributed

– This allows use of the mean to describe the dataset (that’s why it’s called a “summary statistic”)

• When scores are more dispersed a distribution’s shape is flatter

– Distance between most scores and the mean is greater

– Many scores are at a considerable distance from the mean

– The mean loses value as a summary statistic

Normal distribution

“Flat” distribution

Mean A good 3.0 descriptor

Arrests

Arrests

TT

TT

Mean A poor 3.65 descriptor

Page 4: Review: what is a  distribution ?

Normal distributions• Characteristics:

– Unimodal and symmetrical: shapes on both sides of the mean are identical– 68.26 percent of the area “under” the curve – meaning 68.26 percent of the cases –

falls within one “standard deviation” (+/- 1 ) from the mean– NOTE: The fact that a distribution is “normal” or “near-normal” does NOT imply that

the mean is of any particular value. All it implies is that scores distribute themselves around the mean “normally”.

• Means depend on the data. In this distribution the mean could be any value.• By definition, the standard deviation score that corresponds with the mean of a

normal distribution - whatever that score might be - is zero.

Mean (whatever it is)

Standard deviation (always 0 at the mean)

Page 5: Review: what is a  distribution ?

Measuring dispersion• Average deviation

(x - ) ----------- n

– Average distance between the mean and the values (scores) for each case– Uses absolute distances (no + or -)– Affected by extreme scores

• Variance (s2): A sample’s cumulative dispersion

(x - )2

----------- n use n-1 for small samples

• Standard deviation (s): A standardized form of variance, comparable between samples

(x - )2

----------- n use n-1 for small samples

– Square root of the variance– Expresses dispersion in units of equal size for that particular distribution– Less affected by extreme scores

Page 6: Review: what is a  distribution ?

Number of tickets

B D F H KA C E G I J L M

2.13 4.46 6.79 -1 SD mean +1 SD

How well do means represent(summarize) a sample?

Mean = 4.46 SD = 2.33

13 officers scored on numbersof tickets written in one weekOfficer A: 1 ticketOfficers B & C: 2 tickets eachOfficers D & E: 3 tickets eachOfficers F & G: 4 tickets eachOfficers H & I: 5 tickets eachOfficer J: 6 ticketsOfficers K & L: 7 tickets eachOfficer M: 9 tickets

In a normal distribution about 66% of cases fall within 1 SD of the mean. .66 X 13 cases = 9 casesBut here only 7 cases (Officers D-J) fallwithin 1 SD of the mean. Six officers wrote very few or very many tickets, making the distribution considerably more dispersed than “normal.”

So…for this sample, the mean does NOT seem to be a good summary statistic. It is NOT a good shortcut for describing how officers in this sample performed.

If variable “no. of tickets” was “normally” distributed most

cases would fall inside the bell-shaped curve. Here they don’t.

Page 7: Review: what is a  distribution ?

Mean = 4.69 SD = 2.1

In a normal distribution 66percent of the cases fall within1 SD of the mean .66 X 13 = 8.58 = 9 cases

Here, 9 of the 13 cases (officers C-K)do fall within 1 SD of the mean.The distribution is normal becausemost officers wrote close to the samenumber of tickets, so the cases“clustered” around the mean.

So, for this sample the mean is a goodsummary statistic - a good shortcut fordescribing officer performance

D G E H JA B C F I K L M

2.59 4.69 6.79 -1 SD mean +1 SD

Number of tickets

13 officers scored on numbersof tickets written in one week

Officer A: 1 ticketOfficer B: 2 ticketsOfficer C: 3 ticketsOfficers D, E, F: 4 tickets eachOfficers G, H, I: 5 tickets eachOfficers J & K: 6 tickets eachOfficer L: 7 ticketsOfficer M: 9 tickets

If variable “no. of tickets” was “normally” distributed most

cases would fall inside the bell-shaped curve. Here they do!

Page 8: Review: what is a  distribution ?

Going beyond description…• As we’ve seen, when variables are normally or near-

normally distributed, the mean, variance and standard deviation can help describe datasets

• But they are also useful in explaining why things change; that is, in testing hypotheses

• For example, assume that patrol officers in the XYZ police dept. were tested for effectiveness, and that on a scale of 1 (least eff.) to 5 (most eff.) their mean score was 3.2, distributed about normally

• You want to use XYTZ P.D. to test the hypothesis that college-educated cops are more effective: college greater effectiveness

– Independent variable: college (Y/N)– Dependent variable: effectiveness (scale 1-5)

• You draw two officer samples (we’ll cover this later in the term) and compare their mean effectiveness scores

– 10 college grads (mean 3.7)– 10 non-college (mean 2.8)

• On its face, the difference between means is in the hypothesized direction: college grads seem more effective. But that’s not the end of it. Each group’s variance would then be used to determine whether the difference in scores is “statistically significant.” Don’t worry - we’ll cover this later!

College grads

Non-college grads

Are college-educated cops

more effective?

Page 9: Review: what is a  distribution ?

Sample 1 (n=10)

Officer Score Mean Diff. Sq. 1 3 2.9 .1 .012 3 2.9 .1 .013 3 2.9 .1 .014 3 2.9 .1 .015 3 2.9 .1 .016 3 2.9 .1 .017 3 2.9 .1 .018 1 2.9 -1.9 3.619 2 2.9 -.9 .8110 5 2.9 2.1 4.41____________________________________________________ Sum 8.90Variance (sum of squares / n-1) s2 .99Standard deviation (sq. root of variance) s .99

Variabilityexercise

Random sample of patrol officers,each scored 1-5 on a cynicism scale

This is not an acceptable graph – it’s only to illustrate dispersion

Page 10: Review: what is a  distribution ?

Sample 2 (n=10)

Officer Score Mean Diff. Sq.

1 2 ___ ___ ___2 1 ___ ___ ___3 1 ___ ___ ___4 2 ___ ___ ___5 3 ___ ___ ___6 3 ___ ___ ___7 3 ___ ___ ___8 3 ___ ___ ___9 4 ___ ___ ___10 2 ___ ___ ___

Sum ____ Variance s2 ____

Standard deviation s ____

Another random sample of patrol officers,each scored 1-5 on a cynicism scale

Compute ...

Page 11: Review: what is a  distribution ?

Sample 2 (n=10)

Officer Score Mean Diff. Sq.1 2 2.4 -.4 .162 1 2.4 -1.4 1.963 1 2.4 -1.4 1.964 2 2.4 -.4 .165 3 2.4 .6 .366 3 2.4 .6 .367 3 2.4 .6 .36 8 3 2.4 .6 .369 4 2.4 1.6 2.5610 2 2.4 -.4 .16

Sum 8.40Variance (sum of squares / n-1) s2 .93Standard deviation (sq. root of variance) s .97

Sample 1 (n=10)

Officer Score Mean Diff. Sq. 1 3 2.9 .1 .012 3 2.9 .1 .013 3 2.9 .1 .014 3 2.9 .1 .015 3 2.9 .1 .016 3 2.9 .1 .017 3 2.9 .1 .018 1 2.9 -1.9 3.619 2 2.9 -.9 .8110 5 2.9 2.1 4.41

Sum 8.90Variance (sum of squares / n-1) s2 .99Standard deviation (sq. root of variance) s .99

Two random samples of patrol officers, each scored 1-5 on a cynicism scale

These are not acceptable graphs – they’re only used here to illustrate how the scores disperse around the mean

Page 12: Review: what is a  distribution ?

z-score (a “standard” score)• If the distribution of a variable (e.g., number of arrests) is approximately normal, we can estimate

where any score would fall in relation to the mean. • We first convert the sample score into a z-score using the sample standard deviation

z-scores -3 -2 -1 0 +1 +2 +3

Page 13: Review: what is a  distribution ?

• We then look up the z-score in a table. It gives the proportion of cases in the distribution…– Between a case and the mean– Beyond the case, away from the mean (left for negative z’s, right for positive z’s)

• Z-scores can be used to identify the percentile bracket into which a case falls (e.g., bottom ten percent) • Since z-scores are standardized like percentages, they can be used to compare samples• The z-table indicates the proportion of the area under the curve (the proportion of scores) between the

mean and any z score, and the proportion of the area beyond that score (to the left or right)• In a normal distribution 95 percent of all z-scores falls between +/- 1.96• In a normal distribution 5 present of all z-scores fall beyond +/- 1.96

Rare/unusual cases

Proportion of area “under the curve” where cases lie .025 .475 .475 .025

100 percent of cases

95 percent of cases

2½ pct. 2½ pct.

-1.96 +1.96

Page 14: Review: what is a  distribution ?

Variability exerciseSample of twenty officers drawn from

the Anywhere police department,each measured for number of arrests

0 1 2 3 4 5 6 Arrests

Freq

uenc

y

1

2

3

4

5

6

Unit of analysis: officersCase: one officerVariable: number of arrests

Number of arrests is presumably normally distributed in the population of officers, meaning the whole police department. That is, most officers make about the same number of arrests; a few make less, and a few make more.

Page 15: Review: what is a  distribution ?

Officer #Arrests Mean Diff. Diff. Squared Z-score

1 2

2 4

3 5

4 3

5 1

6 3

7 2

8 (Jay) 0

9 3

10 4

11 5

12 3

13 2

14 1

15 4

16 6

17 3

18 4

19 2

20 3

Sum of squared differences

Variance (sum of squares/n-1)

Standard deviation (sq root var)

Assignment

1. Compute the sample standard deviation

2. Obtain the z-score for 0, 1, 2, 3, 4, 5 and 6 arrests

(x -x)z = -------- s

NOTE: There are only seven values: 0, 1, 2, 3, 4, 5, 6. Only need to compute their statistics once.

Page 16: Review: what is a  distribution ?

Ofcr #Arr Mean Diff. Diff. Sq

1 2 3 -1 1

2 4 3 1 1

3 5 3 2 4

4 3 3 0 0

5 1 3 -2 4

6 3 3 0 0

7 2 3 -1 1

8 (Jay) 0 3 -3 9

9 3 3 0 0

10 4 3 1 1

11 5 3 2 4

12 3 3 0 0

13 2 3 -1 1

14 1 3 -2 4

15 4 3 1 1

16 6 3 3 9

17 3 3 0 0

18 4 3 1 1

19 2 3 -1 1

20 3 3 0 0

Sum of squared differences 42

Variance (sum of squares/n-1) 2.21

Standard Deviation (sq. root) 1.49

Page 17: Review: what is a  distribution ?

z-score -2 -1 0 +1 +2

No.

of

offi

cers

1

2

3

4

5

6

No. of arrests 0 1 2 3 4 5 6

arrests calculate z Prop. between mean and z Prop. beyond z

0 (Jay) 0-3/1.49 -2.01 48% (.4778) 2% (.0222)

1 1-3/1.49 -1.34 41% (.4099) 9% (.0901)

2 2-3/1.49 -.67 25% (.2486) 25% (.2514)

3 3-3/1.49 0 0 50% (.50)

4 4-3/1.49 +.67 25% (.2486) 25% (.2514)

5 5-3/1.49 +1.34 41% (.4099) 9% (.0901)

6 (Dudley) 6-3/1.49 +2.01 48% (.4778) 2% (.0222)Jay’s score falls in the bottom two percent of a normal distribution

Dudley’s score falls in the top two percent of a normal distribution

Page 18: Review: what is a  distribution ?

Exam information

• You must bring a regular, non-scientific calculator with no functions beyond a square root key and a z-table.

• You need to understand the concept of a distribution.• You will be given data and asked to create graph(s) depicting the

distribution of a single variable.• You will compute basic statistics, including mean, median, mode,

standard deviation and z-score. All computations must be shown on the answer sheet.

• You will be given the formulas for variance (s2) and z. You must use and display the procedure described in the slides and practiced in class for manually calculating variance (s2) and standard deviation (s).

• You will use the z-table to calculate where cases from a given sample would fall in a normal distribution.

• This is a relatively brief exam. You will have one hour to complete it. We will then take a break and move on to the next topic.