Measures of Center and Variation Prof. Felix Apfaltrer [email protected] Office:N518 Phone:...

28
Measures of Center and Variation Prof. Felix Apfaltrer [email protected] Office:N518 Phone: X7421 Office hours: Tue, Thu 1:30-3pm
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Transcript of Measures of Center and Variation Prof. Felix Apfaltrer [email protected] Office:N518 Phone:...

Measures of Center and Variation

Prof. Felix Apfaltrer

[email protected]

Office:N518

Phone: X7421

Office hours:

Tue, Thu 1:30-3pm

2

Measures of center - mean• A measure of center is a value that

represents the center of the data set

• The mean is the most important measure of center (also called arithmetic mean)

• sample mean

• population mean

addition of values

variable (indiv. data vals)

sample size

population size

Example. Lead (Pb) in air at BMCC (mmg/m3), 1.5 high:5.4, 1.1, 0.42, 0.73, 0.48, 1.1

Outlier has strong effect on mean!

3

Measures of center - median• Mean is good but sensitive to outliers!

• Large values can have dramatic effect!

The median is the middle value of the original data arranged in increasing order

– If n odd: exact middle value

– If n even: average 2 middle values

Previous example:-reorder data: 0.42, 0.48, 0.73, 1.1, 1.1, 5.4

If we had an extra data point:5.4, 1.1, 0.42, 0.73, 0.48, 1.1, 0.66

After reordering we have0.42, 0.48, 0.66, 0.73, 1.1, 1.1, 5.4

Outlier has strong effect on mean, not so on median!

Used for example in median household income: $ 36,078

4

Measures of Center - mode and midrange

• Mode M value that occurs most frequently

– if 2 values most frequent: bimodal

– if more than 2: multimodal

– Iif no value repeated: no mode

• Needs no numerical values

• Midrange = (highest-lowest value)/2• Outliers have very strong weight

Examples:a. 5.4, 1.1,0.42, 0.73, 0.48, 1.1b. 27, 27, 27, 55, 55, 55, 88, 88, 99c. 1, 2, 3, 6 , 7, 8, 9, 10

Mean 172 61 16Median 170 2 0Mode 1 1 0Midrange 245.5 276 154.5

Solutions:• unimodal: 1.1• Bimodal 27 and 55• No mode

a. (0.42+5.4)/2=2.91b. (27+99)/2=63c. (1+10)/2= 5.5

5

Mode and more …• Mode: not much used with

numerical dataExample:Survey shows students own:• 84% TV• 76% VCR• 69% CD player• 39% video game player• 35% DVD

• Mean from frequency distribution

• Weighted mean:

(example on page 23 of BMCC booklet)

Dis-Advantages of different measures of center

Skewness

TV is the mode!No mean, median or midrange!

Round-off: carry one more decimal than in data!

6

Measures of variation• Variation measures consistency

• Range = (highest value - lowest value)/2

• Standard deviation:

Precisionarrows

junglearrows

Same mean length, but different variation!

7

Standard deviation

• Measure of variation of all values from mean

• Positive or zero (data = )

• Larger deviations, larger s

• Can increase dramatically with outliers

• Same units as original data values

Recipe:• Compute mean • Substract mean from

individual values3. Square the differences4. Add the squared differences5. Divide by n-1.

6. Take the square root.

Example: waiting times

Bank Consistency 6 5 4 4 6 5

Bank Unpredictable 0 15 5 0 0 10

• Mean: (6+5+4+4+6+5)/6=51. (6-5)=1,(5-5)=0, (4-5)=-1, (4-5)=-1, (6-5)=1, 0

2. 12=1 , 02=0, (-1)2=1, (-1)2=1, 12=1,02=03. ∑ 1+0+1+1+1+0 = 44. n-1=6-1=5 4/5=0.85. √0.8 = 0.9 min vs 6.3 min

Calculationg standard deviationBank Unpredictablex x-mean (x-mean)20 -5 2515 10 1005 0 00 -5 250 -5 2510 5 25

Total: 30 200

mean=30/6=5min

s= sqrt(200/(6-1))=sqrt(40)=6.3 min

8

Standard deviation of sample and population

Example using fast formula:

• Find values of n, ,

n=6 6 values in sample

= 30 adding the values

= 62+52+42 +42 +52+ 62 = 154

Standard deviation of a population

• divide by N• - mu (population mean)• Sigma (st. dev. of

population)• Different notations in

calculators – Excell: STDEVP instead of – STDEV

Estimating s and : (highest value - lowest value)/4

9

Example: class gradesA statistics class of 20 students obtains the

following grades:

To rapidly approximate the mean, we take a random sample of 5 students. At random, we pick

x = (78+92+64+83+78)/5=395/5 =79

s =√((78-79) 2 +(92-79) 2 +(64-79)2+(83-79) 2 +(78-79)2)/4

=√(( -1) 2 + ( 13 ) 2 + ( -15 )2+ ( 4 ) 2 +( -1 )2)/4

=√( 1 + 169 + 225 + 16 + 1)/4

=√( 412 )/4 =√( 103 ) = 10.15

The population mean is obtained

by adding all grades

and dividing by 20, which is 79.95.

The population variance is 10.71.

Which we can obtain using Excell:

Student Name Grade Name GradePeter 83 Albert 69Kathy 98 John 71Pat 57 John B. 64Nina 73 Hughes 85Nancy 78 Zak 89Victor 86 Zoe 84Vikki 82 Lena 83Jen 95 Mary 92Jay 92 Joe 74Fred 66 Betty 78

Nancy Mary John B. Betty Lena

83 98 57 73 78 86 82 95 92 66 69 71 64 85 89 84 83 92 74 78

Name Grade x x-mu squaredPeter 83 3.1 9.3Kathy 98 18.1 325.8Pat 57 -23.0 526.7Nina 73 -7.0 48.3Nancy 78 -2.0 3.8Victor 86 6.1 36.6Vikki 82 2.1 4.2Jen 95 15.1 226.5Jay 92 12.1 145.2Fred 66 -14.0 194.6Albert 69 -11.0 119.9John 71 -9.0 80.1John B. 64 -16.0 254.4Hughes 85 5.1 25.5Zak 89 9.1 81.9Zoe 84 4.1 16.4Lena 83 3.1 9.3Mary 92 12.1 145.2Joe 74 -6.0 35.4Betty 78 -2.0 3.8

sum 2293.0/ N=20 114.6root 10.71

10

Variance and coefficient of variation

Variance

Variance = square of standard deviation

sample

population

General terms refering to variation: dispersion, spread, variation

Variance: specific definition

Ex: finding a variance 0.8, 40

Examples:

In class grade case, sample standard deviation was 10.15.

Therefore, s2=103.

The population standard deviation was 10.71, therefore,

2=10.71 2= 114.7.

11

Coefficient of variation

• Coefficient of variation allows to compare dispersion of completely different data sets– ex:

• consistent bank data set

6,5,4,4,6,5; x=5, s=0.9

CV=.9/5=0.18

• Class sample: x=79, s=10.1

CV=10.1/79=0.13

– Variation of consistent bank is larger than that of the class in relative terms!

Coefficient of variation CV

Describes the standard deviation relative to the mean:

In previous example,

CVsample=10.1/79 =12.8%

CVpopulation=10.71/ 79.95 =13.4%

12

More on variance and standard deviation

• Why use variance, standard deviation is more intuitive?– (Independent) variances have

additive properties– Probabilistic properties– Standard deviation is more

intuitive

• Why divide sample st. dev by n-1?– Only n-1 free parameters

• Skewness: Pearson’s index– I=3( mean-median )/s– If I<-1 or I>1: significantly

squewed

Empirical rule for data with normal distribution

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

-3 -2 -1 0 1 2 3

68% of data95% of data

99.7% of dataExample: Adult IQ scores have a bell-shaped

distribution with mean of 100 and a standard deviation of 15. What percentage of adults have IQ in 55:145 range?

s=15, 3s=45, x-3s=55, x+3s=145

Hence, 99.7% of adults have IQs in that range.

Chebyshev’s theorem: At least 1-1/k2 percent of the data lie between k standard deviations from the mean. Ex: At least 1-1/32=8/9=89% of the data lie within 2 st. dev. of the mean.

13

And more on variance and standard deviation

• Finding s from a frequency distribution Interpreting a known value of the standard deviation s: If the standard deviation s is known, use it to find rough estimates of the minimum and maximum “usual” sample values by using

max “usual” value ≈ mean + 2(st. dev)min “usual” value ≈ mean - 2(st. dev)

N-1: DATA 3,6,9

=6, 2=6

Samples (replacement): 33 36 39 63 66 69 93 96 99

x = 3 4.5 6 4.5 6 7.5 6 7.5 9

∑(x-x )2 = 0 4.5 18 4.5 0 4.5 18 4.5 0

S2=(divide by n-1=2-1) 0 4.5 18 4.5 0 4.5 18 4.5 0

Mean value of s2= 54/9 = 6

S 2=(divide by n=2) 0 2.25 9 2.25 0 2.25 9 2.25 0

Mean value of s 2= 27/9 = 3

Example: cotinine levels of smokersRange Midpoint Smokers0-99 49.5 11100-199 149.5 12200-299 249.5 14300-399 349.5 1400-499 449.5 2500-599 549.5 0

Frequency MidpointRange f x f. x f.( x^2)0-99 11 49.5 544.5 26952.75100-199 12 149.5 1794 268203200-299 14 249.5 3493 871503.5300-399 1 349.5 349.5 122150.25400-499 2 449.5 899 404100.5500-599 0 549.5 0 0Totals: 40 7080 1692910

using Excel we obtain

with which we calculate:

14

Measures of relative standing

Useful for comparing different data sets

• z scores– Number of standard

deviations that a value x is above of below the mean

sample population

Example:• NBA Jordan 78, =69, =2.8

• WNBA Lobo 76, =63.6, =2.5 Number of standard deviations that a value x is above of below the mean– J: z=(x-)/=(78-69)/2.8=3.21– L: z=(x-)/=(76-63.6)/2.5=4.96

• Percentiles:

– Percentile of value x Px

unusual values ordinary values unusual values-3 -2 -1 0 1 2 3

z

Example

data point 48 in Smoker data

8/40*100=20th percentile = P20

Exercise:

Locate the percentiles of data points 1, 130 and 250.

total number of valuesPx=

number of values less than x

15

Quartiles and percentilesSMOKERS CLASS

pos value sorted grade SORTED1 1 130 0 173 83 572 0 234 1 173 98 643 131 164 1 198 57 664 173 198 3 208 73 695 265 17 17 210 78 716 210 253 32 222 86 737 44 87 35 227 82 748 277 121 44 234 95 789 32 266 48 245 92 7810 3 290 86 250 66 8211 35 123 87 253 69 8312 112 167 103 265 71 8313 477 250 112 266 64 8414 289 245 121 277 85 8515 227 48 123 284 89 8616 103 86 130 289 84 8917 222 284 131 290 83 9218 149 1 149 313 92 9219 313 208 164 477 74 9520 491 173 167 491 78 98

16

Percentiles and Quartiles

Conversely, if you are looking for data in the kth percentile:

L=(k/100)*n

n total number of values

k percentiles being used

L locator that gives position of a value

(the 12th value in the sorted list L=12)

Pk kth percentile (ex: P25 is 25th percentile)

Example: In class table ( n = 20 )

• find value of 21 percentile– L=21/100 * 20 = 4.2 – round up to 5th data point – --> P21 = 71

• find the 80th percentile:– L=80/100 * 20 = 16, – WHOLE NUMBER:– P80 =(89+92)/2=90.5

• Quartiles:

– Q1,= P25, Q2 = P50 =median, Q3= P75

Pk: k = (L – 1)/n •100

Example: data point 48 in Smoker data is 9th on table, n= 40.

(9 – 1)/40 •100=20 48 is in P20 or 20th percentile or the first quartile Q1.

Data point 234 is 28th. k=(28 – 1)/40 •100= 68th percentile, or the 3rd quartile Q3.

total number of valuesPk: k= number of values less than x

Yes:take average of

Lth and (L+1)st valueas Pk

No:ROUND UP

Pk is the Lth value

ComputeL=(k/100)*n

n=number of valuesk=percentile

SORT DATA

START

L wholenumber?

pos SORTED1 572 643 664 695 716 737 748 789 78

10 8211 8312 8313 8414 8515 8616 8917 9218 9219 9520 98

pos sorted1 0 1732 1 1733 1 1984 3 2085 17 2106 32 2227 35 227

28 44 2349 48 245

10 86 25011 87 25312 103 26513 112 26614 121 27715 123 28416 130 28917 131 29018 149 31319 164 47720 167 491

17

Exploratory Data Analysis

Exploratory data analysis is the process of using statistical tools (graphs, measures of center and variation) to investigate data sets in order to understand their characteristics.

• Box plots have less information than histograms and stem-and-leaf plots

• Not that often used with only one set of data

• Good when comparing many different sets of data

Outlier: Extreme value. (often they are typos when collecting data, but not always).

• can have a dramatic effect on mean

• can have dr. effect on standard deviation

• … on histogram

Min Q1 Median Q3 Max

0 100 200 300 400 500

18

Probability - Chapter 3 False positives and negatives

• False positive: test incorrectly indicates woman pregnant when she is not.

• False negative: test incorrectly indicates woman is not pregnant when she is pregnant.

• True positive: test correctly indicates woman pregnant when she is .

• True negative: test correctly indicates woman not pregnant when she is not.

• Test sensitivity: the probability of a true positive.

• Test specificity: the probability of a true negative.

• Ex: Abbot test pack indicates that their urinte test has a 0.2% false positive and a 0.6% false negative rate.

Pregnancy test resultsPositive test result Negative test result(test indicates pregnant) (test indicates not pregnant)

Subject pregnant 80 5Subject not pregnant 3 11

19

Overview

• Rare event rule: If under a given assumption (lottery is fair) the probability of a particular observed event (5 consecutive lottery wins by the same person) is extremely small, the assumption is probably not correct.

20

Fundamentals

Definitions:• Procedure: rolling a die, 2

dice, tossing a coin, … A procedure is an action whose outcome(s) (result) is (are) random.

• Event: Any collection of outcomes of a procedure.

• Simple events: an event that cannot be simplified even further.

• Sample space of a procedure: The set of all simple events.

Examples:• Procedure: rolling a die, 2 dice,

• Event: For 1 die, any of 1,2,3,4, 5,6, “even”, greater than 3”.

• For 2 dice: “sum is 7”, “sum is bigger than 10”, “1-1”, “1-2”, “2-1”, “both even”.

• Simple events: for 1 die:1, 2, 3,4, 5, 6. For 2 dice: 1-1, 1-2,1-3,1-4,1-5,1-6, 2-1, 2-2, 2-3, 2-4, 2-5, 2-6, 3-1, …, 6-6

• Sample space of a procedure: The set of all simple events.

Notation:

• P probability

• A, B, C specific events

• P(A) the probability of the event A occurring

21

Defining a probability

• Relative Frequency Approach: Observe a procedure a large number of times and count the number of times that event A occurs, then P(A) is estimated by

Examples:

• A tack falls up: repeat the experiment 1000 times and count how many times the tack falls up, then P(A) is the ratio of number it falls up over the number of times the tack was thrown.

number of trialsP(A)= number of times A occurs

• Classical Approach: If a procedure has n simple (different) events that can occur that are equally likely, and there are s different ways that A can occur then

number of simple eventsP(A)= number of ways A can occur = s

n

• Subjective Probability: P(A), the probability of the event A, is found by based on knowledge of relevant circumstances.

Total # of optionsP(even)=

# of ways face even= 3

6

• Weather forecast: need to be expert to estimate wisely if it will rain tomorrow or not.

• Rolling a die: assuming the die is not loaded each face has the same chance of falling upside

22

More examples

• Flying on a commercial plane. Find the probability that a random selected adult has flown on a plane.

• 2 events: flown, or not.

• events not equally likely (cannot use classical approach)

• use relative frequency approach. Gallup poll: 815 randomly selected adults, 710 indicated the have flown

• Roulette: Bet on number 13 on a roulette game. What is the probability that you will lose?

• 38 slots, all equally likely, use classic approach. 37 result in loss.

P(flew on commercial plane)= =0.83710815

P(loss)= 37 38

• Meteorites: What is the probability that your house will be hit by a meteorite?

• In absence of historical data, need 3rd approach. We know the chance is very small, say 0.000,000,001. This is a subjective estimate. A general ballpark.

23

Law of large numbers

Law of large numbers: As a procedure is repeated again and again,

the relative frequency probability of an event tends to approach the actual probability.

• 319 for• 133 against• 39 no

opinion• 491 total

P(for)= =0.65319

491

• Example: 2 boys, 1 girl. What is that when a couple has 3 children, exactly 2 out of the 3 are boys.

• Assuming that having boys or girls is equally likely, use classical approach.

• Options are:– boy-boy-boy– boy-boy-girl– boy-girl-boy– boy-girl-girl– girl-boy-boy– girl-boy-girl– girl-girl-boy– girl-girl-girl

• 8 possible outcomes, 3 correspond to exactly 2 boys

P(A) s n

• Example: Death penalty. In a Gallup poll, adults are randomly selected and asked if they are in favor or against the death penalty. The responses include 319 who are for it, 133 who are against it, and 39 that have no opinion . Based on these results, estimate the probability that a randomly selected person is in favor of the death penalty.

P(exaclty 3 boys)= =0.375 3

8

24

Complementary probabilities and properties

• Thanksgiving day. What is the probability that Thanksgiving day falls on a

a) Wednesday?

b) Thursday?

– Thanksgiving is always on a Thursday!

a) Impossible: P(Thxgiv. Wed)=0

b) Always true: P(Thxgiv. Thu)=1

Examples:

• If X denotes the number the face a die shows when it lands, then

– P( X = 7 ) = 0

– P( X ≤ 7 ) = 1

– P( X not even ) = 1- P( X even )

– P( { X ≤ 2} c ) = 1 - P( { X ≤ 2 } ) = 1 - 2/6 = 4/6 =

2/3

= P( X > 2 )

– P( X ≥0 ) = 1

– For any event A, P(A)≥0

– P(A)=0 only if A cannot happen

– For any event A, P(A ) ≤ 1

– P(A)=1 exactly only if A happens for sure

• If Y denotes the sum of the numbers on the faces when throwing 2 dice:

– P( Y = 1) =0

– P( 2 ≤ Y ≤ 12 ) =1

– P(Y=4) = 3/36 namely 1-3, 2-2, and 3-1

– P({Y=2} c) = 1-P(Y=2)=1-1/36 = 35/36

The probability of the impossible event is 0.P( ) =0.

The probability of the certain event is 0.P( ) =1.

For any event A, 0 ≤ P(A) ≤ 1.

If Ac denotes the complement event to A, then

P(A)+P(Ac)=1

HW: p.120 #1-7

25

Addition Rule

A compound event is an event combining 2 or more simple events.

Events A and B are disjoint (or mutually exclusive) if they cannot both occur together.

In such a case, the intersection of the events is empty: AB = ø and we recall that P(ø) = 0. We then have

P(AB) = P(A) + P(B)

B

AAB

A

B

Notation:

P(AB) intersection of A and B

(both A and B occur)

P(AB) union of A and B

(either A or B or both occur)

Addition Rule:

P(AB) = P(A) + P(B) – P(AB)

Mendel: hybridization experi-ments. Peas with purple (p) and white (w) flowers, green (g) and yellow (y) pods.8 p 9 g6 w 5 yP(g p) = 9/14 + 8/14 – 5/14 = P(g) + P(p) – P(g p)

Idea: count data only once!

Venn diagrams

Overlapping events Non-Overlapping events(disjoint)

P(AB ) = P(A ) + P( B ) – P( A B)

+= –

26

Examples: addition rule

Clinical trials of pregnancy test:

Assuming that 1 person is selected at random from the 99 people in the test, find the probability of selecting a subject who is pregnant or had a positive test result.

P(pregnant) = (80 + 5)/99

P(test positive) = (80 +3 ) / 99

P(pregnant and test positive) = 80 / 99

P(pregnant or test positive) =

P(pregnant) + P(test positive)

- P(pregnant and test positive)

= 85/99 + 83/99 - 80/99

= 88/99 = 8/9 = 0.899

Alternatively

P(pregnant or positive)= P(pregnant and positive)+ P(pregnant and negative)+ P (not pregnant but positive)= 80/99+ 5/99+ 3/99

Note that Pregnant =(pregnant & pos) + (preg. & neg)Positive = (pregnant & pos) + (pos. & not preg.)

Substract to avoid double counting!

Pregnancy test resultsPositive test result Negative test result(test indicates pregnant) (test indicates not pregnant)

Subject pregnant 80 5Subject not pregnant 3 11

27

Multiplication rule

• P( A and B ) = P( A B )

Example:

Answer at random

1. True/false: A pound of feathers is heavier than a pound of gold.

2. Which has affected society most:a) Remote control

b) Sneakers with high heels

c) Hostess twinkies

d) Computers

e) Phone

• To answer at random q. 1, each choice has probability 1/2.

• To answer at random q. 2, each choice has probability 1/5.

• P(both answers correct)

= P( T and (d) )

= 1/2 * 1/5 =1 / 10 = P(T ) P(d )

a

T

F

b

c

de

a

b

c

de

28

Multiplication rule: independent event

If events A and B are independent, then

P( A B ) = P (A) P(B )

Example:

Throwing 2 dice. What is the probability that the first number is even and the second one is larger than 4.

Die #1 #2 #1 #2 #1 #2 #1 #2 #1 #2 #1 #2

1 1 1 2 1 3 1 4 1 5 1 62 1 2 2 2 3 2 4 2 5 2 63 1 3 2 3 3 3 4 3 5 3 64 1 4 2 4 3 4 4 4 5 4 65 1 5 2 5 3 5 4 5 5 5 66 1 6 2 6 3 6 4 6 5 6 6

Answer: Independent? YES!

A: 1st die even

B: second die larger than 4

P(A ) = 3/6 = 1/2

P(B) = P(“face shows 5 or 6”)

= 2 / 6 = 1/3

P(A B ) = P (A) P(B )

= 1/2 * 1/3 = 1/6

From graph there are 6 options that are good: 2-5, 2-6, 4-5,4-6, 6-5,6-6:

P(A B ) = 6/36 = 1/6