Descriptive Statistics: Numerical Measures Measures of Shape of a Distribution, Relative Location...

Post on 19-Dec-2015

231 views 3 download

Tags:

Transcript of Descriptive Statistics: Numerical Measures Measures of Shape of a Distribution, Relative Location...

Descriptive Statistics: Numerical Measures Measures of Shape of a Distribution,

Relative Location and Outliers

Measures of Association between Two Variables

Weighted Mean and Grouped Data

Shape of A Distribution Depends On the Relative Location, and Outliers

Shape of a Distribution

z-Scores (Standardized Values)

Chebyshev’s Theorem Empirical Rule

Detecting Outliers

Distribution Shape: Skewness

An important measure of the shape of a distribution is called skewness.

The formula for computing the skewness of a data set is somewhat complex.

3

3)(

x

XES

-

Skeweness (S)

Is a measure of the asymmetry of a probability distribution

S=0: Symmetrical S>0: the distribution is right (positively)

skewed S<0: the distribution is left (negatively)

skewed

3

3)(

x

XES

-

Distribution Shape: Skewness

Distribution Shape: Skewness

Symmetric (not skewed)• Skewness is zero.• Mean and median are equal.

Rela

tive F

req

uen

cyR

ela

tive F

req

uen

cy

.05.05

.10.10

.15.15

.20.20

.25.25

.30.30

.35.35

00

Skewness = 0

Distribution Shape: Skewness Moderately Skewed Left

Skewness is negative. Mean will usually be less than the median.

Rela

tive F

req

uen

cyR

ela

tive F

req

uen

cy

.05.05

.10.10

.15.15

.20.20

.25.25

.30.30

.35.35

00

Skewness = - .31

Distribution Shape: Skewness

Highly Skewed Right• Skewness is positive (often above 1.0).• Mean will usually be more than the median.

Rela

tive F

req

uen

cyR

ela

tive F

req

uen

cy

.05.05

.10.10

.15.15

.20.20

.25.25

.30.30

.35.35

00

Skewness = 1.25

Is a measure of the relative location of the observation in a data set.

z-Score (Standardized Value)

zx xsii

Z-Score denotes the number of standard units (deviations) a given data value xi is located from the mean.

As a result, z-score is also called a standardized value.

A data value less than the sample mean will always have a z-score less than zero.

A data value greater than the sample mean will always have a z-score greater than zero.

A data value equal to the sample mean will always have a z-score of zero.

z-Score (Standardized Value)

zx xsii

Chebyshev’s Theorem

A theorem that shows the position of a certain proportion of observation in any data set relative to the mean of the data when the values are standardized.

A theorem that shows the position of a certain proportion of observation in any data set relative to the mean of the data when the values are standardized.

At least of the data values must be

within of the mean.

At least of the data values must be

within of the mean.

75%75%

z = 2 standard deviations z = 2 standard deviations

At least of the data values must be

within of the mean.

At least of the data values must be

within of the mean.

89%89%

z = 3 standard deviations z = 3 standard deviations

At least of the data values must be

within of the mean.

At least of the data values must be

within of the mean.

94%94%

z = 4 standard deviations z = 4 standard deviations

The theorem states that, for any data set---

For a data with a bell-shaped distribution:

of the values of a normal random variable are within of its mean.

of the values of a normal random variable are within of its mean.68.26%68.26%

+/- 1 standard deviation+/- 1 standard deviation

of the values of a normal random variable are within of its mean.

of the values of a normal random variable are within of its mean.95.44%95.44%

+/- 2 standard deviations+/- 2 standard deviations

of the values of a normal random variable are within of its mean.

of the values of a normal random variable are within of its mean.99.72%99.72%

+/- 3 standard deviations+/- 3 standard deviations

Empirical Rule

Empirical Rule

xm – 3s m – 1s

m – 2sm + 1s

m + 2sm + 3sm

68.26%

95.44%99.72%

Z-Scores allow us to Detect Outliers

An outlier is an unusually small or unusually large value in a data set.

A data value with a z-score less than -3 or greater than +3 might be considered an outlier.

It might be the result of:• an incorrect recording• an incorrectly included value in the data set• a correctly recorded data value but an unusual

occurrence

Exploratory Data Analysis

Five-Number Summary and Box Plot

Five-Number Summary

1 Smallest Value

First Quartile

Median

Third Quartile

Largest Value

2

3

4

5

375375

400400

425425

450450

475475

500500

525525

550550

575575

600600

625625

A box is drawn with its ends located at the first and third quartiles.

Box Plot

A vertical line is drawn in the box at the location of the median (second quartile).

Q1 = 445 Q3 = 525Q2 = 475

Measures of Association between Two Variables

CovarianceCorrelation Coefficient

Covariance

Covariance

Covariance between two random variables ( X and Y) is computed as follows:

Covariance between two random variables ( X and Y) is computed as follows:

forsamples

forpopulations

sx x y ynxy

i i

( )( )

1

xyi x i yx y

N

( )( )

Covariance

Positive values indicate a positive relationship. Positive values indicate a positive relationship.

Negative values indicate a negative relationship. Negative values indicate a negative relationship.

The covariance is a measure of the direction of movement and linear association between two variables. The covariance is a measure of the direction of movement and linear association between two variables.

Correlation Coefficient

Correlation Coefficient

However, it doesn’t indicate the causation. That is, just because two variables are highly correlated, it does not mean that one variable is the cause of the other.

However, it doesn’t indicate the causation. That is, just because two variables are highly correlated, it does not mean that one variable is the cause of the other.

Correlation is a measure of the degree of linear association between two variables. Correlation is a measure of the degree of linear association between two variables.

The correlation coefficient is computed as follows:

The correlation coefficient is computed as follows:

forsamples

forpopulations

rs

s sxyxy

x y

xyxy

x y

Correlation Coefficient

Correlation Coefficient

Values near +1 indicate a strong positive linear relationship. Values near +1 indicate a strong positive linear relationship.

Values near -1 indicate a strong negative linear relationship. Values near -1 indicate a strong negative linear relationship.

The coefficient can take on values between -1 and +1.

The correlation coefficient is computed as follows:

The correlation coefficient is computed as follows:

forsamples

rs

s sxyxy

x y

Correlation Coefficient

Weighted Mean Mean, Variance, and Standard Deviation of

Grouped Data

You are taking five courses. The following table depicts the credit hours associated with each course and your grades. Compute your GPA for the semester?

Courses Credit Hours GradeCalculus 4 BPsychology 3 AMarketing 3 CEconomics 3 DStat 2 A

Weighted Mean

i i

i

wxx

w

where: xi = value of observation i

wi = weight for observation i

You are taking five courses. The following table depicts the credit hours associated with each course and your grades. Compute your GPA for the semester?

Courses Credit Hours (Wi)

Grade WiX G

Calculus 4 B(3) 12Psychology 3 A(4) 12Marketing 3 C(2) 6Economics 3 D(1) 3Stat 2 A(4) 8

13 41

Weighted Mean

i i

i

wxx

w

where: xi = value of observation i

wi = weight for observation i

Weighted Mean

When the mean is computed by giving each data value a weight that reflects its importance, it is referred to as a weighted mean.

When data values vary in importance, the analyst must choose the weight that best reflects the importance of each value.

Working with Grouped Data

425 430 430 435 435 435 435 435 440 440440 440 440 445 445 445 445 445 450 450450 450 450 450 450 460 460 460 465 465465 470 470 472 475 475 475 480 480 480480 485 490 490 490 500 500 500 500 510510 515 525 525 525 535 549 550 570 570575 575 580 590 600 600 600 600 615 615

Given below is a sample of monthly rents for 70 efficiency apartments.

Grouped Data

Rent ($) Frequency420-439 8440-459 17460-479 12480-499 8500-519 7520-539 4540-559 2560-579 4580-599 2600-619 6

Computing the mean and variance of a grouped data

To compute the weighted mean from a grouped data we treat the midpoint of each class as though it were the mean of all items in the class.

We compute a weighted mean of the data using class midpoints and class frequencies as weights.

Similarly, in computing the variance and standard deviation, the class frequencies are used as weights.

Mean for Grouped Datai if M

xn

N

Mf ii

where: fi = frequency of class i Mi = midpoint of class i

Sample Data

Population Data

Given below is a sample of monthly rents for 70 efficiency apartments as grouped data--- in the form of a frequency distribution.

Rent ($) Frequency420-439 8440-459 17460-479 12480-499 8500-519 7520-539 4540-559 2560-579 4580-599 2600-619 6

Sample Mean for Grouped Data

Sample Mean for Grouped Data

This approximationdiffers by $2.41 fromthe actual samplemean of $490.80.

34,525 493.21

70x

Rent ($) f i

420-439 8440-459 17460-479 12480-499 8500-519 7520-539 4540-559 2560-579 4580-599 2600-619 6

Total 70

M i

429.5449.5469.5489.5509.5529.5549.5569.5589.5609.5

f iM i

3436.07641.55634.03916.03566.52118.01099.02278.01179.03657.034525.0

Variance for Grouped Data

sf M xn

i i22

1

( )

22

f M

Ni i( )

For sample data

For population data

Rent ($) f i

420-439 8440-459 17460-479 12480-499 8500-519 7520-539 4540-559 2560-579 4580-599 2600-619 6

Total 70

M i

429.5449.5469.5489.5509.5529.5549.5569.5589.5609.5

Sample Variance for Grouped Data

M i - x

-63.7-43.7-23.7-3.716.336.356.376.396.3116.3

f i(M i - x )2

32471.7132479.596745.97110.11

1857.555267.866337.13

23280.6618543.5381140.18

208234.29

(M i - x )2

4058.961910.56562.1613.76

265.361316.963168.565820.169271.76

13523.36

34,525 493.21

70x

3,017.89 54.94s

s2 = 208,234.29/(70 – 1) = 3,017.89

This approximation differs by only $.20 from the actual standard deviation of $54.74.

Sample Variance for Grouped Data

Sample Variance

Sample Standard Deviation