Descriptive Statistics: Numerical Measures Measures of Shape of a Distribution, Relative Location...

Descriptive Statistics: Numerical Measures Measures of Shape of a Distribution,

Relative Location and Outliers

Measures of Association between Two Variables

Weighted Mean and Grouped Data

Shape of A Distribution Depends On the Relative Location, and Outliers

Shape of a Distribution

z-Scores (Standardized Values)

Chebyshev’s Theorem Empirical Rule

Detecting Outliers

Distribution Shape: Skewness

An important measure of the shape of a distribution is called skewness.

The formula for computing the skewness of a data set is somewhat complex.

Skeweness (S)

Is a measure of the asymmetry of a probability distribution

S=0: Symmetrical S>0: the distribution is right (positively)

skewed S<0: the distribution is left (negatively)

skewed

Symmetric (not skewed)• Skewness is zero.• Mean and median are equal.

tive F

.05.05

.10.10

.15.15

.20.20

.25.25

.30.30

.35.35

Skewness = 0

Distribution Shape: Skewness Moderately Skewed Left

Skewness is negative. Mean will usually be less than the median.

tive F

.05.05

.10.10

.15.15

.20.20

.25.25

.30.30

.35.35

Skewness = - .31

Highly Skewed Right• Skewness is positive (often above 1.0).• Mean will usually be more than the median.

tive F

.05.05

.10.10

.15.15

.20.20

.25.25

.30.30

.35.35

Skewness = 1.25

Is a measure of the relative location of the observation in a data set.

z-Score (Standardized Value)

zx xsii

Z-Score denotes the number of standard units (deviations) a given data value xi is located from the mean.

As a result, z-score is also called a standardized value.

A data value less than the sample mean will always have a z-score less than zero.

A data value greater than the sample mean will always have a z-score greater than zero.

A data value equal to the sample mean will always have a z-score of zero.

z-Score (Standardized Value)

zx xsii

Chebyshev’s Theorem

A theorem that shows the position of a certain proportion of observation in any data set relative to the mean of the data when the values are standardized.

At least of the data values must be

within of the mean.

75%75%

z = 2 standard deviations z = 2 standard deviations

within of the mean.

89%89%

within of the mean.

94%94%

The theorem states that, for any data set---

For a data with a bell-shaped distribution:

of the values of a normal random variable are within of its mean.

of the values of a normal random variable are within of its mean.68.26%68.26%

+/- 1 standard deviation+/- 1 standard deviation

+/- 2 standard deviations+/- 2 standard deviations

+/- 3 standard deviations+/- 3 standard deviations

Empirical Rule

xm – 3s m – 1s

m – 2sm + 1s

m + 2sm + 3sm

68.26%

95.44%99.72%

Z-Scores allow us to Detect Outliers

An outlier is an unusually small or unusually large value in a data set.

A data value with a z-score less than -3 or greater than +3 might be considered an outlier.

It might be the result of:• an incorrect recording• an incorrectly included value in the data set• a correctly recorded data value but an unusual

occurrence

Exploratory Data Analysis

Five-Number Summary and Box Plot

Five-Number Summary

1 Smallest Value

First Quartile

Median

Third Quartile

Largest Value

375375

400400

425425

450450

475475

500500

525525

550550

575575

600600

625625

A box is drawn with its ends located at the first and third quartiles.

Box Plot

A vertical line is drawn in the box at the location of the median (second quartile).

Q1 = 445 Q3 = 525Q2 = 475

Measures of Association between Two Variables

CovarianceCorrelation Coefficient

Covariance

Covariance between two random variables ( X and Y) is computed as follows:

forsamples

forpopulations

sx x y ynxy

( )( )

xyi x i yx y

( )( )

Covariance

Positive values indicate a positive relationship. Positive values indicate a positive relationship.

Negative values indicate a negative relationship. Negative values indicate a negative relationship.

The covariance is a measure of the direction of movement and linear association between two variables. The covariance is a measure of the direction of movement and linear association between two variables.

Correlation Coefficient

However, it doesn’t indicate the causation. That is, just because two variables are highly correlated, it does not mean that one variable is the cause of the other.

Correlation is a measure of the degree of linear association between two variables. Correlation is a measure of the degree of linear association between two variables.

The correlation coefficient is computed as follows:

forsamples

forpopulations

s sxyxy

Values near +1 indicate a strong positive linear relationship. Values near +1 indicate a strong positive linear relationship.

Values near -1 indicate a strong negative linear relationship. Values near -1 indicate a strong negative linear relationship.

The coefficient can take on values between -1 and +1.

The correlation coefficient is computed as follows:

forsamples

s sxyxy

Weighted Mean Mean, Variance, and Standard Deviation of

Grouped Data

You are taking five courses. The following table depicts the credit hours associated with each course and your grades. Compute your GPA for the semester?

Courses Credit Hours GradeCalculus 4 BPsychology 3 AMarketing 3 CEconomics 3 DStat 2 A

Weighted Mean

where: xi = value of observation i

wi = weight for observation i

You are taking five courses. The following table depicts the credit hours associated with each course and your grades. Compute your GPA for the semester?

Courses Credit Hours (Wi)

Grade WiX G

Calculus 4 B(3) 12Psychology 3 A(4) 12Marketing 3 C(2) 6Economics 3 D(1) 3Stat 2 A(4) 8

Weighted Mean

where: xi = value of observation i

wi = weight for observation i

Weighted Mean

When the mean is computed by giving each data value a weight that reflects its importance, it is referred to as a weighted mean.

When data values vary in importance, the analyst must choose the weight that best reflects the importance of each value.

Working with Grouped Data

425 430 430 435 435 435 435 435 440 440440 440 440 445 445 445 445 445 450 450450 450 450 450 450 460 460 460 465 465465 470 470 472 475 475 475 480 480 480480 485 490 490 490 500 500 500 500 510510 515 525 525 525 535 549 550 570 570575 575 580 590 600 600 600 600 615 615

Given below is a sample of monthly rents for 70 efficiency apartments.

Grouped Data

Rent ($) Frequency420-439 8440-459 17460-479 12480-499 8500-519 7520-539 4540-559 2560-579 4580-599 2600-619 6

Computing the mean and variance of a grouped data

To compute the weighted mean from a grouped data we treat the midpoint of each class as though it were the mean of all items in the class.

We compute a weighted mean of the data using class midpoints and class frequencies as weights.

Similarly, in computing the variance and standard deviation, the class frequencies are used as weights.

Mean for Grouped Datai if M

where: fi = frequency of class i Mi = midpoint of class i

Sample Data

Population Data

Given below is a sample of monthly rents for 70 efficiency apartments as grouped data--- in the form of a frequency distribution.

Rent ($) Frequency420-439 8440-459 17460-479 12480-499 8500-519 7520-539 4540-559 2560-579 4580-599 2600-619 6

Sample Mean for Grouped Data

This approximationdiffers by $2.41 fromthe actual samplemean of $490.80.

34,525 493.21

Rent ($) f i

420-439 8440-459 17460-479 12480-499 8500-519 7520-539 4540-559 2560-579 4580-599 2600-619 6

Total 70

429.5449.5469.5489.5509.5529.5549.5569.5589.5609.5

f iM i

3436.07641.55634.03916.03566.52118.01099.02278.01179.03657.034525.0

Variance for Grouped Data

sf M xn

Ni i( )

For sample data

For population data

Rent ($) f i

420-439 8440-459 17460-479 12480-499 8500-519 7520-539 4540-559 2560-579 4580-599 2600-619 6

Total 70

429.5449.5469.5489.5509.5529.5549.5569.5589.5609.5

Sample Variance for Grouped Data

M i - x

-63.7-43.7-23.7-3.716.336.356.376.396.3116.3

f i(M i - x )2

32471.7132479.596745.97110.11

1857.555267.866337.13

23280.6618543.5381140.18

208234.29

(M i - x )2

4058.961910.56562.1613.76

265.361316.963168.565820.169271.76

13523.36

34,525 493.21

3,017.89 54.94s

s2 = 208,234.29/(70 – 1) = 3,017.89

This approximation differs by only $.20 from the actual standard deviation of $54.74.

Sample Variance for Grouped Data

Sample Variance

Sample Standard Deviation

Descriptive Statistics: Numerical Measures Measures of Shape of a Distribution, Relative Location...

Documents

Transcript of Descriptive Statistics: Numerical Measures Measures of Shape of a Distribution, Relative Location...

Descriptive Discriminant Analysis for Repeated Measures Data

Descriptive Statistics II: Measures of Dispersion

Descriptive Statistics Measures of Central Tendency.

Descriptive Statistics: Numerical Measures Exploratory Data Analysis

Agenda Descriptive Statistics Measures of Spread - Variability.

Numerical Descriptive Measures Interpreting Correlation ...

Slide 3-2 Chapter 3 Descriptive Measures Section 1 Measures of Center.

Chapter 3 Descriptive Statistics: Numerical Methods Part B n Measures of Relative Location and Detecting Outliers n Exploratory Data Analysis n Measures.

Statistik: Numerical Descriptive Measures

SBE10_03b (ca) Descriptive Stats-Numerical measures

3.4 Measures of Position and Outliers. The Z-Score.

Descriptive Statistics: Numerical Measures Location and Variability

Chapter 3, Part A Descriptive Statistics: Numerical Measures n Measures of Location n Measures of Variability.

Numerical Descriptive Measures

Descriptive Statistics II: Measures of Dispersion.

Detecting Outliers in Data with Correlated Measures

Chapter 3 Descriptive Measures Measures Central Tendency MeanMedianModeDispersionRange Variance & Standard Deviation Measures Central Tendency MeanMedianModeDispersionRange.

1 Chapter 3 – Descriptive Statistics Numerical Measures.

Descriptive Statistics. Measures of Central Tendency Measures of Location Measures of Dispersion Measures of Symmetry Measures of Peakdness.

2.Descriptive Statistics-measures of Central Tendency