Download - Graphical Descriptive Techniques

1

Graphical Descriptive Techniques

Graphical Descriptive Techniques

2

2.1 Introduction

Descriptive statistics involves the arrangement, summary, and presentation of data, to enable meaningful interpretation, and to support decision making.Descriptive statistics methods make use of graphical techniques numerical descriptive measures.

The methods presented apply to both the entire population the population sample

3

2.2 Types of data and information

A variable - a characteristic of population or sample that is of interest for us. Cereal choice Capital expenditure The waiting time for medical services

Data - the actual values of variables Interval data are numerical observations Nominal data are categorical observations Ordinal data are ordered categorical observations

4

Types of data - examples

Interval data

Age - income55 7500042 68000

. .

. .

Age - income55 7500042 68000

. .

. .Weight gain+10+5..

Weight gain+10+5..

Nominal

Person Marital status1 married2 single3 single. .. .

Person Marital status1 married2 single3 single. .. .Computer Brand

1 IBM2 Dell3 IBM. .. .

Computer Brand1 IBM2 Dell3 IBM. .. .

5

Types of data - examples

Interval data

Age - income55 7500042 68000

. .

. .

Age - income55 7500042 68000

. .

. .

Nominal data

With nominal data, all we can do is, calculate the proportion of data that falls into each category.

IBM Dell Compaq Other Total 25 11 8 6 50 50% 22% 16% 12%

IBM Dell Compaq Other Total 25 11 8 6 50 50% 22% 16% 12%

Weight gain+10+5..

Weight gain+10+5..

6

Types of data – analysis

Knowing the type of data is necessary to properly select the technique to be used when analyzing data.

Type of analysis allowed for each type of data Interval data – arithmetic calculations Nominal data – counting the number of observation in each

category Ordinal data - computations based on an ordering process

7

Cross-Sectional/Time-Series Data

Cross sectional data is collected at a certain point in time Marketing survey (observe preferences by gender,

age) Test score in a statistics course Starting salaries of an MBA program graduates

Time series data is collected over successive points in time Weekly closing price of gold Amount of crude oil imported monthly

8

2.3 Graphical Techniques forInterval Data

Example 2.1: Providing information concerning the monthly bills of new subscribers in the first month after signing on with a telephone company. Collect data Prepare a frequency distribution Draw a histogram

9

Largest observation

Collect dataBills42.1938.4529.2389.35118.04110.460.0072.8883.05

.

.

(There are 200 data points

Prepare a frequency distributionHow many classes to use?Number of observations Number of classes

Less then 50 5-750 - 200 7-9200 - 500 9-10500 - 1,000 10-111,000 – 5,000 11-135,000- 50,000 13-17More than 50,000 17-20

Class width = [Range] / [# of classes]

[119.63 - 0] / [8] = 14.95 15Largest observationLargest observation

Smallest observationSmallest observationSmallest observationSmallest observation

Largest observation

Example 2.1: Providing information

10

0

20

40

60

80

15 30 45 60 75 90 105 120

Bills

Fre

qu

en

cy

Draw a HistogramBin Frequency

15 7130 3745 1360 975 1090 18

105 28120 14


11

0

20

40

60

8015 30 45 60 75 90 10

5

120

Bills

Fre

qu

ency

What information can we extract from this histogramAbout half of all the bills are small

71+37=108 13+9+10=32

A few bills are in the middle range

Relatively,large numberof large bills

18+28+14=60


12

It is often preferable to show the relative frequency (proportion) of observations falling into each class, rather than the frequency itself.

Relative frequencies should be used when the population relative frequencies are studied comparing two or more histograms the number of observations of the samples studied are

different

Class relative frequency = Class relative frequency = Class frequency

Total number of observations

Class frequency

Total number of observations

Relative frequency

13

It is generally best to use equal class width, but sometimes unequal class width are called for.

Unequal class width is used when the frequency associated with some classes is too low. Then, several classes are combined together to form a

wider and “more populated” class. It is possible to form an open ended class at the

higher end or lower end of the histogram.

Class width

14

There are four typical shape characteristics

Shapes of histograms

15

Positively skewed

Negatively skewed

Shapes of histograms

16

A modal class is the one with the largest number of observations.

A unimodal histogram

The modal class

Modal classes

17

Modal classes

A bimodal histogram

A modal class A modal class

18

• Many statistical techniques require that the population be bell shaped.

• Drawing the histogram helps verify the shape of the population in question

Bell shaped histograms

19

Example 2.2: Selecting an investment An investor is considering investing in one

out of two investments. The returns on these investments were

recorded. From the two histograms, how can the

investor interpret the Expected returns The spread of the return (the risk involved with

each investment)

Interpreting histograms

20

Example 2.2 - Histograms

18-16-14-12-10- 8- 6- 4- 2- 0-

18-16-14-12-10- 8- 6- 4- 2- 0-

-15 0 15 30 45 60 75 -15 0 15 30 45 60 75

Return on investment A Return on investment B

Interpretation: The center of the returns of Investment AThe center of the returns of Investment Ais slightly lower than that for Investment Bis slightly lower than that for Investment B

The center for B

The center for A

21

18-16-14-12-10- 8- 6- 4- 2- 0-

18-16-14-12-10- 8- 6- 4- 2- 0-

-15 0 15 30 45 60 75 -15 0 15 30 45 60 75

Interpretation: The spread of returns for Investment AThe spread of returns for Investment Ais less than that for investment Bis less than that for investment B

Return on investment A Return on investment B

17 16

Sample size =50 Sample size =50

34 26

46 43


22

18-16-14-12-10- 8- 6- 4- 2- 0-

18-16-14-12-10- 8- 6- 4- 2- 0-

-15 0 15 30 45 60 75 -15 0 15 30 45 60 75Return on investment A Return on investment B

Interpretation: Both histograms are slightly positively Both histograms are slightly positively skewed. There is a possibility of large returns.skewed. There is a possibility of large returns.


23

Example 2.2: Conclusion It seems that investment A is better, because:

Its expected return is only slightly below that of investment B

The risk from investing in A is smaller. The possibility of having a high rate of return exists

for both investment.

Providing information

24

Example 2.3: Comparing students’ performance Students’ performance in two statistics classes

were compared. The two classes differed in their teaching

emphasis Class A – mathematical analysis and development of

theory. Class B – applications and computer based analysis.

The final mark for each student in each course was recorded.

Draw histograms and interpret the results.


25

Histogram

02040

50 60 70 80 90 100

Marks(Manual)

Fre

qu

en

cy

Histogram

02040

50 60 70 80 90 100

Marks(Manual)

Fre

qu

en

cy

Histogram

02040

50 60 70 80 90 100

Marks(Computer)

Fre

qu

en

cy

Histogram

02040

50 60 70 80 90 100

Marks(Computer)

Fre

qu

en

cy


The mathematical emphasiscreates two groups, and a larger spread.

26

2.5 Describing the Relationship Between Two Variables

We are interested in the relationship between two interval variables.

Example 2.7 A real estate agent wants to study the relationship

between house price and house size Twelve houses recently sold are sampled and

there size and price recorded Use graphical technique to describe the

relationship between size and price.

Size Price23 31524 22926 33527 261……………..……………..

27

Solution The size (independent variable, X) affects

the price (dependent variable, Y) We use Excel to create a scatter diagram

2.5 Describing the Relationship Between Two Variables

0

100

200

300

400

0 10 20 30 40

Y

X

The greater the house siz

e,

the greater the price

28

Typical Patterns of Scatter DiagramsPositive linear relationship Negative linear relationshipNo relationship

Negative nonlinear relationship

This is a weak linear relationship.A non linear relationship seems to fit the data better.

Nonlinear (concave) relationship

29

2.6 Describing Time-Series Data

Data can be classified according to the time it is collected. Cross-sectional data are all collected at

the same time. Time-series data are collected at

successive points in time.

Time-series data is often depicted on a line chart (a plot of the variable over time).

30

Line Chart

Example 2.9 The total amount of income tax paid by

individuals in 1987 through 1999 are listed below.

Draw a graph of this data and describe the information produced

31

Line Chart

0200,000400,000600,000800,000

1,000,0001,200,000

87 88 89 90 91 92 93 94 95 96 97 98 99

For the first five years – total tax was relatively flatFrom 1993 there was a rapid increase in tax revenues.

Line charts can be used to describe nominal data time series.

Line Chart

32

Numerical Descriptive Techniques

33

4.2 Measures of Central Location

Usually, we focus our attention on two types of measures when describing population characteristics: Central location (e.g. average) Variability or spread

The measure of central location reflects the locations of all the actual data points.

34

With one data pointclearly the central location is at the pointitself.

4.2 Measures of Central Location

The measure of central location reflects the locations of all the actual data points.

How?

But if the third data point appears on the left hand-sideof the midrange, it should “pull”the central location to the left.

With two data points,the central location should fall in the middlebetween them (in order to reflect the location ofboth of them).

35

Sum of the observationsNumber of observations

Mean =

This is the most popular and useful measure of central location

The Arithmetic Mean

36

nx

x in

1i

Sample mean Population mean

Nx i

N1i

Sample size Population size

nx

x in

1i

The Arithmetic Mean

37

10

...

101021

101 xxxx

x ii

• Example 4.1The reported time on the Internet of 10 adults are 0, 7, 12, 5, 33, 14, 8, 0, 9, 22 hours. Find the mean time on the Internet.

00 77 222211.011.0

• Example 4.2

Suppose the telephone bills of Example 2.1 representthe population of measurements. The population mean is

200x...xx

200x 20021i

2001i 42.1942.19 38.4538.45 45.7745.77

43.5943.59

The Arithmetic Mean

The arithmetic mean

38

Odd number of observations

0, 0, 5, 7, 8 9, 12, 14, 220, 0, 5, 7, 8, 9, 12, 14, 22, 330, 0, 5, 7, 8, 9, 12, 14, 22, 33

Even number of observations

Example 4.3

Find the median of the time on the internetfor the 10 adults of example 4.1

The Median of a set of observations is the value that falls in the middle when the observations are arranged in order of magnitude.

The Median

Suppose only 9 adults were sampled (exclude, say, the longest time (33))

Comment

8.5, 8

39

The Mode of a set of observations is the value that occurs most frequently.Set of data may have one mode (or modal class), or two or more modes.

The modal classFor large data setsthe modal class is much more relevant than a single-value mode.

The Mode

40

Example 4.5Find the mode for the data in Example 4.1. Here are the data again: 0, 7, 12, 5, 33, 14, 8, 0, 9, 22

Solution

All observation except “0” occur once. There are two “0”. Thus, the mode is zero.

Is this a good measure of central location? The value “0” does not reside at the center of this set

(compare with the mean = 11.0 and the mode = 8.5).

The ModeThe Mode The Mean, Median,

Mode

41

Relationship among Mean, Median, and Mode

If a distribution is symmetrical, the mean, median and mode coincide

If a distribution is asymmetrical, and skewed to the left or to the right, the three measures differ.A positively skewed distribution

(“skewed to the right”)

MeanMedian

Mode

42

If a distribution is symmetrical, the mean, median and mode coincide

If a distribution is non symmetrical, and skewed to the left or to the right, the three measures differ.

A positively skewed distribution(“skewed to the right”)

MeanMedian

Mode MeanMedian

Mode

A negatively skewed distribution(“skewed to the left”)

Relationship among Mean, Median, and Mode

43

This is a measure of the average growth rate.

Let Ri denote the the rate of return in period i (i=1,2…,n). The geometric mean of the returns R1, R2, …,Rn is the constant Rg that produces the same terminal wealth at the end of period n as do the actual returns for the n periods.

The Geometric Mean

44

If the rate of return was Rg in everyperiod, the nth period return wouldbe calculated by:

ng )R1( )R1)...(R1)(R1( n21

For the given series of rate of returns the nth period return iscalculated by:

Rg is selected such that…

1)R1)...(R1)(R1(R nn21g 1)R1)...(R1)(R1(R n

n21g

The Geometric MeanThe Geometric Mean

45

4.3 Measures of variability

Measures of central location fail to tell the whole story about the distribution.A question of interest still remains unanswered:

How much are the observations spread outaround the mean value?

46


Observe two hypothetical data sets:

The average value provides a good representation of theobservations in the data set.

Small variability

This data set is now changing to...

47


Observe two hypothetical data sets:

The average value provides a good representation of theobservations in the data set.

Small variability

Larger variability

The same average value does not provide as good representation of theobservations in the data set as before.

48

The range of a set of observations is the difference between the largest and smallest observations.

Its major advantage is the ease with which it can be computed.

Its major shortcoming is its failure to provide information on the dispersion of the observations between the two end points.

? ? ?

But, how do all the observations spread out?

Smallestobservation

Largestobservation

The range cannot assist in answering this questionRange

The range

49

This measure reflects the dispersion of all the observations

The variance of a population of size N x1, x2,…,xN

whose mean is is defined as

The variance of a sample of n observationsx1, x2, …,xn whose mean is is defined asx

N

)x( 2i

N1i2

N

)x( 2i

N1i2

1n

)xx(s

2i

n1i2

1n

)xx(s

2i

n1i2

The Variance

50

Why not use the sum of deviations?

Consider two small populations:

1098

74 10

11 12

13 16

8-10= -2

9-10= -111-10= +1

12-10= +2

4-10 = - 6

7-10 = -3

13-10 = +3

16-10 = +6

Sum = 0

Sum = 0

The mean of both populations is 10...

…but measurements in Bare more dispersedthen those in A.

A measure of dispersion Should agrees with this observation.

Can the sum of deviationsBe a good measure of dispersion?

A

B

The sum of deviations is zero for both populations, therefore, is not a good measure of dispersion.

51

Let us calculate the variance of the two populations

185

)1016()1013()1010()107()104( 222222B

25

)1012()1011()1010()109()108( 222222A

Why is the variance defined as the average squared deviation?Why not use the sum of squared deviations as a measure of variation instead?

After all, the sum of squared deviations increases in magnitude when the variationof a data set increases!!

The Variance

52

Which data set has a larger dispersion?Which data set has a larger dispersion?

1 3 1 32 5

A B

Data set Bis more dispersedaround the mean

Let us calculate the sum of squared deviations for both data sets

The Variance

53

1 3 1 32 5

A B

SumA = (1-2)2 +…+(1-2)2 +(3-2)2 +… +(3-2)2= 10SumB = (1-3)2 + (5-3)2 = 8

SumA > SumB. This is inconsistent with the observation that set B is more dispersed.

The Variance

54

1 3 1 32 5

A B

However, when calculated on “per observation” basis (variance), the data set dispersions are properly ranked.

A2 = SumA/N = 10/5 = 2

B2 = SumB/N = 8/2 = 4

The Variance

55

Example 4.7 The following sample consists of the number

of jobs six students applied for: 17, 15, 23, 7, 9, 13. Finds its mean and variance

Solution

2

2222

in

1i2

jobs2.33

)1413...()1415()1417(16

11n

)xx(s

jobs146

846

13972315176

xx i

61i

The Variance

56

The standard deviation of a set of observations is the square root of the variance .

2

2

:deviationandardstPopulation

ss:deviationstandardSample

2

2

:deviationandardstPopulation

ss:deviationstandardSample

Standard Deviation

57

Example 4.8 To examine the consistency of shots for a

new innovative golf club, a golfer was asked to hit 150 shots, 75 with a currently used (7-iron) club, and 75 with the new club.

The distances were recorded. Which 7-iron is more consistent?

Standard Deviation

58

Example 4.8 – solution

Standard Deviation

Excel printout, from the “Descriptive Statistics” sub-menu.

Current Innovation

Mean 150.5467 Mean 150.1467Standard Error 0.668815 Standard Error 0.357011Median 151 Median 150Mode 150 Mode 149Standard Deviation 5.792104 Standard Deviation 3.091808Sample Variance 33.54847 Sample Variance 9.559279Kurtosis 0.12674 Kurtosis -0.88542Skewness -0.42989 Skewness 0.177338Range 28 Range 12Minimum 134 Minimum 144Maximum 162 Maximum 156Sum 11291 Sum 11261Count 75 Count 75

The innovation club is more consistent, and because the means are close, is considered a better club

The Standard Deviation

59

Interpreting Standard Deviation

The standard deviation can be used to compare the variability of several distributions make a statement about the general shape of a

distribution.

The empirical rule: If a sample of observations has a mound-shaped distribution, the interval

tsmeasuremen the of 68%ely approximat contains )sx,sx(

tsmeasuremen the of 95%ely approximat contains )s2x,s2x( tsmeasuremen the of 99.7%ely approximat contains )s3x,s3x(

60

Example 4.9A statistics practitioner wants to describe the way returns on investment are distributed. The mean return = 10% The standard deviation of the return = 8% The histogram is bell shaped.


61

Example 4.9 – solutionThe empirical rule can be applied (bell shaped histogram)Describing the return distribution Approximately 68% of the returns lie between 2% and 18%

[10 – 1(8), 10 + 1(8)] Approximately 95% of the returns lie between -6% and 26%

[10 – 2(8), 10 + 2(8)] Approximately 99.7% of the returns lie between -14% and 34%

[10 – 3(8), 10 + 3(8)]


62

The proportion of observations in any sample that lie within k standard deviations of the mean is at least 1-1/k2 for k > 1.

This theorem is valid for any set of measurements (sample, population) of any shape!!

K Interval Chebysheff Empirical Rule1 at least 0%

approximately 68%

2 at least 75% approximately 95%

3 at least 89% approximately 99.7%

s2x,s2x sx,sx

s3x,s3x

The Chebysheff’s Theorem

(1-1/12)

(1-1/22)

(1-1/32)

63

Example 4.10 The annual salaries of the employees of a chain of computer

stores produced a positively skewed histogram. The mean and standard deviation are $28,000 and $3,000,respectively. What can you say about the salaries at this chain?

SolutionAt least 75% of the salaries lie between $22,000 and $34,000 28000 – 2(3000) 28000 + 2(3000)

At least 88.9% of the salaries lie between $$19,000 and $37,000 28000 – 3(3000) 28000 + 3(3000)

The Chebysheff’s Theorem

64

The coefficient of variation of a set of measurements is the standard deviation divided by the mean value.

This coefficient provides a proportionate measure of variation.

CV : variationoft coefficien Population

x

scv : variationoft coefficien Sample

A standard deviation of 10 may be perceivedlarge when the mean value is 100, but only moderately large when the mean value is 500

The Coefficient of Variation

65Your score

4.4 Measures of Relative Standing and Box Plots

Percentile The pth percentile of a set of measurements is

the value for which p percent of the observations are less than that

value 100(1-p) percent of all the observations are greater

than that value. Example

Suppose your score is the 60% percentile of a SAT test. Then 60% of all the scores lie here 40%

66

Commonly used percentiles First (lower)decile = 10th percentile First (lower) quartile, Q1, = 25th percentile

Second (middle)quartile,Q2, = 50th percentile

Third quartile, Q3, = 75th percentile Ninth (upper)decile = 90th percentile

Quartiles

67

Quartiles

Example

Find the quartiles of the following set of measurements 7, 8, 12, 17, 29, 18, 4, 27, 30, 2, 4, 10, 21, 5, 8

68

SolutionSort the observations

2, 4, 4, 5, 7, 8, 10, 12, 17, 18, 18, 21, 27, 29, 30

At most (.25)(15) = 3.75 observations should appear below the first quartile.Check the first 3 observations on the left hand side.

At most (.25)(15) = 3.75 observations should appear below the first quartile.Check the first 3 observations on the left hand side.

At most (.75)(15)=11.25 observations should appear above the first quartile.Check 11 observations on the right hand side.

At most (.75)(15)=11.25 observations should appear above the first quartile.Check 11 observations on the right hand side.

The first quartileThe first quartile

Comment:If the number of observations is even, two observations remain unchecked. In this case choose the midpoint between these two observations.

Comment:If the number of observations is even, two observations remain unchecked. In this case choose the midpoint between these two observations.

15 observations

Quartiles

69

4.5 Measures of Linear Relationship

The covariance and the coefficient of correlation are used to measure the direction and strength of the linear relationship between two variables.

Covariance - is there any pattern to the way

two variables move together? Coefficient of correlation - how strong is the

linear relationship between two variables

70

N

)y)((xY)COV(X,covariance Population yixi

N

)y)((xY)COV(X,covariance Population yixi

x (y) is the population mean of the variable X (Y).N is the population size.

1-n)yy)(x(x

y) cov(x,covariance Sample ii

1-n)yy)(x(x

y) cov(x,covariance Sample ii

Covariance

x (y) is the sample mean of the variable X (Y).n is the sample size.

71

Compare the following three sets

Covariance

xi yi (x – x)

(y – y)

(x – x)(y – y)

2

6

7

13

20

27

-3

1

2

-7

0

7

21

0

14

x=5 y =20

Cov(x,y)=17.5xi yi (x –

x)(y – y)

(x – x)(y – y)

2

6

7

27

20

13

-3

1

2

7

0

-7

-21

0

-14

x=5 y =20

Cov(x,y)=-17.5

xi yi

2

6

7

20

27

13

Cov(x,y) = -3.5

x=5

y =20

72

If the two variables move in opposite directions, (one increases when the other one decreases), the covariance is a large negative number.

If the two variables are unrelated, the covariance will be close to zero.

If the two variables move in the same direction, (both increase or both decrease), the covariance is a large positive number.

Covariance

73

This coefficient answers the question: How strong is the association between X and Y.

yx

)Y,X(COV

ncorrelatio oft coefficien Population

yx

)Y,X(COV

ncorrelatio oft coefficien Population

yxss)Y,Xcov(

r

ncorrelatio oft coefficien Sample

yxss

)Y,Xcov(r

ncorrelatio oft coefficien Sample

The coefficient of correlation

74

COV(X,Y)=0 or r =

+1

0

-1

Strong positive linear relationship

No linear relationship

Strong negative linear relationship

or

COV(X,Y)>0

COV(X,Y)<0


75

If the two variables are very strongly positively related, the coefficient value is close to +1 (strong positive linear relationship).

If the two variables are very strongly negatively related, the coefficient value is close to -1 (strong negative linear relationship).

No straight line relationship is indicated by a coefficient close to zero.


76

Compute the covariance and the coefficient of correlation to measure how GMAT scores and GPA in an MBA program are related to one another.

Solution We believe GMAT affects GPA. Thus

GMAT is labeled X GPA is labeled Y

The coefficient of correlation and the covariance – Example 4.16

77

1 599 9.6 358801 92.16 5750.4

2 689 8.8 474721 77.44 6063.2

3 584 7.4 341056 54.76 4321.6

4 631 10 398161 100 6310

11 593 8.8 351649 77.44 5218.4

12 683 8 466489 64 5464

Total 7,587 106.4 4,817,755 957.2 67,559.2

Student x y x2 y2 xy

………………………………………………….

n

xx

ns

n

yxyx

n

yx

i

iiii

222

1

1

1

1

),cov(

FormulasShortcut

The coefficient of correlation and the covariance – Example 4.16

cov(x,y)=(1/12-1)[67,559.2-(7587)(106.4)/12]=26.16

Sx = {(1/12-1)[4,817,755-(7587)2/12)]}.5=43.56Sy = similar to Sx = 1.12

r = cov(x,y)/SxSy = 26.16/(43.56)(1.12) = .5362

78

Use the Covariance option in Data Analysis

If your version of Excel returns the population covariance and variances, multiply each one by n/n-1 to obtain the corresponding sample values.

Use the Correlation option to produce the correlation matrix.

The coefficient of correlation and the covariance – Example 4.16 – Excel

GPA GMAT

GPA 1.15

GMAT

23.98 1739.52

GPA GMAT

GPA 1.25

GMAT

26.16 1897.66

1212-1

Variance-Covariance MatrixPopulation values

Sample values

Population values

Sample values

79

Interpretation The covariance (26.16) indicates that

GMAT score and performance in the MBA program are positively related.

The coefficient of correlation (.5365) indicates that there is a moderately strong positive linear relationship between GMAT and MBA GPA.

The coefficient of correlation and the covariance – Example 4.16 – Excel