1
Graphical Descriptive Techniques
Graphical Descriptive Techniques
2
2.1 Introduction
Descriptive statistics involves the arrangement, summary, and presentation of data, to enable meaningful interpretation, and to support decision making.Descriptive statistics methods make use of graphical techniques numerical descriptive measures.
The methods presented apply to both the entire population the population sample
3
2.2 Types of data and information
A variable - a characteristic of population or sample that is of interest for us. Cereal choice Capital expenditure The waiting time for medical services
Data - the actual values of variables Interval data are numerical observations Nominal data are categorical observations Ordinal data are ordered categorical observations
4
Types of data - examples
Interval data
Age - income55 7500042 68000
. .
. .
Age - income55 7500042 68000
. .
. .Weight gain+10+5..
Weight gain+10+5..
Nominal
Person Marital status1 married2 single3 single. .. .
Person Marital status1 married2 single3 single. .. .Computer Brand
1 IBM2 Dell3 IBM. .. .
Computer Brand1 IBM2 Dell3 IBM. .. .
5
Types of data - examples
Interval data
Age - income55 7500042 68000
. .
. .
Age - income55 7500042 68000
. .
. .
Nominal data
With nominal data, all we can do is, calculate the proportion of data that falls into each category.
IBM Dell Compaq Other Total 25 11 8 6 50 50% 22% 16% 12%
IBM Dell Compaq Other Total 25 11 8 6 50 50% 22% 16% 12%
Weight gain+10+5..
Weight gain+10+5..
6
Types of data – analysis
Knowing the type of data is necessary to properly select the technique to be used when analyzing data.
Type of analysis allowed for each type of data Interval data – arithmetic calculations Nominal data – counting the number of observation in each
category Ordinal data - computations based on an ordering process
7
Cross-Sectional/Time-Series Data
Cross sectional data is collected at a certain point in time Marketing survey (observe preferences by gender,
age) Test score in a statistics course Starting salaries of an MBA program graduates
Time series data is collected over successive points in time Weekly closing price of gold Amount of crude oil imported monthly
8
2.3 Graphical Techniques forInterval Data
Example 2.1: Providing information concerning the monthly bills of new subscribers in the first month after signing on with a telephone company. Collect data Prepare a frequency distribution Draw a histogram
9
Largest observation
Collect dataBills42.1938.4529.2389.35118.04110.460.0072.8883.05
.
.
(There are 200 data points
Prepare a frequency distributionHow many classes to use?Number of observations Number of classes
Less then 50 5-750 - 200 7-9200 - 500 9-10500 - 1,000 10-111,000 – 5,000 11-135,000- 50,000 13-17More than 50,000 17-20
Class width = [Range] / [# of classes]
[119.63 - 0] / [8] = 14.95 15Largest observationLargest observation
Smallest observationSmallest observationSmallest observationSmallest observation
Largest observation
Example 2.1: Providing information
10
0
20
40
60
80
15 30 45 60 75 90 105 120
Bills
Fre
qu
en
cy
Draw a HistogramBin Frequency
15 7130 3745 1360 975 1090 18
105 28120 14
Example 2.1: Providing information
11
0
20
40
60
8015 30 45 60 75 90 10
5
120
Bills
Fre
qu
ency
What information can we extract from this histogramAbout half of all the bills are small
71+37=108 13+9+10=32
A few bills are in the middle range
Relatively,large numberof large bills
18+28+14=60
Example 2.1: Providing information
12
It is often preferable to show the relative frequency (proportion) of observations falling into each class, rather than the frequency itself.
Relative frequencies should be used when the population relative frequencies are studied comparing two or more histograms the number of observations of the samples studied are
different
Class relative frequency = Class relative frequency = Class frequency
Total number of observations
Class frequency
Total number of observations
Relative frequency
13
It is generally best to use equal class width, but sometimes unequal class width are called for.
Unequal class width is used when the frequency associated with some classes is too low. Then, several classes are combined together to form a
wider and “more populated” class. It is possible to form an open ended class at the
higher end or lower end of the histogram.
Class width
14
There are four typical shape characteristics
Shapes of histograms
15
Positively skewed
Negatively skewed
Shapes of histograms
16
A modal class is the one with the largest number of observations.
A unimodal histogram
The modal class
Modal classes
17
Modal classes
A bimodal histogram
A modal class A modal class
18
• Many statistical techniques require that the population be bell shaped.
• Drawing the histogram helps verify the shape of the population in question
Bell shaped histograms
19
Example 2.2: Selecting an investment An investor is considering investing in one
out of two investments. The returns on these investments were
recorded. From the two histograms, how can the
investor interpret the Expected returns The spread of the return (the risk involved with
each investment)
Interpreting histograms
20
Example 2.2 - Histograms
18-16-14-12-10- 8- 6- 4- 2- 0-
18-16-14-12-10- 8- 6- 4- 2- 0-
-15 0 15 30 45 60 75 -15 0 15 30 45 60 75
Return on investment A Return on investment B
Interpretation: The center of the returns of Investment AThe center of the returns of Investment Ais slightly lower than that for Investment Bis slightly lower than that for Investment B
The center for B
The center for A
21
18-16-14-12-10- 8- 6- 4- 2- 0-
18-16-14-12-10- 8- 6- 4- 2- 0-
-15 0 15 30 45 60 75 -15 0 15 30 45 60 75
Interpretation: The spread of returns for Investment AThe spread of returns for Investment Ais less than that for investment Bis less than that for investment B
Return on investment A Return on investment B
17 16
Sample size =50 Sample size =50
34 26
46 43
Example 2.2 - Histograms
22
18-16-14-12-10- 8- 6- 4- 2- 0-
18-16-14-12-10- 8- 6- 4- 2- 0-
-15 0 15 30 45 60 75 -15 0 15 30 45 60 75Return on investment A Return on investment B
Interpretation: Both histograms are slightly positively Both histograms are slightly positively skewed. There is a possibility of large returns.skewed. There is a possibility of large returns.
Example 2.2 - Histograms
23
Example 2.2: Conclusion It seems that investment A is better, because:
Its expected return is only slightly below that of investment B
The risk from investing in A is smaller. The possibility of having a high rate of return exists
for both investment.
Providing information
24
Example 2.3: Comparing students’ performance Students’ performance in two statistics classes
were compared. The two classes differed in their teaching
emphasis Class A – mathematical analysis and development of
theory. Class B – applications and computer based analysis.
The final mark for each student in each course was recorded.
Draw histograms and interpret the results.
Interpreting histograms
25
Histogram
02040
50 60 70 80 90 100
Marks(Manual)
Fre
qu
en
cy
Histogram
02040
50 60 70 80 90 100
Marks(Manual)
Fre
qu
en
cy
Histogram
02040
50 60 70 80 90 100
Marks(Computer)
Fre
qu
en
cy
Histogram
02040
50 60 70 80 90 100
Marks(Computer)
Fre
qu
en
cy
Interpreting histograms
The mathematical emphasiscreates two groups, and a larger spread.
26
2.5 Describing the Relationship Between Two Variables
We are interested in the relationship between two interval variables.
Example 2.7 A real estate agent wants to study the relationship
between house price and house size Twelve houses recently sold are sampled and
there size and price recorded Use graphical technique to describe the
relationship between size and price.
Size Price23 31524 22926 33527 261……………..……………..
27
Solution The size (independent variable, X) affects
the price (dependent variable, Y) We use Excel to create a scatter diagram
2.5 Describing the Relationship Between Two Variables
0
100
200
300
400
0 10 20 30 40
Y
X
The greater the house siz
e,
the greater the price
28
Typical Patterns of Scatter DiagramsPositive linear relationship Negative linear relationshipNo relationship
Negative nonlinear relationship
This is a weak linear relationship.A non linear relationship seems to fit the data better.
Nonlinear (concave) relationship
29
2.6 Describing Time-Series Data
Data can be classified according to the time it is collected. Cross-sectional data are all collected at
the same time. Time-series data are collected at
successive points in time.
Time-series data is often depicted on a line chart (a plot of the variable over time).
30
Line Chart
Example 2.9 The total amount of income tax paid by
individuals in 1987 through 1999 are listed below.
Draw a graph of this data and describe the information produced
31
Line Chart
0200,000400,000600,000800,000
1,000,0001,200,000
87 88 89 90 91 92 93 94 95 96 97 98 99
For the first five years – total tax was relatively flatFrom 1993 there was a rapid increase in tax revenues.
Line charts can be used to describe nominal data time series.
Line Chart
32
Numerical Descriptive Techniques
33
4.2 Measures of Central Location
Usually, we focus our attention on two types of measures when describing population characteristics: Central location (e.g. average) Variability or spread
The measure of central location reflects the locations of all the actual data points.
34
With one data pointclearly the central location is at the pointitself.
4.2 Measures of Central Location
The measure of central location reflects the locations of all the actual data points.
How?
But if the third data point appears on the left hand-sideof the midrange, it should “pull”the central location to the left.
With two data points,the central location should fall in the middlebetween them (in order to reflect the location ofboth of them).
35
Sum of the observationsNumber of observations
Mean =
This is the most popular and useful measure of central location
The Arithmetic Mean
36
nx
x in
1i
Sample mean Population mean
Nx i
N1i
Sample size Population size
nx
x in
1i
The Arithmetic Mean
37
10
...
101021
101 xxxx
x ii
• Example 4.1The reported time on the Internet of 10 adults are 0, 7, 12, 5, 33, 14, 8, 0, 9, 22 hours. Find the mean time on the Internet.
00 77 222211.011.0
• Example 4.2
Suppose the telephone bills of Example 2.1 representthe population of measurements. The population mean is
200x...xx
200x 20021i
2001i 42.1942.19 38.4538.45 45.7745.77
43.5943.59
The Arithmetic Mean
The arithmetic mean
38
Odd number of observations
0, 0, 5, 7, 8 9, 12, 14, 220, 0, 5, 7, 8, 9, 12, 14, 22, 330, 0, 5, 7, 8, 9, 12, 14, 22, 33
Even number of observations
Example 4.3
Find the median of the time on the internetfor the 10 adults of example 4.1
The Median of a set of observations is the value that falls in the middle when the observations are arranged in order of magnitude.
The Median
Suppose only 9 adults were sampled (exclude, say, the longest time (33))
Comment
8.5, 8
39
The Mode of a set of observations is the value that occurs most frequently.Set of data may have one mode (or modal class), or two or more modes.
The modal classFor large data setsthe modal class is much more relevant than a single-value mode.
The Mode
40
Example 4.5Find the mode for the data in Example 4.1. Here are the data again: 0, 7, 12, 5, 33, 14, 8, 0, 9, 22
Solution
All observation except “0” occur once. There are two “0”. Thus, the mode is zero.
Is this a good measure of central location? The value “0” does not reside at the center of this set
(compare with the mean = 11.0 and the mode = 8.5).
The ModeThe Mode The Mean, Median,
Mode
41
Relationship among Mean, Median, and Mode
If a distribution is symmetrical, the mean, median and mode coincide
If a distribution is asymmetrical, and skewed to the left or to the right, the three measures differ.A positively skewed distribution
(“skewed to the right”)
MeanMedian
Mode
42
If a distribution is symmetrical, the mean, median and mode coincide
If a distribution is non symmetrical, and skewed to the left or to the right, the three measures differ.
A positively skewed distribution(“skewed to the right”)
MeanMedian
Mode MeanMedian
Mode
A negatively skewed distribution(“skewed to the left”)
Relationship among Mean, Median, and Mode
43
This is a measure of the average growth rate.
Let Ri denote the the rate of return in period i (i=1,2…,n). The geometric mean of the returns R1, R2, …,Rn is the constant Rg that produces the same terminal wealth at the end of period n as do the actual returns for the n periods.
The Geometric Mean
44
If the rate of return was Rg in everyperiod, the nth period return wouldbe calculated by:
ng )R1( )R1)...(R1)(R1( n21
For the given series of rate of returns the nth period return iscalculated by:
Rg is selected such that…
1)R1)...(R1)(R1(R nn21g 1)R1)...(R1)(R1(R n
n21g
The Geometric MeanThe Geometric Mean
45
4.3 Measures of variability
Measures of central location fail to tell the whole story about the distribution.A question of interest still remains unanswered:
How much are the observations spread outaround the mean value?
46
4.3 Measures of variability
Observe two hypothetical data sets:
The average value provides a good representation of theobservations in the data set.
Small variability
This data set is now changing to...
47
4.3 Measures of variability
Observe two hypothetical data sets:
The average value provides a good representation of theobservations in the data set.
Small variability
Larger variability
The same average value does not provide as good representation of theobservations in the data set as before.
48
The range of a set of observations is the difference between the largest and smallest observations.
Its major advantage is the ease with which it can be computed.
Its major shortcoming is its failure to provide information on the dispersion of the observations between the two end points.
? ? ?
But, how do all the observations spread out?
Smallestobservation
Largestobservation
The range cannot assist in answering this questionRange
The range
49
This measure reflects the dispersion of all the observations
The variance of a population of size N x1, x2,…,xN
whose mean is is defined as
The variance of a sample of n observationsx1, x2, …,xn whose mean is is defined asx
N
)x( 2i
N1i2
N
)x( 2i
N1i2
1n
)xx(s
2i
n1i2
1n
)xx(s
2i
n1i2
The Variance
50
Why not use the sum of deviations?
Consider two small populations:
1098
74 10
11 12
13 16
8-10= -2
9-10= -111-10= +1
12-10= +2
4-10 = - 6
7-10 = -3
13-10 = +3
16-10 = +6
Sum = 0
Sum = 0
The mean of both populations is 10...
…but measurements in Bare more dispersedthen those in A.
A measure of dispersion Should agrees with this observation.
Can the sum of deviationsBe a good measure of dispersion?
A
B
The sum of deviations is zero for both populations, therefore, is not a good measure of dispersion.
51
Let us calculate the variance of the two populations
185
)1016()1013()1010()107()104( 222222B
25
)1012()1011()1010()109()108( 222222A
Why is the variance defined as the average squared deviation?Why not use the sum of squared deviations as a measure of variation instead?
After all, the sum of squared deviations increases in magnitude when the variationof a data set increases!!
The Variance
52
Which data set has a larger dispersion?Which data set has a larger dispersion?
1 3 1 32 5
A B
Data set Bis more dispersedaround the mean
Let us calculate the sum of squared deviations for both data sets
The Variance
53
1 3 1 32 5
A B
SumA = (1-2)2 +…+(1-2)2 +(3-2)2 +… +(3-2)2= 10SumB = (1-3)2 + (5-3)2 = 8
SumA > SumB. This is inconsistent with the observation that set B is more dispersed.
The Variance
54
1 3 1 32 5
A B
However, when calculated on “per observation” basis (variance), the data set dispersions are properly ranked.
A2 = SumA/N = 10/5 = 2
B2 = SumB/N = 8/2 = 4
The Variance
55
Example 4.7 The following sample consists of the number
of jobs six students applied for: 17, 15, 23, 7, 9, 13. Finds its mean and variance
Solution
2
2222
in
1i2
jobs2.33
)1413...()1415()1417(16
11n
)xx(s
jobs146
846
13972315176
xx i
61i
The Variance
56
The standard deviation of a set of observations is the square root of the variance .
2
2
:deviationandardstPopulation
ss:deviationstandardSample
2
2
:deviationandardstPopulation
ss:deviationstandardSample
Standard Deviation
57
Example 4.8 To examine the consistency of shots for a
new innovative golf club, a golfer was asked to hit 150 shots, 75 with a currently used (7-iron) club, and 75 with the new club.
The distances were recorded. Which 7-iron is more consistent?
Standard Deviation
58
Example 4.8 – solution
Standard Deviation
Excel printout, from the “Descriptive Statistics” sub-menu.
Current Innovation
Mean 150.5467 Mean 150.1467Standard Error 0.668815 Standard Error 0.357011Median 151 Median 150Mode 150 Mode 149Standard Deviation 5.792104 Standard Deviation 3.091808Sample Variance 33.54847 Sample Variance 9.559279Kurtosis 0.12674 Kurtosis -0.88542Skewness -0.42989 Skewness 0.177338Range 28 Range 12Minimum 134 Minimum 144Maximum 162 Maximum 156Sum 11291 Sum 11261Count 75 Count 75
The innovation club is more consistent, and because the means are close, is considered a better club
The Standard Deviation
59
Interpreting Standard Deviation
The standard deviation can be used to compare the variability of several distributions make a statement about the general shape of a
distribution.
The empirical rule: If a sample of observations has a mound-shaped distribution, the interval
tsmeasuremen the of 68%ely approximat contains )sx,sx(
tsmeasuremen the of 95%ely approximat contains )s2x,s2x( tsmeasuremen the of 99.7%ely approximat contains )s3x,s3x(
60
Example 4.9A statistics practitioner wants to describe the way returns on investment are distributed. The mean return = 10% The standard deviation of the return = 8% The histogram is bell shaped.
Interpreting Standard Deviation
61
Example 4.9 – solutionThe empirical rule can be applied (bell shaped histogram)Describing the return distribution Approximately 68% of the returns lie between 2% and 18%
[10 – 1(8), 10 + 1(8)] Approximately 95% of the returns lie between -6% and 26%
[10 – 2(8), 10 + 2(8)] Approximately 99.7% of the returns lie between -14% and 34%
[10 – 3(8), 10 + 3(8)]
Interpreting Standard Deviation
62
The proportion of observations in any sample that lie within k standard deviations of the mean is at least 1-1/k2 for k > 1.
This theorem is valid for any set of measurements (sample, population) of any shape!!
K Interval Chebysheff Empirical Rule1 at least 0%
approximately 68%
2 at least 75% approximately 95%
3 at least 89% approximately 99.7%
s2x,s2x sx,sx
s3x,s3x
The Chebysheff’s Theorem
(1-1/12)
(1-1/22)
(1-1/32)
63
Example 4.10 The annual salaries of the employees of a chain of computer
stores produced a positively skewed histogram. The mean and standard deviation are $28,000 and $3,000,respectively. What can you say about the salaries at this chain?
SolutionAt least 75% of the salaries lie between $22,000 and $34,000 28000 – 2(3000) 28000 + 2(3000)
At least 88.9% of the salaries lie between $$19,000 and $37,000 28000 – 3(3000) 28000 + 3(3000)
The Chebysheff’s Theorem
64
The coefficient of variation of a set of measurements is the standard deviation divided by the mean value.
This coefficient provides a proportionate measure of variation.
CV : variationoft coefficien Population
x
scv : variationoft coefficien Sample
A standard deviation of 10 may be perceivedlarge when the mean value is 100, but only moderately large when the mean value is 500
The Coefficient of Variation
65Your score
4.4 Measures of Relative Standing and Box Plots
Percentile The pth percentile of a set of measurements is
the value for which p percent of the observations are less than that
value 100(1-p) percent of all the observations are greater
than that value. Example
Suppose your score is the 60% percentile of a SAT test. Then 60% of all the scores lie here 40%
66
Commonly used percentiles First (lower)decile = 10th percentile First (lower) quartile, Q1, = 25th percentile
Second (middle)quartile,Q2, = 50th percentile
Third quartile, Q3, = 75th percentile Ninth (upper)decile = 90th percentile
Quartiles
67
Quartiles
Example
Find the quartiles of the following set of measurements 7, 8, 12, 17, 29, 18, 4, 27, 30, 2, 4, 10, 21, 5, 8
68
SolutionSort the observations
2, 4, 4, 5, 7, 8, 10, 12, 17, 18, 18, 21, 27, 29, 30
At most (.25)(15) = 3.75 observations should appear below the first quartile.Check the first 3 observations on the left hand side.
At most (.25)(15) = 3.75 observations should appear below the first quartile.Check the first 3 observations on the left hand side.
At most (.75)(15)=11.25 observations should appear above the first quartile.Check 11 observations on the right hand side.
At most (.75)(15)=11.25 observations should appear above the first quartile.Check 11 observations on the right hand side.
The first quartileThe first quartile
Comment:If the number of observations is even, two observations remain unchecked. In this case choose the midpoint between these two observations.
Comment:If the number of observations is even, two observations remain unchecked. In this case choose the midpoint between these two observations.
15 observations
Quartiles
69
4.5 Measures of Linear Relationship
The covariance and the coefficient of correlation are used to measure the direction and strength of the linear relationship between two variables.
Covariance - is there any pattern to the way
two variables move together? Coefficient of correlation - how strong is the
linear relationship between two variables
70
N
)y)((xY)COV(X,covariance Population yixi
N
)y)((xY)COV(X,covariance Population yixi
x (y) is the population mean of the variable X (Y).N is the population size.
1-n)yy)(x(x
y) cov(x,covariance Sample ii
1-n)yy)(x(x
y) cov(x,covariance Sample ii
Covariance
x (y) is the sample mean of the variable X (Y).n is the sample size.
71
Compare the following three sets
Covariance
xi yi (x – x)
(y – y)
(x – x)(y – y)
2
6
7
13
20
27
-3
1
2
-7
0
7
21
0
14
x=5 y =20
Cov(x,y)=17.5xi yi (x –
x)(y – y)
(x – x)(y – y)
2
6
7
27
20
13
-3
1
2
7
0
-7
-21
0
-14
x=5 y =20
Cov(x,y)=-17.5
xi yi
2
6
7
20
27
13
Cov(x,y) = -3.5
x=5
y =20
72
If the two variables move in opposite directions, (one increases when the other one decreases), the covariance is a large negative number.
If the two variables are unrelated, the covariance will be close to zero.
If the two variables move in the same direction, (both increase or both decrease), the covariance is a large positive number.
Covariance
73
This coefficient answers the question: How strong is the association between X and Y.
yx
)Y,X(COV
ncorrelatio oft coefficien Population
yx
)Y,X(COV
ncorrelatio oft coefficien Population
yxss)Y,Xcov(
r
ncorrelatio oft coefficien Sample
yxss
)Y,Xcov(r
ncorrelatio oft coefficien Sample
The coefficient of correlation
74
COV(X,Y)=0 or r =
+1
0
-1
Strong positive linear relationship
No linear relationship
Strong negative linear relationship
or
COV(X,Y)>0
COV(X,Y)<0
The coefficient of correlation
75
If the two variables are very strongly positively related, the coefficient value is close to +1 (strong positive linear relationship).
If the two variables are very strongly negatively related, the coefficient value is close to -1 (strong negative linear relationship).
No straight line relationship is indicated by a coefficient close to zero.
The coefficient of correlation
76
Compute the covariance and the coefficient of correlation to measure how GMAT scores and GPA in an MBA program are related to one another.
Solution We believe GMAT affects GPA. Thus
GMAT is labeled X GPA is labeled Y
The coefficient of correlation and the covariance – Example 4.16
77
1 599 9.6 358801 92.16 5750.4
2 689 8.8 474721 77.44 6063.2
3 584 7.4 341056 54.76 4321.6
4 631 10 398161 100 6310
11 593 8.8 351649 77.44 5218.4
12 683 8 466489 64 5464
Total 7,587 106.4 4,817,755 957.2 67,559.2
Student x y x2 y2 xy
………………………………………………….
n
xx
ns
n
yxyx
n
yx
i
iiii
222
1
1
1
1
),cov(
FormulasShortcut
The coefficient of correlation and the covariance – Example 4.16
cov(x,y)=(1/12-1)[67,559.2-(7587)(106.4)/12]=26.16
Sx = {(1/12-1)[4,817,755-(7587)2/12)]}.5=43.56Sy = similar to Sx = 1.12
r = cov(x,y)/SxSy = 26.16/(43.56)(1.12) = .5362
78
Use the Covariance option in Data Analysis
If your version of Excel returns the population covariance and variances, multiply each one by n/n-1 to obtain the corresponding sample values.
Use the Correlation option to produce the correlation matrix.
The coefficient of correlation and the covariance – Example 4.16 – Excel
GPA GMAT
GPA 1.15
GMAT
23.98 1739.52
GPA GMAT
GPA 1.25
GMAT
26.16 1897.66
1212-1
Variance-Covariance MatrixPopulation values
Sample values
Population values
Sample values
79
Interpretation The covariance (26.16) indicates that
GMAT score and performance in the MBA program are positively related.
The coefficient of correlation (.5365) indicates that there is a moderately strong positive linear relationship between GMAT and MBA GPA.
The coefficient of correlation and the covariance – Example 4.16 – Excel
Top Related