(6) Graphical Presentation 2
Transcript of (6) Graphical Presentation 2
-
7/30/2019 (6) Graphical Presentation 2
1/30
Applied Statistics and Computing Lab
GRAPHICAL PRESENTATIONS 2(REPRESENTATION OF NUMERICAL DATA)
Applied Statistics and Computing Lab
Indian School of Business
1
-
7/30/2019 (6) Graphical Presentation 2
2/30
Applied Statistics and Computing Lab
Learning Goals Why we use graphs
What are the various types of graphs for
presenting numerical data
Which graph to use in which scenario Graphical Distortion
2
-
7/30/2019 (6) Graphical Presentation 2
3/30
Applied Statistics and Computing Lab
Why use Graphs: An example A private insurance firm interested in
marketing its insurance products in region A.To target precisely, needs to know age
distribution. Questions-
In which age group does the highest number of
people lie?
Needs to divide population into 4 different agegroups, to sell 4 different products
It has the following data-
3
-
7/30/2019 (6) Graphical Presentation 2
4/30
Applied Statistics and Computing Lab
Data on ages of 507 people23,21,23,26,22,27,29,37,55,53,21,19,20,18,32,20,28,19,23,33,40,28,24,36,23,29,34,31,34,42,45,23,46,26,30,25,20,37,24,36,28,29,23,23,25,24,37,42,30,28,29,39,26,
20,21,20,19,20,40,25,45,28,21,22,19,24,24,20,29,27,27,40,43,22,22,21,24,23,23,45,20,25,25,33,21,23,20,34,20,41,25,32,24,65,28,25,38,23,22,20,35,34,67,38,33,26,25,52,21,32,43,24,28,62,45,40,21,23,30,20,28,41,32,26,37,38,27,23,50,25,23,43,33,22,26,37,32,23,37,23,27,23,27,24,21,25,23,23,46,34,25,29,45,44,35,55,25,31,19,45,34,19,20,29,33,37,21,23,51,31,27,27,37,25,37,33,25,29,25,20,25,28,24,31,25,27,23,20,28,40,21,62,44,49,34,25,29,19,20,20,26,19,36,34,24,27,23,20,28,40,21,62,
44,49,34,25,29,19,20,20,26,19,36,34,24,27,22,22,48,21,27,33,34,54,25,35,22,21,41,23,19,29,27,36,21,20,20,24,35,33,25,45,55,49,30,28,25,23,26,21,26,32,32,32,35,19,26,22,23,25,38,30,43,60,32,26,23,24,21,28,25,20,64,39,27,32,23,24,23,29,44,20,24,42,27,43,37,20,47,45,20,28,21,37,27,26,22,21,62,27,27,22,22,52,42,30,19,19,19,24,21,36,32,52,26,56,30,23,21,44,37,51,38,23,44,26,23,20,44,25,18,22,35,24,25,23,22,24,26,26,28,34,24,33,46,51,25,19,35,19,19,20,41,33,44,19,29,35,33,22,33,
44,29,46,19,30,26,20,32,20,27,22,40,42,29,31,22,29,36,37,25,46,25,43,43,24,24,19,46,29,26,32,29,34,26,34,22,25,41,38,21,34,37,56,28,35,29,22,22,24,36,40,40,37,23,34,20,23,40,20,30,32,30,21,39,37,22,39,49,24,20,40,24,39,32,24,22,20,27,21,26,28,26,18,30,22,30,18,52,25,28,42,23,41,32,22,24,25,27,24,27,31,35,21,36,20,23,19,25,31,32,40,41,36,43,34,26,29,23,45,33,29,29,45,48,19,38,26,48,22,32,44,44,19,32,30,
4
-
7/30/2019 (6) Graphical Presentation 2
5/30
Applied Statistics and Computing Lab
Inferences
What can you infer from the data? Practically nothing!
How long before you come up with answers?
Probably the first thing you do, is count the observations for each age.
Note down the observations along with the corresponding age
That makes a frequency table for you!
Frequency table- Just like in categorical data, a frequency table for
discrete numerical data lists each possible value (either individually or
grouped into intervals), the associated frequency and sometimes the
corresponding relative frequency.
Note: Age is, in theory, a continuous variable as it can assume any value.
But here the variable is, age, in whole years, which is discrete.
But 44 distinct values in your data!
Hence frequency table with 44 rows and one frequency column
5
-
7/30/2019 (6) Graphical Presentation 2
6/30
Applied Statistics and Computing Lab
So, list individually or group? List Individually or group into intervals?-
In the ages data, there are 44 distinct values. If we list individually, wehave data with 44 rows! Cumbersome to interpret
Insurance company interested in selling 4 different products cateringto the needs of 4 different age groups.
Interested in 4 age categories
In general, depending on need and size of data, decide whether togroup or not ( For discrete data).
For continuous data it is necessary to group.
How to make groups?- Find max and min. Choose suitable classwidth= (max-min)/(desired no of classes), round off to the nextinteger, if decimal. If not, then the next integer
6
-
7/30/2019 (6) Graphical Presentation 2
7/30
Applied Statistics and Computing Lab
Construction of Frequency table
Do we have the answers in a minute from this table?
The age group 17-29 has the maximum number of people
We also have the exact number of people in each age group
This same data can be represented pictorially in a number of
ways!
Class Interval Frequency
17-29 298
30-42 142
43-55 56
56-68 10
7
-
7/30/2019 (6) Graphical Presentation 2
8/30
Applied Statistics and Computing Lab
Types of Graph
Graphs for presenting Numerical data:
Bar chart (for discrete variable)
Histogram
Frequency Polygon
Ogive
Line Diagram
8
-
7/30/2019 (6) Graphical Presentation 2
9/30
Applied Statistics and Computing Lab
Bar Chart (Numerical Data) Graph of the frequency distribution
Similar to bar chart for categorical data Each frequency or relative frequency is represented by
a rectangle centered over the corresponding value (orrange of values for grouped data)
Area of the rectangle is proportional to thecorresponding frequency or relative frequency
We could name the groups group 1, group 2, group 3and group 4 and plot the corresponding frequencies,exactly like in case of categorical data (Exercise)
Conceptually hence there is no difference between thetwo
9
-
7/30/2019 (6) Graphical Presentation 2
10/30
Applied Statistics and Computing Lab
Histogram( for continuous
numerical data)-
Graph of the frequency distribution
of continuous data Suppose given the ages of 507
people in continuous form- (Nowage not reported in whole years,can take any value on real line)
We draw histogram instead of barchart
Similar to bar chart for numericaldata except that there are no gaps
between the bars Length of each rectangle represents
frequency of each equal class-interval , so that area representedby histogram= total frequency
If class-intervals are not equal, thenlength represents relativefrequency,(= class frequency/classinterval) then total area enclosed byhistogram=1
10
-
7/30/2019 (6) Graphical Presentation 2
11/30
Applied Statistics and Computing Lab
Inferences: Maximum concentration is in the age group 20-25
Gives an idea about shape of the distribution-
for eg, we can say that the distribution of ages is not symmetric, it is highly right skewed
(See module on Skewness and kurtosis)
Extent of spread or variation (see module on dispersion)
What is bin width?- Bin width refers to the length of each class interval.
How to choose bin width?- Well, R chose a bin width for you!
The default bin width in R is given by Sturges Rule
Some other Thumb Rules- Doanes Formula, Rice Rule, Scott Rule, Freedman Diaconis Rule,
All you need to do is specify the option in breaks= in R (see histogram in R-code slide)
For more details on these rules- http://en.wikipedia.org/wiki/Histogram11
-
7/30/2019 (6) Graphical Presentation 2
12/30
-
7/30/2019 (6) Graphical Presentation 2
13/30
Applied Statistics and Computing Lab
Frequency Polygon (for
representing continuousdata)
A frequency polygon is formed
by plotting the frequencies ofeach class against their
midpoints and joining the points
by straight lines
To get a closed polygon, we take
two additional classes, one at
each end, that have zero
frequencies. ( The midpoints
corresponding to these classes
thus have zero frequencies)
Basically, if superimposed on a
histogram, it joins the midpointsof each rectangular bar by
straight line segments
We draw the frequency polygon
for the ages data over thehistogram itself 13
-
7/30/2019 (6) Graphical Presentation 2
14/30
Applied Statistics and Computing Lab
Inferences
But is there any additional information you can derive from a frequency polygon, over
and above which the histogram gives?
Not really! In fact histogram gives more information since while it lists the entire
class intervals, a frequency polygon only shows the midpoint. To appreciate fully, lookat a frequency polygon without the corresponding histogram-
In the construction we have made a simplification by drawing the class frequency
corresponding to the mid point of the class interval thereby losing more information14
-
7/30/2019 (6) Graphical Presentation 2
15/30
Applied Statistics and Computing Lab
Why use Frequency Polygon?
For comparing between two sets of data the corresponding
frequency polygons can be drawn on the same graph
Drawing two histograms on the same diagram for comparison
purposes is confusing
The insurance company is looking at the profitability of
investing in two regions- region A and region B. Region with a higher proportion of 50 plus population
demands more insurance.
The ages.both.regions.csv data gives the ages of a randomsample of 507 people in both region A and in region B
Draw two histograms on the same diagram and try to
compare-
15
-
7/30/2019 (6) Graphical Presentation 2
16/30
Applied Statistics and Computing Lab
Why use Frequency polygonQ. What can you infer? Practically nothing, right!
16
-
7/30/2019 (6) Graphical Presentation 2
17/30
Applied Statistics and Computing Lab
Why Frequency Polygon (Contd)Draw two frequency polygons on the same diagram and compare.
What can you infer? Can you infer better?
Which region should have the higher insurance demand?17
-
7/30/2019 (6) Graphical Presentation 2
18/30
Applied Statistics and Computing Lab
Ogives- Cumulative Frequency Curves
Now suppose the insurance company wants answers to more
particular questions- In region A, how many are 50 years or more?
In region A, how many people are 20 years or less?
In region A, how many people are 60 years or more?
(Similar questions for region B)
It wants to design separate products for the age groups 20 or
less, 20-50, 50 and above and a few additional schemes for 60
plus people
Clearly, it needs to know the cumulative frequency for each
age group!
18
-
7/30/2019 (6) Graphical Presentation 2
19/30
Applied Statistics and Computing Lab
Ogives for Region A A cumulative frequency curve
or ogive is obtained by plottingcumulative, rather thanindividual class frequencies.There maybe two types ofogives- A curve showing the number
of observations equal to orgreater than the lower classlimit of each correspondingclass- referred to as more
than type ogive A curve showing the number
of observations equal to orless than the upper class limitof each corresponding class-
referred to as less than typeogive
Each successive point is joinedby line segments to give theogive
19
-
7/30/2019 (6) Graphical Presentation 2
20/30
Applied Statistics and Computing Lab
Ogives- For Region A
The black plot gives the less than type Ogive
The purple plot gives the more than type Ogive
From diagram insurance company readily has the answers- From the less than type Ogive,
observe that there are 114 people aged less than 22, around 100 people aged less than 20 491 people aged less than 52, roughly 500 people less than 50
From the more than type ogive, we infer there are very few people above 60, something
around 5
Q. Draw the ogives for Region B and try to answer the above questions. Compare with the
age distribution for region A20
-
7/30/2019 (6) Graphical Presentation 2
21/30
Applied Statistics and Computing Lab
Line Diagram
Year Households
with computer
1985 8.2
1990 15
1994 22.8
1995 24.1
1998 36.6
1999 42.1
2000 51
Source: Falling Through The Net: Toward Digital Inclusion
( U.S Department of Commerce,October 2000)
The ages data is an example of cross section data.
Use any of the above diagrams depending on nature of cross section data.
But what if given a time series data- a series of observations given corresponding to
each time point?
For eg, consider the following data
How to represent this graphically?
Need to represent each value
corresponding to each given year
21
-
7/30/2019 (6) Graphical Presentation 2
22/30
Applied Statistics and Computing Lab
Line Chart Plot years on the horizontalaxis and mark the values
corresponding to each year
on the vertical axis
Join the points by linesegments. We have our line
graph ready!
Think: Can we construct a
histogram , ogive or bar chart
with this data? Why or why
not?
Line diagram is meant forrepresenting chronological
data. It exhibits the
relationship of the variable
with time. 22
-
7/30/2019 (6) Graphical Presentation 2
23/30
Applied Statistics and Computing Lab
Line Chart: Inferences
Shows an increasing trend over the years- that is, from 1985 to 2000, the percentageof households with computers consistently rising
From under 10% in 1985 it has crossed to over 50% in 2000, signifying an over 400%
increase from 1985 to 2000
Useful for analysing time trend- that is, the long-term movement of time series data
23
-
7/30/2019 (6) Graphical Presentation 2
24/30
Applied Statistics and Computing Lab
Graphical Distortion of Data
As much as graphs can be used to summarize and represent various aspects of
data succinctly it can also be used to distort data
First might be inadequate representation of data. Consider the following line graph
showing the population above poverty line of a hypothetical country A-
Seeing this graph, we conclude that poverty has been falling in this country as the
number of people above poverty line is rising.
24
0
200
400
600
800
1000
1200
1990 1995 2000 2005 2010
People above poverty line
People above poverty line
-
7/30/2019 (6) Graphical Presentation 2
25/30
Applied Statistics and Computing Lab
Graphical Distortion: Continued
But now, this graph used inadequate information- this is the table from which the
graph has been produced
Draw a line chart showing the relative share of people above poverty line
We see that the relative share of people above poverty line is actually decliningand thus the relative share of people below poverty line is actually rising
Our earlier conclusion, based on representation of inadequate data, led to a
fallacious conclusion
25
0
0.1
0.2
0.3
0.4
0.5
0.6
0.70.8
1990 1995 2000 2005 2010
Relative share of people above
poverty line
Relative share of
people above poverty
line
-
7/30/2019 (6) Graphical Presentation 2
26/30
Applied Statistics and Computing Lab
Graphical Distortion of data Contd..
The above is just an example. There might be
numerous ways in which data can be misrepresented
For eg, one common misuse might be distortion withscale
With the explosion of data visualization techniques
and sophisticated displays like 3-D charts datadistortion can be easier to achieve
For more information read-
http://lilt.ilstu.edu/gmklass/pos138/datadisplay/bad
chart.htm
26
-
7/30/2019 (6) Graphical Presentation 2
27/30
Applied Statistics and Computing Lab
R Codes
Histogramdata=read.csv('ages.continuous.csv',header=TRUE,sep=',')
View(data)
age=data$age
max(age)
colors=c("red", "bisque", "darkslategray", "violet", "orange",
"blue", "pink", "cyan","brown","cornsilk")
# hist for histogram,right=TRUE means right-closed, left-open intervals
hist(age,right=TRUE,col=colors)
# To specify bin widths on your own
bins=seq(17,67,by=5)
hist(age,right=TRUE,breaks=bins,col=colors)
#Example of Histogram with too small binwidth
bins=seq(17,67,by=2)
# Example of Histogram with too large binwidth
bins=seq(17,67,by=25)
hist(age,right=TRUE,breaks=bins,col=colors)
# Drawing a frequency polygon over a histogram
bins=seq(17,67,by=10)
hist(age,right=TRUE,breaks=bins,col=colors,xlim=c(10,75)) # draw the histogram
lines(c(12,seq(22,62,by=10),72),c(0,as.vector(table(cut(age,seq(17,67,by=10)))),0),lwd=2) #draw the frequency
polygon27
-
7/30/2019 (6) Graphical Presentation 2
28/30
Applied Statistics and Computing Lab
R CodesFrequency Polygon
RegionA.age=data$RegionA.age
RegionB.age=data$RegionB.age
max(RegionA.age)
min(RegionA.age)
max(RegionB.age)min(RegionB.age)
bins.A=seq(17,67,by=10)
bins.B=seq(15,75,by=10)
#To draw two frequency polygons on the same graph
plot(c(12,seq(22,62,by=10),72),c(0,as.vector(table(cut(RegionA.age,seq(17,67,by=10)))),0),type="b",main="Frequency
distribution of age",xlab="age ",ylab="frequency", xlim=c(10,80),ylim=c(0,270))
lines(c(10,seq(20,70,by=10),80),c(0,as.vector(table(cut(RegionB.age,seq(15,75,by=10)))),0),lwd=2,col="violet")
Line Chart
data=read.csv('Households with computer.csv',header=TRUE,sep=',')
household.comp=data$Households.with.computer.percentage
Year=data$Yearx=c(0,0,0,0,0)
y=c(0,0,0,0,0)
plot(x,y,xlab="Year",ylab="Percentage of Households with Computer",type="b",xlim=c(1985,2000),ylim=c(5,65))
lines(Year,household.comp,type="b",col="blue")
title("Line Chart")
28
-
7/30/2019 (6) Graphical Presentation 2
29/30
Applied Statistics and Computing Lab
R CodesOgives
min(data)
max(data)
NumberOfClasses = 10
ClassInterval = (67 - 17)/10
ClassInterval
ClassEnds = seq(17,67,5)
classes=cut(data[,1], breaks=ClassEnds)
FrequencyDistribution = table(classes)
CumulativeFrequencies = c(cumsum(FrequencyDistribution))
cbind(CumulativeFrequencies)
#Less than type Ogive
plot(ClassEnds,c(0,as.vector(CumulativeFrequencies)),type="b",xlim=c(10,70),ylim=c(0,700),main="Ogives",xlab="ClassIntervals",y
lab="Cumulative Frequency of Age")
#More than type Ogive
cbind(FrequencyDistribution)Frequency=as.vector(FrequencyDistribution)
cbind(as.vector(FrequencyDistribution))
More.than.cum.freq=cumsum(rev(Frequency))
Upper.limit=rev(ClassEnds)
lines(Upper.limit,c(0,More.than.cum.freq),type="b",col="violet")
29
-
7/30/2019 (6) Graphical Presentation 2
30/30
Thank you
Applied Statistics and Computing Lab30