Descriptive Statistics and Data Visualization
-
Upload
doug-joubert -
Category
Education
-
view
4.666 -
download
3
description
Transcript of Descriptive Statistics and Data Visualization
![Page 1: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/1.jpg)
Diversity in Datasets: (d)econstructing Descriptive Statistics and Data Visualization
Douglas James JoubertNational Institutes of Health Library
![Page 2: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/2.jpg)
Outline
Types of Scale Levels of Measurement Descriptive vs. Inferential Statistics Univariate Analysis Graphical Methods for Displaying Data
![Page 3: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/3.jpg)
Before you Survey
Consult with a Statistician
Vital toyour success
Great way tocollaborate
![Page 4: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/4.jpg)
Analysis Always Follows Design
Johnson (2005)
Question
Hypothesis
Experimental DesignSample
sData
Analysis
![Page 5: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/5.jpg)
Descriptive Statistics
Location Spread (Dispersion)Shape of theDistribution
MeanMode
Median
SDVariance
COV
Skewness(+ or -)
Kurtosis
![Page 6: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/6.jpg)
Levels of Measurement
The questions you ask are just as important as what is being measured Consult, confer, and pick apart your hypothesis
Results are only as good as your poorest measurement Your measurement will never provide the absolute
truth Try to control as much as possible to reduce
error Random error – due to chance – either direction Systematic error – due to bias – one direction
![Page 7: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/7.jpg)
Triangulate
Different measures for same construct
X2X1
Reducing Measurement Error
![Page 8: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/8.jpg)
Types of Scale
Nominal or Categorical Mutually exclusive group: gender, sick vs. healthy,
remote user vs. library user Used for identification purposes only Cannot be ranked from smallest to largest
Ordinal Mutually exclusive group that is also ordered in a
meaningful manner Distance between categories is unknown—you
cannot say that a person with a job satisfaction of 2 is twice as satisfied as a person rated as a 1
![Page 9: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/9.jpg)
Types of Scale
Interval Ordered groups with equal intervals between any
two pairs of adjacent classes No absolute zero and you cannot compute ratios,
for example, temperature Ratio
Interval scale with a true absolute zero, for example, weight
You can tell how much larger or smaller one value is compared with another
![Page 10: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/10.jpg)
Hierarchy of Measurement
Absolute Zero
Distance is meaningful
Characteristics can be ordered
Classification is arbitrary
Ratio
Interval
Ordinal
Nominal
Trochim (2001)
![Page 11: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/11.jpg)
Descriptive vs. Inferential Statistics
Descriptive (Summary) statistics describe or characterize data in such a way that none of the original information is lost or distorted1
Inferential statistics allow one to draw conclusions about a population based on data obtained from a sample
Munro (2002)
S1 S2
S3 S4
S5
S6
?
???
??
Sample Population
![Page 12: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/12.jpg)
Univariate Descriptive Analysis
Allows one to examine each variable separately to check for data inconsistencies, variability of variables
Also allows one to check statistical assumptions about the shape of the distribution before moving on to more complex analysis
Univariate descriptive statistics can also be used to determine central tendency, variability, skewness, and kurtosis
![Page 13: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/13.jpg)
Graphical Methods for Displaying Data
Frequency Distributions Histograms Plots Pareto Charts Boxplots Error Bar Charts
![Page 14: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/14.jpg)
Frequency distributions
Frequency distributions are a nice tool for categorizing data into meaningful groups
Organizing data in tabular form using classes or frequencies
Two main types: Categorical: qualitative data such as gender,
treatment group or not, religious affiliation Ungrouped or grouped quantitative data
![Page 15: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/15.jpg)
Categorical Frequency distributions
AB B A O
O O B B
A A AB AB
A B O A
Class Frequency f
A 5
B 4
O 4
AB 3
Total 16
![Page 16: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/16.jpg)
Ungrouped Frequency distributions
Birth weight data in (oz)
32 58 64 64
67 88 88 91
93 94 94 89
98 98 100 101
103 103 155 161
![Page 17: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/17.jpg)
Ungrouped Frequency distributions
Birth weight Count (Frequency f)
32 1
58 1
64 2
67 1
88 2
91 1
93 1
…
![Page 18: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/18.jpg)
Grouped Frequency Distribution
Grouped frequency distribution is obtained by constructing classes (intervals) for the data
If the difference between minimum and maximum values exceed 15 then you need to divide the data into classes
Should have a minimum of 5 classes and a maximum of 20
Histogram is a graphical representation of a frequency distribution
![Page 19: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/19.jpg)
Grouped Frequency Distribution
Typically grouped frequency distributions will contain: The frequency of the value within each category Relative frequency: The percentage of values within
each category based on the total number of cases Valid percent is the percentage of cases in each
category based on non-missing scores Cumulative frequency: sum of the frequencies for all
values at or below the given value Cumulative relative frequency: sum of the relative
frequencies for all values at or below the given value
![Page 20: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/20.jpg)
Grouped Frequency Distribution of CA patients
Age Frequency
rf* cf crf
0 – 10 2 0.0696 2 .0696
10 – 20 71 .2473 73 .2542
20 – 30 59 .2055 132 .4597
30 – 40 70 .2439 202 .7036
40 – 50 43 .1498 245 .8534
More 42 .1463 287 .9997
Total 287 1.00
*=(E2/$E$8)*100, in Excel to force absolute reference
![Page 21: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/21.jpg)
Table Tips
Use tables to highlight major facts Keep it simple – tables are usually intended
to demystify your data, not make it more difficult to understand
If you are using a software program to create class intervals make sure the default works with you data
Think of your audience – how can I convey my message without losing important data
![Page 22: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/22.jpg)
Table Tips
The clustering that best describes the data should be the ultimate guide
Too few or too many class intervals will obscure important information about your data
Tables used to analyzed data are rarely published
![Page 23: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/23.jpg)
Charts
Effective way to give the reader a snapshot of the differences and patterns in a set of data
Primary disadvantage to charts is that you lose the details
Things to consider when constructing charts Does my data represent a single moment in time
(cross sectional) or does my data occur over time (time series)
Do I have a qualitative or quantitative variables? If my variable is quantitative, is the variable
discrete or continuous?
Munro (2002)
![Page 24: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/24.jpg)
Bar Charts
For nominal or ordinal data use simple bar charts Simple bar charts you will have spaces between
categories Cluster bar charts can be used to represent
univariate distributions Cluster bar charts can also be stacked
![Page 25: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/25.jpg)
Cyt
opla
sm
Pla
sma
Mem
bran
e
Ext
race
llula
r S
pace
Nuc
leus
Location
Simple Bar Chart
Nominal data
![Page 26: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/26.jpg)
Stacked Bar Chart
You are really just stacking two or more columns into a single new column
Compares the percentage that each group contributes to the total across categories
Want to have 100% stacked columns so you can compare the percentages in each group
![Page 27: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/27.jpg)
Stacked Bar Chart
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 2 3 4 5 6 7 8 9
X3
X2
X1
![Page 28: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/28.jpg)
Histograms
Best for interval and ratio data Represent percentages rather than counts Each histogram has total area of 100% Since this is a range of values no gaps
between bars From a descriptive standpoint allows one to
look at the distribution of variables Consider grouping the data if range > 15 Height of the vertical axis is important
![Page 29: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/29.jpg)
G-protein coupled receptor
cytokine
enzyme
growth factor
ion channel
kinase
ligand-dependent nuclear receptor
peptidase
phosphatase
transcription regulator
translation regulator
transmembrane receptor
transporter
16
14
100
12
16
68
10
24
14
107
1
35
57
25 50 75 100
Histogram of Family Terms
![Page 30: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/30.jpg)
HistogramStd Err Bars
Normal Dist Fit
![Page 31: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/31.jpg)
Histogram: SEM and Normal Distributions
Standard error of the mean is the estimate of how much we would expect the mean to vary in a population, given repeated samples
Fit distribution (Normal) estimates the parameters of the normal distribution based on the analysis sample
![Page 32: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/32.jpg)
Pareto Charts
Pareto chart is a special type of histogram that is arranged from largest to smallest
Allows one to determine which values are least important and which values are more important
Pareto charts combines a bar chart displaying percentages of categories in the data with a line plot showing cumulative percentages of the categories
![Page 33: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/33.jpg)
Pareto Chart
0
10
20
30
40C
ount
45.16
70.97
80.65
87.10
93.5596.77
N=31
contamination oxide defect miscellaneous corrosion metallization doping silicon defect
failure
0
25
50
75
100
125
Cum
Per
cent
SAS (1990)
![Page 34: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/34.jpg)
afte
r
0
5
10
15
20
25C
ount
OCT 1 OCT 2 OCT 3
0
20
40
60
80
100
Cum
Per
cent
befo
re
0
5
10
15
20
25
Cou
nt
cont
amin
atio
n
met
alliz
atio
n
corr
osio
n
mis
cella
neou
s
silic
on d
efec
t
oxid
e de
fect
dopi
ng
failure
cont
amin
atio
n
met
alliz
atio
n
corr
osio
n
mis
cella
neou
s
silic
on d
efec
t
oxid
e de
fect
dopi
ng
failure
cont
amin
atio
n
met
alliz
atio
n
corr
osio
n
mis
cella
neou
s
silic
on d
efec
t
oxid
e de
fect
dopi
ng
failure
0
20
40
60
80
100
Cum
Per
cent
2-Way Comparative Pareto Chart
SAS (1990)
![Page 35: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/35.jpg)
0
20
40
60
80
100
120
Y
-5 0 5 10 15 20 25 30 35
April
Y Test1Test2
Overlay Chart Similar to a scatterplot but…your are only
looking at one variable
SAS (1989–2004)
![Page 36: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/36.jpg)
Plots
Scatterplots look at the relationship between two or more variables
Great way to identify outliers Typically the Y-axis is the DV and X-axis the
IV Using a control variable allows one to
identify different groups For example, the relationship between bp
and weight, and controlling for smoking vs. non-smoking
![Page 37: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/37.jpg)
Plots
Scatterplots look at the relationship between two or more variables
Great way to identify outliers Typically the Y-axis is the DV and X-axis the
IV Using a control variable allows one to
identify different groups For example, the relationship between bp
and weight, and controlling for smoking vs. non-smoking
Why? Because we are controlling for some factor
![Page 38: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/38.jpg)
Simple Scatterplot
20
30
40
50
60
70
80
90
100
Hum
id1:
PM
0 2.5 5 7.5 10 12.5 15
wrSpeed SAS (1989–2004)
![Page 39: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/39.jpg)
Simple Scatterplot
20
30
40
50
60
70
80
90
100
Hum
id1:
PM
0 2.5 5 7.5 10 12.5 15
wrSpeed
In correlation, this is the least-square line (scary math, but very important)
SAS (1989–2004)
![Page 40: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/40.jpg)
Box-and-Whisker Plots
A graphical method based on percentiles Useful for visualizing the distribution of a
variable Simultaneously displays the median, the IQR,
and the smallest and largest values for a group More compact than a histogram but less
revealing Good tool for identifying outliers and extreme
values Two common types: Outlier Box Plot and a
Quantile Box Plot
![Page 41: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/41.jpg)
Outlier Box Plot
0 1 2 3 4 5
Possible Outliers
IQRLargest value not an outlier
Smallest value not an outlier
75th
25th
50th (media
n)
![Page 42: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/42.jpg)
0
1
2
3
4
5
100.0%99.5%97.5%90.0%75.0%50.0%25.0%10.0%2.5%0.5%0.0%
maximum
quartilemedianquartile
minimum
4.7605 4.7605 3.9211 1.8560 1.0298 0.4325 0.1671 0.0451 0.00640.000410.00041
Quantiles
Quantile Box Plot
![Page 43: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/43.jpg)
Contact Information
Douglas J. Joubert, MLISBiomedical Informationist
National Institutes of Health LibraryBldg. 10, Room 1L09A
Bethesda, MD 20906-1150Phone: 301.594.6282
Fax: 301.402.0254E-mail: [email protected]: [email protected]
http://nihlibrary.nih.gov/
![Page 44: Descriptive Statistics and Data Visualization](https://reader035.fdocuments.net/reader035/viewer/2022062404/554b1a77b4c90569098b5015/html5/thumbnails/44.jpg)
References
1. Johnson, Laura Lee Ph.D (2004). Principles and Practices of Clinical Research (Lecture), NIH.
2. SAS (1990). Common causes of failure during the fabrication of integrated circuits. Data from "Selected SAS/QC Software Examples, Release 6.06, SAS Users Group International Conference, April 2, 1990 pg 383.
3. Munro, B. H. (2001). Statistical methods for health care research (4th ed.). Philadelphia: Lippincott Williams & Wilkins.
4. SAS Institute Inc. (1989-2004). SAS Help Files. Cary: North Carolina.