Exploratory Data Analysis: One Variablegp42/sta101/notes/FPP3_6_2pp.pdf · Exploratory Data...
Transcript of Exploratory Data Analysis: One Variablegp42/sta101/notes/FPP3_6_2pp.pdf · Exploratory Data...
9/8/09
1
FPP 3-6
Exploratory Data Analysis: One Variable
The five steps of statistical analyses 1. Form the question 2. Collect data 3. Model the observed data 4. Check the model for reasonableness 5. Make and present conclusions
9/8/09
2
Just to make sure we are on the same page More (or repeated) vocabulary
Individuals are the objects described by a set of data examples: employees, lab mice, states…
A variable is any characteristic of an individual that is of interest to the researcher. Takes on different values for different individuals examples: age, salary, weight, location…
How is this different from a mathematical variable?
Just to make sure we are on the same page #2 Measurement The value of a variable obtained and
recorded on an individual Example: 145 recorded as a person’s weight, 65 recorded as
the height of a tree, etc.
Data is a set of measurements made on a group of individuals
The distribution of a variable tells us what values it takes and how often it takes these values
Possible values -> Chest Size 33-34 35-36 37-38 39-40 41-42 43-44 45-46 47-48How often each occur -> count 21 266 1169 2152 1592 462 71 5
Chest Sizes of 5,738 Militamen
9/8/09
3
Two Types of Variables a categorical/qualitative variable places an individual into one of
several groups or categories examples:
Gender, Race, Job Type, Geographic location…
a quantitative variable takes numerical values for which arithmetic operations such as adding and averaging make sense examples:
Height, Age, Salary, Price, Cost…
Why two types? Both require their own summaries (graphically and numerically) and
analysis.
I don’t think FPP talks about this enough so I will provide information that is not in FPP.
Example
Age: quantitative Gender: categorical Race: categorical Salary: quantitative Job type: categorical
Name Age Gender Race Salary Job TypeFleetwood, Delores 39 Female White 62,100 ManagementPerez, Juan 27 Male White 47,350 TechnicalWang, Lin 20 Female Asian 18,250 ClericalJohnson, LaVerne 48 Male Black 77,600 Management
9/8/09
4
Exploratory data analysis Statistical tools that help examine data in order to describe
their main features
Basic strategy Examine variables one by one, then look at the relationships
among the different variables Start with graphs, then add numerical summaries of specific
aspects of the data
Exploratory data analysis: One variable Graphical displays
Qualitative/categorical data: bar chart, pie chart, etc. Quantitative data: histogram, stem-leaf, boxplot, timeplot etc.
Summary statistics Qualitative/categorical: contingency tables Quantitative: mean, median, standard deviation, range etc.
Probability models Qualitative: Binomial distribution(others we won’t cover in this class) Quantitative: Normal curve (others we won’t cover in this class)
9/8/09
5
Example categorical data
Summary table we summarize categorical data using a table. Note
that percentages are often called Relative Frequencies.
Class Frequency Relative FrequencyHighest Degree Obtained Number of CEOs ProportionNone 1 0.04Bachelors 7 0.28Masters 11 0.44Doctorate / Law 6 0.24Totals 25 1.00
9/8/09
6
Bar graph The bar graph quickly
compares the degrees of the four groups
The heights of the four bars show the counts for the four degree categories
Pie chart
A pie chart helps us see what part of the whole group forms
To make a pie chart, you must include all the categories that make up a whole
9/8/09
7
9/8/09
8
The biggest change in Durham demographics
Graphically Bar graphs, pie charts
Bar graph nearly always preferable to a pie chart. It is easier to compare bar heights compared to slices of a pie
Numerically: tables with total counts or percents
9/8/09
9
Quantitative variables Graphical summary
Histogram Stemplots Time plots more
Numerical sumary Mean Median Quartiles Range Standard deviation more
Histograms The bins are: 1.0 ≤ rate < 1.5 1.5 ≤ rate < 2.0 2.0 ≤ rate < 2.5 2.5 ≤ rate < 3.0 3.0 ≤ rate < 3.5 3.5 ≤ rate < 4.0 4.0 ≤ rate < 4.5 4.5 ≤ rate < 5.0 5.0 ≤ rate < 5.5 5.5 ≤ rate < 6.0 6.0 ≤ rate < 6.5
9/8/09
10
Histograms
02
5
8
12
9
64
2 1 1
1.0 2.0 3.0 4.0 5.0 6.0Unemployment Rate
Unemployment RateDistributions
The bins are: 1.0 ≤ rate < 1.5 1.5 ≤ rate < 2.0 2.0 ≤ rate < 2.5 2.5 ≤ rate < 3.0 3.0 ≤ rate < 3.5 3.5 ≤ rate < 4.0 4.0 ≤ rate < 4.5 4.5 ≤ rate < 5.0 5.0 ≤ rate < 5.5 5.5 ≤ rate < 6.0 6.0 ≤ rate < 6.5
Histograms Where did the bins come from?
They were chosen rather arbitrarily
Does choosing other bins change the picture? Yes!! And sometimes dramatically
What do we do about this? Some pretty smart people have come up with some “optimal”
bin widths and we will rely on there suggestions
9/8/09
11
JMP
Histogram The purpose of a graph is to help us understand the data
After you make a graph, always ask, “What do I see?”
Once you have displayed a distribution you can see the important features
9/8/09
12
Histograms We will describe the features of the distribution that the
histogram is displaying with three characteristics
1. Shape Symmetric, skewed right, skewed left, uni-modal, multi-modal,
bell shaped
2. Center Mean, median
3. Spread Standard deviation
Body temperatures of 30 people
9/8/09
13
Incomes from 500 households in 2000 current population survey
Histogram vs. Bar graph Spaces mean something in histograms but not in bar graphs Shape means nothing with bar graphs The biggest difference is that they are displaying
fundamentally different types of variables
02
5
8
12
9
64
2 1 1
1.0 2.0 3.0 4.0 5.0 6.0Unemployment Rate
Unemployment RateDistributions
9/8/09
14
Numerical summaries of quantitative variables Want a numerical summary for center and spread
Center Mean Median Mode
Spread Range Inter-quartile range Standard deviation
Mean To find the mean of a set of observations, add their values
and divide by the number of observations
equation 1:
equation 2:
nxxxx n+++
=...21
x =1n
xii=1
n
∑
9/8/09
15
Mean example The average age of 20 people in a room is 25. A 28 year old
leaves while a 30 year old enters the room. Does the average age change? If so, what is the new average age?
Median The median is the midpoint of a distribution
The number such that half the observations are smaller and the other half are larger
To compute a median Order observations If number of observations is odd the median is the center
observation If number of observations is even the median is the average of
the two center observations
9/8/09
16
Median example The median age of 20 people in a room is 25. A 28 year old
leaves while a 30 year old enters the room. Does the median age change? If so what is the new median age?
Mean vs Median When histogram is symmetric mean and median are similar
Mean and median are different when histogram is skewed Skewed to the right mean is larger than median Skewed to the left mean is smaller than median
9/8/09
17
Mean vs Median Symmetric distribution
Mean vs Median Right skewed distribution
9/8/09
18
Mean vs Median Left skewed distribution
Extreme example Income in small town of 6 people
$25,000 $27,000 $29,000 $35,000 $37,000 $38,000
Mean is $31,830 and median is $32,000 Bill Gates moves to town
$25,000 $27,000 $29,000 $35,000 $37,000 $38,000 $40,000,000
Mean is $5,741,571 median is $35,000 Mean is pulled by the outlier while the median is not. The
median is a better of measure of center for these data
9/8/09
19
Is a central measure enough? A warm, stable climate greatly affects some individual’s health.
Atlanta and San Diego have about equal average temperatures (62o
vs. 64o). If a person’s health requires a stable climate, in which city would you recommend they live?
Measures of spread Range:
subtract the largest value form the smallest
Inter-quartile range: subtract the 75th percentile from the 25th percentile
We will focus on and use the Standard Deviation (SD)
9/8/09
20
Standard Deviation The standard deviation looks at how far observations are
from their mean It is the square root of the average squared deviations from
the mean
Compute distance of each value from mean Square each of these distances Take the average of these squares and square root
€
s =1
n −1
xi − x ( )2
i=1
n
∑
Example
9/8/09
21
Standard deviation Order these
histograms by the SD of the numbers they portray. Go from smallest largest
Histograms on same scale
9/8/09
22
Problem from text (p. 74, #2) Which of the following sets of numbers has the smaller SD’ a) 50, 40, 60, 30, 70, 25, 75
b) 50, 40, 60, 30, 70, 25, 75, 50, 50, 50
Repeat for these two sets c) 50, 40, 60, 30, 70, 25, 75
d) 50, 40, 60, 30, 70, 25, 75, 99, 1
Properties of SD Measures “average” squared distance from the MEAN Let s denote the sample standard deviation
Then When is ?
Has the same unit of measurement as the original observations
Inflated by outliers
s ≥ 0
€
s = 0
9/8/09
23
Mean and SD What happens to the mean if you add 5 to every number in a
list? What happens to the SD?
€
s =1
n −1
xi − x ( )2
i=1
n
∑
nxxxx n+++
=...21
Standard deviation SDs are like measurement units on a ruler Any quantitative variable can be converted into SD units
These are often called z-scores
Important formula
Example ACT versus SAT scores Which is more impressive
A 1340 on the SAT, or a 32 on the ACT?
€
z − score =value −mean
SD
9/8/09
24
The normal curve When histogram looks like a bell-shaped curve, SD units are associated
with percentages
The percentage of the data in between two different values of SD units equals the area under the normal curve in between the two values of SD units
A bit of notation here. We will use the Greek letter µ (pronounced “mew”) to denote a normal
curve’s mean We will use the Greek letter σ (pronounced “sigma”) to denote a normal
curve’s SD N(µ, σ) is short hand for writing normal curve with mean µ and standard
deviation σ
Normal curves
9/8/09
25
Normal curves
Properties of normal curve In the Normal distribution with mean µ and standard deviation σ:
68% of the observations fall within 1 σ of µ 95% of the observations fall within 2 σs of µ 99.7% of the observations fall within 3 σs of µ
By remembering these numbers, you can think about Normal curves without constantly making detailed calculations
9/8/09
26
Properties of normal curves
IQ A person is considered to have mental retardation when
1. IQ is below 70
2. Significant limitations exist in two or more adaptive skill areas
3. Condition is present from childhood
What percentage of people have IQ that meet the first criterion of mental retardation
9/8/09
27
IQ A histogram of all people’s IQ scores has a mean of 100 and a
SD=16 How to get % of people with IQ < 70
More IQ Reggie Jackson, one of the greatest baseball players ever, has an IQ of 140. What
percentage of people have bigger IQs than Reggie?
Marilyn vos Savant, self-proclaimed smartest person in the world, has a reported IQ of 205. What percentage of people have IQ scores smaller than Marilyn’s score?
Mensa is a society for “intelligent people.” To qualify for Mensa, one needs to be in at least the upper 2% of the population in IQ score. What is the score needed to qualify for Mensa?
9/8/09
28
Checking if data follow normal curve
Look for symmetric histogram
A different method is a normal probability plot. When normal curve is a good fit, points fall on a nearly straight line