The five steps of statistical Exploratory Data Analysis ...gp42/sta101/notes/FPP3_6_6pp.pdf ·...

10
9/8/09 1 FPP 3-6 Exploratory Data Analysis: One Variable The five steps of statistical analyses 1. Form the question 2. Collect data 3. Model the observed data 4. Check the model for reasonableness 5. Make and present conclusions Just to make sure we are on the same page More (or repeated) vocabulary Individuals are the objects described by a set of data examples: employees, lab mice, states… A variable is any characteristic of an individual that is of interest to the researcher. Takes on different values for different individuals examples: age, salary, weight, location… How is this different from a mathematical variable? Just to make sure we are on the same page #2 Measurement The value of a variable obtained and recorded on an individual Example: 145 recorded as a person’s weight, 65 recorded as the height of a tree, etc. Data is a set of measurements made on a group of individuals The distribution of a variable tells us what values it takes and how often it takes these values Possible values -> Chest Size 33-34 35-36 37-38 39-40 41-42 43-44 45-46 47-48 How often each occur -> count 21 266 1169 2152 1592 462 71 5 Chest Sizes of 5,738 Militamen Two Types of Variables a categorical/qualitative variable places an individual into one of several groups or categories examples: Gender, Race, Job Type, Geographic location… a quantitative variable takes numerical values for which arithmetic operations such as adding and averaging make sense examples: Height, Age, Salary, Price, Cost… Why two types? Both require their own summaries (graphically and numerically) and analysis. I don’t think FPP talks about this enough so I will provide information that is not in FPP. Example Age: quantitative Gender: categorical Race: categorical Salary: quantitative Job type: categorical Name Age Gender Race Salary Job Type Fleetwood, Delores 39 Female White 62,100 Management Perez, Juan 27 Male White 47,350 Technical Wang, Lin 20 Female Asian 18,250 Clerical Johnson, LaVerne 48 Male Black 77,600 Management

Transcript of The five steps of statistical Exploratory Data Analysis ...gp42/sta101/notes/FPP3_6_6pp.pdf ·...

Page 1: The five steps of statistical Exploratory Data Analysis ...gp42/sta101/notes/FPP3_6_6pp.pdf · Exploratory Data Analysis: One Variable The five steps of statistical analyses 1. Form

9/8/09

1

FPP 3-6

Exploratory Data Analysis: One Variable

The five steps of statistical analyses 1.  Form the question 2.  Collect data 3.  Model the observed data 4.  Check the model for reasonableness 5.  Make and present conclusions

Just to make sure we are on the same page  More (or repeated) vocabulary

 Individuals are the objects described by a set of data   examples: employees, lab mice, states…

 A variable is any characteristic of an individual that is of interest to the researcher. Takes on different values for different individuals   examples: age, salary, weight, location…

 How is this different from a mathematical variable?

Just to make sure we are on the same page #2  Measurement The value of a variable obtained and

recorded on an individual  Example: 145 recorded as a person’s weight, 65 recorded as

the height of a tree, etc.

 Data is a set of measurements made on a group of individuals

 The distribution of a variable tells us what values it takes and how often it takes these values

Possible values -> Chest Size 33-34 35-36 37-38 39-40 41-42 43-44 45-46 47-48How often each occur -> count 21 266 1169 2152 1592 462 71 5

Chest Sizes of 5,738 Militamen

Two Types of Variables   a categorical/qualitative variable places an individual into one of

several groups or categories   examples:

  Gender, Race, Job Type, Geographic location…

  a quantitative variable takes numerical values for which arithmetic operations such as adding and averaging make sense   examples:

  Height, Age, Salary, Price, Cost…

 Why two types?   Both require their own summaries (graphically and numerically) and

analysis.

  I don’t think FPP talks about this enough so I will provide information that is not in FPP.

Example

 Age: quantitative  Gender: categorical  Race: categorical  Salary: quantitative   Job type: categorical

Name Age Gender Race Salary Job TypeFleetwood, Delores 39 Female White 62,100 ManagementPerez, Juan 27 Male White 47,350 TechnicalWang, Lin 20 Female Asian 18,250 ClericalJohnson, LaVerne 48 Male Black 77,600 Management

Page 2: The five steps of statistical Exploratory Data Analysis ...gp42/sta101/notes/FPP3_6_6pp.pdf · Exploratory Data Analysis: One Variable The five steps of statistical analyses 1. Form

9/8/09

2

Exploratory data analysis  Statistical tools that help examine data in order to describe

their main features

 Basic strategy  Examine variables one by one, then look at the relationships

among the different variables  Start with graphs, then add numerical summaries of specific

aspects of the data

Exploratory data analysis: One variable  Graphical displays

 Qualitative/categorical data: bar chart, pie chart, etc.  Quantitative data: histogram, stem-leaf, boxplot, timeplot etc.

  Summary statistics  Qualitative/categorical: contingency tables  Quantitative: mean, median, standard deviation, range etc.

  Probability models  Qualitative: Binomial distribution(others we won’t cover in this class)  Quantitative: Normal curve (others we won’t cover in this class)

Example categorical data Summary table  we summarize categorical data using a table. Note

that percentages are often called Relative Frequencies.

Class Frequency Relative FrequencyHighest Degree Obtained Number of CEOs ProportionNone 1 0.04Bachelors 7 0.28Masters 11 0.44Doctorate / Law 6 0.24Totals 25 1.00

Bar graph   The bar graph quickly

compares the degrees of the four groups

  The heights of the four bars show the counts for the four degree categories

Pie chart

  A pie chart helps us see what part of the whole group forms

  To make a pie chart, you must include all the categories that make up a whole

Page 3: The five steps of statistical Exploratory Data Analysis ...gp42/sta101/notes/FPP3_6_6pp.pdf · Exploratory Data Analysis: One Variable The five steps of statistical analyses 1. Form

9/8/09

3

The biggest change in Durham demographics

 Graphically  Bar graphs, pie charts

  Bar graph nearly always preferable to a pie chart. It is easier to compare bar heights compared to slices of a pie

 Numerically: tables with total counts or percents

Quantitative variables  Graphical summary

 Histogram  Stemplots  Time plots  more

 Numerical sumary  Mean  Median  Quartiles  Range  Standard deviation  more

Histograms The bins are: 1.0 ≤ rate < 1.5 1.5 ≤ rate < 2.0 2.0 ≤ rate < 2.5 2.5 ≤ rate < 3.0 3.0 ≤ rate < 3.5 3.5 ≤ rate < 4.0 4.0 ≤ rate < 4.5 4.5 ≤ rate < 5.0 5.0 ≤ rate < 5.5 5.5 ≤ rate < 6.0 6.0 ≤ rate < 6.5

Page 4: The five steps of statistical Exploratory Data Analysis ...gp42/sta101/notes/FPP3_6_6pp.pdf · Exploratory Data Analysis: One Variable The five steps of statistical analyses 1. Form

9/8/09

4

Histograms

02

5

8

12

9

64

2 1 1

1.0 2.0 3.0 4.0 5.0 6.0Unemployment Rate

Unemployment RateDistributions

The bins are: 1.0 ≤ rate < 1.5 1.5 ≤ rate < 2.0 2.0 ≤ rate < 2.5 2.5 ≤ rate < 3.0 3.0 ≤ rate < 3.5 3.5 ≤ rate < 4.0 4.0 ≤ rate < 4.5 4.5 ≤ rate < 5.0 5.0 ≤ rate < 5.5 5.5 ≤ rate < 6.0 6.0 ≤ rate < 6.5

Histograms  Where did the bins come from?

 They were chosen rather arbitrarily

 Does choosing other bins change the picture?  Yes!! And sometimes dramatically

 What do we do about this?  Some pretty smart people have come up with some “optimal”

bin widths and we will rely on there suggestions

JMP Histogram  The purpose of a graph is to help us understand the data

 After you make a graph, always ask, “What do I see?”

 Once you have displayed a distribution you can see the important features

Histograms  We will describe the features of the distribution that the

histogram is displaying with three characteristics

1.  Shape  Symmetric, skewed right, skewed left, uni-modal, multi-modal,

bell shaped

2.  Center  Mean, median

3.  Spread  Standard deviation

Body temperatures of 30 people

Page 5: The five steps of statistical Exploratory Data Analysis ...gp42/sta101/notes/FPP3_6_6pp.pdf · Exploratory Data Analysis: One Variable The five steps of statistical analyses 1. Form

9/8/09

5

Incomes from 500 households in 2000 current population survey Histogram vs. Bar graph

 Spaces mean something in histograms but not in bar graphs  Shape means nothing with bar graphs  The biggest difference is that they are displaying

fundamentally different types of variables

02

5

8

12

9

64

2 1 1

1.0 2.0 3.0 4.0 5.0 6.0Unemployment Rate

Unemployment RateDistributions

Numerical summaries of quantitative variables  Want a numerical summary for center and spread

 Center   Mean   Median   Mode

 Spread   Range   Inter-quartile range   Standard deviation

Mean  To find the mean of a set of observations, add their values

and divide by the number of observations

 equation 1:

 equation 2:

nxxxx n+++

=...21

x =1n

xii=1

n

Mean example  The average age of 20 people in a room is 25. A 28 year old

leaves while a 30 year old enters the room.  Does the average age change?  If so, what is the new average age?

Median  The median is the midpoint of a distribution

 The number such that half the observations are smaller and the other half are larger

 To compute a median  Order observations  If number of observations is odd the median is the center

observation  If number of observations is even the median is the average of

the two center observations

Page 6: The five steps of statistical Exploratory Data Analysis ...gp42/sta101/notes/FPP3_6_6pp.pdf · Exploratory Data Analysis: One Variable The five steps of statistical analyses 1. Form

9/8/09

6

Median example  The median age of 20 people in a room is 25. A 28 year old

leaves while a 30 year old enters the room.  Does the median age change?  If so what is the new median age?

Mean vs Median  When histogram is symmetric mean and median are similar

 Mean and median are different when histogram is skewed  Skewed to the right mean is larger than median  Skewed to the left mean is smaller than median

Mean vs Median  Symmetric distribution

Mean vs Median  Right skewed distribution

Mean vs Median  Left skewed distribution

Extreme example   Income in small town of 6 people

$25,000 $27,000 $29,000 $35,000 $37,000 $38,000

 Mean is $31,830 and median is $32,000  Bill Gates moves to town

$25,000 $27,000 $29,000 $35,000 $37,000 $38,000 $40,000,000

 Mean is $5,741,571 median is $35,000  Mean is pulled by the outlier while the median is not. The

median is a better of measure of center for these data

Page 7: The five steps of statistical Exploratory Data Analysis ...gp42/sta101/notes/FPP3_6_6pp.pdf · Exploratory Data Analysis: One Variable The five steps of statistical analyses 1. Form

9/8/09

7

Is a central measure enough?   A warm, stable climate greatly affects some individual’s health.

Atlanta and San Diego have about equal average temperatures (62o

vs. 64o). If a person’s health requires a stable climate, in which city would you recommend they live?

Measures of spread  Range:

 subtract the largest value form the smallest

  Inter-quartile range:  subtract the 75th percentile from the 25th percentile

 We will focus on and use the Standard Deviation (SD)

Standard Deviation  The standard deviation looks at how far observations are

from their mean   It is the square root of the average squared deviations from

the mean

 Compute distance of each value from mean  Square each of these distances  Take the average of these squares and square root

s =1

n −1

xi − x ( )2

i=1

n

Example

Standard deviation  Order these

histograms by the SD of the numbers they portray. Go from smallest largest

Histograms on same scale

Page 8: The five steps of statistical Exploratory Data Analysis ...gp42/sta101/notes/FPP3_6_6pp.pdf · Exploratory Data Analysis: One Variable The five steps of statistical analyses 1. Form

9/8/09

8

Problem from text (p. 74, #2)  Which of the following sets of numbers has the smaller SD’ a) 50, 40, 60, 30, 70, 25, 75

b) 50, 40, 60, 30, 70, 25, 75, 50, 50, 50

 Repeat for these two sets c) 50, 40, 60, 30, 70, 25, 75

d) 50, 40, 60, 30, 70, 25, 75, 99, 1

Properties of SD  Measures “average” squared distance from the MEAN  Let s denote the sample standard deviation

 Then  When is ?

 Has the same unit of measurement as the original observations

  Inflated by outliers

s ≥ 0

s = 0

Mean and SD  What happens to the mean if you add 5 to every number in a

list?  What happens to the SD?

s =1

n −1

xi − x ( )2

i=1

n

nxxxx n+++

=...21

Standard deviation  SDs are like measurement units on a ruler  Any quantitative variable can be converted into SD units

 These are often called z-scores

  Important formula

 Example  ACT versus SAT scores  Which is more impressive

  A 1340 on the SAT, or a 32 on the ACT?

z − score =value −mean

SD

The normal curve   When histogram looks like a bell-shaped curve, SD units are associated

with percentages

  The percentage of the data in between two different values of SD units equals the area under the normal curve in between the two values of SD units

  A bit of notation here.   We will use the Greek letter µ (pronounced “mew”) to denote a normal

curve’s mean   We will use the Greek letter σ (pronounced “sigma”) to denote a normal

curve’s SD   N(µ, σ) is short hand for writing normal curve with mean µ and standard

deviation σ

Normal curves

Page 9: The five steps of statistical Exploratory Data Analysis ...gp42/sta101/notes/FPP3_6_6pp.pdf · Exploratory Data Analysis: One Variable The five steps of statistical analyses 1. Form

9/8/09

9

Normal curves Properties of normal curve   In the Normal distribution with mean µ and standard deviation σ:

 68% of the observations fall within 1 σ of µ  95% of the observations fall within 2 σs of µ  99.7% of the observations fall within 3 σs of µ

  By remembering these numbers, you can think about Normal curves without constantly making detailed calculations

Properties of normal curves IQ  A person is considered to have mental retardation when

1. IQ is below 70

2. Significant limitations exist in two or more adaptive skill areas

3. Condition is present from childhood

 What percentage of people have IQ that meet the first criterion of mental retardation

IQ  A histogram of all people’s IQ scores has a mean of 100 and a

SD=16  How to get % of people with IQ < 70

More IQ   Reggie Jackson, one of the greatest baseball players ever, has an IQ of 140. What

percentage of people have bigger IQs than Reggie?

  Marilyn vos Savant, self-proclaimed smartest person in the world, has a reported IQ of 205. What percentage of people have IQ scores smaller than Marilyn’s score?

  Mensa is a society for “intelligent people.” To qualify for Mensa, one needs to be in at least the upper 2% of the population in IQ score. What is the score needed to qualify for Mensa?

Page 10: The five steps of statistical Exploratory Data Analysis ...gp42/sta101/notes/FPP3_6_6pp.pdf · Exploratory Data Analysis: One Variable The five steps of statistical analyses 1. Form

9/8/09

10

Checking if data follow normal curve

 Look for symmetric histogram

 A different method is a normal probability plot. When normal curve is a good fit, points fall on a nearly straight line