Exploring and Understanding DataExploring and...

47
Introduction to Statistics : Exploring and Understanding Data Introduction to Statistics Exploring and Understanding Data Exploring and Understanding Data Part II Instructor : Siana Halim -S. Halim -

Transcript of Exploring and Understanding DataExploring and...

Page 1: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

Introduction to StatisticsExploring and Understanding DataExploring and Understanding Data

Part II

Instructor : Siana Halim

-S. Halim -

Page 2: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

-S. Halim -

Page 3: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

TOPICS• Introduction

D t• Data• Displaying and Describing Categorical Data

• Displaying Quantitative Data• Displaying Quantitative Data• Describing Distribution Numerically

• The Standard Deviation as a RulerThe Standard Deviation as a Ruler

References:•De Veaux Velleman Bock Stats Data and Models Pearson Addison Wesley•De Veaux, Velleman , Bock, Stats, Data and Models, Pearson Addison WesleyInternational Edition, 2005•John A Rice, Mathematical Statistics and Data Analysis, Duxbury Press, 1995

-S. Halim -

Page 4: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

-S. Halim -

Page 5: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

4. Displaying Quantitative Data

WHO MonthsWHAT Changes in Enron’s stock

price in dollarsprice in dollarsHow Difference of closing price

on first day of each month minus the first day of yprevious month

When 1997 to 2002

Where New York Stock ExchangeWhere New York Stock Exchange

-S. Halim -

Page 6: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

Monthly stock price change in dollars of Enron stock for the period January 1997 to December 2001

Jan. Feb. Mar. Apr. May June July Aug. Sept. Oct. Nov. Dec.

1997 -1.44 -0.75 -0.69 -0.88 0.12 0.75 0.81 -1.75 0.69 -0.22 -0.16 0.34

1998 0.78 0.62 2.44 -0.28 2.22 -0.50 2.06 -0.88 -4.50 4.12 1.16 -0.50

1999 3 28 1 22 1 22 0 47 5 62 1 59 4 31 1 47 0 72 0 38 3 25 0 031999 3.28 -1.22 -1.22 0.47 5.62 -1.59 4.31 1.47 -0.72 -0.38 -3.25 0.03

2000 5.72 4.50 4.50 4.56 -1.25 -1.19 -3.12 8.00 9.31 1.12 -3.19 -17.75

2001 14.38 21.06 -10.11 -12.11 5.84 -9.37 -4.74 -2.69 -10.61 -5.85 -17.16 -11.59

-S. Halim -

Page 7: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

The Distribution of Price Changes

How can we display a quantitative variable ?

• We slice up the entire span of values covered by the quantitative variable into equal – width piles called bins.

• Then we count the number of values that fall into each bin. The bins and the counts in each bin give the distribution of the quantitative variablequantitative variable.

• We can display the bin counts in a display called a histogram. A histogram plots the bin counts as the heights of barscounts as the heights of bars.

No Gaps !

-S. Halim -

Page 8: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

Histogram• It displays the shape of the distribution of data values

• The range of the data is divided into intervals or bins and the number of proportions of the observations falling in each bin is plotted

• A procedure that is often recommended is to plot the• A procedure that is often recommended is to plot the proportion of observations falling in the bin divided by the bin width; if this procedure is used, the area under the histogram is 1.•If the bind width is too small the histogram is too•If the bind width is too small, the histogram is too ragged;• If the bind width is to wide, the shape is oversmoothed and obscured.•The choice of bin width is usually made subjectively in•The choice of bin width is usually made subjectively in attemp to balance between a histogram that is too ragged and one that oversmooths.

Histogram of melting points of beeswax (a) bin width = 0.1;

(b) bin width = 0.2, (c) bin width = 0.5

-S. Halim -

Page 9: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

•Histograms are frequently used to display data for which there is no assumption of any stochastic modelassumption of any stochastic model.•If the data are modeled as a random sample from some continuous distribution, the histogram may bedistribution, the histogram may be viewed as an estimate of the probability density

1

211

)(1)(n

iihh xxW

hxf

=

⎞⎛

−= ∑

22

21,1)(

x

h eWhxW

hxW −

=⎟⎠⎞

⎜⎝⎛=

π

-S. Halim -

Page 10: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

Convey information about shape while retaining the numerical information.

Th d

Stem and Leaf Plot (Tukey 1977)

The procedure:

1. Make every observation in stem and leaf, the stem consists of the digit of the observation, while the leaf consists only single number

2. List the stems vertically in ascendent from top to bottom, draw straight line on the right of the stem and y p , g gadd the leaf on the right line

3. Put the leaves ascendently to the right

Example: 54 59 35 41 46 25 47 60 54 46 49 46 41 34 22

2 2 5

3 4 5

4 1 1 6 6 7 9

Stem and leaf displays contain all the information found in a histogram and when carefully drawn, satisfy the are principle and show the distribution. In addition, stem and leaf displays preserve the individual data5 4 4 9

6 0

stem-and-leaf displays preserve the individual data values.

-S. Halim -

Page 11: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

N = 59 Median = 63.5 Quartiles = 63.34, 63.83

Decimal point is 1 place to the left of the colon

1 1 628 : 5

1 0 629 : The Advantages4 3 630 : 3 5 8

7 3 631 : 0 3 3

9 2 632 : 4 7 7 7

18 9 633 : 0 0 1 4 4 6 6 6 9 9

1. Localized the distribution and its numerics

2. The distribution is seen either it is simetrics , bimodal, etc directly

23 5 634 : 0 1 3 3 5

10 635 : 0 0 0 0 1 1 3 6 6 8

26 7 636 : 0 0 1 3 6 8

19 2 637 : 8 8

17 6 638 : 3 3 4 6 6 8

3. The existence of gap in the data, can be known directly

The Disadvantage

Straightforward stem-and-leaf plots do not work17 6 638 : 3 3 4 6 6 8

11 5 639 : 2 2 2 2 3

6 0 640 :

6 1 641 : 2

6 1 642 : 1

Straightforward stem and leaf plots do not work well for data that range over several

orders of magnitude. (Use the logarithms of the data or histogram)

2 0 643 :

2 2 644 : 0 2 frequency

Order for finding the median, quartile ,...

-S. Halim -

Page 12: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

Shape, Center and SpreadWhat Is the Shape of the Distribution ?

1. Does the histogram have a single, central hump or several separated bumps ?

A histogram with one main peak, is called unimodal; histograms with two peaks are bimodal, and those with three of more, are

ll d lti d lcalled multimodal.

A histogram that doesn’t appear to have any mode and in which all the bars are

i t l th h i ht i ll d The mode is sometimes definedapproximately the same height is called uniform.

The mode is sometimes defined as the single value that appears most often

-S. Halim -

Page 13: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

2. Is the histogram symmetric ? Can you fold it along a vertical line through the middle and have the edges match pretty closely, or are more of the values on one side ?

The (usually) thinner ends of a distribution are called the tails. If one tail stretches out farther than the other, the histogram is said to be skewed to the side of the , glonger tail.

-S. Halim -

Page 14: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

3 D l f t ti k t ? Oft h f t3. Do any unusual features stick out ? Often such features tell us something interesting or exciting about the data. You should always mention any outliers that stand off away from the body of the distribution.y

•An outlier can be the most informative part of your data. (Or it might just be an error). But don’t throw it away without comment Treat itthrow it away without comment. Treat it specially and discuss it when you tell about your data (Or find the error and fix it if you can).

•Generally we look at the main body of the data•Generally, we look at the main body of the data. If it seems roughly symmetric, then stragglers are best regarded as outliers. If the main part of the data is skewed, then the long tail that continues that skew ness is part of the overall pattern and probably not full of outliers.

-S. Halim -

Page 15: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

Where Is the Center of the Distribution ?

•The center is an easy description of a typical value and a concise summary of the whole batch of numbersthe whole batch of numbers

•When a histogram is unimodal an symmetric, it’s easy to find its center. It’s right in the middle.

•If the histogram is skewed, defining the center is more of a challenge. And if the histogram has more than one mode, the center might not even be a useful concept

How Spread Out Is the Center of the Distribution ?

The center gives a typical value, but not everyone is typical. Variation matters. Statistics is about variation but how can we see it ? We can look to see whetherStatistics is about variation, but how can we see it ? We can look to see whether all the values of the distribution are tightly clustered around the center or spread out.

-S. Halim -

Page 16: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

Order, Please !

When data are collected in a specific order like Enron’sa specific order like Enron s data, you should check to see if they have a pattern when plotted in that order.

-S. Halim -

Page 17: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

Re-expressing Skewed Data to Improve Symmetry

Skewed distributions are hard to summarize. It‘s hard to know what we mean by the „center“ of a skewed distribution, so it‘s hard to pick a typical value to summarize thetypical value to summarize the distribution.

One way to make a skewed distribution moreOne way to make a skewed distribution more symmetric is re-expressed or to transform, the data by applying a simple function. For example, we could take the square root or logarithm of each data value.

-S. Halim -

Page 18: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

What Can Go Wrong ?What Can Go Wrong ?

• Don’t make a histogram of a categorical variable A histogram orcategorical variable. A histogram or stem-and-leaf display of a categorical variable makes no sense. A bar chart or pie chart may do better.y

•Choose a scale appropriate to the data.

•Avoid inconsistency scales.

-S. Halim -

Page 19: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

Label clearly. Variable should be identified clearly and axes labeled so reader knows what the plot displays.what the plot displays.

What‘ wrong ?

The horizontal axes are inconsistent. Both lines show trends over time. BUT the tuition sequence starts in 1965 andthe tuition sequence starts in 1965 and the ranking graphs from 1989 !

The vertical axis isn‘t labeled. That hides the fact it‘s inconsistent Does it graphthe fact it s inconsistent. Does it graph dollars (of tuition) or ranking (of Cornell University) ?

-S. Halim -

Page 20: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

-S. Halim -

Page 21: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

5. Describing Distribution Numerically

WHO 191 Countries (not WHO !)WHAT DALE (disability adjusted life

expectancy)expectancy)Unit Year

When Data are for babies born in 19991999

Where Earth

Why Annual report by World Health Organization

-S. Halim -

Page 22: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

When distributions are mixed together the result is often a

lti d l d/ k dmultimodal and/or skewed distribution.

C t Fi di th M diCenter : Finding the MedianWhen we think of a typical value, we usually look for the center of the distribution.

Th M diThe Median

• If the sample size is an odd number, the median is defined to be the middle value

Median• If the sample size is an even number, the median is defined to be the average of the two middle values.

-S. Halim -

Page 23: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

Spread : Home on the Range

•The more data vary, however, the less the median alone can tell us. So we need to measure how much the data value vary around the centerneed to measure how much the data value vary around the center.

•The range of the data is defined as the difference between the maximum and minimum values

Range = max - min

The Interquartile Range

A better way to describe the spread of aA better way to describe the spread of a variable might be to ignore the extremes and concentrate on the middle of the data.

-S. Halim -

Page 24: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

Split the sorted data at the median and then find the medians of each half. These values are the quartile, and they border the middle half of the data.

The difference between the quartiles is the interquartile range, IQRq q g ,

IQR = upper quartile – lower quartile

75th percentage 25th percentage

Five - Number SummaryThe five number summary for the DALE data

Five Number Summary

The five – number summary of distribution reports its median, quartile and extremes

Max 73.8 yearsQ3 64.6Median 58.5quartile and extremes.Q1 46.9Min 29.5

-S. Halim -

Page 25: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

Making Boxplots (Tukey)

Whenever we have a five-number summary of a variable, we can put the information together in one graphical display called a boxplot

Th t ti f B l tThe construction of Boxplot.

1. Draw a single vertical axis spanning the range of the data. Draw short horizontal li t th l d til d tlines at the lower and upper quartiles and at the median. Then connect them with vertical lines to form a box. The box can have any width that looks OK.y

Boxplot of data Platinum

-S. Halim -

Page 26: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

2 To help us construct the boxplot we make “fences” around the main

The construction of Boxplot. (Continue … )

2. To help us construct the boxplot, we make fences around the main part of the data. We place the upper fence 1.5 IQRs above the upper quartile and the lower fence 1.5 IQR below the lower quartile. ( This is just for a construction and are not part of the display)

3. We use the fences to grow “whiskers”. Draw lines from the ends of the box up and down to the most extreme data values found within the fences. If a data value falls outside one of the fences, we do not

i i h hi kconnect it with a whisker.

4. Finally, we add the outliers by displaying any data values beyond the fences with special symbols.

-S. Halim -

Page 27: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

Comparing Groups with Boxplots

-S. Halim -

Page 28: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

Methods Based on Cumulative Distribution FunctionThe Empirical Cumulative Distribution Function

Suppose that is a batch of numbers (sample). The empirical cumulativeDistribution function (ecdf) is defined as

nxx ,...,1

)(#1)( xxxF in ≤=N t l V i bilit i lti i t

)()(n in

)(xFn •Data proportion that less than or equal to (some numbers)

St f ti h j f

Natural Variability in melting points:

90% < 64.2°c,

12% < 63.2°C

An addition of 5% microcystalline• Step function has a jump of height at , if there are r observations with the same value , has a jump of height

n1

nF

An addition of 5% microcystalline wax might be difficult to detect, but that an addition to 10% would be detectable.

atnr

-S. Halim -

Page 29: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

The Survival FunctionThe survival function is equivalent to the cumulative distribution function and is defined as :

)(1)( FS )(1)( tFtS nn −=•The data consists of times until failure or death and are nonnegative.

•Data of this type occur in medical and reliability studies.

•In these cases, S(t) is simply the probability that the lifetime will be longer than t.

))(log()( xSdxh = ))(log()( xSdx

xh n=

h(x) - hazard function

-S. Halim -

Page 30: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

Summarizing Symmetric Distributions

The Arithmetic Mean ∑=n

ixn

x 1=in 1

Mean or Median ?

The mean is the point at which the histogram would bebalance

ea o ed aFor skewed data, it’s better to report the median than the mean as a measure of center.

iix εβμ ++= μix is the value of the ith measurement

ii βμ

μ β+μβ

is the true value

represents bias in the measurement procedure

is the random error, usually assumed to be independent and identically distributed random

μ i independent and identically distributed random variables.

-S. Halim -

Page 31: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Example

Introduction to Statistics : Exploring and Understanding Data

ExamplePlatinum <- c(136.3, 136.6, 135.8, 135.4, 134.7, 135.0, 134.1, 143.3,

147.8, 148.8, 134.8, 135.2, 134.9, 146.5, 141.2, 135.4,134.8, 135.8, 135.0, 133.7, 134.4, 134.9, 134.8, 134.5,134.3, 135.2)

1 1 133 : 7

4 3 134 : 1 3 4

outliers

3 3 3

11 7 134 : 5 7 8 8 8 9 9

6 135 : 0 0 2 2 4 4

9 2 135 : 8 8

7 1 136 : 3

6 1 136 : 6

High : 141.2 143.3 146.5 147.8 148.8

On this stem-and-leaf plot, the outlying observations have been isolated and flagged as high

-S. Halim -

Page 32: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

Example (Continue)

•In their analysis, Hampson and Walker set aside the seven largest observations and the smallest observation and found the average of the remaining observations to be 134.9 (The arithmetic mean is 137 05)arithmetic mean is 137.05).

•Look that the arithmetic mean is outside the stem-and-leaf and larger than the bulk of the data and is clearly not a good g y gdescriptive measure of the “center” of the data.

•Median = 135.1

-S. Halim -

Page 33: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

The Trimmed Mean• The 100 % trimmed mean is calculated as follows:

• Order the data

α

Order the data

• Discard the lowest 100 % and the highes 100 %

• Take the arithmetic mean of the remaining data.

α α

• It is generally recommended that the value chosen for be from 0.1 to 0.2.

• Formally, we may write the trimmed mean as

α

])[()1]([ xx ++

][2... ])[()1]([

ααα

αnn

xxx nnn

++= −+

][ αn•Where denotes the greatest integer less than or equal to n

•Median can be regarded as a 50% trimmed mean

α

-S. Halim -

Page 34: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

What About Spread ? The Standard Deviation

A measure of dispersion, or scale, gives a numerical indication of the scattered ness of a batch of numbersscattered ness of a batch of numbers.

∑ −=n

xxs 22 )(1 mean∑=

−−

=i

i xxn

s1

)(1

xxns

n

ii

1

2

2)(

11

α−− ∑

=

Trimmed mean

Trimmed Variance

Variance

ns i

21

)21( αα −=

Median Absolute Deviation~xx i −

Trimmed Variance

Median

-S. Halim -

Page 35: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

Thinking About Variation

Measures of spread help us to be precise about what weMeasures of spread help us to be precise about what we don’t know.

If many data values are scattered far from the center, the y ,IQR and the standard deviation will be larger.

If the data values are close to the center, then these measures of spread will be small.

Measures of spread tell how well other summaries describe the data.

-S. Halim -

Page 36: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

Shape, Center and Spread

•If the shape is skewed, report the median and IQR

•If the shape is symmetric, report the mean and standard deviation

•If there are any clear outliers and you are reporting the mean and standard de iation report them ith themean and standard deviation, report them with the outliers present and with the outliers removed.

-S. Halim -

Page 37: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

Wh t C G W ?What Can Go Wrong ?

• Do a reality check. Don’t let the computer (or calculator) do your thinking for you Make sure the calculated summaries make senseyou. Make sure the calculated summaries make sense.

• Don’t forget to sort the values before finding the median or percentiles.

• Don’t computer numerical summaries of a categorical variable. The i d th t d d d i ti f i l it b i tmean zip code or the standard deviation of social security numbers is not

meaningful. If the variable is categorical, you should instead report summaries such as percentages.

W t h t f lti l d• Watch out for multiple modes.

• Be aware of slightly different methods. There are at least six reasonable definition of quartile alone. If your compare different statistics packages or calculators you may find that they give slightly different answers for the samecalculators, you may find that they give slightly different answers for the same data. But this is OK !

-S. Halim -

Page 38: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

•Beware of outliers. If the data have outliers but are otherwise unimodal, consider holding the outliers out of the further calculations and reporting them individually If you can find a simple reason for theand reporting them individually. If you can find a simple reason for the outlier (for instance, a data transcription error) you should remove or correct it. If you cannot do either of these, then choose the median and IQR to summarize the center and spread.

• Make a picture (make a picture, make a picture). Summarizing a variable with its mean and standard deviation when you have not looked at a histogram or dot plot to check outliers invites disaster Youlooked at a histogram or dot plot to check outliers invites disaster. You may find yourself drawing absurd or dangerous wrong conclusions about the data. Don’t accept a mean and standard deviation blindly without some evidence that the variable they summarize has no outliers

kor severe skew ness.

-S. Halim -

Page 39: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

Be careful when comparing groups that have very different spreads

WHO Smokers, nonsmokers, and passive smokers Taking logpassive smokers

WHAT Blood continine levels

UNITS Nanogram per

Taking log

milliliter (ng/ml)

-S. Halim -

Page 40: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

6. The Standard Deviation as a Ruler

The women’s heptathlon in the Olympics consists of seven track and fields events : the 200-m and 800-m runs, 100-m high hurdles, shot put, javelin, high jump, and long jump. Somehow the performances in such different events have to be combinedsuch different events have to be combined into one score. How can performances in such different events be compared ?

-S. Halim -

Page 41: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

The Standard Deviation as a Ruler

The trick in comparing very different – looking values is to use the standard deviations as our rulers. The standard deviation is the most common measure of variationmeasure of variation.

Important questions : “How far is this value fromImportant questions : How far is this value from the mean ?” or “How different are these two statistics?”. The answer in every case will be to measure the distance or difference in standard deviation.

-S. Halim -

Page 42: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

StandardizingStandardizingWe compared the heptathlon results by asking how many standard deviations they were from the event means. We can write this as

sxxz )( −

=s

We call the resulting values standardized values,and denote them with the letter z. Usually we just call them the z-scores.

Standardized values have no units. That makes all z-scores comparable regardless of the original units of the data.

-S. Halim -

Page 43: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

z-scores measure the distance of each data value from the mean in standard deviations. A z-score of 2.0 indicates that a data value is 2 standard d i ti b th

We are using the standard deviation as a ruler to measure statistical distance from the mean

deviations above the mean.

Benefits of Standardizing

measure statistical distance from the mean

Standardizing data brings several benefits.

• Standardized values have been converted from their original units to the standard statistical unit of standard deviations from the mean.

• This makes it possible to compare values that are measured on different scale, with different units, or for different populations.

-S. Halim -

Page 44: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

Shifting Data

WHO Th 28 i li t dWHO The 28 companies listed on the stock exchange of Trinidad & Tobago

WHAT Dividends paid in the yearWHAT Dividends paid in the year 2000

Unit Percent

Wh C l d 2000When Calendar year 2000

Where Trinidad & TobagoWhy Annual report of stock y p

exchange

-S. Halim -

Page 45: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

Wh hift d t l b bt tiWhen we shift data values by subtracting or adding the same number to all the data values, does the distribution change ?

Addi ( bt ti ) t t t tAdding (or subtracting) a constant amount to each value just adds (or subtracts) the same constant to the mean. The same is true for the median and the quartiles.q

Adding a constant increases all the data values equally, the distribution just shifts. Its shape doesn’t change and neither does the spread. None of the measured of spread we’ve discussed – not the range, not the IQR, not the standard deviation – changes.

Adding a constant to every data value adds the same constant to measure of center and percentiles, but leaves measures of spread unchanged.

-S. Halim -

Page 46: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

Rescaling Data

When we divide or multiply all the data values by any constant valueWhen we divide or multiply all the data values by any constant value, both measures of location (such as the mean and median) and measures of spread (such as the range, the IQR, and the standard deviation) are divided or multiplied by that same value.

-S. Halim -

Page 47: Exploring and Understanding DataExploring and ...faculty.petra.ac.id/halim/index_files/Stat1/Chapter3.pdf · Categorical Data • Displaying Quantitative DataDisplaying Quantitative

Introduction to Statistics : Exploring and Understanding Data

Back to z-scores

When we subtract the mean of the data from every data value we shift theWhen we subtract the mean of the data from every data value, we shift the mean to zero. Such a shift doesn’t change the standard deviation.

When we divide each of these shifted values by s, however, the standard deviation should be divided by s as well Since the standard deviation was sdeviation should be divided by s as well. Since the standard deviation was s to start with, the new standard deviation becomes 1.

Consider the three aspects of a distribution : the shape center and spread :Consider the three aspects of a distribution : the shape, center and spread :• Standardizing does not change the shape of the distribution of a variable• Standardizing changes the center by making the mean 0• Standardizing changes the spread by making the standard deviation 1

-S. Halim -