Summarising and presenting data

37
Summarising and presenting data www.anu.edu.au/nceph/ surfstat/

description

Summarising and presenting data. www.anu.edu.au/nceph/surfstat/. Types of data. T wo broad types: qualitative and quantitative Qualitative data arise when the observations fall into separate distinct categories. Examples are: Colour of eyes : blue, green, brown etc - PowerPoint PPT Presentation

Transcript of Summarising and presenting data

Page 1: Summarising and presenting data

Summarising and presenting data

www.anu.edu.au/nceph/surfstat/

Page 2: Summarising and presenting data

Types of data

• Two broad types: qualitative and quantitative • Qualitative data arise when the observations

fall into separate distinct categories. Examples are:Colour of eyes : blue, green, brown etc Exam result : pass or fail Socio-economic status : low, middle or high.

•Such data are discrete

Page 3: Summarising and presenting data

Quantitative Data

• Quantitative or numerical data arise when the observations are counts or measurements

• Discrete if measurements are integers– number of people in a household, – number of cigarettes smoked per day

• Continuous if measurements can take any value, (usually within some range) – weight– height– time

Page 4: Summarising and presenting data

Variables and statistics

• Quantities such as sex and weight are called variables, because the value of these quantities vary from one observation to another.

• Numbers calculated to describe important features of the data are called statistics. For example,

the proportion of females the average age of unemployed persons, in a sample of

residents of a town are statistics.

Page 5: Summarising and presenting data

Example: Commodore data

• Prices of n=38 second-hand cars

6000 6700 3800 7000 5800 9975 10500 5990

20000 11990 16500 10750 9500 12995 12500 8000

9900 18000 9500 9400 7250 15000 4500 8900

9850 9000 5800 29500* 15000 9000 4250 4990

11000 9990 2200 4000 13500 14500

• Continuous data, need to summarise

Page 6: Summarising and presenting data

Constructing a frequency distribution

• Calculate the range and divide it by the chosen number of intervals to get the approximate length for each interval.

• Usually use from 5 to 15 intervals. • Define interval end points so they don't overlap or leave gaps (ie.

they are mutually exclusive and exhaustive) - This ensures that every observation belongs in exactly one interval.

• It is a usually simpler idea to have all intervals of the same length • Count the number of values in each interval (the class frequency) -

go through the data once only and use tally marks to help counting. • Usually relative frequencies or percentages are helpful to show the

distribution of data.

Page 7: Summarising and presenting data

Frequency distribution

Page 8: Summarising and presenting data

Histogram

• area of rectangle = frequency (or relative frequency) • But area = length x height • So if all intervals are the same length, L

Page 9: Summarising and presenting data

Features of a histogram

Page 10: Summarising and presenting data

Mode

• The mode is the value or category which occurs most frequently.

• If several data values occur with the same maximal frequency, they are all modes.

• For example, in the Commodore data, using the grouped data, the class interval, [8,000 - 10,999], is the modal interval.

Page 11: Summarising and presenting data

Modality and Symmetry

• Modality: No. of peaks– E.g. one peak-unimodal

• Skewness: departure from symmetry

positive skewness (skew to the right)

negative skewness

(skew to the left)

Page 12: Summarising and presenting data

Human histogram

Page 13: Summarising and presenting data

Human histogram explained

Page 14: Summarising and presenting data

Process control example

• Is process in control?• Why the gap?

•Deming•500 steel rods•Ideal dia. = 1cm

Page 15: Summarising and presenting data

MEASURES OF CENTRAL TENDENCY ("Averages")

• Mean (arithmetic mean): (read as 'x bar') x

n

x

n

xxxx

n

ii

n

121 ...

• Notation: denote data values by x1,x2,…,xn

• n denotes no. of data points

Page 16: Summarising and presenting data

Mean for frequency distribution

n

iii

nn

n

n

fxx

n

fxfxfxx

f,ff

x,xx

1

2211

21

21

i.e.

... data groupedfor Mean

,...,by sfrequencie theDenote

,...,by midpoints class theDenote

Page 17: Summarising and presenting data

Median

• ‘Middle’ value of the data set

• A number which is greater than half the data values and less than the other half

• (n+1)/2 –th ordered observation

Data set: 6, 6.7, 3.8, 7, 5.8Ordered: 3.8, 5.8, 6, 6.7, 7 Median: (5+1)/2 ordered obs.

If even: 6, 6.7, 3.8, 7, 5.8, 9.975

Page 18: Summarising and presenting data

Quartiles and percentiles• Median: 50% below, 50% above

• 1st quartile: 25% below, 50% above

• Q1: (n+1)/4 ordered observation

• Q3 (3rd quartile): (3n+1)/4 ordered observation

Data set: 6, 6.7, 3.8, 7, 5.8

Ordered: 3.8, 5.8, 6, 6.7, 7

•p-th percentile or quantile: p% below, (100-p)% above

Page 19: Summarising and presenting data

Stem and leaf plot

Finally order the leaves

Page 20: Summarising and presenting data

Percentiles via stem and leaf plot

Get the median:Median= (n+1)/2 ordered obs.i.e. 10.5 th ordered observationLies in the stem 7|Median=(72+76)/2 = 74

Get 1st quartile:Q1 = (n+1)/4 ordered obs.

Get third quartile:Q3 = (3n+1)/4 ordered obs.

Page 21: Summarising and presenting data

Percentiles from a freq. distr.

•What are median, 1st and 3rd quartiles ?•Actual values are 6700, 5900 and 10200•You lose details in a frequency distribution

Page 22: Summarising and presenting data

Comparison of Mean and MedianData set A: 2,3,3,4,5,7,8 Data set B: 2,3,3,4,5,8,20 Both have n = 7 values.

•The median is not affected by extreme values, but the mean is changed •Median is useful for incomplete data•E.g. consider an experiment to measure average lifetime of a light bulb (n=6) : 200,400, 650, 700, 900,..

Page 23: Summarising and presenting data

Comparing Mean, Median and Mode

•If distribution is symmetric and unimodal, all three coincide•If only symmetric, mean and median coincide

•If distribution is not symmetric, better to use median than mean

Page 24: Summarising and presenting data

MEASURES OF VARIABILITY

• Statistics which summarise how spread out the data values are. Also called measures of dispersion

• The range = max-min (used in quality control)• The range is susceptible to extreme values

Page 25: Summarising and presenting data

IQR

• The interquartile range is defined as   IQR = Q3 - Q1

• IQR is less susceptible to outliers (like the median)

Page 26: Summarising and presenting data

Five number summary

•Boxplot (or box-and-whisker plot)•Box contains middle 50% of data•If an obs is > 3 times IQR, it is an outlier

Page 27: Summarising and presenting data

Boxplots are useful for comparing groups

Page 28: Summarising and presenting data

Deviations from the mean

Page 29: Summarising and presenting data

Summarising deviations from mean

The deviation of each value xi from the mean is:

The mean (or sum) of deviations is not a good summary:

xxd ii

•Instead use a positive function such as di2 or |di|

•Variance or mean square error:

•Mean absolute deviation: idn

1

Page 30: Summarising and presenting data

Variance and Standard DeviationUsually n-1 instead of n is used in the denominator :

sample variance

Problem: squared distances have squared units

s =

the sample standard deviation.

Page 31: Summarising and presenting data

Example: small data setData set A:   {xi} = 2, 3, 3, 4, 5, 7, 8:

There are n=7 observations and mean = 4.57. The deviations from the mean, di , are:  

-2.57, -1.57, -1.57, -0.57, 0.43, 2.43, 3.43. So

Page 32: Summarising and presenting data

Shortcut formulae for variance

Page 33: Summarising and presenting data

Bivariate methods

• We have (mostly) looked at univariate methods

• Most interesting problems are bi (or multi) variate

• Continuous variable vs. qualitative variable: comparative boxplot

• Continuous variable vs. continuous variable: scatterplot

Page 34: Summarising and presenting data

Presenting bivariate data• Scatterplots are useful for illustrating the

relationship between continuous variables (xi, yi), i = 1,..n

•Indicates type of relationship

Page 35: Summarising and presenting data

Creating a scatterplotStep 1:Create variables ht and wtStep 2:plot(ht,wt,xlab=“height”,ylab=“weight”)

Page 36: Summarising and presenting data

Summarising a relationship

plot(temperature,ozone)abline(lm(ozone~temperature, data=air))

Page 37: Summarising and presenting data

Summarising a nonlinear relationship

plot(E,NOx)

lines(supsmu(E,NOx))

•Use a smoother