# Statistics crash course

## Documents

### Transcript of Statistics crash course

Probability and statistics crash course

Probability 1 (for dummies:-)

Stats 1 (averages and deviations)

Probability 2 (Trials and distributions)

Stats 2 (significance)

Stats 3 (errors)

Preliminaries

So what is statistics?

Applied branch of mathematics

Concerning data and its representation

Descriptive Statistics (today) are concerned withrepresenting and summarising data

Analytical Statistics (in a few weeks) are concerneddrawing conclusions from data

... probability theory enables us to find the consequencesof a given ideal world, while statistical theory enables us toto measure the extent to which our world is idealSkiena, 2001.

Descriptive statistics: Why?

Summarising data.

32 7 16 33

33 10 13 35

22 11 15 34

21 13 17 32

23 16 15 24

Max, Min, Mean(s), Median, Mode, Variance, StandardDeviation, Interquartile range, ...

All ways of presenting numerical data in such a way that welearn something of its spread and tendency and deviation.

What is an average?

Average originally meant Financial loss incurred throughdamage to goods in transit, from the Italian avaria, a wordfrom 12c. Mediterranean maritime trade. Sometimes traced

to Arabic arwariya damaged merchandise, but this is lesscertain.

Later, the meaning of the word shifts to equal sharing ofsuch loss by the interested parties.

Measures of central tendency

Arithmetic Mean (often what we think of when we say theword Average).Add em all up and divide by the number there are.

x =1

n

ni=1

xi

An aside about samples and populations

Often we cant measure an entire population, and insteadhave to measure a subset (a sample). The mean on theprevious slide x is, strictly speaking, a sample mean. The

population mean is usually referred to as , and the size ofthe whole population as N.

= 1N

Ni=1

xi

The other two

Median = put them all in order, and choose the middle one.IF there are an even number, then there are two middleones, so use the number halfway between these.

Mode = choose the most frequent one.

Symmetricity/Skewness

I am just going to mention this in passing today, but...

0 10 20 30 40 50 60 700

100

200

300

400

500

600

700A fictitious but nastily skewed dataset

Count

Number

Figure 1: A skewed dataset

This dataset has a mean of 21.8, a median of 12 and amode of 12.

An aside about types of data

There are various types of data we can consider withinstatistics. Not all measures of central tendency apply to allof these

Data type Description Average

Nominal Categories or names Mode

Ordinal Orderings (e.g., First,Second, Third . . . )

Median

Interval Proper numbers Mean (symmetrical)

and Ratio Median (skewed)

To conclude the average bit

Arithmetic Mean; Median; Geometric median; Mode;Geometric Mean; Harmonic Mean; Quadratic Mean (orRMS); Generalised Mean (like quadratic mean but with

different powers); Weighted Mean (some matter more thanothers); Truncated Mean (leave out the tricky outliers);Interquartile Mean (uses the interquartile range, of whichmore later); Midrange (max+min/2); Winsorized mean (Liketruncated but not quite); Annualization (to do with financestuff).

All of these have their own wikipedia page, so, you knowwhere to start!

Boring practical bit

32 7 16 33

33 10 13 35

22 11 15 3421 13 17 32

23 16 15 24

Boring practical bit: answers

32 7 16 33

33 10 13 35

22 11 15 3421 13 17 32

23 16 15 24

Mean 26.2 11.4 15.2 31.6Median 23 11 15 33

Mode ? ? 15 ?

Deviation

As well as knowing some kind of average of a particularsample, you might want to know something of its spread.

1.5 1 0.5 0 0.5 1 1.5 2 2.5 3 3.50

1

2

3

4

5

6x 10

4 More fictitious data

Number

Count

Figure 2: Three datasets with the same mean but

different spreads.

The really simple one

The range is the simplest way of describing the spread ofdata - find the max, find the min, subtract the min from themax, there you go.

Deviation

The deviation of a sample is measured with reference tosome measure of central tendency you want to know howmuch the sample deviates from something. With average

deviation, variance, and standard deviation, this is themean or the sample mean x.

Measures of deviation

Average deviation =

|x |N

Variance = 2 =

(x )2N

Standard deviation = =

(x )2

N

For reasons you will now be familiar with, when consideringsamples, becomes s, and becomes x. To account forbias, sample standard deviation is divided by n 1 ratherthan n.

Worked example

This examplea involves the rainfall in Liberiab.

J F M A M J J A S O N D

1 2 4 6 18 37 31 16 28 24 9 4

The mean of this data is

1 + 2 + 4 + 6 + 18 + 37 + 31 + 16 + 28 + 24 + 9 + 412

= 15

The range of this data is 36; (max-min, or 37-1)

ataken from Sternsteins StatisticsbNo, Ive never been there either

Average deviation

The average deviation

= |1

15|

+|2

15|

+|4

15|

+|6

15|

+|18

15|

+ ...

12

=14 + 13 + 11 + 9 + 3 + 22 + 16 + 1 + 13 + 9 + 6 + 11

12(10.7 Inches)

Variance and standard deviation

The variance

= 14

2

+ 13

2

+ 11

2

+ 9

2

+ 3

2

+ 22

2

+ 16

2

+ 1

2

+ 13

2

+ 9

2

+ 6

2

+ 1112

(143.7 Inches squared)

AND the standard deviation is the square root of thevariance, so...

=

143.7 = 12.0

and the units of the standard deviation are... the same asthe units of measurement.

Interquartile range

One final measure of deviation is the interquartile range.

This is related to the median, and the first thing you do is

place your data in order.

Discard the lowest and the highest 14

of your data, and use

the range of what remains. This is much more robust tooutliers.

And to finish

If your data is normally distributed (of which more nextweek), knowing the standard deviation tells you all sorts ofuseful stuff.

Figure 3: Another graph stolen from wikipedia

