Common Statistical Methods in Geoscience

42
Intro to Quantitative Geology www.helsinki.fi/yliopisto Introduction to Quantitative Geology Lecture 3 Common statistical methods in geoscience Lecturer: David Whipp david.whipp@helsinki.fi 16.3.2015 2

description

Statistics Geosciences

Transcript of Common Statistical Methods in Geoscience

Page 1: Common Statistical Methods in Geoscience

Intro to Quantitative Geology www.helsinki.fi/yliopisto

Introduction to Quantitative Geology Lecture 3

Common statistical methods in geoscienceLecturer: David [email protected]

16.3.2015

2

Page 2: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology

Goals of this lecture

• Address the many challenges of geological data samples

• Discuss uncertainty in Earth science data

• Review the basic mathematical representations of measurement uncertainty

3

Page 3: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology

Why are statistics important for us?

4

Bighorn Basin, WY, USA

Page 4: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology 5

Bighorn Basin, Wyoming, USA

Page 5: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology 6

Bighorn Mtns, Wyoming, USA

Page 6: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology 7

Bighorn Basin, Wyoming, USA

Page 7: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology 8

Bighorn Basin, Wyoming, USA

Page 8: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology 9

Bighorn Basin, Wyoming, USA

Page 9: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology

Why are statistics important for us?

10

Bighorn Basin, Wyoming, USA

Even here, bedrock exposures are limited and widespread

Page 10: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology

Why are statistics important for us?

11

Bighorn Basin, Wyoming, USA

• We can directly observe only a tiny fraction of most geological features

• Bedrock exposure is limited

• Many rock units are very large and/or widely dispersed

• Sometimes, even well-exposed units can be hard to access

Page 11: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology

Fully funded field opportunity - Bighorn Basin

12More information at https://rock.geosociety.org/eo/

Page 12: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology

A few statistical definitions

• Population: The total number of occurrences of a particular thing present in a defined area

• We basically cannot access this geologically

• Subset: A collected portion of the population

• Sampling units: The collected material

13

Page 13: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology

A representative sample

• Our goals, as Earth scientists is to collect data that forms a representative sample

• Representative sample: A sample that can be used to infer the characteristics of the population

• The best way to get a representative sample is to collect sampling units at random, without bias

14

Page 14: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology

A representative sample

• Our goals, as Earth scientists is to collect data that forms a representative sample

• Representative sample: A sample that can be used to infer the characteristics of the population

• The best way to get a representative sample is to collect sampling units at random, without bias

• Why might this be difficult?

15

Page 15: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology

The trouble with “Random” sampling

• Making our best effort to collect a “random” sample does not guarantee it is a representative sample

1. Samples taken from the same population may be very different from one another

16

Second, even if two populations are very different, samples from eachmay be similar simply by chance, and therefore give the misleadingimpression the populations are also similar (Figure 1.2).Finally, variation within samples may make it difficult to interpret any

effect of differences in location. There is often so much variation within asample (and a population) that differences in location may be difficult tointerpret. For example, imagine you are an environmental geologist work-ing to assess a landfill contaminated with lead. The lead content in a sample

Population

Sample 1

Sample 2

Figure 1.1 Even a random sample may not necessarily be a goodrepresentative of the population. Two samples have been taken at randomfrom a Devonian oil field in Ghawar. By chance, sample 1 contains a group ofrelatively large fossils, while those in sample 2 are relatively small, and thetypes of fossils in the two samples are also different.

1.1 Experimental design and statistics 3

Figure 1.1, McKillup and Dyar, 2010

Large fossils

Small fossils

Page 16: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology

The trouble with “Random” sampling

• Making our best effort to collect a “random” sample does not guarantee it is a representative sample

2. Samples taken from very different populations may be quite similar

17

Figure 1.2, McKillup and Dyar, 2010

of ten cores from the oldest part of the landfill is 1000mg/kg Pb on average,and ranges from 100–9000mg/kg. In contrast, a sample of ten cores fromthe youngest part of the landfill contains 2000mg/kg Pb on average butranges from 100–7000mg/kg. Which of these two areas would you considerto be most contaminated?

Variability within samples can also obscure the effect of experimentaltreatments. For example, opaque brown topaz crystals may change to

Population 1

Population 2

Sample 1

Sample 2

Figure 1.2 Samples selected at random from very different populations maynot necessarily be different. Simply by chance the samples from populations1 and 2 are similar in size and composition.

4 Introduction

Page 17: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology

Using a sample to infer something about a population

• Collecting a representative sample is a clear challenge and needed to be able to make inferences about a population

• Every precaution must be taken to collect a representative sample

• Collecting a sufficient number of measurements

• Making measurements or collecting samples in the most logical locations

• Carefully preparing samples for analysis or measurement

18

Page 18: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology 19

Uncertainty

Page 19: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology 20

Deep water oil/gas exploration is expensiveDrill ship: ~350,000 € per day

Oil platform: ~300,000 € per day

Uncertainty

Page 20: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology 21

Deep water oil/gas exploration is expensiveDrill ship: ~350,000 € per day

Oil platform: ~300,000 € per day

Know your uncertainties!

Uncertainty

Page 21: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology

Uncertainty in Earth science

• In addition to challenges collecting a representative sample, all geological measurements have inherent uncertainty

• It is not possible to make an exact measurement

• We may use tools with very high precision, but measurements are still uncertain

• A scientist’s goal is to reduce uncertainty when taking measurements by making them as accurate and precise as possible

• Our goal should be to always include that uncertainty when comparing measured values to predictions

22

Page 22: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology

Precision versus accuracy

• Random error: Experimental uncertainty revealed by repeated measurements

• Small random error = Precise

• Systematic error: Error that cannot be revealed by repeated measurement

• Small systematic error = Accurate

• How can we detect systematic error?

23

Fig. 4.1, Taylor, 1997

Page 23: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology

Precision versus accuracy

• As it turns out, detecting systematic error is difficult

• Scatter might reveal large random errors, but without a reference for the true value, the systematic error is unclear

• This is one reason that blanks and standards are used for measuring isotope concentrations when measuring geochronometer ages, for example

24

Fig. 4.2, Taylor, 1997

Page 24: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology

Sources of uncertainty

25

• If we think about it, uncertainty is everywhere

Page 25: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology

Sources of uncertainty

26

• If we think about it, uncertainty is everywhere

• A typical field compass is graduated in 1°, so we can’t expect any higher precision

• Magnetic declination can also only be set to within 1°, and the uncertainty larger if not set correctly

• Outcrops generally don’t provide perfect exposure, and repeated measurements vary often by 3-5° (at least)

Page 26: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology

Reported measurements

• Ignoring the declination, the most precise measurement we can make with our compass is

• Imagine we take 5 measurements of the strike of a rock unit and find: 33°, 36°, 32°, 35°, 34°

• Based on these numbers alone, we could state the strike is 34±2°

• Including the uncertainty in reading the compass, we would say the strike is 34±2.5°

• Generally, any measurement should be reported in the form

27

xbest ± �x

xbest ± 0.5�

Page 27: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology

Mean and standard deviation

• For the compass measurements, it is quite natural to take the average value, or mean, for the best estimate of the strike

• The mean, , is simply the sum of the measured values divided by the number of measurementswhere !" is the "th measured value, # is the total number of measurements and $ represents sigma notation of a sum

28

X

i

xi =NX

i=1

xi = x1 + x2 + ...+ xN

x̄ =x1 + x2 + ...+ xN

N

=⌃xi

N

Page 28: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology

Mean and standard deviation

• For our example, calculating the mean is easy

• If we consider the mean as the best estimate of the quantity, a natural value to consider as well is the deviation (or residual) for each measurement

• If the deviation is small, the measurements are precise; if it is large, the measurements are not so precise

29

x̄ =33� + 36� + 32� + 35� + 34�

5= 34�

di = xi � x̄

Page 29: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology

Mean and standard deviation

• It might be tempting to average the deviation to determine how much measurements deviate from the mean on average

• What is the average deviation in our dataset?

30

Page 30: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology

Mean and standard deviation

• It might be tempting to average the deviation to determine how much measurements deviate from the mean on average

• What is the average deviation in our dataset?

• If we square the deviation, all values will be positive, and then we can average and take the square root of the result

31

x

=

r1

N � 1(d

i

)2 =

vuut 1

N � 1

NX

i=1

(xi

� x̄)2

Page 31: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology

Mean and standard deviation

• It might be tempting to average the deviation to determine how much measurements deviate from the mean on average

• What is the average deviation in our dataset?

• If we square the deviation, all values will be positive, and then we can average and take the square root of the result

• This is the sample standard deviation, %!

32

x

=

r1

N � 1(d

i

)2 =

vuut 1

N � 1

NX

i=1

(xi

� x̄)2

Page 32: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology

Mean and standard deviation

• If we plug in our values we see

33

Measured value

Mean Deviation Deviation squared33 34 -1 1

36 34 2 4

32 34 -2 4

35 34 1 1

34 34 0 0

�x

=

r1

N � 1(d

i

)2

=

r1

4(1 + 4 + 4 + 1 + 0)

= 1.6

Page 33: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology

What does the standard deviation tell us?

• The 2% uncertainty is often reported for geological measurements

• For the measured values, ~95% of the measurements are within ±2% of the mean & (& is the population mean, is the sample mean)

34

7.3.3 The standard deviation of a population

The importance of the variance is apparent when you obtain the standarddeviation, which is symbolized for a population by σ and is just the square rootof the variance. For example, if the variance is 64, the standard deviation is 8.

The standard deviation is important because the mean of a normallydistributed population, plus or minus one standard deviation, includes68.27% of the values within that population.

Evenmore importantly, 95% of the values in the population will be within±1.96 standard deviations of the mean. This is especially useful becausethe remaining 5% of values will be outside this range and therefore furtheraway from the mean (Figure 7.3). Remember from Chapter 6 that 5% is thecommonly used significance level.

These two statistics are all you need to describe the location and shape ofa normal distribution and can also be used to determine the proportionof the population that is less than or more than a particular value (Box 7.1).

7.3.4 The Z statistic

The proportions of the normal distribution described in the previoussection can be expressed in a different andmore workable way. For a normaldistribution, the difference between any value and the mean, divided by thestandard deviation, gives a ratio called the Z statistic that is also normally

Frequency

(a)

µ

Frequency

(b)

µ

Figure 7.3 Illustration of the proportions of the values in a normallydistributed population. (a) 68.27% of values are within ±1 standard deviationfrom the mean and (b) 95% of values are within ±1.96 standard deviationsfrom the mean. These percentages correspond to the area of the distributionenclosed by the two vertical lines.

70 Working from samples: data, populations and statistics

~68% of measurements within ±1% ~95% of measurements within ±2%

±1% ±2%

Fig. 7.3, McKillup and Dyar, 2010

Page 34: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology

Normal distribution

• The normal distribution refers to a bell-shaped curve that mathematically describes the frequency that a measured value is observed

• This is also known as a Gaussian curve

• Many geological variables follow a normal distribution

• We will look at this in detail on Wednesday in lab

35

chosen at random and plot the frequency of these on the Y axis and angle onthe X axis, the distribution will look like a symmetrical bell, which has beencalled the normal distribution (Figure 7.1).The normal distribution has been found to apply to many types of

variables in natural phenomena (e.g. grain size distributions in rocks, theshell length of many species of marine snails, stellar masses, the distributionof minerals on beaches, etc.).The very useful thing about normally distributed variables is that two

descriptive statistics – themean and the standard deviation – can describethis distribution. From these, you can predict the proportion of data that willbe less than or greater than a particular value. Consequently, tests that use theproperties of the normal distribution are straightforward, powerful and easyto apply. To use them you have to be sure your data are reasonably “normal.”(There are methods to assess normality and these will be described later.)To understand parametric tests you need to be familiar with some

statistics used to describe the normal distribution and some of its properties.

7.3.1 The mean of a normally distributed population

First, the mean (the average) symbolized by the Greek μ describes thelocation of the center of the normal distribution. It is the sum of all the

0 70averageCinder cone angle (°)

Frequencyof eachangle

Figure 7.1 An example of a normally distributed population. The shape ofthe distribution is symmetrical about the average and the majority of valuesare close to the average, with an upper and lower “tail” of steeply and gentlysloping cinder cones, respectively.

7.3 The normal distribution 67

Fig. 7.1, McKillup and Dyar, 2010

Page 35: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology

A silly example of why uncertainties matter

36

Regression line

Page 36: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology

A silly example of why uncertainties matter

37

Question: Is this a good fit?

Regression line

Page 37: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology

A silly example of why uncertainties matter

38

Question: Is this a good fit?

Answer: Cannot say.No uncertainties shown Regression line

Page 38: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology

A silly example of why uncertainties matter

39

Trend line fits thedata

Conclusion:More treats = Happier cat

Page 39: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology

A silly example of why uncertainties matter

40

Major breakthrough!

More treats = happier cat,but cat happiness is not linear

Page 40: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology

Reporting uncertainties

• The point is uncertainties matter

• What our data can say is a function of the uncertainty

• A model fit to data is meaningful only if the predictions lie within the uncertainty in the data

• A geological age of 1.88 Ga for an event might be interesting, but an age of 1.88±0.02 Ga is far more powerful

41

Page 41: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology

Recap

• It is difficult to collect a representative sample and requires careful planning

• Uncertainty is everywhere in Earth science data

• Characterising and reporting uncertainty is as important as the measured values themselves

42

Page 42: Common Statistical Methods in Geoscience

www.helsinki.fi/yliopistoIntro to Quantitative Geology

References

McKillup, S., & Dyar, M. D. (2010). Geostatistics explained: an introductory guide for earth scientists. Cambridge University Press.

Taylor, J. R. (1997). An introduction to error analysis: the study of uncertainties in physical measurements. Sausalito, CA, USA: University science books.

43