4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

66
4- Data Analysis and Presentation Statistics

Transcript of 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

Page 1: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

4- Data Analysis and Presentation

Statistics

Page 2: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

CHAPTER 04: Opener

Page 3: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

1) Uncertainty in measurements: large N distributions and related parameters and concepts (Gaussian or normal distribution)2) Approximations for smaller N (Student’s t-and related concepts)3) Other methods: G, Q (FYI: F)4) Excel examples (spreadsheet)

What is in this chapter?

Page 4: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

What do we want from measurements in chemical analysis?

• We want enough precision and accuracy to give us certain (not uncertain) answers to the specific questions we formulated at the beginning of each chemical analysis. We want small error, small uncertainty. Here we answer the question how to measure it!

• READ: red blood cells count again!

Page 5: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

Distribution of results from: • Measurements of the same sample from

different aliquots

• Measurements of similar samples (expected to be similar/same because of the same process of generation)

• Measurements of samples on different instruments

• ETC

Page 6: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

FYI: Formalism and mathematical description of distributions

• Counting of black and white objects in a sample, or how many times will n black balls show up in a sample Binomial distribution

• For larger numbers of objects with low freqency (probability) in the sample

Poisson distribution

and if the number of samples goes to infinityNormal or Gaussian distribution

Page 7: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

Normal or Gaussian distribution

• Unlimited, infinite number of measurements

• Large number of measurements

• Approximation: small number of measurements

Page 8: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

CHAPTER 04: Figure 4.1

Page 9: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

 

Data from many measurements of the same object or many measurements of similar objects show this type of distribution. This figure is the frequency of light bulb lifetimes for a particular brand. Over four hundred were tested (sampled) and the mean bulb life is 845.2 hours. This is similar but not the same as measurement of one bulb many times in similar conditions! See also Fig 4.2

4-1 IMPORTANT: Normal or Gaussian distribution

2

2

22

1)( σ

μ)(x

exF

Find “sigma” and “mu” on the Gaussian distribution figure !!!!!!!!!!!!!!!!!!!!!!!!

Page 10: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

CHAPTER 04: Equation 4.3

Page 11: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

IMPORTANT

Page 12: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

CHAPTER 04: Figure 4.3

Page 13: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

Here is a normal or Gaussian distribution determined by two parameters , here =0, and , here a) 5, b) 10, c) 20. Wide distributions such as ( c) are result of poor or low precision .The distribution (a) has a narrow distribution of values, so it is very precise.

Q: How to quantify the width as a measure of precision?

A: “sigma” and“s” standarddeviation

Page 14: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

Another example with data

Page 15: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

Another way to get close to Gaussian distribution is to measure a lot of

data

Page 16: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

Properties of the Gaussian or Normal Distribution or Normal Error Curve

1. Maximum frequency (number of measurements with same value) at zero

2. Positive and negative errors occur at the same frequency (curve is symmetric)

3. Exponential decrease in frequency as the magnitude of the error increases.

Page 17: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

The interpretation of Normal distribution, Standard Deviation and Probability: the area under the curve is proportional to the probability you will find that value in your measurement. Clearly, we can see form our examples that the probability of measuring value x from a certain range of values is proportional to the area under the normalization curve for that range.

Range Gaussian Distribution

µ ± 1 68.3%

µ ± 2 95.5%

µ ± 3 99.7%

The more times you measure a quantity, the more confident you can be that the average value of your measurements is close to the true population mean, µ. Standard deviation here is a parameter of Gaussian curve.

The uncertainty decreases in proportion to 1/(n)^.5, where n is the number of measurements. 

Page 18: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

We can now say with certain confidence that the value we are measuring will be inside certainrange with some well-defined probability.

This is what can help us in quantitative analysis!

BUT, can we effort measurements of large, almost infinite number of samples? Or repeat measurement of one sample almost infinite number of times???

Page 19: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

As n gets smaller (<=5) µ mean X and s

This is the world we are in, not infinite number of measurements !!!!!!!

All our chemical analysis calculations starts here from these “approximations” of Gaussian or Normal distributions: mean and standard deviation

We will introduce something that can be measured with smaller number of samples, X, and s instead…….

Page 20: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

Mean value and Standard deviation

s =

i

(xi–x–)2

n-1

x– =

i

xi

n

Examples-spreadsheet

Also interesting are :median (same number of points above and below , range ( or span, from the maximum to the minimum

Page 21: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

CHAPTER 04: Equation 4.1

Page 22: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

CHAPTER 04: Equation 4.2

Page 23: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

ExampleFor the following data set, calculate the mean and standard deviation.Replicate measurements from the Calibration of a 10-mL pipette

Trial Volume delivered

1 9.990

2 9.993

3 9.973

4 9.980

5 9.982

Mean (x– ) = (9.990 + 9.993 + 9.973 + 9.980 + 9.982)

5 = 9.984

Standard Deviation (s) =

(9.990–9.984)2 + (9.993–9.984)2 + (9.973–9.984)2 + (9.980–9.984)2 + (9.982–9.984)2

5–1

s = 8 x 10–3

Page 24: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

CHAPTER 04: Unnumbered Table 4.1

Page 25: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

x

normal

t

THE TRICK: Student's t (conversion to a small number of measurements, by fitting )

Shown above are the curves for the t distribution and a normal distribution

Student's t Table .

Degree of freedom = n-1

Page 26: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

  Student's t table, see Table 4-2 book and handouts 

  Confidence level(%)  

Degrees of freedom 90% 95%

1 6.314 12.706

2 2.920 4.303

3 2.353 3.182

4 2.132 2.776

5 2.015 2.571

6 1.943 2.447

7 1.895 2.365

8 1.860 2.306

9 1.833 2.262

10 1.812 2.228

15 1.753 2.131

20 1.725 2.086

25 1.708 2.068

30 1.697 2.042

40 1.684 2.021

60 1.671 2.000

120 1.658 1.980

1.645 1.960

Page 27: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

CHAPTER 04: Equation 4.4

Page 28: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

CHAPTER 04: Figure 4.2

Page 29: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

The square of the standard deviation is called the variance (s2) or 2

2=25

2=100

2=400

The standard deviation (or error) of the mean = sn

Link: Can we also use parameters similar to normal distribution to characterize certainties and uncertainties of our measurements?

Page 30: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

  

The standard deviation, s, measures how closely the data are clustered about the mean. The smaller the standard deviation, the more closely the data are clustered about the mean .

 

 The degrees of freedom of a system are given by the quantity n–1.

Typically use small # of trials, so we never measure µ or

Page 31: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

THE TRICK: Student's t (conversion to a small number of measurements, by fitting )

 

x

normal

t

 

The confidence interval is an expression stating that the true mean, µ, is likely to lie within a certain distance from the measured mean, x-bar.

 

Confidence interval: µ = x– ± tsn

where s is the measured standard deviation, n is the number of observations, and t is the Student's t Table .

Degree of freedom = n-1

Shown above are the curves for the t distribution and a normal distribution.

Page 32: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

4.2 Confidence interval

• Calculating CI

CI for a range of values will show the probability at certain level (say 90%) that you have the true value in that range.

Note : true value .

Page 33: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

  Student's t table, see Table 4-2 book and handouts 

  Confidence level(%)  

Degrees of freedom 90% 95%

1 6.314 12.706

2 2.920 4.303

3 2.353 3.182

4 2.132 2.776

5 2.015 2.571

6 1.943 2.447

7 1.895 2.365

8 1.860 2.306

9 1.833 2.262

10 1.812 2.228

15 1.753 2.131

20 1.725 2.086

25 1.708 2.068

30 1.697 2.042

40 1.684 2.021

60 1.671 2.000

120 1.658 1.980

1.645 1.960

Page 34: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

Example A chemist obtained the following data for the alcohol content in a sample

of blood: percent ethanol = 0.084, 0.089, and 0.079. Calculate the confidence interval 95% confidence limit f or the mean.

x– = 0.084 + 0.089 + 0.079

3 = 0.084

s = (0.000)2 + (0.005)2 + (0.005)2

3–1 = 0.005

From Table 4–2, t at the 95% confidence interval with two

degrees of freedom is 4.303.

So the 95% confidence interval = 0.084 ± (4.303)(0.005)

3

= 0.084 ± 0.012 What is the CI at the 90% CL for this example?

Page 35: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

Representation and the meaning of the confidence interval

 the error bars include the target mean (10,000) more often for the 90% CL than for the

50% CLImportant information for real process!!!

Page 36: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

A control chart was prepared to detect problems if something is out of specification. As can be seen when 3 away at the 95% CL then there is a problem and the process should be examined.

Representation and the meaning of the confidence interval

Student's t values can aid us in the interpretation of results and help compare different analysis methods.

Page 37: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

4-3 Comparison of Means , hypothesis

• Case 1:

• Case 2

• Case 3

Underlying question is are the mean values from two different measurements significantly different?

Page 38: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

Hypothesis about the TRUE VALUES and/or ESTABLISHED VALUES    We will say that two results do not differ from each other unless there is > 95%

chance that our conclusion is correct

The statement about the comparison of values is the same statement as the concept of a "null hypothesis in the language of statistics ".The null hypothesis assumes that the two values being compared, are in fact, the same.

 Thus, we can use the t test (for example) as a measurement of whether the null hypothesis is valid or not.

There are three specific cases that we can utilize the t test to question the null hypothesis.

Student's t values can aid us in the interpretation of results and help compare different analysis methods.

Page 39: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

A Answers on analytical chemistry questions:

Are the results certain and do they indicate significant differences that could give different answers ?

Page 40: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

How to establish quantitative criteria?

Page 41: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

Case #1: Comparing a Measured Result to a "Known Value"

Example A new procedure for the rapid analysis of sulfur in kerosene was tested by analysis of a sample which was known from its method of preparation

to ccontain 0.123% S. The results obtained were: %S = 0.112, 0.118, 0.115, and 0.119. Is this new method a valid procedure for determining sulfur

in kerosene?

One of the ways to answer this question is to test the new procedure on the known sulfur sample and if it produces a data value that falls within the

95% confidence interval, then the method should be acceptable.

x– = 0.116 s = 0.0033

95% confidence interval = 0.116 ± (3.182)(0.0033)

4

x– = 0.116 ± 0.005

x– = 0.111 to 0.121 which does not contain the "known value

0.123%S" Because the new method has a <5% probability of being correct, we can conclude that this method will not be a valid procedure for determining

sulfur in kerosene.

Looks good, but…..

Page 42: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

µ = x– ± ts / n ± t = (x– - µ ) ( n / s) The statistical " t " value is found and compared to the table "t" value. If t found > t table , we assume a difference at that CL (i.e. 50%, 95%,

99.9%).

Is the method acceptable at 95% CL? dof = (n - 1) = 3 & @ 95% the tt = 3.182 (from student's t table)

± tf = (x– - µ) ( n /s) ± tf = (0.116 - 0.123) * [( 4)/0.0033] = 4. 24 t found > t table

4.24 > 3.18, so there is a difference, (thus the same conclusion.)

…but this is the correct method to avoid problems.

**If you have than use it instead of mean

Page 43: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

Another example: Case 1: Let's assume µ is known, and a new method is being developed.

If @ 95% tfound > ttable then there is a difference.

We have a new method of Cu determination and we have a NIST Standard for Cu. The NIST value of Cu = 11.87 ppm. We do 5 trials

and get x– =10.80 ppm , s = 0.7 ppm. Is the method acceptable at 95% CL? dof = (N - 1) = 4 & @ 95% the tt = 2.776

± tf = (x– - µ) (n /s) = (10.80 - 11.87) * [(5)/0.7] = 3.4 tfound > ttable 3.4 > 2.8, so there is a difference present

Page 44: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

Case #2: Comparing Replicate Measurements (Comparing two sets of data that were independently done using the "t" test. Note: The

question is; " Are the two means of two different data sets significantly different?" This could be used to decide if two materials are the same or if two independently done

analyses are essentially the same or if the precision for the two analysts performing the analytical method is the same. or two sets of data consisting of n1 and n2 measurements with averages x1 and x2 ), we can calculate a value of t by using the following formula 

t = | x–1 – x–2|sspooled

n1n2

n1+n2 where sspooled =

s12(n1–1) + s22(n2–1)n1 + n2 –2

Page 45: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

Cont.

• The value of t is compared with the value of t in Table 4–2 for (n1 + n2 – 2) degrees of freedom. If the calculated value of t is greater than the t value at the 95% confidence level in Table 4–2, the two results are considered to be different.

•  • The CRITERIUM

• If tfound > ttable there is a difference!!

Page 46: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

The Ti content (wt%) of two different ore samples was measured several times by the same method. Are the mean values significantly different at the 95% confidence level?

n X s

Sample 1 5 0.0134 4E-4

Sample 2 5 0.0140 3.4E-4

Page 47: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

sspooled = s12(n1–1) + s22(n2–1)

n1 + n2 –2

sspooled = (4.0 x 10–4)2(5–1) + (3.4 x 10–4)2(5–1)

5 + 5 –2

sspooled = 3.7 x 10–4

t = | x–1 – x–2|sspooled

n1n2

n1+n2

t = |0.0134 – 0.0140|

3.7 x 10–4 (5)(5)

10

t = 2.564

t from Table 4–2 at 95% confidence level and 8 degrees of freedom is 2.306 

Since our calculated value (2.564) is larger than the tabulated value (2.306), we can say that the mean values for the two samples are significantly different.

 If t found > t table then a difference exists.

Page 48: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

Case #3: Comparing Individual Differences 

(We are using t test with multiple samples and are comparing the differences obtained by the two methods using different samples without the duplication of samples. For example; it might be reference method vs. new method. This would monitor the versatility of the method for a range of concentrations.

 This case applies when we use two different methods to make single measurements on several different samples.

t = d–

sd n sd =

i

(di – d

–)2

n–1

where d is the difference of results between the two methods

Page 49: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

Sample Composition by method 1

(old)

Composition by method 2 (new)

Delta -d

A 0.0134 0.0135 0.0001

B 0.0144 0.0156 0.0012

C 0.0126 0.0137 0.0011

D 0.0125 0.0137 0.0012

E 0.0137 0.0136 0.0001

Page 50: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

sd =

i

(di – d

–)2

n–1

sd = (16.6 X10 -7 / 4) sd = 6.4 x 10–4

t = d–

sd n

t = (0.00070/0.00064) x ( 5) t = 2.45 t from Table 4–2 at 95% confidence level and 4 degrees of

freedom is 2.776 Since our calculated value (2.45) is smaller than the tabulated

value (2.776), we can say that the results for the two samples are not significantly different.

If t found > t table then a difference exists. 2.45 > 2.776 is not true, So no difference exists!

Page 51: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

C

Known true value and CI:

CI

tabulated t values show rejection of hypothesis

A

A acceptance

Page 52: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

What to do with outliners? Points far from the rest. Keep them or not?

G-method

FWI: also Q method

(both should give similar estimates)

Page 53: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

CHAPTER 04: Unnumbered Figure 4.4

Page 54: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

CHAPTER 04: Equation 4.13

Page 55: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

CHAPTER 04: Table 4.5

Page 56: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

Q 4-6) Test for Bad Data (Q-test and Dixon’s outliners) 

Sometimes one datum appears to be inconsistent with the remaining data. When this happens, you are faced with the decision of whether to retain or discard the questionable data point.The Q Test allows you to make that decision:

Q = gap

range

 

gap = difference between the questionable point and the nearest point

 If Q ( observed or calculated) > Q(tabulated), the questionable data point should be discarded.

Q table for the rejection of data values

# of observations 90% 95% 99%

       

3 0.941 0.97 0.994

4 0.765 0.829 0.926

5 0.642 0.71 0.821

6 0.56 0.625 0.74

7 0.507 0.568 0.68

8 0.468 0.526 0.634

9 0.437 0.493 0.598

10 0.412 0.466 0.568

range= spread of data

Page 57: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

Example For the 90% CL, Can the value of 216 be rejected f rom the following set

of results? Data: 192, 216, 202, 195 and 204

Q = gap

range

gap = 216 – 204 = 12 range = 216 – 192 = 24

Q = 1224 = 0.50

Q(tabulated) = 0.64 Q(observed) is less than Q(tabulated) (0.50 < 0.64) so the data

point cannot be rejected.

Page 58: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

Note:

You must deal with the result of the Q-test. The simplest way is to throw away the datum!

It is unethical to KEEP the datum!!!

Page 59: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

4-4 FYI: F Test 

The F test provides a simple method for comparing the precision of two sets of identical measurements.

F = s12

s22

where s1 is the standard deviation of method 1 and s2 is the standard deviation of method 2 

The F test may be used to provide insights into either of two questions:(1) Is method 1 more precise than method 2, or(2) Is there a significant difference in the precisions of two methods

 For the first case, the variance of the supposedly more precise procedure is denoted s2.For the second case, the larger variance is always denoted s1.

Page 60: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

Degrees of Freedom

2 19.00 19.16 19.25 19.30 19.33 19.41 19.45 19.50

3 9.55 9.28 9.12 9.01 8.94 8.74 8.66 8.53

4 6.94 6.59 6.39 6.26 6.16 5.91 5.80 5.63

5 5.79 5.41 5.19 5.05 4.95 4.68 4.56 4.36

6 5.14 4.76 4.53 4.39 4.28 4.00 3.87 3.67

12 3.89 3.49 3.26 3.11 3.00 2.69 2.54 2.30

20 3.49 3.10 2.87 2.71 2.60 2.28 2.12 1.84

Degrees of Freedom (Numerator)

(Denominator) 2 3 4 5 6 12 20

3.00 2.60 2.37 2.21 2.10 1.75 1.57 1.00

Critical Values for F at the Five Percent Level

Page 61: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

Example 

The standard deviations of six data obtained in one experiment was 0.12%, while the standard deviations of five data obtained in a second experiment was 0.06%. Are the standard deviations statistically the same for these two experiments?

 This example deals with option #2, so the larger standard deviation is placed in the numerator:

F = (0.12)2

(0.06)2 = 4.00

Note : If F calculated > F table then difference exists.

F(tabulated) = 6.26, so the standard deviations are statistically insignificant. i.e. no difference

Page 62: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

Additional material

cont

Page 63: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

Systematic error

• What if all measurements are off the true values?

Page 64: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

SFYI: SYSTEMATIC ERRORT(1) Back to Types and Origin(s) of Uncertainties Errors We must address errors

when designing and evaluating any analytical method or performing an analysis determination. Systematic errors or determinate errors. When they are detected, we must remove them, or

reduce or have them under control. Signature of the determinate error is that all are on one side of the true value.

Examples of systematic errorIInstrument errors: a thermometer constantly reads two degrees too high. We can use a correction

factor. A balance is out of calibration, so we must have it calibrated. Method errors: Species or reagents are not stable or possible contamination. Relationship about

analyte and signal invalid (standard curve not linear). Limitations on equipment, tolerances, measurement errors of glassware, etc). Failing to calibrate you glassware or instrumentation. Lamp light source not properly aligned.

 Personal errors: color blindness, prejudice (You think the measurement is OK or is bad). We make these a lot of the time! Not reading instruments correctly, not following instructions!

 Suggested ways of eliminating systematic error: equipment calibration, self-discipline, care, analysis of known reference standards, variation of sample size, blank determinations, independent lab testing of method or sample.

Page 65: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

Random error is always present and always “symmetrical”!!! Random errors or indeterminate error They cannot be reduced-unless you change the instrument or method; so they are always present and are distributed around some mean (true) value. Thus data readings will fluctuate between low and high values and have a mean value. We often use statistics to characterize these errors.

Page 66: 4- Data Analysis and Presentation Statistics. CHAPTER 04: Opener.

A measurement

Examples: "Parallax error reading a buret“ or could be instrumental noise such as electrical voltage noise of recorder, detector, etc.