Basic data analyses skills for science research
description
Transcript of Basic data analyses skills for science research
The Dos and Don’ts!!
Prepared by Law HL
Statistics
the practice or science of collecting and
analysing numerical data in large quantities,
especially for the purpose of inferring proportions
in a whole from those in a representative sample.
used to communicate research findings and to
support hypotheses and give credibility to research
methodology and conclusions.
Two Branches of Statistics
Example 1: Is the lipase concentration
significantly different among the various fruits? Fruit samples 1st Sample 2nd Sample 3rd Sample Average
lipase
concentration
Lime 0.564 0.585 0.606 0.585
Lemon 0.104 0.101 0.107 0.104
Grapefruit 0.182 0.183 0.181 0.182
Avocado 0.415 0.637 0.550 0.534
Peanut 0.182 0.328 0.405 0.367
0.585
0.104
0.182
0.534
0.367
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Lime Lemon Grapefruit Avocado Peanut
Ave
rag
e l
ipa
se
co
nc
en
tra
tio
n U
/10
0u
L
Fruits
Average lipase concentration in various fruits
No observable
difference between the
average lipase
concentration of lime
and avocado
No significant
difference between the
average lipase
concentration of lime
and avocado!!
Student’s Conclusion:
Lime has a significantly higher ?? lipase
concentration than the other fruit
samples.
Error Bars
Overlap – no observable difference
Overlap – no significant difference if
inferential stats is used
No overlap – observable difference
No overlap – significant difference is
inferential stats is used
Example 2: Is the average distance
travelled by the shuttlecock
significantly different among the
various shots? Trials Set 1 Set 2 Set 3 Set 4 Set 5 Set 6 Average
Shot 1 4.921 4.698 4.598 4.822 5.171 5.096 4.884
Shot 2 4.879 4.772 4.772 4.787 4.808 4.596 4.769
Shot 3 4.483 4.536 4.565 4.430 4.760 4.594 4.561
Shot 4 4.392 4.268 4.096 4.162 4.388 4.462 4.295
Shot 5 4.180 4.122 4.142 4.092 4.238 3.712 4.081
Shot 6 3.612 3.698 3.612 3.962 3.788 3.928 3.767
4.884 4.769 4.561
4.295 4.081
3.767
0
1
2
3
4
5
6
Shot 1 Shot 2 Shot 3 Shot 4 Shot 5 Shot 6
Ave
rag
e d
ista
nc
e (
m)
SHOTS
Average distance travelled by the shuttlecock for each of the six shots
Student’s Conclusion:
There is a significant difference?? in the
average distance travelled by the
shuttlecock among the six shots.
Statistical Significance
The results observed that are due to
REAL treatment effects and NOT due to
Chance.
The P-Value approach
P-Value – the probability of obtaining a
value which is different from what is
being hypothesized.
The smaller the P-Value, the more likely
the results are statistically significant.
So…what is the P-Value for a
statistically significant result?
Generally……
P < 0.05 (Results are statistically
significant)
P < 0.001 (Results are extremely
statistically significant)
Example 3: Is there a significant
difference in the absorbance of
reaction mixture of papain at various
concentration?
P = 0.04
There is a significant difference in the average absorbance among the three concentrations of Papain.
Concentrations
of Papain (%)
Absorbance readings
2 0.40 0.52 0.51 0.49 0.42
5 0.35 0.42 0.44 0.53 0.31
10 0.41 0.36 0.21 0.21 0.33
Various Statistical tools for
generating P-Values.
Statistical Analyses
Group comparisons
Establishing linear relationships
between variables
Group comparisons
2 groups
Sample size n = 5 - 15
Mann-Whitney U-Test
Sample size n > 15
T-Test
More than 2 groups
Sample size n = 5 - 15
Kruskal-Wallis K-Test
Sample size n > 15
ANOVA Post hoc test:
Multiple Comparisons
Various Statistical tools for
generating P-Values (I)
Example 4: An experiment was
conducted to find out if the survival of E.Coli differed between those grown using brass and glass pots.
Since there are two groups to be compared and n = 9, use the Mann-Whitney Test
Results: P > 0.05
There is no significant difference in the average number of bacterial colonies between the two samples
Number of
bacterial
colonies in each
brass pot
Number of
bacterial
colonies
present in each
glass pot
405 412
310 231
196 89
63 567
167 134
312 253
675 423
465 134
78 231
Example 5: Experiment to find out if temperature
readings differ among the various layers.
Since n < 5 for each group, non of our
statistical tools is appropriate for the
analysis.
Example 6: Experiment to find out if the mean concentration
of ethanol produced differed significantly between
the two methods.
If n > 15 for both groups, use T-Test set at α = 5%
Example 7: Experiment to find out if the mean
amount of ion adsorbed by mango
peels differed significantly among the
three groups.
3 T-Tests??
T-Test
P = 0.00003
T-Test
P = 0.00005
T-Test
P = 0.00379
Example 7:
For comparing more than 2 means with
n > 15 for each treatment group, use
ANOVA.
DO NOT USE MULTIPLE T-TESTS as
the error rate gets INFLATED!!
If ANOVA shows a significant difference
in the means among the groups, use
Tukey’s Multiple Comparisons to
determine where the difference lies.
Example 8: Experiment to determine if there is a significant
difference in the average acid concentration
among the four preparations.
Comparing averages among three or more
groups with 5 ≤ n ≤ 15 for each group.
Kruskal Wallis Test
Preparation A Preparation B Preparation C Preparation D
0.45 0.35 0.24 0.34
0.35 0.56 0.12 0.56
0.46 0.24 0.13 0.53
0.24 0.56 0.17 0.43
0.56 0.24 0.45 0.21
Establishing linear relationships
between variables
Functional dependence of one variable on another
Simple linear regression
Non dependence between variables
Simple linear correlation
Various Statistical tools for
generating P-Values (II)
Simple Linear Regression
Two variables
One variable (dependent/response variable)
depends on the other (independent/predictor
variable)
Represented by
scatterplots
Reported with
r2 and P-value
r2 and P-value in regression
analysis r2 – coefficient of determination
Measures how much of the variation in
the dependent variable is due to the
independent variable.
0% ≤ r2 ≤ 100%
P-Value – the probability of obtaining the
slope of the regression line if the actual
slope is zero.
Always report
both r2 and
P-value.
r2 and P-value in regression
analysis
Sample
slope
Population
slope
n= 5
r2 = 0.80
Simple Linear Correlation
Two variables
Neither of the two is functionally
dependent on the other
Represented by scatterplots
r (pearson correlation coefficient) –
measures the strength of linear
relationship between two variables.
Always report both r and P-value
Guidelines to interpreting r Coefficient, r
Strength of Association Positive Negative
Small .1 to .3 -0.1 to -0.3
Medium .3 to .5 -0.3 to -0.5
Large .5 to 1.0 -0.5 to -1.0
Caution……………………..
It is not appropriate to analyze a non-
linear relationship using Pearson
correlation coefficient
Example 10: Experiment to find out if there is a significant
correlation between percentage of DPPH
reacted and concentration of fruit peel extract.
•P-Value?
•Scatterplot?
1. The Don’ts……………………
For n < 5, DO NOT analyze your data
with inferential statistics.
E.g. Trying to determine if the amount of
heavy metal ion removed differed
among the three methods
Method 1 Method 2 Method 3
0.421 0.324 0.534
0.521 0.512 0.342
0.654 0.526 0.523
Con
ce
ntra
tion
of
he
avy m
eta
l ion
rem
ove
d
2. The Don’ts………………
When no statistical analysis is being
performed on the data sets, refrain from
using the word ‘Significant’!
You can however claim that ‘there is an
observable difference…’
3. The Don’ts…………………
Data analyses DO NOT PROVE
hypotheses.
The results either support or do not
support the hypotheses.
Refrain from using the word ‘Prove’ or
Discover!!
3. The Don’ts…………………
Do not attempt to analyze too many variables at
the same time!
Analyses of multiple variables at the same time
Multivariate Statistical Analyses!!
The Dos…………
Decide on the appropriate significance level before statistical analyses (e.g. 5%)
Always factor in the appropriate statistical tool for analyzing your data at the planning stage
Always report your significance level and P-value!
Consult your treachers or Mr Law if you have any queries