Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it .

Statistics for Biologists

Using Microsoft Excel

Luisa Cutillo

cutillo@tigem.it

http://bioinformatics.tigem.it

Class Topics:

Basic concepts and practice in excel

• Hypothesis Testing Methodology

• p-Value Approach to Hypothesis Testing

• Comparative Statistics examples (T-Test, Chi squared)

• Multiple Hypotesis Testing (FDR)

•Descriptive and association statistics

• A hypothesis is an assumption about the population parameter.– A parameter is a

characteristic of the population, like its mean or variance.

– The parameter must be identified before analysis.

I assume the mean AGE of this class is 50!!!

Am I correct? TEST IT!

What is a Hypothesis

Of a test?

• States the Assumption (numerical) to be tested

e.g. Our class mean age is 50 (H0: µ=50 )

• Begin with the assumption that the null hypothesis is TRUE.

(Similar to the notion of innocent until proven guilty)

The Null Hypothesis, H0

The Null Hypothesis may or may not be rejected,but our aim is to REJECT the null

hypothesis!

• Is the opposite of the null hypothesise.g. The average age of our

class is different from 50 (H1: µ ≠50)

• Is generally the hypothesis that is believed to be true by the researcher!

The Alternative Hypothesis, H1

• Steps:– State the Null Hypothesis– State its opposite, the Alternative

Hypothesis• Hypotheses are mutually exclusive & exhaustive

• Sometimes it is easier to form the alternative hypothesis first.

Identify the Problem

Assume thepopulationmean age is 50.(Null Hypothesis)

REJECT

The SampleMean Is 20

Population

SampleNull Hypothesis

Hypothesis Testing Process

No, not likely!

Sample Meanµ = 50

Sampling DistributionOur sample mean (20) falls in the tails!It’s not likely! Hypotyzed

population mean.

we reject the null

hypothesis that µ = 50.

Reason for Rejecting H0

Observed population mean

• Defines the Rejection region

• Typical value of a is 0.05. It Provides the Critical Value(s) of the Test

Level of Significance, α

Critical Value

Rejection Regions

“Area” of the Rejection region

Level of Significance, α and the Rejection Region

H1: < 0

H1: > 0

Critical Value(s)

Rejection Regions

One tail (left) test

One tail (right) test

Two tails test

• Type I Error– Reject Null Hypothesis when it is

True (“False Positive”)– Has Serious Consequences– Probability of Type I Error Is α

• Called Level of Significance

• Type II Error– Do Not Reject Null Hypothesis

when it is False (“False Negative”)– Probability of Type II Error Is β

( Power 1- β )

Errors in Making Decisions

Reduce probability of one error and the other one goes up.

& Have an Inverse Relationship

One possibility: Increase the sample size!!!!

• The p-value is the Probability of Obtaining a Test Statistic (under H0) more Extreme or ) than the observed Sample Value

• Used to Make Rejection Decision

– If p value < Reject H0 SUCCESS

– If p value Do Not Reject H0 FAILURE

What is the p Value and how to use it in a Test?

One tail testObservedSample Value

Random variables: am I observing continuous or discrete data???

Roughly speaking

a “random” variable is a quantity whose values are “random” and to which a probability distribution is assigned

(e.g. a fair dice outcomes have same chance of coming up at each throw ) ;

THE DIFFERENCE BETWEEN CONTINUOUS AND DISCRETE VARIBLES IS

FUNDAMENTAL IN CHOOSING THE KIND OF TEST STATISTICS!

Discrete R.V.

If the r.v. X values belongs to a finite set {x1 ,x2,…, xn}

then X is called DISCRETE (usually counts)

As example the flipping of a coin, the number of red cells counted in an image, the number of success in 100 trials…are

observations of a discrete variable!

Continuous R.V.A continuous random variable is a r.v. which takes an infinite number of possible

values.

Continuous random variables are usually measurements. Examples include height, weight, the amount of sugar in an orange,

the time required to run a mile,the fluorescence intensity in a microarray,

(A continuous random variable is not defined at specific values

bat over intervals of values)

Which test to use?

First of all you should choose a summary SAMPLE STATISTIC!

As. Example:

SAMPLE MEAN

SAMPLE VARIANCE

SAMPLE COVARIANCE

SAMPLE CORRELATION

T-test!

• Assumptions– Population is normally distributed– If not normal, only slightly skewed &

a large sample taken (Central limit theorem applies)

• Parametric test procedure (sample stat. is the sample mean!)

• t test statistic, with n-1 degrees of freedom

paired t-Test: s Unknown

(rigth and left eye)

Reject H0

H0: ≠ H1: < 0

H0: ≠0 H1: > 0

Must Be Significantly below

Small values don’t contradict H0 Don’t

Reject H0!

Rejection Region

(one tail)

Unpaired T-test

•The two sample observations are not coupled•Not necessary equal sample numbers•You may distinguish between equal and unequal sample variance

In few words the other tests:• If you want to compare more then two

populations means when you observe 1 characteristic: one way Anova Test

• If you want to compare more then two populations means when you observe 2 characteristic: two way Anova Test

• If you want to compare two populations variance: F-test

• If you want to compare two populations proportions: Chi-square test

Remark

• If you have counts…or few data YOU ARE NOT ALLOWED TO USE T-TEST!!!

• Any test is build upon conjecture about the shape of the null distributions…again if you have few data or any doubt…please contact us!

• If you just want to have a summary about your data, then use the descriptive statistic excel sheet

PRACTICALS:Handy Guide

In Excel

Luisa Cutillo

cutillo@tigem.it

http://bioinformatics.tigem.it

HOW IN EXCEL….

• DESCRIPTIVE STATISTICS

• ASSOCIATION STATISTICS

• COMPARATIVE STATISTICS

• STATISTICS FOR FREQUENCY DATA

• FDR (for Multiple Hypothesis Testing)

REMARK

• Statistical formulae and tables can look mysterious and confusing

• You don’t really need to make calculations yourself

• Excel has most of the common statistical tests built in

• EXCEL HOWEVER IS NOT A STATISTICAL SOFTWARE! But it can be used for a basic analysis level.

DESCRIPTIVE STATISTICS How to summarise the collected measurements?

(time, length, temperature, expression level..)Excel provides 3 measures of the centre of a distribution of replicates:

Aritmetic mean: =AVERAGE(range) most appropriate for normal approximation!

Median (Pr(X> or < median)= 0.5): =MEDIAN(range)

Mode (most frequent value) : =MODE(range)

DESCRIPTIVE STATISTICS:DESCRIPTIVE_STAT_toy.xls

The mean has no meanig without some measure of spread or variation:Aritmetic mean: AVERAGE(range) most appropriate for normal approximation!The range:MAX(range)-MIN(range)The variance: VAR(range)Standard deviation: STDV(range)Standard error MEAN:

STDV(range)/SQRT(COUNT(range))Confidence interval:

=CONFIDENCE(0.05,STDV(range),COUNT(range))

ASSOCIATION STATISTICS Task: investigate an association between two variables

(ex. Two genes expression values).

Correlation: to see if two variables vary together i.e. One goes up, the other goes up (or goes down) [excel]

Regression:to see how one variable affects another [contact us!]

The most common tests for correlation are:Pearson coefficient for nomally distributed data

(parametric): to see if two variables vary together i.e. One goes up, the other goes up (or goes down) [excel]

CORREL(range 1, range 2)or

PEARSON(range 1, range 2)

Spearman rank-order correlation coefficient (non parametric) [contact us!]

Both vary from +1 (perfect correlation) through 0 (no correlation) to -1 (anti correlation)

Two types of correlation coefficient. The data are the lengths of a leg bone (in mm) in penguin mating pairs. The Pearson coefficient r can be calculated directly from the data, but the Spearman coefficient rs must be calculated from the ranks of the data. The ranks can either be entered by hand or calculated using Excel’s =RANK formula.

Ex1.xls

correlation_covar_toy.xls

COMPARATIVE STATISTICS: test_toy.xls Task: Compare two or more sets of data do determine

whether they are basically the same or they are significantly different.

Final result: probability P that the null hypothesis of no difference is true.

In Biology usually: we say that there is a significant difference if P<5%. The most common test for normally distributed data is the T-TEST;

=TTEST((range1,range2,tails,type) which returns directly the P value.tails: 1 for one tailed test

2 for two tailed test (most used in biology, test for differences reguardless of the sign)type: 1 for paired data (one sample, dependent data) 2 for unpaired data (two samples, equal variance) 3 for two sample unequal variance (Never use it!)

MICROARRAY Hypothesis Testing

We want to compare two biologically different samples (ex. Wild Type vs Mutant) through the identification of

differentially expressed genes

We have to simultaneously test, for each gene, the null hypothesis: gene expression has not changed.

For each gene j the test is expressed in term of a Statistic and a p-value

Null Hypothesis Ho: j(WT)=j(KO)

Which is the test to use in this case?

For each gene the test is expressed in term of a Statistic and a p-value

Null Hypothesis

Ho: j(WT)=j(KO)

T-statistic on gene j --> p-value

p-value Is true

( α )

Reminder:The p-value is the probability of finding a false positive (probability of type I error) that is the probability of finding out a differentially expressed gene that actually is not!!!

Ex. If α=0.01 and p<α, then 0.01 represent the probability that the gene detected is a false positive.

Problems in controlling the errors…

Assume that a chip experiment reveals the expression level of m = 20.000 genes relatively to two different biological conditions.

We want to test, simultaneously for each gene, the null hypothesis that the gene is not differentially expressed against the alternative that it is.

If we test each of the m hypothesis at level p<α=0.01, we would expect about 200 false positive!!!

Multiple error controlling procedures:Bonferroni

BonferroniCorrection(FWER)

In practice for each gene you have to compute a new p-value pj<Tr=α/m ----> pj*m<α ---> Pbonf<α

and you should retrieve all the genes for which Pbonf=pj*m <α

Multiple error controlling procedures:Benjamini - HockbergConsider the p-values sorted in ascending order:

p(1)<p(2)<... <p(m)

For the j-st gene the new pBH is p(j)*m/j

So you have to detect all the genes whos sorted p-value is s.t.

p(j) m/j< α

In practice for the j-st gene you have to compute a new p-value

Pcorrect(j)=p(j)*m/jand you should retreive all the genes for wich Pcorrect<α

MicroarrayFdr.xls

Statistics for frequency data

Sometimes in biology results are not measurements but counts (or frequencies)!e.g. counts of different phenotypes, counts of cell types ...

Task: Compare frequency data in different categories with some expected data

You are NOT ALLOWED to perform a t test! Instead you do a Chi-squared test;

=CHITEST(observed range,expected range) which returns directly the P value ( probability that the null hypothesis of no difference between the observed and the expected is true).

Statistics for frequency data Three different uses:

Expected calculated from theory: you test if your observed data agree with the theory. E.g. Mendel theory can be used to predict frequencies of different phenotypes: we expect a genetic cross to be 3:1 ratio of red and white flowers.(P>5% data agree with theory)

Expected calculated assuming that the counts in all the categories should be the same: you test whether there is a difference between the observed sets. (P<5% data significantly different from each other)

Investigate association between frequency data in two separate groups. Expected calculated assuming counts in one group are not affected by counts in the other. (P<5% there is a significant association). Data are set in a contingency table. For each cell the expected data is:

E=(column total x row total)/grand total

Statistics for frequency data Two kinds of chi-squared test.Top: expected values from theory, calculated assuming 3/4 of the flowers should be red and 1/4 should be white.

Bottom: expected values assuming equal distribution.

Ex2.xls

Statistics for frequency data

The chi squared test for association.The observed data were entered in the upper table, and the expected data in the lower table were calculated from the sums for each column and row. Only some examples of the formulae used are shown.

Ex2.xls

References:

• Biology statistics made simple using Excel, Millar

Now ...”test” your lunch!!!

Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it .

Documents

Transcript of Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it .

Famous Biologists

FLI biologists have been researching

Biologists’ Tools & Technology Technology continually changes the way biologists work. 1.

Pragmatic R for Biologists

for Biologists

AVIAN NECROPSY MANUAL FOR BIOLOGISTS IN … bird manual...AVIAN NECROPSY MANUAL FOR BIOLOGISTS IN REMOTE ... This manual is for biologists in remote refuges who have little to no background

Atomic Force Microscopy for Biologists

AAAS talk: Bioinformatics for Biologists

Mariarosa Cutillo Presentazione con la collaborazione di Clara Castellucci Foto di Claudio Fagioli.

Etool.Me Platform for Biologists

Zoologists and Wildlife Biologists

REGIONAL BIOLOGISTS & COORDINATORS

Bioinformatics for Human Biologists

Perl Scripting for Biologists -

Relational Databases for Biologists: Efficiently Managing ...barc.wi.mit.edu/education/bioinfo2006/db4bio/Databases_1_color.pdf · Relational Databases for Biologists © Whitehead

Computer Programming for Biologists

Biologists’ Tools and Technology

Programming for biologists

Notable biologists

Perl for Biologists