Experimental Design Principles for Big Data Bioinformatics ...

28
Experimental Design Principles for Big Data Bioinformatics Analysis Bruce A Craig Department of Statistics Purdue University

Transcript of Experimental Design Principles for Big Data Bioinformatics ...

Experimental Design Principles for Big Data

Bioinformatics AnalysisBruce A Craig

Department of StatisticsPurdue University

My Backgroundā€¢ Professor of Statistics (at Purdue 20+ years)ā€¢ Director of Purdueā€™s Statistical Consulting Service

(SCS) since 2004ā€“ Free service available to all associated with Purdueā€“ Balances research and teaching

ā€¢ Education and training of student consultants (and clients)ā€¢ Improving the quality and productivity of research

ā€¢ Had been involved in the design and analysis of spotted microarrays and QTL analysis in polyploids

ā€¢ Now just try to keep up with the ever-changing technologies within my SCS role

Importance of experimentationā€¢ ā€œThere are three principal means of acquiring

knowledgeā€¦ observation of nature, reflection, and experimentation. Observation collects facts; reflection combines them; experimentation verifies the result of that combination.ā€ ā€“ Denis Diderot

ā€¢ ā€œThe proper method for inquiring after the properties of things is to deduce them from experiments.ā€ ā€“ Isaac Newton

Importance of experimental design

ā€¢ ā€œYou can't fix by analysis what you bungled by design.ā€ - Light, Singer and Willett

ā€¢ ā€œTo call in the statistician after the experiment is done may be no more than asking him to perform a postmortem examination: he may be able to say what the experiment died of.ā€ - Sir Ronald Aylmer Fisher

Importance of experimental designā€¢ "All experiments are designed experiments, it is just

that some are poorly designed and some are well-designed."

ā€¢ ā€œIf your experiment needs statistics, you ought to have done a better experiment.ā€ - Lord Ernest Rutherford

ā€¢ ā€œNo experiment is ever a complete failure. It can always be used as a bad example.ā€ ā€“ Paul Dickson

Experimental Designā€¢ Does NOT simply involve statisticsā€¦ā€¢ Requires a combination of

ā€“ Scientific/biologic insight ā€“ Scientific logic ā€“ Common sense ā€“ Planning ā€“ Statisticsā€“ Communication skills

ā€¢ https://www.youtube.com/watch?v=PbODigCZqL8

Not a one persontask!!!

Experimental Design

ā€¢ Basic design principles do not change with number of outcomes (e.g., genes, proteins), advancement in technology, or size of the data

ā€¢ ā€œKeep it simple stupidā€ (KISS) principle often plays even more of a role in these bioinformatics studies due to ā€“ Costs involved ā€“ Complexity/size of data obtained

Basic Design Principles

ā€¢ Randomizationā€“ ā€œbalancing outā€ effects of lurking/hidden variables

ā€¢ Replicationā€“ Improving precision

ā€¢ Blockingā€“ Control the effects of nuisance factors that

otherwise increase the noise in the experiment

Experimental Strategies

ā€¢ One-factor-at-a-time (OFAT)ā€“ Favored when data cheap and abundantā€“ Often very inefficient ā€“ Cannot investigate interaction

ā€¢ Factorial treatment structure ā€“ Hidden replication advantageā€“ Can assess interaction

Both perhaps involving a blocking factor

Factorial Treatment StructureFactorLevels

B1 B2

A1 xx xx

A2 xx

FactorLevels

B1 B2

A1 x x

A2 x x

š‘›š‘› = 6 Assume variance is Ļƒ2 š‘›š‘› = 4

Effect of A = A2B1 āˆ’ A1B1Var(Effect of A)= ļæ½2Ļƒ2

2 = Ļƒ2Effect of A = A2 āˆ’ A1Var(Effect of A)= ļæ½2Ļƒ2

2 = Ļƒ2

A1 A2

B1

A1 A2

B1.

A1B2

B2

Nuisance Factorsā€¢ Imperative to consider all possible factors that could

obscure or alter your resultsā€¢ Often rely on sound judgement of center generating

the bioinformatics data (i.e., temp, humidity)ā€“ Block on day or technician(?)

ā€¢ Controlling these factors allow results to be as generalizable as possibleā€“ Sample split in half Trt1 to one half, Trt2 to other

ā€¢ Randomization provides protection against the unknownā€¦.avoid possible biases

Replicationā€¢ Biological versus technical replicates

ā€“ Technical replicate: Measuring same biological sample multiple times

ā€“ Technical replicates assess non-biological variation ā€“ May be beneficial if this variation expected to be largeā€“ Biological replicates typically improve precision more and

allow conclusions to be more generalizable

ā€¢ Number of replicates depends on numerous factorsā€“ Cost and availability of resourcesā€“ Desired precision / power

Replication

ā€¢ Poolingā€“ Used when sample material scarceā€“ Larger sample better precision of measurementā€“ Danger in pooling

ā€¢ Measurement obtained from pooled sample can be different from the average of individual measurements

ā€“ Better to avoid pooling if possible ā€¢ More precision from multiple biological replicates

Calculating Powerā€¢ Numerous calculators available in software

and online

ā€¢ Be waryā€¦you will always get numbers but whether theyā€™re meaningful depends on the quality of the inputted valuesā€“ Trusting the quality of the inputs requires a basic

understanding of the process

Why Power Analysis?ā€¢ Research is expensiveā€¦wouldnā€™t want to

conduct experiment with farā€¦ā€“ too few experimental units (EUs)

ā€¢ Project wonā€™t find important differences that existā€¢ Not worth the time and money

ā€“ too many experimental units (EUs)ā€¢ Project is unnecessarily too expensive

ā€¢ Typical funding agency requirement ā€“ Demonstrates thinking about plan and organization

15

A Simple Experimentā€¢ Study the effect of cold on a fat gene in ratā€¢ Use a Completely Randomized Design (CRD):

ā€“ 6 rats are randomly assigned to one of two different environments

ā€¢ Trt 1: Normal environment (20Ā°C)ā€¦n=3ā€¢ Trt 2: Cold environment (5Ā°C)ā€¦n=3

ā€¢ Investigator expects lower expression of gene when under Trt 2

ā€¢ Is n=3 per trt enough to detect this difference?

16

Statistical Analysis

ā€¢ Two competing hypotheses:Ho: log2(Āµ1/Āµ2)= log2(Āµ1) ā€“ log2(Āµ2)= 0H1: log2(Āµ1) - log2(Āµ2) > 0

ā€¢ Basis for choosing between the twoā€“ P-value of test relative to a declared significance level

(Ī±)ā€¦typically Ī± = 0.05 ā€¢ P ā‰¤ Ī± ā†’ reject Ho and conclude mean larger in Trt 1ā€¢ P > Ī± ā†’ fail to reject Ho , not enough evidence to conclude H1

ā€¢ There are two possible incorrect conclusions based on the analysis of the data

17

i.e. one-tailed test (for now)

Type I and Type II errors

18

Fail to reject Ho: (P>Ī±)

Reject Ho:(Pā‰¤Ī±)

Ho: is trueNo error

Type I error(Prob is Ī±)

H1: is trueType II error(Prob = Ī²)

No error

What the data indicate:Tr

ue s

tate

So is n = 3 rats large enough?ā€¢ Rephrase: Do we have enough statistical power? ā€¢ Need to ā€œknowā€ several things

ā€“ How large is the true mean difference (Ī“ = log2(Āµ1/Āµ2))?1) What do you anticipate?2) What would be scientifically/practically important?

ā€“ Suppose researchers believe that Ī“ =1 (fold change)

ā€“ How much variability (Ļƒ) exists between rats within a grp?ā€¢ Some prior information potentially available from previously published

studies or small pilot study. Variability due to rat and methodā€¢ May also have to guess

ā€“ Suppose researchers believe that Ļƒ = 0.7

ā€“ Effect size (Ī“/Ļƒ) often used when scale arbitrary

ā€¢ Power analysis involves ā€œeducated guessingā€19

One way to elicit values for Ļƒ

ā€¢ Use an empirical rule:Consider range of responses to be equal to 4Ļƒ or 6Ļƒ

ā€¢ Question: What would be the likely range (max-min) of log expression levels for rats within the same trt?ā€“ Suppose the answer was 2.8

ā€¢ R = 2.8 ā†’ Ļƒ = 0.7

20Rā‰ˆ4Ļƒ

Can often find similar published studies with estimates of Ļƒ.

Always round up to be a little conservative.

Two competing hypotheses:

ā€¢ Under Ho:

ā€¢ Under H1:

ā€¢ Conduct one-tailed t-test for a certain Ī±

š‘”š‘”0 =ļ潚‘¦š‘¦1 āˆ’ ļ潚‘¦š‘¦2

ļæ½2š‘ š‘ 2 š‘›š‘›

> š‘”š‘”š›¼š›¼

21

2

2 12~ 0,y y N

nĻƒ

āˆ’

2

2 12~ ,y y N

nĻƒĪ“

āˆ’

Reject Ho: if

Assuming y are the log2expression levels. Difference in distributions is the mean.

Need to study distribution of t0 under H0 and H1

Distributions of t0

22

Ho

H1

tĪ±

Ī±

1āˆ’Ī²=Power

Central and non-central t distributions

Yikes!! Power is 42%

Power Calculators

ā€¢ Calculators available in software such as Minitab, SAS, JMP, and Rā€“ Be wary of calculators (such as PWR in R) that asks

just for an effect sizeā€“ Effect size essentially a signal versus noise ratioā€“ Noise may be more than just biological variation

ā€¢ Many calculators also take into account issue of multiple comparisons

Multiple Comparisonsā€¢ Testing changes in expression for thousands

of features across several treatments

0 20 40 60 80 100

0.2

0.4

0.6

0.8

1.0

M = # of independent tests

Pro

b of

at l

east

one

fals

e po

sitiv

e

Probability of at least one false positive quickly rises

Multiple Comparisons

ā€¢ Trade-off between control of false positive and

false negative (power) rates

ā€¢ Two common types of control on false positives

ā€“ Familywise error rate : P(at least one) ā‰¤ Ī±Bonferroni (compare P-value to Ī±/M)

ā€“ False discovery rate: Expected proportion of rejected hypotheses rejected incorrectly

False Discovery Rateā€¢ Proposed by Benjamini and Hochberg (1995)

ā€¢ FWER controls Vā€¢ FDR = E(V/R | R>0)Pr(R>0)

Not Rejected Rejected Total

True Null U V m0

False Null T S m-m0

m-R R M

Microarray Myths and Truthsappeared: The Scientist, Vol. 16, No.17, p.22

ā€¢ Myths:ā€¢ ā€œThat complex classification algorithms such as neural

networks perform better than simpler methods for class prediction

ā€¢ That multiple testing issues can be ignored without filling the literature with spurious results

ā€¢ That prepackaged analysis tools are a good substitute for collaboration with statistical scientists in complex problems.ā€

ā€¢ Truthsā€¢ ā€œThe greatest challenge is organizing and training for a more

multidisciplinary approach to systems biology. The greatest specific challenge is good practice in design and analysis of microarray-based experiments.

ā€¢ Comparing expression in two RNA samples tells you only about those samples and may relate more to sample handling and assay artifacts than to biology. Robust knowledge requires multiple samples that reflect biological variation.

ā€¢ Biologists need both good analysis tools and good statistical collaborators. Both are in short supply.ā€