Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

SECTION II

REVIEW OF STATISTICS

INTRODUCTION

• Difference between statistics and probability• Statistical Inference

– Samples and populations– Intro to JMP software package– Central limit theorem– Confidence intervals– Hypothesis testing

• Regression and modeling fundamentals– Introduction to Model Building– Simple linear regression– Multiple linear regression– Model Building

PROBABILITY VS STATISTICS

Problems Approach

Dealing withsources of variability

Understanding the behavior of a process from random experiments on the process

Probability is the language used to characterize

quantitative variability in random experiments

Statistics allows us to infer process behavior from a

small number of experiments or trials

POPULATION VS SAMPLE

Samples drawn from the population are used to infer things about the population

Sample 1

Sample 3

Sample 2

Population

BATCH REACTOR OPTIMIZATION EXAMPLE

A new small molecule API, designated simply C, is being produced in a batch reactor in a pilot plant. Two liquid raw materials A and B are added to the reactor and the reaction A+B K1>C takes place. (K1 is the reaction rate constant.)

BATCH REACTOR OPTIMIZATION EXAMPLE

• There are various controllable factors for the reactor, some of which are:– Temperature– Agitation rate– A/B feed ratio

.........

• Adjusting the values or Levels of these factors may change the yield of C

• We would like to find some combination of these levels that will maximize C

STATISTICAL INFERENCE

Suppose 10 different batches are run and the yield of C at the end of the reaction measured. The properties of the population (i.e. all future batches) can be estimated from the properties of this sample of 10 batch runs.

Specifically it is possible to estimate the parameters:– Central Tendency

Mean, Median, Mode,– Scatter or Variability

Variance, Standard Deviation, (Skewness, Kurtosis)

RANDOM SAMPLE

Each member of the population has an equal chance of being selected for the sample. (In the example, it means that each batch of material is made under the same processing condition and is different only in the time at which it was run.)

MEAN OF A SAMPLE

The average value of n batches in the sample called the sample mean :

It can be used to estimate the central tendency of a population mean

n Yield of the ith

Sample size

VARIANCE OF A SAMPLE

• Variance of a sample of size n is

• The population variance, 2, can be inferred from s2

INTRODUCTION TO JMP

• Background– JMP is a statistical design and analysis

package. JMP helps you explore data, fit models, and discover patterns

– JMP is from the SAS Institute, a large private research institute specializing in data analysis software.

• Features– The emphasis in JMP is to interactively work

with data.– Simple and informative graphics and plots

are often automatically shown to facilitate discovery of behavioral patterns.

INTRODUCTION TO JMP

• Limitations of JMP– Large jobs

JMP is not suitable for problems with large data sets. JMP data tables must fit in main memory of your PC. JMP graphs everything. Sometimes graphs get expensive and more cluttered when they have many thousands of points.

– Specialized Statistics JMP does only conventional data analysis.

Consider another package for performing more complicated analysis. (e.g. SAS, R and S-Plus)

PROBABILITY DISTRIBUTION USING JMP (EXAMPLE 1)

• The yield measurements from a granulator are given below:79, 91, 83, 78, 90, 84, 93, 83, 83, 80 %

• Using the statistical software package JMP, calculate the mean, variance, and standard deviation of the data. Also, plot a distribution of the data.

RESULTS FOR EXAMPLE 1

NORMAL DISTRIBUTION

• The outcomes from many physical phenomenon frequently follow a single type of distribution, the Normal Distribution. (See Section I)

• If several samples are taken from a population, the distribution of sample means begins to look like a normal distribution regardless of the distribution of the event generating the sample

CENTRAL LIMIT THEOREM

If random samples of n observations are drawn from any population with finite mean and variance 2, then, when n is large, the sampling distribution of the sample mean is approximately normally distributed with mean and standard deviation:

( )E X

EFFECTS OF SAMPLE SIZE

As the sample size, n, increases, the variance of the sample mean decreases.

n = 50

n = 30

SAMPLE SIZE EFFECTS (EXAMPLE 2)

• Take 5 measurements for the yield from a granulator and calculate the mean. Repeat this process 50 times and generate a distribution of mean values. The results are the JMP data table S2E2.

• It can be shown that using 10 or 20 measurements in the first step will give greater accuracy and less variability.

• Note the change in the shape of the distributions with an increase in the individual sample size, n.

????CONFIDENCE LIMITS

Confidence limits are used to express the validity of statements about the value of population parameters. For instance:– The yield C of the reactor in example 1 is

90% at a temperature of 250º F– The yield C of the reactor is not significantly

changed when the temperature increases from 242º to 246º F

– There is no significant difference between the variance of the output of C at 250º and 260ºC

CONFIDENCE LIMITS

The bounds on the population parameters θ take the form:

CONFIDENCE LIMITS

The bounds are based on– The size of the sample, n– The confidence level, (1-

% confidence = (100)(1-)

i.e., = 0.1 means that if we generated 100 such intervals, 90 of them contain the true (population) parameter

– These are not Bayesian intervals (those will be discussed in the second module)

Z STATISTIC

• The Z statistic can be used to place confidence limits on the population mean when the population variance is known.

• Z distribution is a normally distributed random variable with =0 and 2=1.

(0,1)Z N

Z STATISTIC

From Central limit theory, if n is large:

regardless of population distribution

distribution

-Zα/2 Zα/2

~ (0,1)x

( , )X Nn

CONFIDENCE LIMITS ON THE POPULATION MEAN (POPULATION VARIANCE KNOWN)

• Two sided confidence interval

• One sided confidence Zα/2 Zα/2

intervals

Or -Zα

2 2x Z x Z

t STATISTIC

• The t statistic is used to determine confidence limits when the population variance 2 is unknown and must be estimated from the sample variance s2

i.e. , t distribution

with n-1 degree of freedom (df).

( 1)/nXT t nS n

COMPARISON OF Z AND t

Z distributiont distribution, df=3t distribution, df=2

t distribution, df=1

CONFIDENCE LIMITS ON THE POPULATION MEAN (POPULATION VARIANCE UNKNOWN)

• Two sided confidence interval

• One sided confidence intervals

, 1 , 12 2n n

s sx t x t

CONFIDENCE LIMITS ON THE DIFFERENCE OF TWO MEANS

To get confidence limits on the difference of the means of two different population µ1-µ2, we sample from the two populations and calculate the sample means , and sample variances S1, S2 respectively.

If we assume the populations have the same variance(σ2=σ1

2 =σ22), the sample variances of the

two samples can be pooled to express a single estimate of variance Sp

2. The pooled variance Sp2

is calculated by:

where n and m are the sample sizes of two samples from the different populations.

2 22 1 2( 1) ( 1)

n S m SS

Known population variance: (Z - Distribution)

Unequal variances

1/ 2 1/ 21 2 / 2 1 2 1 2 / 2

1 2 1 2

1 1 1 1( ) ( )x x Z x x Zn n n n

2 2 2 21/ 2 1/ 21 2 1 2

1 2 / 2 1 2 1 2 / 21 2 1 2

( ) ( )x x Z x x Zn n n n

* Unknown population variance: (t - Distribution)

but unknown

Unequal variance

1 2 1 21 2 1 2 1 2, 2 , 22 2

1 2 1 2

1 1 1 1( ) ( )p pn n n n

x x t s x x t sn n n n

2 2 2 21 2 1 2

1 2 1 2 1 2, ,2 21 2 1 2

( ) ( )S S S S

x x t x x tn n n n

2 2 21 1 2 2

2 2 2 21 1 1 2 2 2

( / / )

(( / ) /( 1) ( / ) /( 1))

s n s n

s n n s n n

EXAMPLE 3

Two samples, each of size 10, are taken from a dissolution apparatus. The first one is taken at a temperature of 35ºC and the second at a temperature of 37ºC. The results of these experiments are the JMP data table S2E3&4.

Using JMP, calculate the mean of each sample and use confidence limits to determine if there is a significant difference between the means of the two samples at the 95% confidence level (α = 0.05).

There is a significant difference between the means of the two samples at the 95% confidence level (α = 0.05).

MODEL BUIDING

• Building multiple linear regression model– Stepwise: Add and remove variables over

several steps– Forward: Add variables sequentially– Backward: Remove variables sequentially

• JMP provides criteria for model selection like R2, Cp and MSE.

HYPOTHESIS TESTING

• Although confidence limits can be used to infer the quality of the population parameters from samples drawn from the population, an alternative and more convenient approach for model building is to use hypothesis testing.

• Whenever a decision is to be made about a population characteristic, make a hypothesis about the population parameter and test it with data from samples.

• Generally statistical test tests the null hypothesis H0 against the alternate hypothesis Ha.

• In the example 3, H0 is that there is no difference between these two experiments. Ha is that there is significant difference between the two experiments.

GENERAL PROCEDURE FOR HYPOTHESIS TESTING

1. Specify H0 and Ha to test. This typically has to be a hypothesis that makes a specific prediction.

2. Declare an alpha level

3. Specify the test statistic against which the observed statistic will be compared.

4. Collect the data and calculate the observed t

statistic.

5. Make Conclusion. Reject the null hypothesis if and only if the observed t statistic is larger than the critical one.

TYPE I AND TYPE II ERROR

Comparing the state of nature and decision, we have four situations.

State of nature Decision• Null hypothesis true Fail to reject Null

• Null hypothesis true Reject Null

• Null hypothesis false Fail to reject Null

• Null hypothesis false Reject Null

TYPE I AND TYPE II ERROR

• Type 1 () error– False positive– We are observing a

difference that does not exist

• Type II () error– False negative– We fail to observe a

difference that does exist

Null True

Null False

RejectType I Error

Correct

Fail to Reject Correct

Type II Error

P - VALUE

• The specific value of when the population parameter and one of the confidence limits coincide– The observed level of significance

• A more technical definition:– The probability (under the null

hypothesis) of observing a test statistic that is at least as extreme as the one that is actually observed

INFERENCE SUMMARY

• Population properties are inferred from sample properties via the central limit theorem

• Confidence intervals tell us something about out how well we understand a parameter… but give no guarantees (type 2 error)

• P values give us a quick number to check to see how significant a test is.

MODEL BUIDING

• “All models are wrong, but some are useful.”– George Box

• “A model should be as simple as possible, but no simpler.”– Albert Einstein

REGRESSION MODEL

Regression analysis creates empirical mathematical models which determine which factors are important and quantify their effects on the process but do not explain underlying phenomenon

Process Conditions

Outputs = f (inputs, process conditions, coefficients) + error

Inputs Outputs

Often called model parameters

SIMPLE LINEAR REGRESSION

Simple linear regression model (one independent factor or variable):

Y = β0 + β1X + e

where e is a measure of experimental and modeling error β0, β1 are regression coefficients

Y is the response X is the factor

These models assume that we can measure X perfectly and all error or variability is in Y.

SIMPLE LINEAR REGRESSION

For a one factor model, we obtain Y as a function of X

Y。。。

。。

CORRELATION COEFFICIENTS

The correlation between the factor and the response is indicated by the regression coefficient which may be:

– Zero

– Positive

– Negative

LACK OF CORRELATION

If = 0 the response does not depend on the factor

Y。。

。。。。

Y 。。

。。

POSITIVE CORRELATION COEFFICIENTS

If > 0 the response and factor are positively correlated

Y 。。。。

NEGATIVE CORRELATION COEFFICIENTS

If < 0 the response and factor are negatively correlated

Y 。。。。

LEAST SQUARES

The coefficients are usually estimated using the method of least squares (or Method of Maximum Likelihood)

This method minimizes the sum of the squares of the difference between the values predicted by the model at ith data point, and the observed value Yi at the same value of Xi

Estimated regression line

Observed value

。。。。

EXAMPLE 4

Use the previous yield data (T2E3&4) from different dissolution temperatures. Make a model that describes the effect of temperature on the yield. Note that here, temperature is the factor and the yield is the response.

MULTIPLE LINEAR REGRESSION – ONE FACTOR

• If a simple linear regression equation does not adequately describe a set of data then multiple linear regression models may be used.

• Multiple linear regression equation for response variable Y and a single factor X takes the form of a polynomial:

Y= β0 + β1X + β2X2 + β3X3 + …. + βmXm

EXAMPLE 5

Three samples of size 10 are taken from an API (Active Pharmaceutical Ingredient) plant. The first one was taken at a batch reactor pressure of 3 bar, the second at 3.5 bar, and the final at 4 bar. The data table is T2E5. Use regression analysis to build a model describing the effect of pressure on the yield of the API, using a squared term if necessary.

MULTIPLE LINEAR REGRESSION – MORE THAN ONE FACTOR• If more than one regressor is needed in

the model, multiple linear regression models may be used to find relationship between Y and combination of factors X1, X2, …, Xp.

• Multiple linear regression equation for one response variable Y and factors X1, X2, …, Xp takes the form of a polynomial.

EXAMPLE OF MULTILINEAR REGRESSION MODEL

Y= β0 + β1X1 + β2X2 + β3X3 + …. + βmXm+ ε

Y= β0 + β1X1 + β2X1X2 + β3X1X3 + ε

Y= β0 + β1X1 + β2X12 + β3X2 + β4X1

2X24 + ε

Y = β0 + β1X1X235 + β2X3

3 + ε

NONLINEAR MODELS

• A model is said to be nonlinear if

• Example

Y = β0 exp(-β1X1) + ε

Y = β0 X1β1 + ε

/ ( ) i jy g any i

EVALUATING REGRESSION MODELS

• To determine if a model is adequate to describe the observed data, the analysis of variance may be performed

• Calculate the deviation between the data points and the values predicted by the model, called the error sum of squares (SSE)

ˆ( )n

E i ii

SS y y

SUM OF SQUARES

• Calculate the total variance in the data, called the total sum of squares (SST)

• The amount of the total variance explained by the model called the regression sum of squares.

• It may be shown that:

SST = SSR + SSE

SS y y

ˆ( )n

SS y y

SOURCES OF VARIABILITY

MEAN SQUARE

The mean square is the sum of squares divided by the associated degrees of freedom (DOF)

MSR = SSR / p MSE = SSE / (n-2)

where p is the number of parameters in the model.

Total DOF = DOF for Regression + DOF for Error

1 p + n - p - 1n

F TEST

In multiple linear regression, F statistic can be used in hypothesis testing.

H0 in this hypothesis testing is that all the β’s except β0 are 0.

F TEST AND R2

• The mean squares are used to perform an F test since they estimate specific population variances

F = MSR / MSE

• The sum of squares are used to calculate the R2 criterion

R2 = SSR/SST= 1- SSE/SST

EXAMPLE 6

Examine the variance of the model created to describe the effect of pressure on the yield of API (in Example 5).

Since the p value for F test is <.0001, which is significant in the .05 level, the overall model is significant.

EXAMPLE 7

Build a model using JMP data table T2E7 with potential factors: temperature, A/B feed ratio, and Termination time and the response variable yield. Determine which terms are significant. Build the model using forward and backward selection technique.

The Temperature and Termination time are significant on the .05 level.

Statistical Design of Experiments

Documents

Transcript of Statistical Design of Experiments

Statistical Process Control, Part 6: Design of Experiments · Statistical Process Control Part 6: Design of Experiments O ... Wood Science and Engineering; James E. Reeb, Extension

Design of Experiments and Data Analysis - weibull.com · ii – Guo & Mettas 2010 AR&MS Tutorial Notes SUMMARY & PURPOSE Design of Experiments (DOE) is one of the most useful statistical

Applying Statistical Design of Experiments To ... · In this study, we employed statistical design of experiments to gain understanding of the impact of components of deﬁned media

Randomization in the Design of Experiments · Randomization in the Design of Experiments ... Journal compilation ? 2009 International Statistical Institute ... (c?c - ?j2 + CT,(acj

Design of Experiments and Data Analysis - weibull.com · Design of Experiments (DOE) is one of the most useful statistical tools in product design and testing. While many organizations

Statistical Design of Experimentsjnahas/DoE_II_CVD_Example_V2.pdf · Statistical Design of Experiments Part II ... 10 Dec 2012 Outline 1. CVD Overview 2. Taguchi L9 Array 3 ... The

Statistical Issues in the Design of Microarray Experiments

Statistical Design and Analysis of Experiments · Statistical Design and Analysis of Experiments Part Two ... Blocking in factorials ... How-to-do blocking by confounding 8.6: Yates

Introduction to Statistical Methods, Design of Experiments ...

P Statistical Design and Analysis of Experiments

Robust Optimization of the Output Voltage of Nanogenerators by Statistical Design of ... · 2011. 8. 8. · Statistical design of experiments [14] was employed to identify the robust

ECE-580-DOE : Statistical Process Control and Design of ...myplace.frontier.com/~stevebrainerd1/STATISTICS/ECE...ECE-580-DOE : Statistical Process Control and Design of Experiments

Design of Experiments Design of Experiments

Statistical Design of Experiments

Design of Engineering Experiments Part 2 – Basic Statistical Concepts

Statistical design of in silico experiments for the ...

Statistical analysis of randomized experiments with non ... · Missing data are frequently encountered in the statistical analysis of randomized experiments. I propose statistical

RNA-seq · RNA-seq Design of experiments. Experimental design. Introduction • An experiment is a process or study that results in the collection of data. • Statistical experiments

Use of Design of Experiments (DoE) software to quickly and ...€¦ · + Design of Experiments is a statistical approach to the development and optimisation of assays. + Investigates

Dr. Gary Blau, Sean HanMonday, Aug 13, 2007 Statistical Design of Experiments SECTION III SINGLE FACTOR EXPERIMENTS.