Statistical Design of Experiments

67
Monday, Aug 13, 2007 Dr. Gary Blau, Sean Han Statistical Design of Experiments SECTION II REVIEW OF STATISTICS

description

Statistical Design of Experiments. SECTION II REVIEW OF STATISTICS. INTRODUCTION. Difference between statistics and probability Statistical I nference Samples and populations Intro to JMP software package Central limit theorem Confidence intervals Hypothesis testing - PowerPoint PPT Presentation

Transcript of Statistical Design of Experiments

Page 1: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

Statistical Design of Experiments

SECTION II

REVIEW OF STATISTICS

Page 2: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

INTRODUCTION

• Difference between statistics and probability• Statistical Inference

– Samples and populations– Intro to JMP software package– Central limit theorem– Confidence intervals– Hypothesis testing

• Regression and modeling fundamentals– Introduction to Model Building– Simple linear regression– Multiple linear regression– Model Building

Page 3: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

PROBABILITY VS STATISTICS

Problems Approach

Dealing withsources of variability

Understanding the behavior of a process from random experiments on the process

Probability is the language used to characterize

quantitative variability in random experiments

Statistics allows us to infer process behavior from a

small number of experiments or trials

Page 4: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

POPULATION VS SAMPLE

Samples drawn from the population are used to infer things about the population

Sample 1

Sample 3

Sample 2

Population

Page 5: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

BATCH REACTOR OPTIMIZATION EXAMPLE

A new small molecule API, designated simply C, is being produced in a batch reactor in a pilot plant. Two liquid raw materials A and B are added to the reactor and the reaction A+B K1>C takes place. (K1 is the reaction rate constant.)

Page 6: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

BATCH REACTOR OPTIMIZATION EXAMPLE

• There are various controllable factors for the reactor, some of which are:– Temperature– Agitation rate– A/B feed ratio

.........

• Adjusting the values or Levels of these factors may change the yield of C

• We would like to find some combination of these levels that will maximize C

Page 7: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

STATISTICAL INFERENCE

Suppose 10 different batches are run and the yield of C at the end of the reaction measured. The properties of the population (i.e. all future batches) can be estimated from the properties of this sample of 10 batch runs.

Specifically it is possible to estimate the parameters:– Central Tendency

Mean, Median, Mode,– Scatter or Variability

Variance, Standard Deviation, (Skewness, Kurtosis)

Page 8: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

RANDOM SAMPLE

Each member of the population has an equal chance of being selected for the sample. (In the example, it means that each batch of material is made under the same processing condition and is different only in the time at which it was run.)

Page 9: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

MEAN OF A SAMPLE

The average value of n batches in the sample called the sample mean :

It can be used to estimate the central tendency of a population mean

X

1

n

ii

XX

n Yield of the ith

batch

Sample size

Page 10: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

VARIANCE OF A SAMPLE

• Variance of a sample of size n is

• The population variance, 2, can be inferred from s2

2

2 1

( )

1

n

ii

X Xs

n

Page 11: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

INTRODUCTION TO JMP

• Background– JMP is a statistical design and analysis

package. JMP helps you explore data, fit models, and discover patterns

– JMP is from the SAS Institute, a large private research institute specializing in data analysis software.

• Features– The emphasis in JMP is to interactively work

with data.– Simple and informative graphics and plots

are often automatically shown to facilitate discovery of behavioral patterns.

Page 12: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

INTRODUCTION TO JMP

• Limitations of JMP– Large jobs

JMP is not suitable for problems with large data sets. JMP data tables must fit in main memory of your PC. JMP graphs everything. Sometimes graphs get expensive and more cluttered when they have many thousands of points.

– Specialized Statistics JMP does only conventional data analysis.

Consider another package for performing more complicated analysis. (e.g. SAS, R and S-Plus)

Page 13: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

PROBABILITY DISTRIBUTION USING JMP (EXAMPLE 1)

• The yield measurements from a granulator are given below:79, 91, 83, 78, 90, 84, 93, 83, 83, 80 %

• Using the statistical software package JMP, calculate the mean, variance, and standard deviation of the data. Also, plot a distribution of the data.

Page 14: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

RESULTS FOR EXAMPLE 1

Page 15: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

NORMAL DISTRIBUTION

• The outcomes from many physical phenomenon frequently follow a single type of distribution, the Normal Distribution. (See Section I)

• If several samples are taken from a population, the distribution of sample means begins to look like a normal distribution regardless of the distribution of the event generating the sample

Page 16: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

CENTRAL LIMIT THEOREM

If random samples of n observations are drawn from any population with finite mean and variance 2, then, when n is large, the sampling distribution of the sample mean is approximately normally distributed with mean and standard deviation:

nx

x

( )E X

Page 17: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

EFFECTS OF SAMPLE SIZE

As the sample size, n, increases, the variance of the sample mean decreases.

n = 50

n = 30

x

Page 18: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

SAMPLE SIZE EFFECTS (EXAMPLE 2)

• Take 5 measurements for the yield from a granulator and calculate the mean. Repeat this process 50 times and generate a distribution of mean values. The results are the JMP data table S2E2.

• It can be shown that using 10 or 20 measurements in the first step will give greater accuracy and less variability.

• Note the change in the shape of the distributions with an increase in the individual sample size, n.

Page 19: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

RESULTS FOR EXAMPLE 2

Page 20: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

????CONFIDENCE LIMITS

Confidence limits are used to express the validity of statements about the value of population parameters. For instance:– The yield C of the reactor in example 1 is

90% at a temperature of 250º F– The yield C of the reactor is not significantly

changed when the temperature increases from 242º to 246º F

– There is no significant difference between the variance of the output of C at 250º and 260ºC

Page 21: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

CONFIDENCE LIMITS

The bounds on the population parameters θ take the form:

l u

Page 22: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

CONFIDENCE LIMITS

The bounds are based on– The size of the sample, n– The confidence level, (1-

% confidence = (100)(1-)

i.e., = 0.1 means that if we generated 100 such intervals, 90 of them contain the true (population) parameter

– These are not Bayesian intervals (those will be discussed in the second module)

Page 23: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

Z STATISTIC

• The Z statistic can be used to place confidence limits on the population mean when the population variance is known.

• Z distribution is a normally distributed random variable with =0 and 2=1.

i.e.

2exp

2

1)(

2ZZp

(0,1)Z N

Page 24: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

Z STATISTIC

From Central limit theory, if n is large:

regardless of population distribution

distribution

-Zα/2 Zα/2

~ (0,1)x

N Z

n

2 x

Z

n

2

( , )X Nn

Page 25: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

CONFIDENCE LIMITS ON THE POPULATION MEAN (POPULATION VARIANCE KNOWN)

• Two sided confidence interval

• One sided confidence Zα/2 Zα/2

intervals

Or -Zα

2 2x Z x Z

n n

x Zn

x Zn

Page 26: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

t STATISTIC

• The t statistic is used to determine confidence limits when the population variance 2 is unknown and must be estimated from the sample variance s2

i.e. , t distribution

with n-1 degree of freedom (df).

( 1)/nXT t nS n

Page 27: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

COMPARISON OF Z AND t

Z distributiont distribution, df=3t distribution, df=2

t distribution, df=1

Page 28: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

CONFIDENCE LIMITS ON THE POPULATION MEAN (POPULATION VARIANCE UNKNOWN)

• Two sided confidence interval

• One sided confidence intervals

, 1 , 12 2n n

s sx t x t

n n

, 1n

sx t

n

, 1n

sx t

n

Page 29: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

CONFIDENCE LIMITS ON THE DIFFERENCE OF TWO MEANS

To get confidence limits on the difference of the means of two different population µ1-µ2, we sample from the two populations and calculate the sample means , and sample variances S1, S2 respectively.

If we assume the populations have the same variance(σ2=σ1

2 =σ22), the sample variances of the

two samples can be pooled to express a single estimate of variance Sp

2. The pooled variance Sp2

is calculated by:

where n and m are the sample sizes of two samples from the different populations.

1X 2X

2 22 1 2( 1) ( 1)

2p

n S m SS

n m

Page 30: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

CONFIDENCE LIMITS ON THE DIFFERENCE OF TWO MEANS

Known population variance: (Z - Distribution)

Unequal variances

22

2

2

1)( i

1/ 2 1/ 21 2 / 2 1 2 1 2 / 2

1 2 1 2

1 1 1 1( ) ( )x x Z x x Zn n n n

2 2 2 21/ 2 1/ 21 2 1 2

1 2 / 2 1 2 1 2 / 21 2 1 2

( ) ( )x x Z x x Zn n n n

)(ii

Page 31: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

CONFIDENCE LIMITS ON THE DIFFERENCE OF TWO MEANS

* Unknown population variance: (t - Distribution)

but unknown

Unequal variance

22

2

2

1)( i

1 2 1 21 2 1 2 1 2, 2 , 22 2

1 2 1 2

1 1 1 1( ) ( )p pn n n n

x x t s x x t sn n n n

2 2 2 21 2 1 2

1 2 1 2 1 2, ,2 21 2 1 2

( ) ( )S S S S

x x t x x tn n n n

2 2 21 1 2 2

2 2 2 21 1 1 2 2 2

( / / )

(( / ) /( 1) ( / ) /( 1))

s n s n

s n n s n n

)(ii

Page 32: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

EXAMPLE 3

Two samples, each of size 10, are taken from a dissolution apparatus. The first one is taken at a temperature of 35ºC and the second at a temperature of 37ºC. The results of these experiments are the JMP data table S2E3&4.

Using JMP, calculate the mean of each sample and use confidence limits to determine if there is a significant difference between the means of the two samples at the 95% confidence level (α = 0.05).

Page 33: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

RESULTS FOR EXAMPLE 3

There is a significant difference between the means of the two samples at the 95% confidence level (α = 0.05).

Page 34: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

MODEL BUIDING

• Building multiple linear regression model– Stepwise: Add and remove variables over

several steps– Forward: Add variables sequentially– Backward: Remove variables sequentially

• JMP provides criteria for model selection like R2, Cp and MSE.

Page 35: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

HYPOTHESIS TESTING

• Although confidence limits can be used to infer the quality of the population parameters from samples drawn from the population, an alternative and more convenient approach for model building is to use hypothesis testing.

• Whenever a decision is to be made about a population characteristic, make a hypothesis about the population parameter and test it with data from samples.  

• Generally statistical test tests the null hypothesis H0 against the alternate hypothesis Ha.

• In the example 3, H0 is that there is no difference between these two experiments. Ha is that there is significant difference between the two experiments.

Page 36: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

GENERAL PROCEDURE FOR HYPOTHESIS TESTING

1. Specify H0 and Ha to test. This typically has to be a hypothesis that makes a specific prediction.

2. Declare an alpha level

3. Specify the test statistic against which the observed statistic will be compared.

 4. Collect the data and calculate the observed t

statistic.

5. Make Conclusion. Reject the null hypothesis if and only if the observed t statistic is larger than the critical one.

Page 37: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

TYPE I AND TYPE II ERROR

Comparing the state of nature and decision, we have four situations.

State of nature Decision• Null hypothesis true Fail to reject Null

• Null hypothesis true Reject Null

• Null hypothesis false Fail to reject Null

• Null hypothesis false Reject Null

Page 38: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

TYPE I AND TYPE II ERROR

• Type 1 () error– False positive– We are observing a

difference that does not exist

• Type II () error– False negative– We fail to observe a

difference that does exist

Null True

Null False

RejectType I Error

Correct

Fail to Reject Correct

Type II Error

Page 39: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

P - VALUE

• The specific value of when the population parameter and one of the confidence limits coincide– The observed level of significance

• A more technical definition:– The probability (under the null

hypothesis) of observing a test statistic that is at least as extreme as the one that is actually observed

Page 40: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

INFERENCE SUMMARY

• Population properties are inferred from sample properties via the central limit theorem

• Confidence intervals tell us something about out how well we understand a parameter… but give no guarantees (type 2 error)

• P values give us a quick number to check to see how significant a test is.

Page 41: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

MODEL BUIDING

• “All models are wrong, but some are useful.”– George Box

• “A model should be as simple as possible, but no simpler.”– Albert Einstein

Page 42: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

REGRESSION MODEL

Regression analysis creates empirical mathematical models which determine which factors are important and quantify their effects on the process but do not explain underlying phenomenon

Process Conditions

Outputs = f (inputs, process conditions, coefficients) + error

Inputs Outputs

Often called model parameters

Page 43: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

SIMPLE LINEAR REGRESSION

Simple linear regression model (one independent factor or variable):

Y = β0 + β1X + e

where e is a measure of experimental and modeling error β0, β1 are regression coefficients

Y is the response X is the factor

These models assume that we can measure X perfectly and all error or variability is in Y.

Page 44: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

SIMPLE LINEAR REGRESSION

For a one factor model, we obtain Y as a function of X

X

Y。。。

。。

Page 45: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

CORRELATION COEFFICIENTS

The correlation between the factor and the response is indicated by the regression coefficient which may be:

– Zero

– Positive

– Negative

Page 46: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

LACK OF CORRELATION

If = 0 the response does not depend on the factor

X

Y。 。

。。。 。

X

Y 。。

。。

。。

Page 47: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

POSITIVE CORRELATION COEFFICIENTS

If > 0 the response and factor are positively correlated

X

Y 。。。。

Page 48: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

NEGATIVE CORRELATION COEFFICIENTS

If < 0 the response and factor are negatively correlated

X

Y 。 。。。

Page 49: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

LEAST SQUARES

The coefficients are usually estimated using the method of least squares (or Method of Maximum Likelihood)

This method minimizes the sum of the squares of the difference between the values predicted by the model at ith data point, and the observed value Yi at the same value of Xi

ˆiy

Estimated regression line

Observed value

Xi X

Y

Yi

。。。。

Page 50: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

EXAMPLE 4

Use the previous yield data (T2E3&4) from different dissolution temperatures. Make a model that describes the effect of temperature on the yield. Note that here, temperature is the factor and the yield is the response.

Page 51: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

RESULTS FOR EXAMPLE 4

Page 52: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

MULTIPLE LINEAR REGRESSION – ONE FACTOR

• If a simple linear regression equation does not adequately describe a set of data then multiple linear regression models may be used.

• Multiple linear regression equation for response variable Y and a single factor X takes the form of a polynomial:

Y= β0 + β1X + β2X2 + β3X3 + …. + βmXm

Page 53: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

EXAMPLE 5

Three samples of size 10 are taken from an API (Active Pharmaceutical Ingredient) plant. The first one was taken at a batch reactor pressure of 3 bar, the second at 3.5 bar, and the final at 4 bar. The data table is T2E5. Use regression analysis to build a model describing the effect of pressure on the yield of the API, using a squared term if necessary.

Page 54: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

RESULTS FOR EXAMPLE 5

Page 55: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

MULTIPLE LINEAR REGRESSION – MORE THAN ONE FACTOR• If more than one regressor is needed in

the model, multiple linear regression models may be used to find relationship between Y and combination of factors X1, X2, …, Xp.

• Multiple linear regression equation for one response variable Y and factors X1, X2, …, Xp takes the form of a polynomial.

Page 56: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

EXAMPLE OF MULTILINEAR REGRESSION MODEL

Y= β0 + β1X1 + β2X2 + β3X3 + …. + βmXm+ ε

Y= β0 + β1X1 + β2X1X2 + β3X1X3 + ε

Y= β0 + β1X1 + β2X12 + β3X2 + β4X1

2X24 + ε

Y = β0 + β1X1X235 + β2X3

3 + ε

Page 57: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

NONLINEAR MODELS

• A model is said to be nonlinear if

• Example

Y = β0 exp(-β1X1) + ε

Y = β0 X1β1 + ε

/ ( ) i jy g any i

Page 58: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

EVALUATING REGRESSION MODELS

• To determine if a model is adequate to describe the observed data, the analysis of variance may be performed

• Calculate the deviation between the data points and the values predicted by the model, called the error sum of squares (SSE)

2

1

ˆ( )n

E i ii

SS y y

Page 59: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

SUM OF SQUARES

• Calculate the total variance in the data, called the total sum of squares (SST)

• The amount of the total variance explained by the model called the regression sum of squares.

• It may be shown that:

SST = SSR + SSE

2

1

( )n

T ii

SS y y

2

1

ˆ( )n

R ii

SS y y

Page 60: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

SOURCES OF VARIABILITY

Page 61: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

MEAN SQUARE

The mean square is the sum of squares divided by the associated degrees of freedom (DOF)

MSR = SSR / p MSE = SSE / (n-2)

where p is the number of parameters in the model.

Total DOF = DOF for Regression + DOF for Error

1 p + n - p - 1n

Page 62: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

F TEST

In multiple linear regression, F statistic can be used in hypothesis testing.

H0 in this hypothesis testing is that all the β’s except β0 are 0.

Page 63: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

F TEST AND R2

• The mean squares are used to perform an F test since they estimate specific population variances

F = MSR / MSE

• The sum of squares are used to calculate the R2 criterion

R2 = SSR/SST= 1- SSE/SST

Page 64: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

EXAMPLE 6

Examine the variance of the model created to describe the effect of pressure on the yield of API (in Example 5).

Page 65: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

RESULTS FOR EXAMPLE 6

Since the p value for F test is <.0001, which is significant in the .05 level, the overall model is significant.

Page 66: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

EXAMPLE 7

Build a model using JMP data table T2E7 with potential factors: temperature, A/B feed ratio, and Termination time and the response variable yield. Determine which terms are significant. Build the model using forward and backward selection technique.

Page 67: Statistical Design of Experiments

Monday, Aug 13, 2007Dr. Gary Blau, Sean Han

RESULTS FOR EXAMPLE 7

The Temperature and Termination time are significant on the .05 level.