Post on 01-Jan-2016
description
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
Statistical Design of Experiments
SECTION II
REVIEW OF STATISTICS
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
INTRODUCTION
• Difference between statistics and probability• Statistical Inference
– Samples and populations– Intro to JMP software package– Central limit theorem– Confidence intervals– Hypothesis testing
• Regression and modeling fundamentals– Introduction to Model Building– Simple linear regression– Multiple linear regression– Model Building
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
PROBABILITY VS STATISTICS
Problems Approach
Dealing withsources of variability
Understanding the behavior of a process from random experiments on the process
Probability is the language used to characterize
quantitative variability in random experiments
Statistics allows us to infer process behavior from a
small number of experiments or trials
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
POPULATION VS SAMPLE
Samples drawn from the population are used to infer things about the population
Sample 1
Sample 3
Sample 2
Population
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
BATCH REACTOR OPTIMIZATION EXAMPLE
A new small molecule API, designated simply C, is being produced in a batch reactor in a pilot plant. Two liquid raw materials A and B are added to the reactor and the reaction A+B K1>C takes place. (K1 is the reaction rate constant.)
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
BATCH REACTOR OPTIMIZATION EXAMPLE
• There are various controllable factors for the reactor, some of which are:– Temperature– Agitation rate– A/B feed ratio
.........
• Adjusting the values or Levels of these factors may change the yield of C
• We would like to find some combination of these levels that will maximize C
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
STATISTICAL INFERENCE
Suppose 10 different batches are run and the yield of C at the end of the reaction measured. The properties of the population (i.e. all future batches) can be estimated from the properties of this sample of 10 batch runs.
Specifically it is possible to estimate the parameters:– Central Tendency
Mean, Median, Mode,– Scatter or Variability
Variance, Standard Deviation, (Skewness, Kurtosis)
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
RANDOM SAMPLE
Each member of the population has an equal chance of being selected for the sample. (In the example, it means that each batch of material is made under the same processing condition and is different only in the time at which it was run.)
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
MEAN OF A SAMPLE
The average value of n batches in the sample called the sample mean :
It can be used to estimate the central tendency of a population mean
X
1
n
ii
XX
n Yield of the ith
batch
Sample size
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
VARIANCE OF A SAMPLE
• Variance of a sample of size n is
• The population variance, 2, can be inferred from s2
2
2 1
( )
1
n
ii
X Xs
n
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
INTRODUCTION TO JMP
• Background– JMP is a statistical design and analysis
package. JMP helps you explore data, fit models, and discover patterns
– JMP is from the SAS Institute, a large private research institute specializing in data analysis software.
• Features– The emphasis in JMP is to interactively work
with data.– Simple and informative graphics and plots
are often automatically shown to facilitate discovery of behavioral patterns.
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
INTRODUCTION TO JMP
• Limitations of JMP– Large jobs
JMP is not suitable for problems with large data sets. JMP data tables must fit in main memory of your PC. JMP graphs everything. Sometimes graphs get expensive and more cluttered when they have many thousands of points.
– Specialized Statistics JMP does only conventional data analysis.
Consider another package for performing more complicated analysis. (e.g. SAS, R and S-Plus)
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
PROBABILITY DISTRIBUTION USING JMP (EXAMPLE 1)
• The yield measurements from a granulator are given below:79, 91, 83, 78, 90, 84, 93, 83, 83, 80 %
• Using the statistical software package JMP, calculate the mean, variance, and standard deviation of the data. Also, plot a distribution of the data.
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
RESULTS FOR EXAMPLE 1
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
NORMAL DISTRIBUTION
• The outcomes from many physical phenomenon frequently follow a single type of distribution, the Normal Distribution. (See Section I)
• If several samples are taken from a population, the distribution of sample means begins to look like a normal distribution regardless of the distribution of the event generating the sample
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
CENTRAL LIMIT THEOREM
If random samples of n observations are drawn from any population with finite mean and variance 2, then, when n is large, the sampling distribution of the sample mean is approximately normally distributed with mean and standard deviation:
nx
x
( )E X
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
EFFECTS OF SAMPLE SIZE
As the sample size, n, increases, the variance of the sample mean decreases.
n = 50
n = 30
x
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
SAMPLE SIZE EFFECTS (EXAMPLE 2)
• Take 5 measurements for the yield from a granulator and calculate the mean. Repeat this process 50 times and generate a distribution of mean values. The results are the JMP data table S2E2.
• It can be shown that using 10 or 20 measurements in the first step will give greater accuracy and less variability.
• Note the change in the shape of the distributions with an increase in the individual sample size, n.
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
RESULTS FOR EXAMPLE 2
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
????CONFIDENCE LIMITS
Confidence limits are used to express the validity of statements about the value of population parameters. For instance:– The yield C of the reactor in example 1 is
90% at a temperature of 250º F– The yield C of the reactor is not significantly
changed when the temperature increases from 242º to 246º F
– There is no significant difference between the variance of the output of C at 250º and 260ºC
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
CONFIDENCE LIMITS
The bounds on the population parameters θ take the form:
l u
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
CONFIDENCE LIMITS
The bounds are based on– The size of the sample, n– The confidence level, (1-
% confidence = (100)(1-)
i.e., = 0.1 means that if we generated 100 such intervals, 90 of them contain the true (population) parameter
– These are not Bayesian intervals (those will be discussed in the second module)
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
Z STATISTIC
• The Z statistic can be used to place confidence limits on the population mean when the population variance is known.
• Z distribution is a normally distributed random variable with =0 and 2=1.
i.e.
2exp
2
1)(
2ZZp
(0,1)Z N
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
Z STATISTIC
From Central limit theory, if n is large:
regardless of population distribution
distribution
-Zα/2 Zα/2
~ (0,1)x
N Z
n
2 x
Z
n
2
( , )X Nn
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
CONFIDENCE LIMITS ON THE POPULATION MEAN (POPULATION VARIANCE KNOWN)
• Two sided confidence interval
• One sided confidence Zα/2 Zα/2
intervals
Or -Zα
2 2x Z x Z
n n
x Zn
x Zn
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
t STATISTIC
• The t statistic is used to determine confidence limits when the population variance 2 is unknown and must be estimated from the sample variance s2
i.e. , t distribution
with n-1 degree of freedom (df).
( 1)/nXT t nS n
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
COMPARISON OF Z AND t
Z distributiont distribution, df=3t distribution, df=2
t distribution, df=1
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
CONFIDENCE LIMITS ON THE POPULATION MEAN (POPULATION VARIANCE UNKNOWN)
• Two sided confidence interval
• One sided confidence intervals
, 1 , 12 2n n
s sx t x t
n n
, 1n
sx t
n
, 1n
sx t
n
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
CONFIDENCE LIMITS ON THE DIFFERENCE OF TWO MEANS
To get confidence limits on the difference of the means of two different population µ1-µ2, we sample from the two populations and calculate the sample means , and sample variances S1, S2 respectively.
If we assume the populations have the same variance(σ2=σ1
2 =σ22), the sample variances of the
two samples can be pooled to express a single estimate of variance Sp
2. The pooled variance Sp2
is calculated by:
where n and m are the sample sizes of two samples from the different populations.
1X 2X
2 22 1 2( 1) ( 1)
2p
n S m SS
n m
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
CONFIDENCE LIMITS ON THE DIFFERENCE OF TWO MEANS
Known population variance: (Z - Distribution)
Unequal variances
22
2
2
1)( i
1/ 2 1/ 21 2 / 2 1 2 1 2 / 2
1 2 1 2
1 1 1 1( ) ( )x x Z x x Zn n n n
2 2 2 21/ 2 1/ 21 2 1 2
1 2 / 2 1 2 1 2 / 21 2 1 2
( ) ( )x x Z x x Zn n n n
)(ii
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
CONFIDENCE LIMITS ON THE DIFFERENCE OF TWO MEANS
* Unknown population variance: (t - Distribution)
but unknown
Unequal variance
22
2
2
1)( i
1 2 1 21 2 1 2 1 2, 2 , 22 2
1 2 1 2
1 1 1 1( ) ( )p pn n n n
x x t s x x t sn n n n
2 2 2 21 2 1 2
1 2 1 2 1 2, ,2 21 2 1 2
( ) ( )S S S S
x x t x x tn n n n
2 2 21 1 2 2
2 2 2 21 1 1 2 2 2
( / / )
(( / ) /( 1) ( / ) /( 1))
s n s n
s n n s n n
)(ii
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
EXAMPLE 3
Two samples, each of size 10, are taken from a dissolution apparatus. The first one is taken at a temperature of 35ºC and the second at a temperature of 37ºC. The results of these experiments are the JMP data table S2E3&4.
Using JMP, calculate the mean of each sample and use confidence limits to determine if there is a significant difference between the means of the two samples at the 95% confidence level (α = 0.05).
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
RESULTS FOR EXAMPLE 3
There is a significant difference between the means of the two samples at the 95% confidence level (α = 0.05).
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
MODEL BUIDING
• Building multiple linear regression model– Stepwise: Add and remove variables over
several steps– Forward: Add variables sequentially– Backward: Remove variables sequentially
• JMP provides criteria for model selection like R2, Cp and MSE.
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
HYPOTHESIS TESTING
• Although confidence limits can be used to infer the quality of the population parameters from samples drawn from the population, an alternative and more convenient approach for model building is to use hypothesis testing.
• Whenever a decision is to be made about a population characteristic, make a hypothesis about the population parameter and test it with data from samples.
• Generally statistical test tests the null hypothesis H0 against the alternate hypothesis Ha.
• In the example 3, H0 is that there is no difference between these two experiments. Ha is that there is significant difference between the two experiments.
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
GENERAL PROCEDURE FOR HYPOTHESIS TESTING
1. Specify H0 and Ha to test. This typically has to be a hypothesis that makes a specific prediction.
2. Declare an alpha level
3. Specify the test statistic against which the observed statistic will be compared.
4. Collect the data and calculate the observed t
statistic.
5. Make Conclusion. Reject the null hypothesis if and only if the observed t statistic is larger than the critical one.
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
TYPE I AND TYPE II ERROR
Comparing the state of nature and decision, we have four situations.
State of nature Decision• Null hypothesis true Fail to reject Null
• Null hypothesis true Reject Null
• Null hypothesis false Fail to reject Null
• Null hypothesis false Reject Null
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
TYPE I AND TYPE II ERROR
• Type 1 () error– False positive– We are observing a
difference that does not exist
• Type II () error– False negative– We fail to observe a
difference that does exist
Null True
Null False
RejectType I Error
Correct
Fail to Reject Correct
Type II Error
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
P - VALUE
• The specific value of when the population parameter and one of the confidence limits coincide– The observed level of significance
• A more technical definition:– The probability (under the null
hypothesis) of observing a test statistic that is at least as extreme as the one that is actually observed
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
INFERENCE SUMMARY
• Population properties are inferred from sample properties via the central limit theorem
• Confidence intervals tell us something about out how well we understand a parameter… but give no guarantees (type 2 error)
• P values give us a quick number to check to see how significant a test is.
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
MODEL BUIDING
• “All models are wrong, but some are useful.”– George Box
• “A model should be as simple as possible, but no simpler.”– Albert Einstein
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
REGRESSION MODEL
Regression analysis creates empirical mathematical models which determine which factors are important and quantify their effects on the process but do not explain underlying phenomenon
Process Conditions
Outputs = f (inputs, process conditions, coefficients) + error
Inputs Outputs
Often called model parameters
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
SIMPLE LINEAR REGRESSION
Simple linear regression model (one independent factor or variable):
Y = β0 + β1X + e
where e is a measure of experimental and modeling error β0, β1 are regression coefficients
Y is the response X is the factor
These models assume that we can measure X perfectly and all error or variability is in Y.
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
SIMPLE LINEAR REGRESSION
For a one factor model, we obtain Y as a function of X
X
Y。。。
。。
。
。
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
CORRELATION COEFFICIENTS
The correlation between the factor and the response is indicated by the regression coefficient which may be:
– Zero
– Positive
– Negative
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
LACK OF CORRELATION
If = 0 the response does not depend on the factor
X
Y。 。
。。。 。
。
X
Y 。。
。。
。。
。
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
POSITIVE CORRELATION COEFFICIENTS
If > 0 the response and factor are positively correlated
X
Y 。。。。
。
。
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
NEGATIVE CORRELATION COEFFICIENTS
If < 0 the response and factor are negatively correlated
X
Y 。 。。。
。
。
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
LEAST SQUARES
The coefficients are usually estimated using the method of least squares (or Method of Maximum Likelihood)
This method minimizes the sum of the squares of the difference between the values predicted by the model at ith data point, and the observed value Yi at the same value of Xi
ˆiy
Estimated regression line
Observed value
Xi X
Y
Yi
。。。。
。
。
。
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
EXAMPLE 4
Use the previous yield data (T2E3&4) from different dissolution temperatures. Make a model that describes the effect of temperature on the yield. Note that here, temperature is the factor and the yield is the response.
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
RESULTS FOR EXAMPLE 4
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
MULTIPLE LINEAR REGRESSION – ONE FACTOR
• If a simple linear regression equation does not adequately describe a set of data then multiple linear regression models may be used.
• Multiple linear regression equation for response variable Y and a single factor X takes the form of a polynomial:
Y= β0 + β1X + β2X2 + β3X3 + …. + βmXm
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
EXAMPLE 5
Three samples of size 10 are taken from an API (Active Pharmaceutical Ingredient) plant. The first one was taken at a batch reactor pressure of 3 bar, the second at 3.5 bar, and the final at 4 bar. The data table is T2E5. Use regression analysis to build a model describing the effect of pressure on the yield of the API, using a squared term if necessary.
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
RESULTS FOR EXAMPLE 5
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
MULTIPLE LINEAR REGRESSION – MORE THAN ONE FACTOR• If more than one regressor is needed in
the model, multiple linear regression models may be used to find relationship between Y and combination of factors X1, X2, …, Xp.
• Multiple linear regression equation for one response variable Y and factors X1, X2, …, Xp takes the form of a polynomial.
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
EXAMPLE OF MULTILINEAR REGRESSION MODEL
Y= β0 + β1X1 + β2X2 + β3X3 + …. + βmXm+ ε
Y= β0 + β1X1 + β2X1X2 + β3X1X3 + ε
Y= β0 + β1X1 + β2X12 + β3X2 + β4X1
2X24 + ε
Y = β0 + β1X1X235 + β2X3
3 + ε
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
NONLINEAR MODELS
• A model is said to be nonlinear if
• Example
Y = β0 exp(-β1X1) + ε
Y = β0 X1β1 + ε
/ ( ) i jy g any i
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
EVALUATING REGRESSION MODELS
• To determine if a model is adequate to describe the observed data, the analysis of variance may be performed
• Calculate the deviation between the data points and the values predicted by the model, called the error sum of squares (SSE)
2
1
ˆ( )n
E i ii
SS y y
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
SUM OF SQUARES
• Calculate the total variance in the data, called the total sum of squares (SST)
• The amount of the total variance explained by the model called the regression sum of squares.
• It may be shown that:
SST = SSR + SSE
2
1
( )n
T ii
SS y y
2
1
ˆ( )n
R ii
SS y y
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
SOURCES OF VARIABILITY
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
MEAN SQUARE
The mean square is the sum of squares divided by the associated degrees of freedom (DOF)
MSR = SSR / p MSE = SSE / (n-2)
where p is the number of parameters in the model.
Total DOF = DOF for Regression + DOF for Error
1 p + n - p - 1n
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
F TEST
In multiple linear regression, F statistic can be used in hypothesis testing.
H0 in this hypothesis testing is that all the β’s except β0 are 0.
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
F TEST AND R2
• The mean squares are used to perform an F test since they estimate specific population variances
F = MSR / MSE
• The sum of squares are used to calculate the R2 criterion
R2 = SSR/SST= 1- SSE/SST
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
EXAMPLE 6
Examine the variance of the model created to describe the effect of pressure on the yield of API (in Example 5).
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
RESULTS FOR EXAMPLE 6
Since the p value for F test is <.0001, which is significant in the .05 level, the overall model is significant.
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
EXAMPLE 7
Build a model using JMP data table T2E7 with potential factors: temperature, A/B feed ratio, and Termination time and the response variable yield. Determine which terms are significant. Build the model using forward and backward selection technique.
Monday, Aug 13, 2007Dr. Gary Blau, Sean Han
RESULTS FOR EXAMPLE 7
The Temperature and Termination time are significant on the .05 level.