Latin Hi Per Cube

download Latin Hi Per Cube

of 24

Transcript of Latin Hi Per Cube

  • 8/6/2019 Latin Hi Per Cube

    1/24

    HELSINKI UNIVERSITY OF TECHNOLOGY

    Department of Engineering Physics and MathematicsSystems Analysis Laboratory

    Mat-2.4108 Independent Research Projects in Applied Mathematics

    Sample Size Requierement for Monte Carlo - simulationsusing Latin Hypercube Sampling

    Anna Matala

    60968U

    20.5.2008

    1

  • 8/6/2019 Latin Hi Per Cube

    2/24

    Contents

    1 Introduction 3

    2 Monte Carlo Integration 3

    3 Latin Hypercube Sampling (LHS) 5

    3.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    3.2 Variance analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    3.3 Estimate of variance . . . . . . . . . . . . . . . . . . . . . . . . . 8

    4 Sample size 8

    5 Examples 12

    5.1 Example of sampling with lognormal distribution . . . . . . . . . 12

    5.2 Example of multinormal distribution . . . . . . . . . . . . . . . . 15

    5.3 Example of binary variables . . . . . . . . . . . . . . . . . . . . . 19

    5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    6 Conclusions 23

    2

  • 8/6/2019 Latin Hi Per Cube

    3/24

    1 Introduction

    The goal of probabilistic fire risk analysis is to discover the probabilities of the

    harmful consequences of fire. The variating initial and boundary conditions,

    that lead to growth of fire, are taken into account statistically. The conse-

    quences are represented as a limit state function. The probability is calculated

    by integrating the limit state function over multidimensional joint distribution.

    In practice, the integration is made by using Monte Carlo method, because the

    physical models of fire are non-linear and often really complex. In Probabilis-

    tic Fire Simulator (PFS, developed by VTT) [1] the sampling can be done by

    Simple Random sampling (also known as Monte Carlo sampling) or Latin Hy-

    percube sampling.

    The purpose of this work is to study the accuracy of Latin Hypercube sampling

    and to find a simple manner to evaluate the sample size. Estimates and ex-

    periences are searched from literature. Finally, the results are tested in simple

    simulations.

    2 Monte Carlo Integration

    Random sampling has been used for solving numerical problems as early as 1777

    when Comte de Buffon made his experiment by dropping needle to the board

    with parallel lines with same distance [2]. If the length of the needle is L and

    the distance between lines d (d>L), the probability that the needle intersects a

    line is

    p =2L

    d. (1)

    Later Laplace realized that this method could be used for estimating the value

    of .

    Modern Monte Carlo (SRS) method was born in 1940s when Stanislaw Ulam,

    John von Neumann and others started to use random numbers to examine

    physics from the stochastic perspective. One of the most effective use of Monte

    Carlo method is to evaluate definite integrals which analytically would be too

    difficult to find [3]. If the problem is

    3

  • 8/6/2019 Latin Hi Per Cube

    4/24

    y =ba

    f(x)dx, (2)

    the solution can found by hit-or-miss (acceptance-rejection) method: A box is

    drawn around the function from a to b and from 0 to y0 (y0>f(x)). After this,

    N random numbers are taken from a uniform distribution and plotted (as in

    Figure 1). A number of points that fall below f(x) are then counted and marked

    with N0. Now the estimate of integral can be calculated using equation

    yest =N0N

    (y0(b a)). (3)

    The estimate is getting more and more precise when N is increased and will

    converge to the correct answer when N . In this example, x had only onedimension for simplicity, but the method can be successfully applied to multi-

    dimensional problems as well.

    Figure 1: Hit or miss method in integration of curve f(x).

    4

  • 8/6/2019 Latin Hi Per Cube

    5/24

    3 Latin Hypercube Sampling (LHS)

    3.1 Method

    The Latin Hypercube Sampling [4]-[6] is a type of stratified Monte Carlo sam-

    pling. All the areas of the sample space are represented by input values. To

    generate a sample size N from K variables x=[x1, x2, ..., xK ]T with probability

    density function f(x), the procedure is as follows. The range of each variable is

    portioned into N non overlapping intervals of equal probability 1/N (Figure 2).

    From each interval one value is selected randomly according to the probability

    density of the interval. The N values of x1 are paired in a random manner with

    values of x2 (Figure 3), these pairs are then paired similarly with values of x3

    and so on, until N samples of n variables are formed. So created cube has NK

    cells which cover the sample space. Denote the input vector X=[x1,...,xN].

    Figure 2: Density function of variable xi divided to N=5 non overlapping inter-vals.

    5

  • 8/6/2019 Latin Hi Per Cube

    6/24

    Figure 3: Pairing x1 with x2 when N=5.

    3.2 Variance analysis

    Let h(x) be the transformation function of input x [4],[7]. The estimate of

    expectation value E(h(x)) is

    h = N1Nj=1

    h(xj). (4)

    The variance of the estimate depends on the sample method used. For Simple

    Random sampling the variance is

    var(h) = N1var(h(x)) (5)

    and for the Latin Hypercube sampling

    var(h) = N1

    var(h(x)) +

    N

    1

    N cov(h(x1), h(x2)). (6)

    Thus the LHS sampling decreases the variance if and only if cov(h( x1),h(x2))

  • 8/6/2019 Latin Hi Per Cube

    7/24

    and {XjN}(where j=1,...,N and N=1,2,...). Steins Theorem 1 shows that ifE(h2)< and N ,

    cov(h(X)1N, h(X)2N) =K

    NE(h)2 N1

    Kk=1

    gk(x)

    2dFk(x) + o(N1). (8)

    By Jensenss inequality,

    gk(x)

    2dFk(x) (

    gk(x)dFk(x))2 = (Eh)2, (9)

    so the highest-order term in expansion of cov(h(x1),h(x2)) is non positive. Thatmeans, that

    limNNcov(h(x1), h(x2)) 0. (10)

    This can be written easier if we define

    ha(x) =

    Kk=1

    gk(xk) (K 1)E(h) (11)

    and

    r(x) = h(x) ha(x). (12)

    The function ha(x) is the best additive fit to h(x) which means

    r2(x)dF(x)

    (h(x)

    Kk=1

    hk(xk))2dF(x). (13)

    Now an approximation of variance can be written as N ,

    var(N1Nj=1

    h(XjN)) = N1

    r(x)2dF(x) + o(N1). (14)

    This is relevant whenever N is much larger than K, so the sample size is greater

    than the number of variables.

    7

  • 8/6/2019 Latin Hi Per Cube

    8/24

    3.3 Estimate of variance

    A simple way to get an estimate to the variance of average is called replicated

    Latin hypercube sampling [7]. The idea is to produce several independent LHS

    samples and then compare the variances between samples. If the number of

    samples is and size of them M, N=M. If

    hi = M1

    Mj=1

    h(Xij), (15)

    then an unbiased estimator for the variance is

    var(1i=1

    hi) =1

    ( 1) [i=1

    h2i (

    i=1 hi

    )2]. (16)

    This is exactly same as the sample variance of mean for normal sampling. Now

    if is too small, the estimate of variance is not precise. On the other hand,

    increasing while N is fixed increases this estimator of variance. However, as

    long as ratio MK is large, the increase in variance is small.

    4 Sample size

    In literature almost no recommendations or estimates for the sample size of LHS

    sampling were proposed. However, it is possible the study the examples and ap-

    ply them to fire-Monte Carlo simulation. In fire simulations, the most limiting

    factor is time. Often the samples are simulated by Fire Dynamics Simulator

    (FDS) and that makes the Monte Carlo simulation quite slow. As an example

    of order of magnitude, already 1000 samples can require computation time of

    several months. Typically number of variables is less than 10. All the following

    examples are about the LHS sampling unless else is mentioned.

    Stein [7] gives one example where ratio MK =10. In his example N=1000, M=200,

    K=20 and =5. For fire simulation, this could mean 5 independent samples of

    size 100 makes total sample size of 500. In his other example there are 6 vari-

    ables and the estimator of variance is calculated by 100 independent replications

    of samples of size 100. Now MK would be nearly 17. This is unnecessary high

    value and clearly impossible for fire simulation. However, according to the first

    8

  • 8/6/2019 Latin Hi Per Cube

    9/24

    example it is possible to get reliable results by using values 5 and MK > 10.

    Jianfang Jia et al. [8] give an example of NF-B signaling pathway model. They

    did global sensitivity analysis using 64 independent variables and 1000 samples,

    so the ratio of samples and variables is 15.6. Applied to 10 variables, this ex-

    ample would suggest sample size 156.

    As mentioned before, the latin hypercube sampling converges at least as fast

    as simple random sampling if N is big enough. Swidzinski et al. [9] go further

    stating that LHS could be nearly five times more effective in yield estimation

    than traditional sampling methods. They justify it by their study at the TexasA&M. In the tests, they generated 10 iterations of samples of size 200 using

    both LHS and simple random sampling. After that, sample size of 1000 was

    generated 10 times using simple random sampling. The results showed that

    variation in the yield estimate during the 10 iterations were similar when LHS

    sample size was 200 and simple random sample was 1000. The theory of simple

    random sampling is better known, so this gives suggestive size of LHS samples.

    Eduard Hofer [10] studied computationally intensive models, where sample sizes

    were 100 as maximum and the samples were generated by Simple Random sam-pling. As the distributions of parameters are usually not known, the fractiles of

    5% and 95% can not be immidiately obtained, and hence Monte Carlo simula-

    tions are used. The resulting distributions are often long-tailed so the fractile

    estimates are highly dependent of the sample size N. Therefore, the possible

    impact of the sampling error needs to be quantified. That can be done by cal-

    culating statistical tolerance limits (u(%), v(%)) where u is the percentage of

    the combined influence of the quantified uncertainties and v is the confidence

    level of it. In other words, with confidence level v(%) we can say that u(%) of

    the samples are inside the tolerance limits. The tolerance interval is chosen to

    be from the smallest to the largest observation in the sample.

    Hofer gives an example: Two sided statistical tolerance limits (95%, 95%) for

    Simple Random sample require only 93 model runs. This is a result of non-

    parametric statistics and depends only on the percentages u and v, not on the

    number of uncertainties taken into account. Besides, it does not assume any

    particular type of underlying distribution. This result is general and really

    promissing from the point of view of fire simulation, especially because LHS

    9

  • 8/6/2019 Latin Hi Per Cube

    10/24

    Table 1: Sample sizes for two-sided nonparametric tolerance limits. 1- is the

    confidence coefficient and q is the fraction of the population that lies between

    the smallest and the largest observation of the sample.

    q 0.5 0.7 0.750 0.8 0.85 0.9 0.95 0.975 0.98 0.99

    1-

    0.5 3 6 7 9 11 17 34 67 84 168

    0.7 5 8 10 12 16 24 49 97 122 244

    0.75 5 9 10 13 18 27 53 107 134 269

    0.8 5 9 11 14 19 29 59 119 149 299

    0.85 6 10 13 16 22 33 67 134 168 337

    0.9 7 12 15 18 25 38 77 155 194 3880.95 8 14 18 22 30 46 93 188 236 473

    0.975 9 17 20 26 35 54 110 221 277 555

    0.98 9 17 21 27 37 56 115 231 290 581

    0.99 11 20 24 31 42 64 130 263 330 662

    0.995 12 22 27 34 47 72 146 294 369 740

    0.999 14 27 33 42 58 89 181 366 458 920

    sampling can be expected to show less variability than simple random sampling.

    The solution for this example can be found from Table 1 originally extracted

    from Conovers book ([11]: Appendix A6). The table shows the sample sizes

    required according some confidence coefficients and population fractions. If the

    Table 1 does not offer direct answer, it is possible to make an approximation

    N 14

    (1)1 + q

    1 q +1

    2(r + m 1), (17)

    where r is the number of observation of lower bound of tolerance limit and

    (N + 1 m) the number of observation of upper bound. The sample must beassorted to ascending order. r=m=1 is commonly used, so the lower bound is the

    smallest observation and the upper bound the largest. 1 is the 1 quantileof a chi-square random variable with 2(r+m) degrees of freedom. Revising the

    example before causes the same result:

    N 14

    9.488 1 + 0.951 0.95 +

    1

    2(1 + 1 1) = 93.008.

    However, the statistical tolerance limits cannot be computed from LHS sample.

    The reason is that the sample estimates of the 95% fractile can be significantly

    10

  • 8/6/2019 Latin Hi Per Cube

    11/24

    below the true 95% fractile for long-tailed distributions. Without statistical

    tolerance limits it is not possible to know how likely this shortcoming is andtherefore the analysis has to be made using simple random sampling. Of course,

    after finding sample size N, it can be considered as upper limit of real LHS sam-

    ple size needed.

    11

  • 8/6/2019 Latin Hi Per Cube

    12/24

    5 Examples

    5.1 Example of sampling with lognormal distribution

    The following example is about samples of lognormal distribution. Lognormal

    distribution is chosen because it is important in fire simulations.

    Lognormal variables can be written y=ex where x is normally distributed with

    mean and variance . Density function of lognormal distribution [12] is

    f(y) =

    1

    y2 e

    (ln(y))2

    22

    (18)

    with mean = e+2/2 and variance 2 = e2+

    2

    (e2 1). The mean and

    variance of ln(y) (, 2 respectively) can be calculated backwards using following

    equations:

    = ln() 12

    ln(2

    2+ 1) (19)

    and

    2 = ln( 22

    + 1). (20)

    Monte Carlo simulations were made with sample sizes N=50, 100, 500, 1000,

    5000 and 10000, using both Simple Random sampling (SRS) and Latin Hyper-

    cube sampling (LHS). The idea was to compare mean and variance of samples.

    The lognormal distribution used had mean 27.4 and variance 16, which result

    (equations (19) and (20)) =3.3000 and 2=0.0211. The sample means and

    variances are calculated using equations [13]

    x = 1N

    Ni=1

    xi (21)

    and

    s2 =1

    N 1Ni=1

    (xi x)2. (22)

    This is valid for both SRS and LHS sampling [4].

    12

  • 8/6/2019 Latin Hi Per Cube

    13/24

    Figure 4: Sample means as function of sample size N.

    In Figures 4 and 5 are shown sample means and variances with SRS and LHSsampling. In simple case of only one variable, the LHS sampling is better with

    every N. The differences between the correct and estimated mean and variance

    values are listed in table 2.

    Table 2: Difference of mean and variance.N (SRS) 2 (SRS) (LHS) 2 (LHS)

    50 0.169231071 3.549920858 0.039925728 0.068954216

    100 0.333796062 2.874371726 0.007375298 0.155918985

    500 0.235102161 0.102067908 0.000254559 0.0328883541000 0.071753115 0.424214694 0.000419834 0.003795222

    5000 0.012967127 0.153119925 0.000180192 0.004282639

    10000 0.023052468 0.054337911 0.0000689647 0.00197263

    13

  • 8/6/2019 Latin Hi Per Cube

    14/24

    Figure 5: Sample variances as function of sample size N.

    14

  • 8/6/2019 Latin Hi Per Cube

    15/24

    5.2 Example of multinormal distribution

    Consider next a situation where variable X depends on five independent, normal

    distributed variables (with parameters i, i). The x is defined as

    x =5

    i=1

    xi, (23)

    where xi N(i,i). The parameters used in this example are listed in table 3and the simulations were made with N=50, 100, 500, 1000 and 5000. Sum of nor-

    mal distributed variables is also normal distrbuted with parameters =

    i

    and = 2i

    . In this example, the mean is 16.5 and standard deviation

    =5.7879 (variance 33.4998). In Figures 6 and 7 are shown the sample mean

    and variance with different sample sizes. It seems that LHC is closer to real

    parameter values than SRS with each N.

    Table 3: Parameters of variables.i i i

    1 1 0.5

    2 2.5 2

    3 -4 54 10 2

    5 7 0.5

    Study now the distributions of samples using the Pearsons chi-square test. That

    compares sample distribution F(x) to some known distribution G(x). The null

    hypothesis is H0: F(x)=G(x). To calculate the p-value, the sample has to be

    divided in m categories. The 2 is

    2

    =

    m

    i=1

    (Oi

    Ei)2

    Ei , (24)

    where Oi is frequency in category i and Ei=NP, the expected frequency of cate-

    gory i (calculated from known distribution). The degree of freedom is df=m-l-1,

    where l is the number of parameters in known distribution. The prerequisites

    are N>50, Ei >2 for all i and only 20 % of categories have Ei

  • 8/6/2019 Latin Hi Per Cube

    16/24

    Figure 6: Mean of sample as function of sample size N.

    Null hypothesis is nowH0: F(x)=N(16.5,5.7879).

    Divide distribution into 8 classes that are listed in table 4. The probabilities

    come from the standard normal distribution table, where z = x . It can be

    seen that now both prerequisites are filled, because when N 50, sample size isexpected to be >5 in every cathegory.

    The 2-values with different sample sizes are presented in table 5. Now df =

    8 2 1 = 5. In 2-table the maximum value to be accepted with risk level 5% is 11.07. In this example, all the 2-values are in the acceptance region, so

    all the samples normally distributed with parameters = 16.5 and = 5.7879.

    In this case, with five independent variables, sample size 50 was enough in both

    SRS and LHS sampling.

    16

  • 8/6/2019 Latin Hi Per Cube

    17/24

    Figure 7: Variance of sample as function of sample size N.

    Cathegory x P

    I [-,9.09) 0.1003II [9.09,12) 0.1174

    III [12,14) 0.1159

    IV [14,16) 0.1305

    V [16,18) 0.1385

    VI [18,20) 0.1231

    VII [20,23) 0.1429

    VIII [23, ) 0.1314

    Table 4: Categories of normal distribution.

    17

  • 8/6/2019 Latin Hi Per Cube

    18/24

    Table 5: 2 values.

    N 2(SRS) 2(LHS)

    50 10.37789289 9.148392622

    100 10.21332038 2.156814332

    500 3.388478928 4.64044467

    1000 6.533513491 5.451614406

    5000 3.706511017 6.611654417

    10000 2.131056413 0.857279318

    18

  • 8/6/2019 Latin Hi Per Cube

    19/24

    5.3 Example of binary variables

    Consider a nuclear power plant where in case of fire the failure time t1 of some

    important component is lognormally distributed with parameters 1 and 1.

    The failure can be prevented by the fire service if they arrive before the failure

    happens. The arriving time of the fire service (t2) is lognormally distributed

    with parameters 2 and 2. Now the interesting factor would be to know what

    is the probability that the failure can be prevented, or that t2 < t1. This is

    actually the probability p(A) of Bernoulli distribution, where X=1 if A happens

    and X=0 if not. In lognormal distribution negative values are not allowed, so

    the probability of P(t2

  • 8/6/2019 Latin Hi Per Cube

    20/24

    Figure 8: Lognormal distributions f1(1,1) and f2(1,2).

    The idea is now to compare the confidence intervals of Monte Carlo samples

    to similar calculated intervals of Latin Hypercube samples. These intervals are

    not valid confidence intervals of LHS sample, but comparing these intervals can

    give more information. Choose 95 % confidence level, so Z0.025 = 1.96. The

    confidence intervals were calculated to each simulation, and then counted the

    times the real value of p was included in the confidence interval in 14 repetitions.

    The results are listed in table 6. It would be interesting to test null-hypothesis

    H0: P(The real value of p is inside the 95 % confidence interval in LHS sam-

    pling)=0.95, but that is not possible for two reasons. First, to make that test,

    m should be big. The rule is that

    mp

    10

    m(1-p)10.The estimates are near 0.95 so the minimum m should be 200 which is too much

    considering that with PFS (Probabilistic Fire Simulator) the maximum variable

    number is 15 and for each p two variables are needed. This would lead to nearly

    30 simulations for each N. However, with smaller number of repetitions it is

    possible to get an idea of the result of the test. Another problem is, that the

    test variable is

    20

  • 8/6/2019 Latin Hi Per Cube

    21/24

    N n (SRS) n (LHS)

    25 13 14

    50 12 14

    100 12 14

    500 14 14

    1000 12 14

    5000 13 14

    10000 13 14

    Table 6: n is a number of times that p is inside the confidence intervals in 14

    repetitions.

    z =p p

    p(1 p)/m , (28)

    which goes to infinity if p=1. In this case the null-hypothesis would be always

    rejected.

    After 14 repetitions the estimate of probability is p=1 with each N in LHS

    sampling. The test variable is and the null-hypothesis rejected. For LHS

    sampling, the 95 % confidence intervals are significantly smaller than for SRSsampling, so with every N the LHS sampling gives at least as good results as

    the SRS.

    21

  • 8/6/2019 Latin Hi Per Cube

    22/24

    Figure 9: Averages of 14 independent repetitions.

    22

  • 8/6/2019 Latin Hi Per Cube

    23/24

    5.4 Discussion

    To determine the sample size is not a simple task. Many guides have focused

    on the Simple Random sampling, but only one common rule was found from

    the literature for Latin Hypercube samples: It gives at least as good results as

    Simple Random sampling when the sample size is big enough. The sample size

    required may depend on the distribution, number of variables or something else.

    The only way to see if the sample has converged, is to make more simulations

    and see if the result stays similar. Even this does not work for latin hypercube

    sampling, where the sample size should be known beforehand.

    In all three examples the latin hypercube samples were closer to the real value

    for each N. For five independent and normal distributed variables sample size

    50 was enough to say that the sample distribution did not differ from the orig-

    inal distribution significantly. In the last example it was seen that the samples

    generated with LHS had smaller confidence intervals, so the results of LHS sam-

    pling are more accurate.

    6 Conclusions

    The target of this work was to study the accuracy of Latin Hypercube sampling

    and find some advices how to choose the sample size N. In literature, very few

    directions were given, and in the tests it became clear that general rules are

    difficult to find. However, both the literature and examples refer to the fact

    that Latin Hypercube samples converge with smaller sample size than simple

    random samples. The sample sizes needed in simple random sampling can be

    used as upper limits for the Latin Hypercube sample size.

    References

    [1] Hostikka, S., Keski-Rahkonen, O. Probabilistic simulation of

    fire scenarios. Nuclear Engineering and Design 224 (2003)

    301-311.

    [2] Veach, E. Robust Monte Carlo methods for light transport

    simulation (1997).

    23

  • 8/6/2019 Latin Hi Per Cube

    24/24

    [3] Landau, D., Binder, K. A Guide to Monte Carlo Simulations

    in Statistical Physics (2000).

    [4] McKay M., Beckman, R., Conover, W. A Comparison of

    Three Methods for Selecting Values of Input Variables in the

    Analysis of Output from a Computer Code. Technometrics,

    vol 21, no. 2, May 1979.

    [5] Keramat, M., Kielbasa, R. Modified Latin Hypercube Sam-

    pling Monte Carlo (MLHSMC) Estimation for Average Qual-

    ity Index. Analog Integrated Circuits and Signal Processing,

    vol. 19, no. 1, April 1999.

    [6] Wyss, G., Jorgensen, K. A Users Guide to LHS: Sandias

    Latin Hypercube Sampling Software. (1998)

    [7] Stein, M. Large Sample Properties of Simulations Using Latin

    Hypercube Sampling. Technometrics, vol 29, no.2, May 1987.

    [8] Jia, J., Yue, H., Liu, T., Wang, H. Global Sensitivity Analy-

    sis of Cell Signalling Transduction Networks Based on Latin

    Hypercube Sampling Method. Bioinformatics and Biomedical

    Engineering, 2007.

    [9] Swidzinski J., Chang, K. A novel nonlinear statistical model-

    ing technique for microwave devices. Microwave Symposium

    Digest., 2000 IEEE MTT-S International (0-7803-5687-X)

    June 2000. Vol.2;p.887

    [10] Hofer, E. Sensitivity analysis in the context of uncertainty

    analysis for computationally intensive morels. Computer

    Physics Communications 117 (1999) 21-34.

    [11] Conover, W. Practical nonparametric statistics, 2nd ed. (Wi-

    ley, New York, 1980).

    [12] Rade, L., Westergren, B. Mathematics handbook for science

    and engineering. (4th edition, 1998).

    [13] Laininen, P. Sovelletun todennakoisyyslaskun kaavoja ja

    taulukoita.

    24