Resampling Statistics · statistical design. Resampling via bootstrapping is a powerful tool in...

12
Resampling Statistics Introduction to Resampling Probability Modeling Resample add-in Bootstrapping values, vectors, matrices R boot package Conclusions Conventional Statistics Assumptions of “conventional” statistics: - Variables are randomly sampled - Follow a normal distribution (Gaussian) Thus, the basis of “conventional” inference is that samples are drawn at random from a larger population and the observations in the sample are then presumed to reflect the population (e.g., mean & variance). Resampling Statistics In resampling statistics, statistical estimates are formed by taking random samples directly from the data at hand. In other words, you randomly sample your random sample!

Transcript of Resampling Statistics · statistical design. Resampling via bootstrapping is a powerful tool in...

Page 1: Resampling Statistics · statistical design. Resampling via bootstrapping is a powerful tool in many statistical situations (cf. Chpt. 19 in W&S textbook). A nice overview of the

Resampling Statistics

Introduction to ResamplingProbability ModelingResample add-inBootstrapping values, vectors, matricesR boot packageConclusions

Conventional Statistics

Assumptions of “conventional” statistics:- Variables are randomly sampled- Follow a normal distribution (Gaussian)

Thus, the basis of “conventional” inference is that samples are drawn at random from a larger population and the observations in the sample are then presumed to reflect the population (e.g., mean & variance).

Resampling Statistics

In resampling statistics, statistical estimates are formed by taking random samples directly from the data at hand.

In other words, you randomly sample your random sample!

Page 2: Resampling Statistics · statistical design. Resampling via bootstrapping is a powerful tool in many statistical situations (cf. Chpt. 19 in W&S textbook). A nice overview of the

Resampling Statistics- Key Features -

1. For small data sets, resampling procedures probably provide more accurate statistical answers than conventional statistics.

2. For large data sets, resampling answers and conventional answers usually agree.

3. Resampling can handle virtually any statistic, not just those for which a distribution is known.

4. Resampling typically generates accurate 95CIs.

Resampling Statistics- Terminology -

Resampling is a “generic term” which refers to a whole array of computer intensive methods for testing hypotheses based on Monte Carlo and resampling simulations.

Bootstrapping and jackknifing represent the two most common forms applied to “conventional statistical designs.”

This lecture will focus primarily on bootstrapping procedures.

Resampling Statistics- References -

These procedures have been around for a long time but have really only begun to be applied recently because of enhanced computer technology.

Selected References:

Efron, B. 1982. The jackknife, the bootstrap, and other resampling plans. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA.

Simon, J.L. 1997. Resampling: The new statistics, 2nd ed. (online)http://www.resample.com/content/text/index.shtml

Good, P.I. 2005. Introduction to statistics through resampling methods and R/S-Plus. Wiley Interscience, New York, NY.

Page 3: Resampling Statistics · statistical design. Resampling via bootstrapping is a powerful tool in many statistical situations (cf. Chpt. 19 in W&S textbook). A nice overview of the

Probability Modeling

Direct modeling of probabilities is the primary point of resampling statistics.

Consider a simple coin flip example.

A coin contains two outcomes: heads (1), tails (0)

If you flip 100 times, the expectation is:50:50 or half 1s and half 0s.

Probability ModelingConsider a less trivial & more biological case of probabilities:

In clutch sizes of 8, how often would you expect to see 3 males and 5 females (i.e., 3:5 ratio)?

This can be modeled using a coin flip algorithm. Assume the probability of male vs. female is equal and independent of previous clutches.

One can flip 8 coins, count the heads (males), and repeat this procedure many times.

Page 4: Resampling Statistics · statistical design. Resampling via bootstrapping is a powerful tool in many statistical situations (cf. Chpt. 19 in W&S textbook). A nice overview of the

Probability ModelingThe only possible logistical difficulty in this is the “many times” part.

Resampling statistical software is available in a variety of forms. A simple Excel add-in is available for $99 (academic pricing) or calculations can be done various ways in R.

Let's first look at a simple using the Excel add-in to get the general idea using our clutch size data. We can mathematically flip a coin 8 times, determine how many males there are, and do this many, many times:

Select Resample, input range A1:A2, place data in D1 in a group of 8

Resampling Software

Resampling Software

The result is 8 values of 0 or 1 placed in column D.

Cell D9 contains the column sum (5 males for this one case of 8 flips).

We need to do this 999 more times!

Page 5: Resampling Statistics · statistical design. Resampling via bootstrapping is a powerful tool in many statistical situations (cf. Chpt. 19 in W&S textbook). A nice overview of the

Resampling Software

Click OK, then 2x click on this cell(will turn red when selected, then 2x Click on any empty cell), 1 score recorded.

Resampling Software

Next, click on RS (Repeat and Score), enter 1000 trials, click OK, go to output tab…

Data are sortedhigh to low

The sum (males) of 1000 groups of 8-flips areplaced in A onoutput sheet

Page 6: Resampling Statistics · statistical design. Resampling via bootstrapping is a powerful tool in many statistical situations (cf. Chpt. 19 in W&S textbook). A nice overview of the

Now, using the stats add-in from Excel, construct a histogram of the 1000 resamples.

3 males happens in 210 of 1000 clutches or 0.210, or ca. 1 in 5 clutches.

Resampling Software

Boot Packagev. 1.2-4325-SEP-11

http://cran.r-project.org/web/packages/boot/boot.pdf

The BOOT package is designed to provide extensive facilities for all forms of bootstrapping and resampling.

One can bootstrap a simple statistic (e,g., median), a vector (e.g., regression weights), or an entire matrix.

The main bootstrapping function is boot() and has the following format:

Bootobject <- boot(data= , statistic=, R=, ...)

where,

data = a vector, matrix, or dataframe

statistic = a function that produces the k statistics to be bootstrapped (k=1 if bootstrapping a single statistic). The function should include an “indicies parameter” that the boot( ) function can use to select cases for each replication.

R = the number of bootstrap replicates

… = additional parameters

Page 7: Resampling Statistics · statistical design. Resampling via bootstrapping is a powerful tool in many statistical situations (cf. Chpt. 19 in W&S textbook). A nice overview of the

Boot( ) calls the statistic function R times.

Each time, it generates a set of random indices, with replacement. (Just like the resample Excel add-in.)

These indices are used within the statistic function to select a sample.

The statistics are calculated on the sample and the results accumulated in bootobject.

The bootobject structure includes:

t0 = The observed values of k statistics applied to the original data

t = An R x k matrix where each row is a bootstrap replicate of the k statistics.

You can access these as bootobject$t0 and bootobject$t

Once the bootstrap samples have been generated, use print(bootobject) and plot(bootobject) to examine the results.

boot.ci() can be used to obtain confidence intervals for the statistic(s).

Let's load the library boot and use one of its datasets:

...

Page 8: Resampling Statistics · statistical design. Resampling via bootstrapping is a powerful tool in many statistical situations (cf. Chpt. 19 in W&S textbook). A nice overview of the

We can try a standard linear model of mpg as a function of weight and displacement:

> summary(reg)

Call:lm(formula = mpg ~ wt + disp)

Residuals: Min 1Q Median 3Q Max -3.4087 -2.3243 -0.7683 1.7721 6.3484

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 34.96055 2.16454 16.151 4.91e-16 ***wt -3.35082 1.16413 -2.878 0.00743 ** disp -0.01773 0.00919 -1.929 0.06362 . ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.917 on 29 degrees of freedomMultiple R-squared: 0.7809, Adjusted R-squared: 0.7658 F-statistic: 51.69 on 2 and 29 DF, p-value: 2.744e-10

Page 9: Resampling Statistics · statistical design. Resampling via bootstrapping is a powerful tool in many statistical situations (cf. Chpt. 19 in W&S textbook). A nice overview of the

> results

ORDINARY NONPARAMETRIC BOOTSTRAP

Call:boot(data = mtcars, statistic = rsq, R = 1000, formula = mpg ~ wt + disp)

Bootstrap Statistics : original bias std. errort1* 0.7809306 0.009334923 0.04890951

> quartz(height=4,width=7)> plot(results)

> boot.ci(results, type="bca")

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONSBased on 1000 bootstrap replicates

CALL : boot.ci(boot.out = results, type = "bca")

Intervals : Level BCa 95% ( 0.6314, 0.8525 ) Calculations and Intervals on Original ScaleSome BCa intervals may be unstable

Page 10: Resampling Statistics · statistical design. Resampling via bootstrapping is a powerful tool in many statistical situations (cf. Chpt. 19 in W&S textbook). A nice overview of the

We can extend a single value bootstrap to an entire vector and continue with same example, but this time determine the model regression coefficients:

> bsmodel <- function(formula, data, indices) {+ d <- data[indices,] # allows boot to select sample + fit <- lm(formula, data=d)+ return(coef(fit)) + }

> results <- boot(data=mtcars, + statistic=bsmodel, + R=1000, formula=mpg~wt+disp)

> results

ORDINARY NONPARAMETRIC BOOTSTRAP

Call:boot(data = mtcars, statistic = bs, R = 1000, formula = mpg ~ wt + disp)

Bootstrap Statistics : original bias std. errort1* 34.96055404 9.262732e-02 2.493484690t2* -3.35082533 -5.329619e-02 1.180377872t3* -0.01772474 3.939446e-05 0.008735869

> results$t [,1] [,2] [,3] [1,] 31.65568 -2.06400409 -2.212067e-02 [2,] 34.12020 -2.88466428 -1.819257e-02 [3,] 38.02991 -4.35540788 -1.735722e-02 [4,] 33.95197 -3.77649064 -9.752654e-03 [5,] 34.43601 -3.16552898 -1.873982e-02 [6,] 34.47165 -2.89633129 -2.302154e-02 [7,] 35.48928 -3.69683419 -1.510129e-02 [8,] 35.47456 -3.11758947 -2.271243e-02 [9,] 33.57981 -2.30608721 -2.730837e-02 [10,] 36.10200 -4.51600675 -4.876640e-03 [11,] 31.67622 -2.60958056 -1.730342e-02. . .

> results$t0(Intercept) wt disp 34.96055404 -3.35082533 -0.01772474

Page 11: Resampling Statistics · statistical design. Resampling via bootstrapping is a powerful tool in many statistical situations (cf. Chpt. 19 in W&S textbook). A nice overview of the

> boot.ci(results, type="bca", index=1) # intercept

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONSBased on 1000 bootstrap replicates

CALL : boot.ci(boot.out = results, type = "bca", index = 1)

Intervals : Level BCa 95% (29.83, 39.96 ) Calculations and Intervals on Original Scale

> boot.ci(results, type="bca", index=2) # wt > boot.ci(results, type="bca", index=3) # disp

CarBoot.RScript File

Resampling- Conclusions -

Hopefully, by now, you can see that there is a very general principle here that can be applied to virtually any statistical design.

Resampling via bootstrapping is a powerful tool in many statistical situations (cf. Chpt. 19 in W&S textbook).

A nice overview of the concepts examined here can be found in:

Efron, B. 1983. Computer-intensive methods in statistics. Scientific American, May, 116-130.

Page 12: Resampling Statistics · statistical design. Resampling via bootstrapping is a powerful tool in many statistical situations (cf. Chpt. 19 in W&S textbook). A nice overview of the