Presenter: Yanlin Wu Advisor: Professor Geman Date: 10/17/2006.

Presenter: Yanlin Wu

Advisor: Professor Geman

Date: 10/17/2006

Is cross-validation valid for small-sample classification?

Ulisses M. Barga-Neto and Edward R. Dougherty

Background

What is the classification problem?How to evaluate the accuracy of one

classifier? Namely measure the error of classification model?

Different error measuring methodsThings need to pay attention to…

Classification problem

In statistical patter recognition, feature vector and a label ,which takes on numerical values representing the different classes; for two-class problem, Y={0,1}

Classifier is a function: Error rate of g is: Bayes classifier: if

and otherwise. For any g, , so that is the optimal

classifier.

dRX RY

}1,0{: dRg))((])([][ XgYEYXgPg

1)( xgBAY 2/1)|1( xXYP

0)( xgBAY

][][ ggBAY BAYg

Training data

Feature-label distribution F is unknown – using a training data to design a classifier

Classification rule is a mapping A classification rule maps the training data Sn into

the designed classifier The true error of a designed classifier is its error rate

given a fixed training dataset:

where EF indicates expectation with respect to F

)},(,),,{( 11 nnn YXYXS

}1,0{}}1,0{{: dnd RRg

),( nSg

)),(()],([ XSgYESg nFnn

Training data (continued)

The expected error rate over the data is given by

where Fn is the joint distribution of the data Sn.

It is also called unconditional error of the classification rule.

)),((][ XSgYEEE nFFn n

Question: How can we state a measure of the true error of a model, since we don’t have access to the universe of observations to test it on, namely we don’t know F ??

Answer: Error estimation methods have been developed.

Error estimation techniques

For all methods, the final model M is built base on all n observations, and then we use these n observations again to estimate the error of the model.

Types: Re-substitution Holdout Method Cross-validation Bootstrap

Re-substitution

Re-use the same training sample to measure error

This error tends to be biased low and can be made to be arbitrarily close to zero by overfitting of the model and reusing same data to measure error.

n

iiniresub xSgy

n 1

),(1

Holdout Method

For large samples, randomly choose a subset for test data, design the classifier on , and estimate its error by applying it to .

Unbiased estimator of , with respect to expectation over

nn SSt

tnn SS \

tnS

][tnnE

nS

Holdout Method Comments

This error can be slightly biased high due to not using all n observations to build the classifier. This bias will tend to decrease as n increases.

The choice of what percentage of the n observation are going to is important. Also it is affected by n

Holdout method can be run multiple times, with the accuracy estimates from all the runs averaged.

Impractical with small samples

tnS

Cross-Validation

Algorithm: Split data into k mutually exclusive subsets , then

build the model on k-1 of them and measuring error on the other.

Each subset will act as testing set once The error is the average of these k error measures.

When k=n, it is called “leave-one-out cross-validation”

)(iS

k

i

kn

j

ijin

ijcvk xSSgy

n 1

/

1

)()(

)( ),\(1

Cross-Validation (continued)

Stratified Cross-Validation: the classes are represented in each fold in the same proportion as in the original data – there is evidence that this improves the estimator

K-fold cross-validation estimator is unbiased as an estimator of

Leave-one-out estimator: nearly unbiased as an estimator of

][ / knnE

][ nE

Cross-Validation Comments

May be biased high, same reason as Holdout method

Often used when n is small which can make the Holdout method may become more biased.

Very computationally intensive, especially for large k and n.

Bootstrap Method

Based on the notion of an ‘empirical distribution’ F*, which puts mass 1/n on each of the n data points

A bootstrap sample Sn* from F* consists of n equally-likely draws with replacement from the original data Sn

The probabiliity that any given data point will not appear in Sn* is when n>>1

A bootstrap sample of size n contains on average

of the original data points.

1)/11( en n

nne 632.0)1( 1

Bootstrap Method (continued)

Bootstrap zero estimator:

In practice, EF* has to be approximated by a sample mean based on independent replicates , for b=1,…,B, where B is recommended to be between 25 and 200:

where Pi*b is the actual proportion of times a data

point (xi,yi) appears in Sn*b

)\),(:),((ˆ ***0 nnnF SSYXXSgYE

bnS *

B

b

n

i P

B

b

n

i Pib

ni

bi

bi

I

IxSgy

1 1 0

1 1 0

*

0

*

*),(

Bootstrap Method (continued)

The bootstrap zero estimator tends to be a high biased estimator of

0.632 bootstrap estimator tries to correct this bias:

Bias-corrected bootstrap estimator:

][ nE

B

b

n

i ib

nib

iresubbbc xSgyPnB 1 1

** ),()1

(1

ˆˆ

0632 ˆ632.0ˆ)632.01(ˆ resubb

Bootstrap Comments

Computation intensive.Choice of B is important.Tends to be slightly more accurate than

cross-validation in some situation. But tends to have greater variance.

Classification procedure

Assess gene expressions with microarraysDetermine genes whose expression levels

can be used as classifier variablesApply a rule to design the classifier from

the sample dataApply an error estimation procedure

Error estimation challenges

What if the number of training samples is remarkably small?

The error estimation will be greatly impacted by small samples.

A dilemma: unbiased or small variance?Prefer small variance: an unbiased

estimator with large variance is of little use

Error estimators under small samples

Holdout: impractical with small samplesResubstitution: always low-biasedcross-validation: have higher variance

than that of resubstitution or bootstrap. The variance problem of cross-validation makes its use questionable for the kinds of very small samples!

Variability affecting error estimation

Internal variance and the variability due to the random training sample. The latter is much large than the internal variance.

Error-counting estimates, such as resubstitution and cross-validation, can only change by 1/n increments.

]|ˆ[ nSVar

Variability (continued)

In cross-validation, test samples are not independent samples. This adds variance to the estimate.

Surrogate problem: original designed classifier is assessed in terms of surrogate classifier, designed by the classification rule applied on reduce data. If these surrogate classifiers are too different from the original classifier too often, the estimate may be far from the true error rate.

Experimental Setup

Classification rules: linear discriminant analysis (LDA) 3-nearest-neighbor (3NN) decision trees (CART)

Error estimators: resubstitution (resub) cross-validation: leave-one-out (loo), 5-fold c-v (cv5),

10-fold c-v (cv10) and repeated 10-fold c-v (cv10r) Bootstrap: 0.632 bootstrap (b632) and the bias-

corrected bootstrap (bbc)

Study terms of error estimators

Study the performance of an error estimator via the distribution of the error: deviation distribution of the error estimator

Estimator bias Confidence we can have in our estimates from

actual samples The root-mean-square (RMS) error: Quartiles of the deviation distribution: is less

affected by outliers than the mean

ˆn

]ˆ[ nE

]ˆ[ nVar

]ˆ[ 2 nE

Linear Discriminant Analysis (LDA)

Class posteriors: for optimal classification.Suppose is the class-conditional density of X in the class G=k, and let be the prior probability of the class k, with . A simple application of Bayes theorem gives us

Suppose we model each class density as multivariate Gaussian

)/Pr( XG

)(xf k

k

K

k k11

K

l ll

kk

xf

xfxXkG

1)(

)()/Pr(

1)(2/1

2/12/)2(

1)( K K

Tk Xx

k

pk exf

）（

LDA (continued)

Assume the class have a common covariance matrix , we get LDA

In comparing two classes k and l, it is sufficient to look at the log-ratio

Linear Discriminant function:

is an equivalent description of the decision rule

With G(x)=

kk

1 1

)()()(2/1loglog)(

)(log

)/(

)/(log lk

Tlk

Tlk

l

k

l

k

l

k

r

r xxf

xf

xXlGP

xXkGP

11log2/1)( kk

Tkk

Tk xx

)(maxarg xkk

LDA (continued)

Estimate the parameters of the Gaussian distributions:

1. where is the number of class-k observations

2.

3.

NN kk /ˆ kN

kg kiki

Nx /

K

k kg

Tkiki

iKNxx

1)()ˆ)(ˆ(

Figure1: 3 Gaussian distribution with same covariance and different means. Included are the contours of constant density enclosing 95% of the probability of each class. (Bayes decision boundaries.)

Figuare2: 30 sample drawn from each Gaussian distribution, and fitted LDA decision boundaries.

KNN: Nearest-Neighbor Methods

Nearest-neighbor methods use those observations in the training set T closest in input space to X to form . Specifically, the k-nearest neighbor fit for IS DEFINED AS FOLLOWS:

where is the neighborhood of x defined by k closest points in the training sample. Closeness implies a metric, which for the moment we assume is Euclidean distance (can define other distance also). So in words, we find the observations with closest to x in input space, and average their responses.

)(

1)(ˆ

xNxi

ki

yk

xY

)(xN k

Y

ix

Y

ix

Figure 1. 15-nearest neighbor classifier Figure 2 1-nearest neighbor classifier

Figure 3 7-nearest neighbor classifier Figure 4 Misclassification curves (training size=20, test size=10000)

Decision tree (CART)

Decide how to split (conditional Gini or conditional Entropy)

Decide when to stop splitting Decide how to prune the tree Use Training sample: Pessimistic Pruning/ Minimal Error Pruning/

Error-based Pruning/Cost Complexity Pruning Use Pruning Sample: Error Reduced Pruning

Simulation (synthetic data)

Six sample sizes: 20 to 120 in increments of 20 Total experimental conditions: 3*6*6=108 For each experimental condition and sample size,

compute the empirical deviation distribution using 1000 replications with different sample data drawn from an underlying model.

True error Computed exactly for LDA By Monte-Carlo computation for 3NN and CART


Empirical deviation distribution for selected simulations (synthetic data). beta fits, n = 20.


Empirical deviation distribution for selected simulations.variance as a function of sample size.


Cross-validation: slightly high-biased, main drawback is high variability. They also tend to produce large outliers

Resubstitution is low-biased, but shows smaller variance than cross-validation

0.632 bootstrap proved to be the bestAlso need to consider computational cost

Simulation (patient data)

Microarrays, from breast tumor samples from 295 patients: 115 good-prognosis, 180 poor-prognosis.

Use log-ratio gene expression values associated with the top p=2 and top p=5 genes. In each case, 1000 observations of size n=20 and 40 were draw independently from the pool of 295 microarrays

Sampling was strafified True error for each observation of size n: holdout

estimator, the 295-n samples points not drawn are used as the test set (good approximation given large test smaple)


Empirical deviation distribution for selected simulations (patient data). beta fits, n = 20.


The observations are not independent, but only weakly dependent

The results obtained with the patient data confirm the general conclusions obtained with the synthetic data

Conclusion

cross-validation error estimation is much less biased than resubstitution, but with excessive variance. Bootstrap methods provide improved performance relative to variance, but at a high computational cost and often with increased bias (much less than resubstitution).

My own opinion

Since the universal distribution of the training sample is unknown, the true error can only be defined on the training sample. So if the number of the training samples is very small, or the sampling method to get this training sample is not carried out correctly, either of which will cause the training samples not be able to represent the universal samples, then the classifiers and the error estimations based on this small number samples can not provide useful information about the universal classification problem.

Outlier sums statistic method

Robert Tibshirani, Trevor Hastie

Background

What’s outlier? Common methods to detect outliers Outliers in cancer gene study T-statistic in outlier study COPA (Cancer Outlier Profile Analysis)

What’s Outlier?

Definition: An outlier is an unusual value in a dataset; one that does not fit the typical pattern of data

Sources of outliersRecording of measurement errorsNatural variation of the data (valid data)

Outlier analysis

Issues: If outlier is a true error and not dealt with,

results can be severely biased. If outlier is valid data and is removed,

valuable information regarding important patterns in the data is lost

Objective: Identify outliers, then decide how to deal with them.

Outlier detection

Visual inspection of data – not applicable for large complex datasets

Automated methodsNormal distribution-based methodMedian Absolute DeviationDistance Based Method…

Normal distribution-based method

Works on one variable at a time: Xk , k=1,…,p Assume normal distribution for each variable Algorithm:

The i th observation’s data for variable Xk (i=1,…,n): xik

Sample mean for variale xk : i

ikk xn

X1

Calculation zik for each i=1,…,n:k

kikik S

Xxz

Sample standard deviation for xk : 1

)( 2

n

XxS i

kik

k

Lable xik an outlier if |xik|>3, about 0.25% will be labeled if normality assumption correct

Normal distribution-based method

Very dependent on assumption of normality

, themselves are not robust to outliers:Many positive outliersMany outliers values are small if there are real outliers in

the data, so fewer outliers will be detected Only numeric-valued variables (same for

other methods)

kX kS

kX ikz

kS ikz

ikz

Robust Normal Method

Deals with robustness problem Same as normal distribution method, but

Use trimmed mean or median instead of Use trimmed standard deviation instead of Calculate and still use |x ik|>3

labeling rule ( R superscript represents robust versions of the mean and standard deviation)

kX

kS

Rk

Rkik

ikS

Xxz

Median Absolute Deviation (MAD)

Another method for dealing with robustness problem Use median as robust estimate of mean Use MAD as robust estimate o standard deviation

Calculate Dik (i=1,…,n):

Calculate MAD:

Calculate modified zik value:

Lbel xik as outlier if |xik|>3.5

Note: 1.4826 used because

)( kikik XmedianxD

),,( 1 nkk DDmedianMAD

)(4826.1

)(

MAD

Xmedianxz kik

ik

)(4826.1 MADE

Distance Based Method

Non-parametric (no assumption of normality) Multidimensional – detects outliers across all

attributes at once (instead of one attribute at a time) Algorithm:

Calculate distance between all pairs of observations: the Euclidean distance from observation i to j

label observation i an outlier if less than r% of the total are within d distance of i (r and d are parameters)

p

k jkikij xxd1

2)(

Distance Based Method

Computationally intensive, particularly for larger and large samples

Time required for each distance calculation grows with number of attributes

Choice of d and r are not obvious, trying different values for a particular dataset further increases computations

Outliers in cancer studies

In cancer studies, mutations can often amplify or turn off gene expression in only a minority of samples, namely produce outliers

t-statistic may yield high false discovery rate (FDR) when trying to detect changes occurred in a small number of samples.

COPA & PPST have been developed Is there better method?

t-statistic method

Two standard normal distributions N(θ1,σ2) & N(θ2,σ2), the expectation of (θ1-θ2) obey t distribution

Algorithm: Assume Xij be the expression values for genes i and samples j;

2 group: 1. normal; 2. disease. Compute a two-sample t-statistic Ti for each gene

Here is the mean of gene i in group k and si is the pooled within group standard deviation of gene i.

Call a gene significant if |Ti| exceeds some threshold c.

Using permutations of the sample labels to estimate the false discovery rate (FDR) for different c.

i

iii s

xxT 12

ikx

t-statistic method

t-statistic method is Normal distribution based method – largely affected by outliers

t-statistic has no outlier dealing procedure Not applicable for the cancer studies

where mutations occur in a minority of samples

COPA

Cancer Outlier Profile Analysis Algorithm:

gene expression values are median centered, setting each gene’s median expression value to zero.

the median absolute deviation (MAD) is calculated and scaled to 1 by dividing each gene expression value by its MAD.

the 75th, 90th, and 95th percentiles of the transformed expression values are tabulated for each gene and then genes are rank-ordered by their percentile scores, providing a prioritized list of outlier profiles.

COPA

median and MAD were used for transformation as opposed to mean and standard deviation so that outlier expression values do not unduly influence the distribution estimates, and are thus preserved post-normalization

Outlier-sum statistic

Idea: improve performance for “abnormal” gene expressions in only a small number of samples.

propose another method besides COPA Compare it with COPA

Outlier-sum statistic

Algorithm Define medi and madi be the median and median

absolute deviation o the values for gene i Standardize each gene: Define qi(r) be the rth percentile of the xij’ values for

gene i, and the interquartile range IQR(i)=qi(75)-qi(25).

Values greater than the limit qi(75)+IQR(i) are defined to be outliers.

iiijij madmedxx /)('

The outlier sum statistic is defined to be sum of the values in the disease group that are beyond this limit:

)]()75('['2

iIQRqxIxW iijCj

iji

In real applications, one might expect negative as well as positive outliers. Hence define

)]()25('[''2

iIQRqxIxW iijCj

iji

Set the outlier-sum to the larger of W(i) and W’(i) in absolute value. This is called “two-sided outlier-sum statistic”

Simulation study

Generate 1000 genes and 30 samples, all values drawn from a standard normal distribution.

Add 2 units to gene 1 for k of the samples in the second group.

Compute the p-value and compare the median, mean and standard deviation of the p-values between different methods.

Simulation result

k=15 (all samples in group 2 are differentially expressed), the t-statistic performs the best. This continues until k=8

For smaller values of k, the outlier-sum statistic yields better results than COPA and t-statistic.

Application to the skin data

12625 genes and 58 cancer patients: 14 with radiation sensitivity and 44 without.

Using the group of 44 as normal class. Apply the outlier-sum statistic within the

SAM (Significance analysis of microarrays) approach.

Experiment result

Outlier-sum statistic has lower false discovery rate (FDR) near the right of plot.

but the FDR there may be too high for it to be useful in practice.

Top 12 genes called by the outlier-sum statistic

Conclusion

The outlier-sum statistic exhibits better performance than simple t-statistic thresholding and COPA when some gene expressions are unusually high in some but not all samples.

Otherwise, t-statistic performs well.

My point of view

Need more test examples to test the theory and see how far this method will work.

In simulation study, the values are drawn from a standard normal distribution. Will the variance of this distribution affect the simulation result? (since exactly 2 units quantities were added on to simulate the abnormal gene expression)

In simulation, only one gene was treated to become outlier. What if more than one gene? In other words, if there is another gene which also exhibits unusually high expression in some samples but is irrelevant with the classification problem, will it affect the outlier-sum statistic?

Reference

U. M. Braga-Neto, E. R. Dougherty, Bioinformatics 20, 374-380

Tomlins, S.A. et al, science 310, 644-648 Robert Tibshirani, Trevor Hastie, Biostatistics Advance

Access, May 15, 2006 The elements of statistical learning, by Trevor Hastie, Bobert

Tibshirani, Jerome Friedman, Springer Class Notes from the ‘Data Mining’, by Paul Maiste ‘Introduction to data mining’, addison wesley, by Pang-ning

Tan... Class notes from ‘machine learning’, by Donald. Geman

Presenter: Yanlin Wu Advisor: Professor Geman Date: 10/17/2006.

Documents

Transcript of Presenter: Yanlin Wu Advisor: Professor Geman Date: 10/17/2006.