LEARNING AND APPLICATIONS · Microarray data analysis ... Microarray data analysis Regularization...

LEARNING AND APPLICATIONSREGULARIZATION METHODS FOR HIGH DIMENSIONAL LEARNING

Francesca Odone and Lorenzo [email protected] - [email protected]

Regularization Methods for High Dimensional Learning Learning and applications

PLAN

Learning and engineering applications: why?

Examples of in house applicationsFace and object detectionMedical image analysisMicroarray data analysis


LET’S GO BACK TO THE BEGINNING

The goal is not to memorize but to generalize (or to predict)

Given a set of data

(x1, y1), . . . , (xn, yn)

find a function f which is a good predictor of y for a future input x

f (x) = y


WHAT IS IT USEFUL FOR?

The learning paradigm is useful whenever the underlying process ispartially unknown,too complex, ortoo noisy

to be modeled as a sequence of instructions.


THE APPLICATIONS WE DEAL WITH

Computer visionFace detection and recognitionObject detectionImage annotationDynamic events and actions analysis

Medical Image AnalysisAutomatic MR annotationDictionary learning

Computational biologyGene selection


PLAN




LEARNING FROM IMAGES

Object detection, image categorization and, more in general,image understanding are difficult problemsLearning from examples has been accepted as a viable way todeal with such problems, addressing noise and intra-classvariability by collecting appropriate data and finding suitabledescriptions

Images are relatively easy to gather


IMAGE DESCRIPTIONS

WITH OVERCOMPLETE FEATURE SETS

Overcomplete general purpose sets of features are effective formodeling visual information

Many object classes have peculiar intrinsic structures that can bebetter appreciated if one looks for symmetries or localgeometries

Examples of features: wavelets, ranklets, chirplets, rectanglefeatures, ...Examples of problems: face detection [Heisele et al., Viola &Jones, Destrero et al.], pedestrian detection [Oren et al.], cardetection [Papageorgiou & Poggio]

The approach is inspired by biological systemsSee, for instance, B. A. Olshauser and D. J. Field “Sparse codingwith an over-complete basis set: a strategy employed by V1?”1997


FACE DETECTIONDESTRERO ET AL, 2009

THE CLASSIFICATION PROBLEM

It is a (binary) classification problem:→ each image region can either be a face or notWe start from a training set of faces and non-faces images:

{(x1, y1), . . . , (xn, yn)}

xi is a raw vector encoding the gray levels of image Ii ,yi = {−1,1} according to whether the image is a face or not

IMAGE REPRESENTATION

We represent images as rectangle feature vectors:

xi → (φ1(xi ), . . . , φp(xi ))


FACE DETECTION

ASSUMPTION

We assumeΦβ = Y

where Φ = {Φij} is the data matrix; β = (β1, ..., βp)T vector ofunknown weights to be estimated; Y = (y1, ..., yn)T output labels

Usually p is big; existence of the solution is ensured, uniquenessis notThe overcomplete set contains many correlated featuresThus, the problem is ill-posed. We resort to regularization.

SELECT FACE FEATURES

L1 regularization allow us to select a sparse subset of meaningfulfeatures for the problem, with the aim of discarding correlated ones

minβ∈IRp

‖Y − βΦ‖2 + λ ‖β‖1 .


A SAMPLED VERSION OF THE ALGORITHM

Applying the algorithm starting from the entire set of feature is notcomputationally feasible (Φ: 4000x64000 ' 1GB)

We create many subsets offeatures randomly sampledwith repetitionWe run the algorithmseparately on each subsetWe keep only featuresselected in every run in whichthey were present

S0

Subset 1

Random extractions of10% features w. repetition

Subset 200Subset 2

Selectedfeatures 1

Selectedfeatures 2

Selectedfeatures 200

ThresholdedLandweber

S1

Keep features selectedin every run in whichthey were present


THE FINAL SET OF FACE FEATURES

Positive and negative samples from thetraining set

Notice how vertical symmetries are notcaptured by selected features


THE SOLUTION DEPENDS ON THE TRAINING DATA

In MIT+CMU training set all imagesare registered and well cropped

Vertical symmetries are captured byselected features


FACE DETECTION

FACE CLASSIFICATION

Elastic net regularization embeds both feature selection andprediction functionalitiesAs suggested in (Candes & Tao, 2007) in order to improve theclassification performance one could use L2 regularization on thereduced data representation.Since a main requirement of our application is real-timeperformance we adopt a linear SVM for classification:

L1 + SVM gives us sparsity both on the representation and on thedataset and thus fewer computations


FACE CLASSIFICATION RESULTS

0.7

0.75

0.8

0.85

0.9

0.95

1

0 0.005 0.01 0.015 0.02

2 stages feature selection2 stages feature selection + correlation

Viola+Jones feature selection using our same dataViola+Jones cascade performance

Our strategy for feature selection outperforms the one by Viola andJones using the same dataset

Adaboost seems to need a big number of examples to be trainedeffectively (we used just 4000 examples)


FROM FACE CLASSIFICATION TO FACE DETECTIONWHY IS IT DIFFICULT?

It is very unlikely to find a face in a real image→ high number of false positives

Image dimensions:384x222px

∼ 6.5 · 105 tests in amulti-scale search with abase window of 19x19px

Only 11 faces!


FACE DETECTION:A CASCADE OF CLASSIFIERS

For each image we have many tests to do→ few positive examples and many negative examplesWe build a coarse-to-fine classification architecture:→ Simpler classifiers are used to reject the majority ofsub-windows→ More complex classifiers allow us to achieve low false positiverates


FACE DETECTION:A CASCADE OF CLASSIFIERS

1 Start from set S of selected features2 Choose at least 3 mutually distant features3 Train a linear SVM classifier using those features and test it on a

validation set4 Do we reach target performance (h = 99,5%; f = 50%)?

YES Finalize the classifier, remove used features from S and go to (2).NO Add a feature from S and go to (3).

F =K∏

i=1

fi and H =K∏

i=1

hiRegularization Methods for High Dimensional Learning Learning and applications

A PIPELINE FOR FACE AUTHENTICATIONDESTRERO ET AL., 2009


RESULTS


PLAN




AUTOMATIC ANNOTATION OF MR IMAGES:SYNOVITIS ASSESSMENTBASSO ET AL, 2010

Setting: children under 16 affected by Juvenile Idiopatic ArthritisGoal: to measure the volume of the inflamed synovia in 3D MRimagesOur problem: to classify each voxel of the MRThe approach is supervised, we use for training the manualannotations performed by experts


VOXEL-BASED IMAGE DESCRIPTION

Each voxel is represented with a set of cues chosen among theones commonly used for voxel classificationThey include the intensity of the voxel and its neighbors, theposition of the voxel, the multiscale 2-jets, the vesselnessmeasures

x→ φ(x) = {ϕ1, . . . , ϕk}


MULTI-CUE VOXEL CLASSIFIER

THE DISCRIMINANT FUNCTION

We look for a more flexible discriminant function

f (φ) =∑

(i,j)∈I

αjiK

ji (φ) + b

ASSUMPTION

The k × n basis functions

K ji (φ) = exp

{−||ϕj − ϕj

i ||2

2σ2

}

measure the similarity of φ with an example voxel i with respectto a specific cue j





f (φ) =∑

(i,j)∈I

αjiK

ji (φ) + b

MODEL SELECTION

The optimal subset I of basis fuctions, on which f depends, maybe inferred from the data by means of feature selection.Starting from a manually annotated training set of n voxels wecompute the n × kn matrix K

K = (K1, . . . ,Kk )

and look for a sparse vector α so that

y = Kα





f (φ) =∑

(i,j)∈I

αjiK

ji (φ) + b

LEARNING ALGORITHM

The goal of learning is to find the optimal affine combinationdefined by the coefficients αj

i and b. This is achieved with L2regularization on the restricted matrix K̂


RESULTS

Multi-cue classifier if 15 times sparser than SVM and approximately40 times faster.


PLAN




MACHINE LEARNING AND THE ANALYSIS OF

MICROARRAYS

GOALS

Design methods able to identify a gene segnature, i.e., a panelof genes potentially interesting for further screeningLearn the gene signatures, i.e., select the most discriminantsubset of genes on the available data


MACHINE LEARNING AND THE ANALYSIS OF

MICROARRAYS

A TYPICAL "-OMICS" SCENARIO

High dimensional data - Few samples per classtenths of data - tenths of thousands genes→ Variable selectionHigh risk of selection biasdata distortion arising from the way the data are collecteddue to the small amount of data available→ Model assessment needed

Possibily find ways to incorporate prior knowledgeDeal with data visualization


GENE SELECTION

THE PROBLEM

Select a small subset of input variables (genes) which are usedfor building classifiers

ADVANTAGES:it is cheaper to measure less variablesthe resulting classifier is simpler and potentially fasterprediction accuracy may improve by discarding irrelevantvariablesidentifying relevant variables gives useful information about thenature of the corresponding classification problem (biomarkerdetection)


VARIABLE SELECTION IN BIOINFORMATICS

MOTIVATIONS

Ease Computational Burden:Discard the (apparently) less significant features and train in asimplified space: alleviate the curse of dimensionalityEnhance Information:Highlight (and rank) the most important features and improve theknowledge of the underlying process.

COMMONLY ADOPTED METHODS

Statistical Filters (t-test,S/N ratio,...)Learning Techniques (embedded methods, wrapper methods,stepwise feature elimination,..)Mapping Methods (“Metagenes”: simplified model for pathways,even though biological suggestions require caution


STATISTICAL FILTERS

These approaches are well established in the gene selectionliterature. One considers the various measurements associated toeach gene (column of the data matrix X)

T TEST

For each column of X we compute

t =µ1 − µ2√σ1n1

+ σ2n2

were subscripts 1 and 2 stand for positive and negative examplesGenes are ranked with respect to the t valueA threshold is set to perform gene selection


GENE SELECTION WITH L1-L2 REGULARIZATIONMOSCI ET., 2008

minβ∈IRp

‖Y − βX‖2 + τ(‖β‖1 + ε ‖β‖22).

Consistency guaranteed - the more samples available thebetter the estimatorMultivariate - it takes into account many genes at once

OUTPUT

one-parameter (ε) family of nested lists with equivalent predictionability and increasing correlation among genes

ε→ 0: minimal list of prototype genesε1 < ε2 < ε3 < . . .: longer lists including correlated genes


DOUBLE OPTIMIZATION APPROACHMOSCI ET., 2008

VARIABLE SELECTION + CLASSIFICATION:

Variable selection step (L1-L2)

minβ∈IRp

‖Y − βX‖2 + τ(‖β‖1 + ε ‖β‖22).

Classification step (RLS)

‖Y − βX‖22 + λ ‖β‖2

2

for each ε we have to choose λ and τ


A SELECTION BIAS AWARE FRAMEWORKBARLA ET AL, 2008

λ→ (λ1, . . . , λA)τ → (τ1, . . . , τB)the optimal pair (λ∗, τ∗) is one of the possible A · B pairs (λ, τ)


ALGORITHMIC AND COMPUTATIONAL ISSUES

FROM MANY LISTS TO ONE FINAL LIST

Criterion based on frequency – i.e., occurrences of a geneacross all the listsSince we have a correlation parameter we can tune and varythe list length

FROM 1 WEEK COMPUTATION TO...?

Computational time for LOO (for one task)time1−optim = (2.5s to 25s)depending on the correlation parameter

total time = A · B · Nsamples · time1−optim∼ 20 · 20 · 30 · time1−optim

∼ 2 · 104s to 2 · 105s

6 tasks→ 1 week!!


COMPUTATION OVER A GRID

Grid middleware: OurGrid, a multiplatform grid that can deal withhosts not directly connected to the Internet.Used by the ShareGrid project, which involves severaluniversities in Northern Italy.Cheap solution: 60 PCs (students: lab)


GENE SELECTION WITH L1-L2 REGULARIZATIONDE MOL, MOSCI, TRASKINE, VERRI, 2008


FINDING STRUCTURED GENE SIGNATURES

How do we estimate groups of correlated genes?We may rely on the nested structure obtained by varying thecorrelation parameterWe consider the minimal list list0 as a starting point of anagglomerative clustering technique , based on the Pearsondistance:

d(Xi ,Xj ) =corr(Xi ,Xj )√var(Xi )var(Xj )

evaluating the normalized correlation between two columns Xiand Xj of the data matrix X


AN EXAMPLE APPLICATION

IDENTIFYING THE HYPOXIA SIGNATURE OF NEUROBLASTOMA VIAREGULARIZATION

joint research with IGG Molecular Biology lab

Dataset: 9 neuroblastoma (NB) cell lines cultured under normoxic andhypoxic conditions. Technology: Affymetrix GeneChip U133 plus 2.0.

t-test: no genes selected!

l1l2 protocol: 11 genes for the minimal list (frequency> 30%)


REFERENCES

A. Destrero, C. De Mol, F. Odone, A. Verri. "A Regularized Framework for FeatureSelection in Face Detection and Authentication". IJCV (2009).

A. Destrero, C. De Mol, F. Odone, A. Verri."A sparsity-enforcing method forlearning face features.". IEEE Transactions on Image Processing 18 (2009):188-201.

C. Basso, M. Santoro, A. Verri and M. Esposito. "Segmentation of InflamedSynovia in Multi-Modal MRI." In Proc. of IEEE ISBI 2009, June 28 - July 1 2009.

Fardin, Paolo, Cornero, Andrea, Annalisa Barla, Sofia Mosci, Acquaviva,Massimo, Lorenzo Rosasco, Gambini, Claudio, Alessandro Verri, Varesio, Luigi,"Identification of multiple hypoxia signatures in neuroblastoma cell lines by l1-l2regularization and data reduction", Journal of Biomedicine and Biotechnology,2010

A. Barla, S. Mosci, L. Rosasco and A. Verri. "A method for robust variableselection with significance assessment." Proc. of ESANN, European Symposiumon Artificial Neural Networks 2008.

C. De Mol, S. Mosci, M. Traskine and A. Verri; "A Regularized Method forSelecting Nested Groups of Relevant Genes from Microarray Data" Journal ofComputational Biology 2008.


LEARNING AND APPLICATIONS · Microarray data analysis ... Microarray data analysis Regularization...

Documents

Transcript of LEARNING AND APPLICATIONS · Microarray data analysis ... Microarray data analysis Regularization...