Usingmulti-block analysis to select informative...

39
Douglas N. Rutledge Laboratoire de Chimie Analytique, AgroParisTech 16, rue Claude Bernard, 75005 Paris, France [email protected] Douglas N. Rutledge Laboratoire de Chimie Analytique, AgroParisTech 16, rue Claude Bernard, 75005 Paris, France [email protected] Using multi-block analysis to select informative variables Using multi-block analysis to select informative variables

Transcript of Usingmulti-block analysis to select informative...

Page 1: Usingmulti-block analysis to select informative variablesiml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Rutledge-slide.pdf · Defining the underlying sensory dimensions. Food Quality

Douglas N. Rutledge

Laboratoire de Chimie Analytique, AgroParisTech

16, rue Claude Bernard, 75005 Paris, France

[email protected]

Douglas N. Rutledge

Laboratoire de Chimie Analytique, AgroParisTech

16, rue Claude Bernard, 75005 Paris, France

[email protected]

Using multi-block analysis

to select informative variables

Using multi-block analysis

to select informative variables

Page 2: Usingmulti-block analysis to select informative variablesiml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Rutledge-slide.pdf · Defining the underlying sensory dimensions. Food Quality

Starting point

•Huge data sets

•Variable selection

•Multiple data sets

Page 3: Usingmulti-block analysis to select informative variablesiml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Rutledge-slide.pdf · Defining the underlying sensory dimensions. Food Quality

Variable Selection

[1] V. Centner, D. L. Massart, O. E. deNoord, S. deJong, B. M. Vandeginste, C. Sterna, Elimination of uninformative variables for multivariate calibration. Analytical Chemistry 1996, 68, 3851-3858.[2] A.S. Bangalore, R.E. Shaffer, G.W. Small, M.A. Arnold, Genetic algorithm-based method for selecting wavelengths and model size for use with partial least-squares regression: Application to near-infrared spectroscopy. Analytical Chemistry 1996, 68, 4200-4212.[3] L. Norgaard, A. Saudland, J. Wagner, J.P. Nielsen, L. Munck, S.B. Engelsen, Interval partial least-squares regression (iPLS): A comparative chemometric study with an example from near-infrared spectroscopy. Applied Spectroscopy 2000, 54, 413-419.

The quality of multivariate predictive models is increased by eliminatinguninformative variables.

For discriminant models, pp-ANOVA is often used :- test each variable separately- varies more between groups than within groups ?

For regression analysis, many methods :- Uninformative Variable Elimination-PLS [1]- Genetic Algorithm-PLS [2]- iPLS [3] ...

Page 4: Usingmulti-block analysis to select informative variablesiml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Rutledge-slide.pdf · Defining the underlying sensory dimensions. Food Quality

pp-ANOVA and iPLS

The most commonly used methods :

• pp-ANOVA is intrinsically UNIVARIATE

• iPLS applies PLS regression to ISOLATED BLOCKS

• And then there is the multiple-testing problem !

SO - better to use an intrinsically MULTIVARIATE, MULTIBLOCK method

Page 5: Usingmulti-block analysis to select informative variablesiml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Rutledge-slide.pdf · Defining the underlying sensory dimensions. Food Quality

Multi-block analysis

[4] E. Qannari, I. Wakeling, P. Courcoux, H.J.H MacFie,

Defining the underlying sensory dimensions. Food Quality and Preference 2000, 11, 151-154.

"Common Components and Specific Weights Analysis" - CCSWA [4]

Simultaneously study several matrices- with different variables describing the same samples

Describe m data tables observed for the same n samples :- a set of m data matrices (X) each with n rows,- but not necessarily the same number columns

Determine a common space for all m data table,- each matrix has a specific contribution ("salience")to the definition of each dimension of this common space

Page 6: Usingmulti-block analysis to select informative variablesiml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Rutledge-slide.pdf · Defining the underlying sensory dimensions. Food Quality

Multi-block analysis

Start with pmatrices Xi of size n × ki (i = 1 to p)

Each Xi column-centered and scaled by dividing by matrix norm :Xsi

For each Xsi, an n × n scalar product matrix Wi can be computed as :

Wi = Xsi • Xsi T

Wi reflect the dispersion of the samples in the space of that table

The common dimensions of all the tables are computed iterativelyAt each iteration, a weighted sum of the pWimatrices is computed, resulting in a global WGmatrix

Page 7: Usingmulti-block analysis to select informative variablesiml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Rutledge-slide.pdf · Defining the underlying sensory dimensions. Food Quality

For each successive Common Dimension, calculate a scores vector q(coordinates of the n samples along the common dimension)

is the specific weight ("salience") associated withthe ith table for the jth Common Dimension generated by qj

Differences in the values of the specific weights for a dimension :- information present in some tables but not others

Subsequent components calculated after deflating the data tables

Multi-block analysis

)(i

( )

1

j ni T

i j j j

j

W q q=

=

= λ∑

Page 8: Usingmulti-block analysis to select informative variablesiml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Rutledge-slide.pdf · Defining the underlying sensory dimensions. Food Quality

Classical ComDim

"ComDim" the implementation of CCSWA used here is part of the SAISIR toolbox SAISIR (2008): Statistics Applied to the Interpretation of Spectra in the InfraRed

Dominique Bertrand ([email protected])

Page 9: Usingmulti-block analysis to select informative variablesiml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Rutledge-slide.pdf · Defining the underlying sensory dimensions. Food Quality

PLS-ComDim

Page 10: Usingmulti-block analysis to select informative variablesiml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Rutledge-slide.pdf · Defining the underlying sensory dimensions. Food Quality

1) NIR on apples

Samples

• 2 Varieties :

• Cox, Jonagold

• 2 Faces :

• Red, Green

• 3 Maturity levels :

• fresh, ripe, over-ripe

• 8 different apples

Spectra

• 94 x 200 points

Tables

• 50 blocks of 4 variables

• 6 Common Dimensions

Page 11: Usingmulti-block analysis to select informative variablesiml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Rutledge-slide.pdf · Defining the underlying sensory dimensions. Food Quality

NIR Spectra

Spectra

ComDimSaliences

Page 12: Usingmulti-block analysis to select informative variablesiml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Rutledge-slide.pdf · Defining the underlying sensory dimensions. Food Quality

Correlation between ComDim Scores

and "Face"

Page 13: Usingmulti-block analysis to select informative variablesiml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Rutledge-slide.pdf · Defining the underlying sensory dimensions. Food Quality

Correlation between PLS-ComDim Scores

and "Face"

Page 14: Usingmulti-block analysis to select informative variablesiml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Rutledge-slide.pdf · Defining the underlying sensory dimensions. Food Quality

1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334353637383940414243444546474849500

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

RM

SE

CV

Dotte d line is RMSECV (4 LV 's) fo r g loba l m ode l / Ita lic numbe rs a re optim a l LVs in inte rva l m ode l

Interval num ber

3 2 3 4 2 2 3 3 4 4 4 1 4 2 1 1 2 1 2 2 2 4 2 4 2 3 2 3 2 1 3 1 1 2 3 1 2 2 3 2 2 1 1 2 1 1 1 1 1 1

i-PLS between NIR and "Face"

10 2 0 30 4 0 50 60 7 0 80 9 0

-1. 5

-1

-0. 5

0

0. 5

1

1. 5

2

- Blocks = 50 - Mean centred- Max LVs = 4 - CV = Full

- Scores on LV1for Block 8

Page 15: Usingmulti-block analysis to select informative variablesiml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Rutledge-slide.pdf · Defining the underlying sensory dimensions. Food Quality

Correlation between ComDim Scores

and "Maturity"

Page 16: Usingmulti-block analysis to select informative variablesiml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Rutledge-slide.pdf · Defining the underlying sensory dimensions. Food Quality

Correlation between PLS-ComDim Scores

and "Maturity"

Page 17: Usingmulti-block analysis to select informative variablesiml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Rutledge-slide.pdf · Defining the underlying sensory dimensions. Food Quality

1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334353637383940414243444546474849500

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

RM

SE

CV

Dotte d line is RMSECV (4 LV 's) fo r g loba l m ode l / Ita lic numbe rs a re optim a l LVs in inte rva l m ode l

Interval num ber

1 1 3 2 3 2 2 3 3 3 3 3 3 2 3 1 1 3 2 2 3 4 1 1 2 2 3 1 1 3 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1

i-PLS between NIR and "Maturity"

10 2 0 30 4 0 50 60 7 0 80 9 0

-1. 5

-1

-0. 5

0

0. 5

1

1. 5

- Blocks = 50 - Mean centred- Max LVs = 4 - CV = Full

- Scores on LV1for Block 10

Page 18: Usingmulti-block analysis to select informative variablesiml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Rutledge-slide.pdf · Defining the underlying sensory dimensions. Food Quality

Correlation between ComDim Scores

and "Variety"

Page 19: Usingmulti-block analysis to select informative variablesiml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Rutledge-slide.pdf · Defining the underlying sensory dimensions. Food Quality

Correlation between PLS-ComDim Scores

and "Variety"

Page 20: Usingmulti-block analysis to select informative variablesiml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Rutledge-slide.pdf · Defining the underlying sensory dimensions. Food Quality

1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334353637383940414243444546474849500

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

RM

SE

CV

Dotte d line is RMSECV (4 LV 's) fo r g loba l m ode l / Ita lic numbe rs a re optim a l LVs in inte rva l m ode l

Interval num ber

3 2 2 2 3 3 4 3 1 3 2 2 4 4 4 3 3 4 4 2 4 1 4 3 3 3 3 1 2 3 2 2 3 3 3 3 2 3 2 2 2 4 3 3 1 2 2 4 2 2

i-PLS between NIR and "Variety"

10 2 0 30 4 0 50 60 7 0 80 9 0-0. 4

-0. 3

-0. 2

-0. 1

0

0. 1

0. 2

0. 3

- Blocks = 50 - Mean centred- Max LVs = 4 - CV = Full

- Scores on LV1for Block 33

Page 21: Usingmulti-block analysis to select informative variablesiml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Rutledge-slide.pdf · Defining the underlying sensory dimensions. Food Quality

Genomics data

0 . 5 1 1 . 5 2 2 . 5 3 3 . 5 4 4 . 5

x 1 04

Chromosome 1

1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 6 0 0 0 7 0 0 0 8 0 0 0 9 0 0 0

Chromosome 22

………

Page 22: Usingmulti-block analysis to select informative variablesiml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Rutledge-slide.pdf · Defining the underlying sensory dimensions. Food Quality

Genomics data

Samples

• 940 healthy individuals :

• 7 regions :

• 53 ethnic groups :

Variables

• 644,138 Single Nucleotide Polymorphisms (SNPs)

/ 22 Chromosomes

Pretreatment

• PCT on each chromosome = 22x(940*940)� 22x(940*100)

Analysis

• ComDim on the set of 22 PCT blocks

[5] H. M. Cann, C. de Toma, L. Cazes, M-F. Legrand, V. Morel, L. Piouffre, J. Bodmer, et al. (2002),A human genome diversity cell line panel. Science 296, 261–262

[6] E. Génin - Inserm UMR 946 - Univ Paris Diderot

Page 23: Usingmulti-block analysis to select informative variablesiml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Rutledge-slide.pdf · Defining the underlying sensory dimensions. Food Quality

X

PCA

X1

T1

P1

X2

T2

P2

Xq

Tq

Pq

T1 T2 ... T1

PCA/PLS...

TPCTPT

PCT

PX1 PX2 PXq...

Segmented PCT on Obese data sets

TX TPCT=

(PTX1=inv(TPCT

T*TPCT)*TPCTT*X1)

Page 24: Usingmulti-block analysis to select informative variablesiml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Rutledge-slide.pdf · Defining the underlying sensory dimensions. Food Quality

Combined PCs

Component t1=Xu1

Space X3

Space X1

SpaceX 2

SegPCT Component tPCT

Component t2=X2u2

Component t3=X3u3

Page 25: Usingmulti-block analysis to select informative variablesiml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Rutledge-slide.pdf · Defining the underlying sensory dimensions. Food Quality

PCA and SegPCT-PCA (SW=2048, 10 PCT-PCs) memory allocation profiles

SegPCT-PCA21 Mbytes1207 s (~20 min.)

PCA

230 Mbytes14819 s (~250 min.)

With PCT : faster and requires less memory

SegPCT-PCA versus PCA

Page 26: Usingmulti-block analysis to select informative variablesiml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Rutledge-slide.pdf · Defining the underlying sensory dimensions. Food Quality

-0.002

0

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

500106016202180274033003860442049805540

PCA & SegPCT-PCA Loadings

With PCT : identical results

Page 27: Usingmulti-block analysis to select informative variablesiml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Rutledge-slide.pdf · Defining the underlying sensory dimensions. Food Quality

Multivariate Analysis of all SNPs

[7] J. Li et al.Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation,Science, 319(5866), 1100 - 1104

Page 28: Usingmulti-block analysis to select informative variablesiml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Rutledge-slide.pdf · Defining the underlying sensory dimensions. Food Quality

PCT-ComDim

Saliences of chromosome blocks

Page 29: Usingmulti-block analysis to select informative variablesiml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Rutledge-slide.pdf · Defining the underlying sensory dimensions. Food Quality

PCT-ComDim

CC1 Scores

Africa

Mid-EastEuropeCS_Asia

AmericaE_Asia

Oceania

Page 30: Usingmulti-block analysis to select informative variablesiml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Rutledge-slide.pdf · Defining the underlying sensory dimensions. Food Quality

PCT-ComDim

CC2 Scores

Africa

Mid-East

Europe

CS_Asia

AmericaE_Asia

Oceania

Page 31: Usingmulti-block analysis to select informative variablesiml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Rutledge-slide.pdf · Defining the underlying sensory dimensions. Food Quality

PCT-ComDim

CC3 Scores

America

E_Asia Oceania

Page 32: Usingmulti-block analysis to select informative variablesiml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Rutledge-slide.pdf · Defining the underlying sensory dimensions. Food Quality

PCT-ComDim

CC4 Scores

Oceania

Page 33: Usingmulti-block analysis to select informative variablesiml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Rutledge-slide.pdf · Defining the underlying sensory dimensions. Food Quality

PCT-ComDim

CC5 Scores

Bantu_S

BiakaPygmies

Mandenka

MbutiPygmies San

Yoruba

Bantu_N

Page 34: Usingmulti-block analysis to select informative variablesiml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Rutledge-slide.pdf · Defining the underlying sensory dimensions. Food Quality

Correlation between ComDim Scores

and European populations

Page 35: Usingmulti-block analysis to select informative variablesiml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Rutledge-slide.pdf · Defining the underlying sensory dimensions. Food Quality

PCT-ComDim

Saliences of chromosome blocks

Page 36: Usingmulti-block analysis to select informative variablesiml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Rutledge-slide.pdf · Defining the underlying sensory dimensions. Food Quality

PCT-ComDim

French

Tuscan &Italian

Sardinian

Orcadian

Adygei

FrenchBasque

Russian

East

West

South

North

Page 37: Usingmulti-block analysis to select informative variablesiml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Rutledge-slide.pdf · Defining the underlying sensory dimensions. Food Quality

PCT-ComDim PCA

French

Tuscan & Italian

Sardinian

Orcadian

Adygei

FrenchBasque

Russian

[7] J. Li et al.Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation,Science, 319(5866), 1100 - 1104

Good & Bad news

Page 38: Usingmulti-block analysis to select informative variablesiml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Rutledge-slide.pdf · Defining the underlying sensory dimensions. Food Quality

With segmented PCT :

• analyse data sets of any width

• quickly

• using less memory

With multi-block analysis :

• detect groups of interesting variables ("salience")

• visualise relations among samples ("scores")

Page 39: Usingmulti-block analysis to select informative variablesiml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Rutledge-slide.pdf · Defining the underlying sensory dimensions. Food Quality

39

First African-European Conference on

Chemometrics

Mining School of Rabat

Morocco, 20th to 24th of September 2010

www.afrodata.org