Introductory Three-Mode Analysiscedric.cnam.fr/~saporta/Intro3ModePK.pdf · In this short paper an...

Introductory Three-Mode Analysis

Pieter M. Kroonenberg

Last Revised May 11, 2010

In this short paper an overview is presented of the major component models for three-mode analysis. The format of the paper is aimed at presenting the content at courses onthree-mode analysis, whence the large typeface and the limited amount of commentary,which is supposed to be delivered by the speaker.

1

THE TUCKER3 MODEL

Tucker3 – Sum notation

xijk =P∑

p=1

Q∑

q=1

R∑

r=1

gpqr(aipbjqckr) + eijk

= xijk + eijk i = 1, · · · , I; j = 1, · · · , J ; k = 1, · · · , K

k=1

k=K

X1

X = A B'P

Q

RP

Q

R

C

G

Figure 1: Tucker3 Model

2

Tucker3 - Matrix formulations

Definition (Right) Kronecker product A⊗B = {aipB}

(X1,X2, · · · ,XK) = A (G1, · · · ,GR) (C′ ⊗B′) + (E1,E2, · · · ,EK)

= A (G1, · · · ,GR)

c11 · · · cK1...

. . ....

c1R · · · cKR

B′ + (E1,E2, · · · ,EK)

= A (G1, , · · · ,GR)

c11B′ · · · cK1B

′...

. . ....

c1RB′ · · · cKRB′

+ (E1,E2, · · · ,EK)

Matrix notation

Xk = A

{R∑

r=1

crkGr

}B′ + Ek (k = 1, · · · , K)

Combination-mode notation

X = AG(C′ ⊗B′) + E

X and E are two-way arrays with order I × JK where the second way with dimension JKis a combination way (or mode), and the index j moves faster (is nested in) the index k.Similarly G is a two-way array with order P ×QR.

Tensor notation

X =P∑

p=1

Q∑

q=1

R∑

r=1

gpqr(ap ⊗ bq ⊗ cr) + E

X is a three-way array which is approximated by a sum of PQR three-way arrays constructedas (ap ⊗ bq ⊗ cr) each weighted by its gpqr.

3

Parameter estimation

Loss function. Define a loss function (analogous to regression)

F(A,B,C,G) =∑

i,j,k

(xijk − xijk)2 =

∑

i,j,k

e2ijk

and search for those A, B, C ,and G which minimise F .

Restrictions. To solve this we have to impose (temporarily) restrictions on the componentmatrices A, B, and C, and the core array G.

1. Most common restrictions are that A, B, and C are orthonormal,

2. Restrictions on the core array G in combination with orthonormality on two ofthe three component matrices A, B, C is also possible.

These restrictions can be imposed without loss of generality.

Sum of squares partitioning. Just like other least squares problems we may write

∑

i,j,k

e2ijk =

∑

i,j,k

(xijk − xijk)2 =

∑

i,j,k

x2ijk −

∑

i,j,k

x2ijk

orSS(Res) = SS(Tot) - SS(Fit)

orSS(Tot) = SS(Fit) + SS(Res).

Explained variability. When scores are in deviation from an appropriate mean, then

R2(data,implieddata) ≡

SS(Fit)

SS(Tot)

= explained sum-of-squares, explained variability, standardised SS(Fit).

4

Algorithms - 1

Estimating A, B, C and G

Via alternating least squares algorithms in which one of the parameter matrices is estimatedwhile the other ones are held fixed, and each parameter matrix is taken in turn. This processis repeated until convergence.

There are at least five alternating least squares methods for the Tucker models in twofamilies: gepcam family and the tuckals family.

• gepcam family

1. Solve the regression equations for A,B,C,G1,G2, · · · ,GR in turn holding theother parameter sets fixed, when each set has had a turn, start again until con-vergence.(Weesie & Van Houwelingen, 1983)

• tuckals family

1. Basic tuckals algorithm - Kroonenberg & De Leeuw (1980)

2. Algorithmic improvement using Gram-Schmidt - Kroonenberg, Ten Berge, Kiers,& Brouwer (1989)

3. Algorithmic improvement using block relaxation - Kroonenberg (unpublished)

4. tuckals algorithm for multimode matrices with restrictions on the core array -Murakami(1983 - quasi three-mode principal component analysis)

5. tuckals algorithm for multimode matrices and large numbers of observationsKiers, Kroonenberg, & Ten Berge, (1992)

5

Algorithms - 2

Basic TUCKALS algorithm

• Express G in terms of A, B, and C:

G = A′X(C⊗B) or gpqr =∑

i,j,k

(aipbjqckr)xijk

• substitute G into the loss function:

F(A,B,C) =∑

i,j,k

x2ijk −

∑

i,j,k

∑

i′,j′,k′

∑p,q,r

[(aipbjqckr)xijk][(ai′pbj′qck′r)xi′j′k′ ]

orF = ||X−AA′X(CC′ ⊗BB′)︸︷︷︸

X

||2

• Solve F by maximising the second part

φ(A,B,C) =∑

i,j,k

x2ijk

• First, we maximise φ with respect to A fixing B and C, then maximise with respectto B fixing C and A followed by maximising with respect to C fixing A and B, etc.until convergence.

6

Starting values

Figure 2: Starting value for A, B, and C

Tucker Method I starts algorithm.

First string-out the three-way array in three ways and compute the singular value decompo-sition. Use the right singular vectors as starting values. Do this for each mode.

Convergence

In order to determine whether the algorithm has converged after α + 1 iterations, inspectwhether

||Aα+1 −Aα|| and ||Bα+1 −Bα|| and ||Cα+1 −Cα|| and |φα+1 − φα|

are small enough.

Simultaneous solution

By comparing the separate solutions for the three modes, which are used as rationalstarting solutions for the Tucker3 model, with the fit of the overall or simultaneous solution,one can evaluate to what extent this solution captures the fit of each of the modes.

A

BCSS(Fit)

SS(FitB)SS(FitC)

SS(FitA) SS(Tot)

Figure 3: Fit of simultaneous solution is the intersection of the fit of the three (Tucker) starting solutionsfor the modes A, B, and C.

7

INPUT

Analysing profile data

Profile data.

Profile data are the bread-and-butter social and behavioural science data - Subjects byvariables by conditions), and in the case of longitudinal data: Subjects by variables by timepoints).

Preprocessing

Analysis of raw data.

It is generally not a good idea to analyse an arbitrary set of raw (unprocessed) profile datawith a three-mode model.The overall solution will generally have a very large fit, and the first components for all of themodes will dominate the other ones. They tend to reflect the differences in means betweenthe levels rather than variances.

Preprocessing two-mode PCA.

In two-mode PCA, raw data are always standardised, i.e. centred and normalised:

zij =xij − x

sx

Many means in three-way data

In three-way data there are many more means than in two-way data: One-way marginalmeans, two-way marginal means, and the overall mean; not all necessarily interpretable.

8

Preprocessing

Centring of profile data

In general we remove the column means in profile data, i.e. the mean of each variable-timepoint combination, x·jk = 1

I

∑Ii=1 xijk. The process of removing means is called centring.

J

K

means of the JK columns

jk

xjk

(a) Centring per column

J

K

scales of the J slices

sj

(b) Normalisation per slice

Figure 4: Preprocessing profile data

Analysis of variance

The means can be further analysed with analysis of variance. A complete model includingboth additive terms (means) and multiplicative terms (three-mode model) look like xijk =µj +γjk +

∑p,q,r gpqr(aipbjqckr)+ eijk, where µj is the mean of variable j and γjk the kth time

point effect for the jth variable.

Normalisation of profile data

When the variables have different units of measurement, say kg, mm, cm, millibar, etc., thenthe variables have to be normalised (or scaled) as well. For profile data this means that thedeviation scores xijk = (xijk − x·jk) have to be normalised as well. The standard procedureis to divide by the scale of a variable, thus by the square root of all observed values of thevariable

sj =

√√√√ 1

IK

I∑

i=1

K∑

k=1

x2ijk.

Therefore the preprocessed score is equal to

zijk =xijk − xijk

sj

.

9

Results: Evaluation of solutions

There are several quantities available to assess how good a three-mode solution fits tothe data.

1. Model fit. The most basic question is how well a particular model with p, q, andr components fits. This information is contained in the standardised fitted sum ofsquares: SS(Fit)/SS(Tot), which given the recommended centring, is equal to theexplained variability, R2.

2. Comparison between models. This is done with a deviance-df plot or a deviance-number of components plot (see examples later on).

3. Component fit. Due to the orthogonality of the components, the SS(Fit) can be parti-tioned into separate contributions of the components. This can be done independentlyfor each mode: SS(Fit) =

∑p SS(Fit)p =

∑q SS(Fit)q =

∑r SS(Fit)r

4. Fit combinations of components. As there are three sets of components, one is alsointerested in the contributions of combinations of components to the fit. Thus whatthe contribution is of the combination of the pth component of the first mode, theqth component of the second mode, and the rth component of the third mode. Thisinformation is contained in the core array, G = (gpqr). In particular, g2

pqr/SS(Total),provides the required information if all component matrices are orthonormal.

5. Level fit. After convergence of the algorithm, again due to the orthogonality in themodel, the fit of each level of a mode can be assessed. Thus

SS(Total)subject = SS(Fit)subject + SS(Residual)subject

so that the relative contribution of a subject (variable, or point in time) can be ex-pressed a proportion of the total variability of his own data:

SS(Fit)subject\SS(Total)subject.

6. Data point fit. Via a residual analysis, it can be established how well each individualdata point is fitted by the model.

10

Results: Components

Scaling of components.

Components both of two-mode and three-mode PCA can be scaled in two ways.

Normalised co-ordinates.

Normalised co-ordinates have lengths one and are as a rule orthogonal.

I∑

i=1

a2ip = 1

︸︷︷︸column length 1

andI∑

i=1

aipaip′ = 0

︸︷︷︸components orthogonal

Principal co-ordinates.

Principal co-ordinates are scaled with their variances.

J∑

j=1

b2jq = λp

︸︷︷︸column length eigenvalue

andP∑

p=1

λp = J,

with J the number of levels in mode B.In two-mode PCA as used in the social and behavioural sciences, the subjects have

normalised co-ordinates and the variables are in principal co-ordinates. However, in manyother sciences, the subjects have principal co-ordinates, and the variables have normalisedco-ordinates. One should always check what is done done and why.

Three-mode PCA.

In three-mode PCA, the basic model is written so that all components of all modes arein normalised co-ordinates. The centring presented above only ensures that one mode, inprinciple the subject mode (A) is standardised, i.e. the component matrix, A, is centred andnormalised. For plotting however, components should be in principal co-ordinates to ensurea proper metric.

11

Results: Interpretation of components and their

combinations

Interpretation via rotations and the like

• Substantive considerations. Ultimately it is the content of the data and the theorythat determine interpretation.

• Oblique rotations Within the context of the model, it would be nice if only orthogonalrotations would be useful, but typically in social and behavioural sciences components(or more still factors) are correlated. Typically the interpretation of variable com-ponent will benefit from (nonsingular) transformations. For time modes, orthogonalpolynomials and dedicated target rotations will be most beneficial.

• Rotations. Procedures have been developed for separate component rotations to simplestructure and joint core and components (mainly orthogonal) rotations.

Interpretation of combination of components

• Via core array. The core array contains the weights associated with combinations ofcomponents. Thus the elements of the core indicate the relative importance. Typically,the weight of g1,1,1 is the largest in the unrotated solutions. Interpretation via the corearray depends heavily on the interpretability of the components themselves.

12

Results: Interpretation via joint biplots

Joint biplots

For each component of one mode, called the reference mode, a biplot is made whichdepicts the markers of the other two modes. There are three such biplots depending whichmode is taken as reference mode.

Construction.

Starting with the matrix notation and modelling the three-way array via the frontal slicesor matrices Xk, we can write the Tucker3 model:

Xk =R∑

r=1

ckr {AGrB′}+ Ek =

R∑

r=1

ckrDr + Ek (k = 1, · · · , K).

We see that for each component r we have a relationship between the rows of A and thoseof B expressed through the matrix Dr. Each level k of the third mode (say, time) weightsDr with a weight ckr. With respect to the rth time component for each time point k therelationship between the first and the second mode is the same, but for each level k thisrelationship is weighted differently, i.e. by a weight ckr.

The biplot is constructed via the singular value decomposition of Gr = UrΦrV′r . The

co-ordinates of the first mode become A∗r = θAUrΦ

12r and those of the second mode become

B∗r = θ−1BVrΦ

12r , where the scale factor θ = (J/I)

14 serves to make the configurations take

up the same space, as much as is possible.

ai1

ai

ai'

ai2

ai''

ai'''

b j

bj1

bj2θij

Figure 5: Biplot for matrices A and B

13

Some interpretational rules for biplots

• Points and vectors. Usually, the levels of one mode are represented by points and theother by vectors. Variables are often arrows. Both are in fact vectors.

• Inner-products. The size of the inner-products indicates the value of d(ij),r. Large pos-itive (negative) inner-products indicate a high positive (negative) association betweenrow and column marker.

• Origin. The origin represents the point of orthogonality of the row and column markers.

• Row-column relationship. In joint biplots, due to its construction, no proper metricexists for the row markers or the column markers.

14

THE TUCKER2 MODEL

Tucker2 – Sum notation

xijk =P∑

p=1

Q∑

q=1

hpqk(aipbjq) + eijk

= xijk + eijk i = 1, · · · , I; j = 1, · · · , J ; k = 1, · · · , K

k=1

k=K

X1

XK = A B'GK

X1 = A B'G1

X3 = A B'G3

X2 = A B'G2

Figure 6: Tucker2 Model

15

Tucker2 - Matrix formulations

(X1,X2, · · · ,XK) = A (H1,H2, · · · ,HK)B′ + (E1,E2, · · · ,EK)

= (AH1B′,AH2B

′, · · · ,AHKB′) + (E1,E2, · · · ,EK)

H = (H1,H2, · · · ,HK) is called the extended core array.

Matrix notation

Xk = AHkB′

︸︷︷︸X

+Ek (k = 1, · · · , K)

Generality of Tucker models.

When comparing with the Tucker3 model

Xk = A

{R∑

r=1

crkGr

}B′ + Ek (k = 1, · · · , K)

we see that the Tucker2 model is a more general model, because there are no restrictions onthe core array H; in the Tucker3 model Hk =

∑Rr=1 crkGr. The Hk are called the individual

characteristic matrices.

Estimation

The parameters of the Tucker2 model are estimated in the same way as in the Tucker3model via the same loss function, but with C set at the identity matrix and R the numberof components of the third mode set equal to K.Instead of iterating over three componentmatrices in the standard tuckals algorithm, iterating over two of them is sufficient.

16

ParafacPARAllel FACtor analysis

Tucker3 model as starting point

xijk =P∑

p=1

Q∑

q=1

R∑

r=1

gpqr(aipbjqckr) + eijk

Equal number of components.

If P = Q = R = S, thus if all modes have equal number of components, the S × S × S corearray G is a cube.

Superdiagonality.

If the core array is a cube, it is superdiagonal (or body diagonal) if all elements gpqr = 0 ifp 6= q 6= r.

S

S

S

Figure 7: Superdiagonal core cube

Parafac model in sum notation.

xijk =S∑

s=1

gsss(aisbjscks) + eijk

In most presentations the gsss are absorbed in the component matrices A, B, and/or C. Inthe above formulation the columns of A, B, and C have lengths 1 and the size of the datais contained in the gsss, the three-way analogues of the singular values.

Tensor formulation

X =S∑

s=1

gsss(as ⊗ bs ⊗ cs) + E

X is a three-way array which is approximated by a sum of S three-way arrays constructedas Xs = (as ⊗ bs ⊗ cs), each weighted by its gsss.

17

ParafacPARAllel FACtor analysis

Tucker2 model as starting point

xijk =P∑

p=1

Q∑

q=1

hpqk(aipbjq) + eijk

with hpqk an element of the extended core array H.

Equal number of components.

If P = Q = S, thus if the first two modes have equal number of components, the S × S ×Kextended core array has square frontal slices.

Slicediagonality.

If the extended core array has square frontal slice, it is slice-diagonal if all elements hpqk = 0if p 6= q.

S

S

K

Figure 8: Slice-diagonal extended core array

Parafac model in sum notation.

xijk =S∑

s=1

hssk(aisbjs) + eijk

Diagonal slice is C

If we look at the set of all hssk the we see that they form a K ×S matrix, and if we call thisarray C, we have the Parafac model back.

xijk =S∑

s=1

aisbjscks + eijk

18

Superdiagonality

The superdiagonality of the core array has important consequences:

• One-to-one relationship between components. Each sth component of each mode hasan exclusive relationship/interaction with each sth component of the other modes,just like in the two-mode singular value decomposition.

• Constraints on the model. Superdiagonality places severe restrictions on the modelparameters. In particular, there is no rotational/transformational freedom. Un-like the Tucker models which “suffer” from rotational indeterminacy: Any nonsingulartransformation on any mode coupled with the inverse transformation on the core array,will leave the fit unaffected.

• ‘Fixed’ orientation. The orientation of the axes in the solution of a two-mode arrayis arbitrary as are their angles. No theoretical implications can be drawn from theorientation. The orientation of the Parafac axes are fixed due to the restrictions on themodel. If the model can be theoretically justified (as for instance in analytical chem-istry), then the axes have immediate interpretation and theoretical relevance.Then the Parafac model is pure parameter estimation and not data reduction.

• Correlated components. In general, A, B and C are not orthogonal. The componentsare correlated. In the Tucker model orthogonality can be imposed without restrictionof generality.

• Not a decomposition. Unlike the Tucker models, the Parafac model is not a decom-position. Some data set cannot fitted by the model. They have to exhibit systemvariation or parallel proportional profiles (see below)

19

Criteria for axes orientation

• Thurstone: Simple structure. Thurstone’s idea of simple structure was that variablesshould as much as possible have a high value on one axis and no or small values on theother axes. If the axes are seen as explanatory factors, then the observed variables areprimarily influenced by single factors. =⇒ Leads to Confirmatory factor analysis.

Table 1: Simple Structure

Factor A Factor B

x ·x ·x ·· x· x

• Cattell: Parallel proportional profiles. One single (oblique) orientation exists for theaxes for all time points. The only thing time points do different is weighting thecommon space. =⇒ Leads to three-mode analysis.

Table 2: Parallel Proportional Profiles

c1 c212c1 2c2 2c1

12c2

0.8 0.10.6 0.90.4 0.50.2 1.00.1 0.3

0.40 0.20.30 1.80.20 1.00.10 2.00.05 0.6

1.6 0.051.2 0.450.8 0.250.4 0.500.2 0.15

D1 D2 D3(1.0 0.00.0 1.0

) (0.5 0.00.0 2.0

) (2.0 0.00.0 0.5

)

23

1

D

Figure 9: Core slice: D1,D2,D3

20

Parafac as parallel factor model

xijk =S∑

s=1

aisbjs(cks) + eijk

• Weights. The component as is weighted at time point k with the same weight for allsubjects i, i.e. cks, but with a different weight at another time point k′, i.e.ck′s.

• Parallel in all modes. This argument is true for all three modes, because the model issymmetric. For description one usually concentrates on the parallelism in one mode:“The subject scores ais are multiplied by the loadings bjs of a variable, and weightedby the proportionally constant weight cks.”

• Constant correlations. One can show that in the Parafac model the components of onemode have the same correlation for each time point.

• No rotational indeterminacy. It turns out that the proportionality of the componentsimposes sufficient restrictions for identifiability or unique orientation of the axes. Thereis no rotational indeterminacy as in two-mode PCA. Each transformation of thecomponents leads to a worse fit that the obtained solution.

• System variation.. The Parafac model is based on system variation. Components asa whole may increase of decrease in size (with constant correlations between compo-nents), but the relative values within a component should stay the same.

Chemistry: Estimation not exploration

In (analytical) chemistry explicit models based on chemical ‘laws’ exist that have theParafac form. Therefore, the basic use of three-mode (and multimode) models is not anexploration of structure in a data set, but an estimation of the parameters of an a prioriknown model. Examples are the data from several samples that are collected with hyphen-ated experiments which couple two measurement procedures.

Pieter Kroonenberg occupies an endowed chair in ”Multivariate analysis, with an emphasison three-mode analysis”, at the Institute of Education and Child Studies, Leiden University,Leiden, The Netherlands. His major interest is in three-mode analysis in all its facets, buthe is also interested in other multivariate data-analytic methods and applying such methodsto data from a wide variety of fields. He has recently written a book on the practice ofthree-mode analysis: Applied Multiway Data Analysis. Hoboken NJ: Wiley, 2008.Address : Institute of Education and Child Studies, Leiden University, Wassenaarseweg 52,2333 AK Leiden, The Netherlands. E-mail : [email protected].

21

Introductory Three-Mode Analysiscedric.cnam.fr/~saporta/Intro3ModePK.pdf · In this short paper an...

Documents

Transcript of Introductory Three-Mode Analysiscedric.cnam.fr/~saporta/Intro3ModePK.pdf · In this short paper an...