Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74)...

62
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energyʼs National Nuclear Security Administration under contract DE-AC04-94AL85000. Brett W. Bader Sandia National Laboratories TRICAP 2009 June 15, 2009 Advances in Data Analysis using PARAFAC2 and Three-way DEDICOM

Transcript of Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74)...

Page 1: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energyʼs National Nuclear Security Administration

under contract DE-AC04-94AL85000.

Brett W. BaderSandia National Laboratories

TRICAP 2009June 15, 2009

Advances in Data Analysis using PARAFAC2 and Three-way

DEDICOM

Page 2: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Acknowledgements

• Peter Chew, Sandia

• Tammy Kolda, Sandia

• Daniel Dunlavy, Sandia

• Evrim Acar, Sandia

• Ahmed Abdelali, New Mexico State University

• Alla Rozovskaya, University of Illinois, Urbana-Champaign

Page 3: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Multi-way Analysis

3-way DEDICOM

Tensor or N-way array

PARAFAC2

Interested in algorithms for analysis of large data sets

Page 4: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

DEDICOM

• DEcomposition into DIrectional COMponents

• Introduced in 1978 by Harshman

• Past applications- Study asymmetries in telephone calls among cities- Marketing research• car switching: car owners and what they buy next • free associations of words for advertising- Asymmetric measures of world trade (import/export)

• Variations- Two-way DEDICOM- Three-way DEDICOM

Page 5: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Two-way DEDICOM

minA,R

!

!

!

X ! ARAT

!

!

!

2

F

• A (N x P) is an orthogonal matrix of loadings or weights• R (P x P) is a dense matrix that captures asymmetric relationships

• Decomposition is not unique- A can be transformed with no loss of fit to the data- Nonsingular transformation Q:

- Usually “fix” A with some standard rotation (e.g., VARIMAX)

X ! ARAT =X A

R AT

s.t. A orthogonal

ARAT = (AQ)(Q!1RQ!T )(AQ)T

N

N P

N

Single domain model

Page 6: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Three-way DEDICOM

=

• A (N x P) is a matrix of loadings or weights (not necessarily orthogonal)• R (P x P) is a dense matrix that captures asymmetric relationships• D (P x P x K) is a tensor with diagonal frontal slices giving the weights

of the columns of A for each slice in third mode

• *Unique* solution with enough slices of X with sufficient variation- i.e., no rotation of A possible- greater confidence in interpretation of results

N

NK

A

ATRD D

N

P

Xk ! ADkRDkAT for k = 1, . . . ,K

minA,R,D

!

k

""Xk !ADkRDkAT""2

F

Page 7: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

PARAFAC2

=

• A (N x P) is a matrix of loadings or weights (not necessarily orthogonal)• H (P x P) is a dense, symmetric matrix (usually positive definite)• D (P x P x K) is a tensor with diagonal frontal slices giving the weights

of the columns of A for each slice in third mode

• *Unique* solution with enough slices of X with sufficient variation- i.e., no rotation of A possible- greater confidence in interpretation of results

N

NK

A

ATHD D

N

P

Xk ! ADkHDkAT for k = 1, . . . ,K

minA,H,D

!

k

""Xk !ADkHDkAT""2

F

Page 8: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

DEDICOM Models & Algorithms

=XR AT

=

• Generalized Takane method• ASALSAN variant• All-at-once optimization

• Kiersʼ method• ASALSAN• All-at-once optimization

X AR

A

AT

(Takane, 1985; Kiers et al., 1990)

(Kiers, 1993)

2-wayDEDICOM

3-wayDEDICOM

(Bader, Harshman, Kolda, 2007)

(Bader, Harshman, Kolda, 2007)

Page 9: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Kiersʼ Algorithm

f(R) =

!

!

!

!

!

!

!

"

#

$

Vec(X1)...

Vec(Xm)

%

&

'!

"

#

$

AD1 " AD1

...ADm " A Dm

%

&

'Vec(R)

!

!

!

!

!

!

!

Vec(R) =

!

m"

i=1

(DiATADi) ! (DiA

TADi)

#

!1 m"

i=1

Vec(DiATXiADi)

1) ALS over columns in A

3) ALS over elements in D

minimize:

2) Least-squares problem for R

minA,R,D

m!

i=1

"

"Xi ! ADiRDiAT

"

"

2

F

(involves EVD of dense n x n matrix)

(Kiers, 1993)

Alternating Least SquaresAccurate algorithm

but not suitablefor large-scale data

O(pn3)

Page 10: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

ASALSAN

!

X1 XT1 · · · Xm XT

m

"

= A!

D1RD1 D1RTD1 · · · DmRDm DmRTDm

" !

I2m ! AT"

A =

!

m"

i=1

#

XiADiRTDi + XT

i ADiRDi

$

% !

m"

i=1

(Bi + Ci)

%

!1

Bi ! DiRDi(ATA)DiR

TDi,

Ci ! DiRTDi(A

TA)DiRDi.

Solving for A:

where

= AY

minA,R,D

m!

i=1

"

"Xi ! ADiRDiAT

"

"

2

F

ZT

A = YZ(ZTZ)!1

“Alternating Simultaneous Approximation, Least Squares, and Newton”

(Bader, Harshman, Kolda, 2007)

Page 11: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

ASALSAN

Solving for R:

f(R) =

!

!

!

!

!

!

!

"

#

$

Vec(X1)...

Vec(Xm)

%

&

'!

"

#

$

AD1 " AD1

...ADm " A Dm

%

&

'Vec(R)

!

!

!

!

!

!

!

Vec(R) =

!

m"

i=1

(DiATADi) ! (DiA

TADi)

#

!1 m"

i=1

Vec(DiATXiADi)

minimize:

Use the approach in (Kiers, 1993)

minR

m!

i=1

"

"Xi ! ADiR DiAT

"

"

2

F

Page 12: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

ASALSAN

minDi

!

!Xi ! ADiRDiAT

!

!

2

F

A = QA,

minDi

!

!

!

QTXiQ ! ADiRDiAT

!

!

!

2

F

gk = !

!

i,j

"

2(X ! ADRDAT ) " (ADrka

Tk + akrk,:DA

T )#

i,j

Solving for D:

Use compressionQR factorization:

Use Newtonʼs method to solve the optimization problem for

Smaller problem (p x p)

hst = !2!

i,j

"

(X ! ADRDAT ) " (asrsta

Tt + atrtsa

Ts )

! (ADrsaTs + asrs:DA

T ) " (ADrtaTt + atrt:DA

T )#

i,j

dnew = d ! H!1g

d = diag(Di)

Gradient:

Hessian:

Page 13: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Algorithm Costs

ATA

XiART

XT

i AR

QTXiQ

QR factorization of AO(p2

n)

Xi

Dominant costs:

For small p, updating A is most expensive part

linear in nnz of

ASALSAN is capable of handling large data sets

Page 14: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Timings

working in this space. Specifically, we find an orthonormalbasis Q ! Rn!p of matrix A using, e.g., a compact QRdecomposition,

A = QA, (11)

where A is upper triangular. Then we use Q to project Xonto the basis of A. By the orthogonality of Q, the mini-mization problem of (10) is the same as

minDk

!!!QTXkQ " ADkRDkAT!!!

2

F, (12)

except that QTXkQ and A are both of size p # p. We usethese smaller matrices in place of Xk and A, respectively,in the updates of both R and D in (9) and (10) above.

The dominant costs of ASALSAN per iteration are lin-ear in the number of nonzeros of Xk and/or O(p2n) andcome from the following steps: ATA, QR factorization ofA, XkART, XT

kAR, and QTXkQ. In contrast, the dom-inant costs in Kiers’ ALS algorithm [23] come from up-dating A with p diagonalizations of a dense n # n matrix,costing O(pn3).

3.2 Nonnegative ASALSAN

Because we often deal with nonnegative data in X, itsometimes helps to examine decompositions that retain thenonnegative characteristics of the original data. So we havemodified ASALSAN to compute a three-way DEDICOMmodel with non-negativity constraints on A, R, and D.We call this algorithm NN-ASALSAN, for “nonnegative”ASALSAN. Modifications to the updates of both A and Rare made as follows: we replace the least squares solutionwith the multiplicative update introduced in [29] as imple-mented in [2]. Specifically, we modify the step to solve forthe A appearing on the left in (5):

aic $ aic

"#mk=1

$XkADkRTDk + XT

kADkRDk

%&ic

[A#m

k=1(Bk + Ck)]ic + !.

where Bk and Ck are the same as in (7)-(8) above and ! isa small number like 10"9. The solution for R is given by:

Vec(R)i $Vec(R)i

"#mk=1 Vec(DkATXkADk)

&i

[#m

k=1(DkATADk) % (DkATADk)Vec(R)]i + !.

We used the procedure for updating D as above, using thesame non-negativity constraints.

An algorithm for a nonnegative two-way DEDICOMmodel follows directly from NN-ASALSAN when one con-siders a matrix X as an array X having a single slice(m = 1) and the D array is just the identity matrix.

Table 1. Time in seconds per iteration (aver-age number of iterations) on both data sets.

Algorithm World trade EnronASALSAN 0.069 (50) 0.85 (184)NN-ASALSAN 0.083 (47) 1.0 (74)Kiers [23] 0.022 (67) 22.3 (400+)

4 Experimental results

We consider two applications: a small example usingthe international trade data used previously in [20] and thelarger email graph of the Enron corporation that was madepublic during the federal investigation.

ASALSAN was written in MATLAB, using the TensorToolbox [5, 6, 7], and Kiers’ algorithm [23] was compiledPascal code obtained from the author. All tests were per-formed on a dual 3GHz Pentium Xeon desktop computerwith 2GB of RAM.

Table 1 shows the timings per iteration and average num-ber of iterations to satisfy a tolerance of 10"5 (World trade)or 10"7 (Enron) in the change of fit for the three algorithms(using the same stopping criteria). We suspect the perfor-mance gap on the world trade example is due to more over-head in our MATLAB code relative to Kiers’ compiled ex-ecutable. Due to the poor asymptotic scalability of Kiers’algorithm on the larger Enron data, its running time is muchslower than ASALSAN. Processing larger data sets, the dis-crepancy will grow even larger.

A practice in some applications of DEDICOM is to ig-nore the diagonal entries of each Xk in the minimizationof (3). For both applications, this makes sense becausewe wish to ignore self-loops (i.e., no self-trade or sendingemail to yourself). We use an imputation technique of esti-mating the diagonal values from the current approximationADkRDkAT at each iteration and including them in Xk.

4.1 World trade

For a simple algorithmic comparison, we tested the in-ternational trade data of [20]. The data consists of im-port/export data among 18 nations in Europe, North Amer-ica, and the Pacific Rim from 1981 to 1990. A semanticgraph of this data corresponds to a dense adjacency array Xof size 18 # 18 # 10.

We computed a three component (p = 3) model usingASALSAN and Kiers’ algorithm, and the same minimizersmay be found among the results of both algorithms. We alsoused NN-ASALSAN to compute a new fully-nonnegativeversion of the DEDICOM model (A,R,D & 0). Becausethe nonnegative results are more easily interpreted and arenew, we report just these results.

Time in seconds per iteration (avg iterations)

(Bader, Harshman, Kolda, 2007)

184x184x449838 nonzeros

18x18x103060 nonzeros

Page 15: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

New Approach: All-at-once Optimization

minx

f(x)Minimization problem:

• Need a flexible method to handle- constraints (e.g., non-negativity)- regularization (e.g., sparsity)- missing data- large-scale data sets

• Want- speed comparable to ASALSAN- improved accuracy, if possible

• Follow what has been done with CANDECOMP/PARAFAC (see Evrim Acarʼs presentation)

• Use Poblano Optimization Toolbox in MATLAB (Dunlavy, et al. 2009)- Vectorize factor variables and derivatives

Page 16: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Optimization Theory

!2f(x)

!f(x)T! +

s

sT

s

+

f(x)

1

2f(x + s)

minx

f(x)

Taylor series approx.

Minimization problem:

!f(x) =

!

"#

!f!x1...

!f!xn

$

%& !2f(x) =

!

"""""#

!2f!x2

1

!2f!x1!x2

· · · !2f!x1!xn

!2f!x2!x1

!2f!x2

2· · · !2f

!x2!xn

......

...!2f

!xn!x1

!2f!xn!x2

· · · !2f!x2

n

$

%%%%%&

Gradient Hessian

Page 17: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Newtonʼs Method

xk

sk

x1

x2

!2f(x)

!f(x)T! +

s

sT

s

+

f(x)

1

2f(x + s)

Taylor series approx.

m(xk + s) = f(xk) +!f(xk)T s +12sT!2f(xk)s

!2f(xk)sk = "!f(xk),

xk+1 = xk + sk

Newtonʼs method:

Local model

Solve for sk

Page 18: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Quasi-Newton Methods

Bk = Bk!1 +sk!1sT

k!1

sTk!1yk

! Bk!1ykyTk Bk!1

yTk Bk!1yk

.

minBk

!Bk "Bk!1 !F subject to Bk = BTk , Bkyk = sk!1,

Hksk = !"f(xk)

Hk ! "2f(xk)Approximation to Hessian:

New linear system:

Bk = H!1k

Secant method: use gradients at past points to approximate Hessian

BFGS update

sk = !Bk"f(xk)

yk = !f(xk)"!f(xk!1)

Page 19: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Large-scale Methods

xk

!"f(xk)

xCPk

x1

x2

!H!1

k "f(xk)

Bk = Bk!1 +sk!1sT

k!1

sTk!1yk

! Bk!1ykyTk Bk!1

yTk Bk!1yk

.

BFGS update

Limited memory BFGS (L-BFGS) keeps only m past update vectors sk and yk

Nonlinear Conjugate Gradient (NCG)

Steepest descent then move in subsequent conjugate directions

These methods are implemented in the Poblano Optimization Toolbox for MATLAB (Dunlavy et al. 2009)

Page 20: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Derivatives for 2-way DEDICOM

!f

!Aij= [!ET AR! EART ]ij

!f

!Rij= [!AT EA]ij

E ! Z "ARAT

Objective function: f(A, R) ! 12 "E "2F = 1

2

!!Z #ARAT!!2

F

Gradient:

AR = A*R;ARt = A*R';AtA = A'*A;G{1} = -Z'*AR - Z*ARt + ARt*(AtA*R) + AR*(AtA*R');G{2} = -A'*Z*A + AtA*R*AtA;

MATLAB:

Error:

Page 21: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Derivatives for 3-way DEDICOM

Objective function:

Gradient:

Error: Ek ! Zk "ADkRDkAT

f(A, R, D) ! 12

K!

k=1

"Ek "2F = 12

K!

k=1

""Zk #ADkRDkAT""2

F

!f

!Aij=

!!

"

k

ET ADkRDk + EADkRT Dk

#

ij

!f

!Rij=

!!

"

k

AT DkEDkA

#

ij

!f

!Dkj= ADkrjEaj + aT

j Erj,:DkAT

Page 22: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Derivatives for 3-way DEDICOM

for k = 1:size(D,1) DktDk = D(k,:)'*D(k,:); Rk = R.*DktDk; AR = A*Rk; ARt = A*Rk'; E = AR*A' - Z(:,:,k); G{1} = G{1} + E'*AR + E*ARt; G{2} = G{2} + (A'*E*A).*DktDk;

ADR = A*(diag(D(k,:))*R); RDA = (R*diag(D(k,:)))*A'; for j = 1:size(D,2) G{3}(k,j) = ADR(:,j)'*E*A(:,j) + A(:,j)'*E*RDA(j,:)'; endend

MATLAB:

Page 23: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Results

• Synthetic data• 20x20x9• Random initialization

(5% error)• No noise in X• p = 2

0 2 4 6 8 10 12 1410

!12

10!10

10!8

10!6

10!4

10!2

100

Time (sec)

Re

l. R

esid

ua

l E

rro

r

ASALSAN

NCG

L!BFGS

Page 24: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Results

• Synthetic data• 20x20x9• Random initialization

(5% error)• 0.1% noise in X• p = 2

0 1 2 3 4 5 610

!3

10!2

10!1

100

Time (sec)

Rel. R

esid

ual E

rror

ASALSAN

NCG

L!BFGS

Page 25: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Results

• Synthetic data• 50x50x35• Random initialization

(5% error)• 0.1% noise in X• Oblique factors A• p = 2

0 10 20 30 40 50 60 70 80 90 100

100

101

102

Time (sec)

Rel. R

esid

ual E

rror

ASALSAN

NCG

L!BFGS

Page 26: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Results

• Enron email data• 184x184x44• 9838 nnz• p = 4• Random

initialization

0 10 20 30 40 50 60 70 800.9

0.91

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

Time (sec)

Rel. R

esid

ual E

rror

ASALSAN

NCG

L!BFGS

Page 27: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Derivatives for PARAFAC2

Objective function:

Gradient:

Error: Ek ! Zk "ADkHDkAT

f(A, H, D) ! 12

K!

k=1

"Ek "2F = 12

K!

k=1

""Zk #ADkHDkAT""2

F

!f

!Hij=

!!

"

k

AT DkEDkA

#

ij

!f

!Dkj= 2ADkhjEaj

!f

!Aij=

!!2

"

k

ET ADkHDk

#

ij

Page 28: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Discussion

• All-at-once optimization is competitive but not the best. ASALSAN is fastest on real data.- DEDICOM and PARAFAC2 are difficult nonlinear

optimization problems with multiple minima- ALSALSAN includes some second order information

• Optimization approach convenient for:- Different loss functions (other than least squares)- General constraints (e.g., non-negativity)- Regularization (e.g., sparsity)- Missing data

• Future research will explore these benefits and aim for better performance

Page 29: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Applications

DEDICOM

Tensor or N-way array

PARAFAC2• Temporal social network analysis (Enron)• Community finding• Part-of-speech tagging

• Multilingual document analysis

Page 30: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Using PARAFAC2 for Multilingual Document

Clustering

Joint work with Peter Chew

Page 31: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Basics: Graphs and Matrices

Term-by-doc (adjacency) matrix

Documents

MATRICES, VECTOR SPACES, AND INFORMATION RETRIEVAL 341

The t = 6 terms:

T1: bak(e,ing)T2: recipesT3: breadT4: cakeT5: pastr(y,ies)T6: pie

The d = 5 document titles:

D1: How to Bake Bread Without RecipesD2: The Classic Art of Viennese PastryD3: Numerical Recipes: The Art of Scientific ComputingD4: Breads, Pastries, Pies and Cakes: Quantity Baking RecipesD5: Pastry: A Book of Best French Recipes

The 6 ! 5 term-by-document matrix before normalization, where theelement aij is the number of times term i appears in document title j:

A =

!

"""""#

1 0 0 1 01 0 1 1 11 0 0 1 00 0 0 1 00 1 0 1 10 0 0 1 0

$

%%%%%&

The 6 ! 5 term-by-document matrix with unit columns:

A =

!

"""""#

0.5774 0 0 0.4082 00.5774 0 1.0000 0.4082 0.70710.5774 0 0 0.4082 0

0 0 0 0.4082 00 1.0000 0 0.4082 0.70710 0 0 0.4082 0

$

%%%%%&

Fig. 2.2 The construction of a term-by-document matrix A.

are returned as relevant. The second, third, and fifth documents, which concernneither of these topics, are correctly ignored.

If the user had simply requested books about baking, however, the results wouldhave been markedly di!erent. In this case, the query vector is given by

q(2) = ( 1 0 0 0 0 0 )T ,

and the cosines of the angles between the query and five document vectors are, in

MATRICES, VECTOR SPACES, AND INFORMATION RETRIEVAL 341

The t = 6 terms:

T1: bak(e,ing)T2: recipesT3: breadT4: cakeT5: pastr(y,ies)T6: pie

The d = 5 document titles:

D1: How to Bake Bread Without RecipesD2: The Classic Art of Viennese PastryD3: Numerical Recipes: The Art of Scientific ComputingD4: Breads, Pastries, Pies and Cakes: Quantity Baking RecipesD5: Pastry: A Book of Best French Recipes

The 6 ! 5 term-by-document matrix before normalization, where theelement aij is the number of times term i appears in document title j:

A =

!

"""""#

1 0 0 1 01 0 1 1 11 0 0 1 00 0 0 1 00 1 0 1 10 0 0 1 0

$

%%%%%&

The 6 ! 5 term-by-document matrix with unit columns:

A =

!

"""""#

0.5774 0 0 0.4082 00.5774 0 1.0000 0.4082 0.70710.5774 0 0 0.4082 0

0 0 0 0.4082 00 1.0000 0 0.4082 0.70710 0 0 0.4082 0

$

%%%%%&

Fig. 2.2 The construction of a term-by-document matrix A.

are returned as relevant. The second, third, and fifth documents, which concernneither of these topics, are correctly ignored.

If the user had simply requested books about baking, however, the results wouldhave been markedly di!erent. In this case, the query vector is given by

q(2) = ( 1 0 0 0 0 0 )T ,

and the cosines of the angles between the query and five document vectors are, in

MATRICES, VECTOR SPACES, AND INFORMATION RETRIEVAL 341

The t = 6 terms:

T1: bak(e,ing)T2: recipesT3: breadT4: cakeT5: pastr(y,ies)T6: pie

The d = 5 document titles:

D1: How to Bake Bread Without RecipesD2: The Classic Art of Viennese PastryD3: Numerical Recipes: The Art of Scientific ComputingD4: Breads, Pastries, Pies and Cakes: Quantity Baking RecipesD5: Pastry: A Book of Best French Recipes

The 6 ! 5 term-by-document matrix before normalization, where theelement aij is the number of times term i appears in document title j:

A =

!

"""""#

1 0 0 1 01 0 1 1 11 0 0 1 00 0 0 1 00 1 0 1 10 0 0 1 0

$

%%%%%&

The 6 ! 5 term-by-document matrix with unit columns:

A =

!

"""""#

0.5774 0 0 0.4082 00.5774 0 1.0000 0.4082 0.70710.5774 0 0 0.4082 0

0 0 0 0.4082 00 1.0000 0 0.4082 0.70710 0 0 0.4082 0

$

%%%%%&

Fig. 2.2 The construction of a term-by-document matrix A.

are returned as relevant. The second, third, and fifth documents, which concernneither of these topics, are correctly ignored.

If the user had simply requested books about baking, however, the results wouldhave been markedly di!erent. In this case, the query vector is given by

q(2) = ( 1 0 0 0 0 0 )T ,

and the cosines of the angles between the query and five document vectors are, in

example from (Berry, Drmac, Jessup, 1999)

Bipartite graph

Terms

T2T1

T4T3

T6T5

D2D1

D4D3

D5

T2T1

T4T3

T6T5

D1 D2 D3 D4 D5

Concepts

• Bag of words• Vector space model• Stemming• Stoplists• Scaling for information content

Page 32: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Cross-language Information Retrieval (CLIR)

Web documents could be in any language

English

French

Arabic

Spanish

EnglishGermanJapaneseFrenchChinese SimplifiedSpanishRussianDutchKoreanPolishPortugueseChinese TraditionalSwedishCzechNorwegianItalianDanishHungarianFinnishHebrewArabicTurkishSlovakIndonesianBulgarianCroatianCatalanSlovenianGreekRomanianSerbianEstonianIcelandicLithuanianLatvian

Languages on the web

Goal: Cluster documents by topic regardless of

language

(P. Chew, B. Bader, A. Abdelali (NMSU), T. Kolda, P. Kegelmeyer)

Page 33: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Bible as a ʻRosetta Stoneʼ

• The Bible has been translated carefully and widely - 451 complete & 2479 partial translations• Verse aligned

Afrikaans Estonian Norwegian

Albanian Finnish Persian (Farsi)

Amharic French Polish

Arabic German Portuguese

Aramaic Greek (New Testament) Romani

Armenian Eastern Greek (Modern) Romanian

Armenian Western Hebrew (Old Testament) Russian

Basque Hebrew (Modern) Scots Gaelic

Breton Hungarian Spanish

Chamorro Indonesian Swahili

Chinese (Simplified) Italian Swedish

Chinese (Traditional) Japanese Tagalog

Croatian Korean Thai

Czech Latin Turkish

Danish Latvian Ukrainian

Dutch Lithuanian Vietnamese

English Manx Gaelic Wolof

Esperanto Maori Xhosa

Sandiaʼs database: 54 languages: 99.76 % coverage of web

Page 34: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Bible as Parallel Corpus

5 languages for training and testing

Translation Terms Total Words

English (King James) 12,335 789,744

Spanish (Reina Valera 1909) 28,456 704,004

Russian (Synodal 1876) 47,226 560,524

Arabic (Smith Van Dyke) 55,300 440,435

French (Darby) 20,428 812,947

• Languages convey information in different number of words

Isolating language Synthetic language

Page 35: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Example of Statistical Differences

!""#$%&'()*+)()*,()-+&".+)%/&"(/0()*,(1%2.,(%&(/#-(3+-+..,.(4/-3#"(

+-,(+44#-+),(+&5(4/$3.,),6()*%"()+2.,(7/#.5(+33,+-()/("#'',")()*+)(

!-+2%4( )+8,"( 9#")( /:,-( *+.0( )*,( &#$2,-( /0( ),-$"( )/( ,;3-,""( )*,(

"+$,(+$/#&)(/0( %&0/-$+)%/&(+"(<&'.%"*6( )*+)(<&'.%"*(+&5(=-,&4*(

)+8,("%$%.+-(&#$2,-"(/0( ),-$"6(+&5("/(/&>( ?&)#%)%:,.@6( )*%"(",,$"(

-%'*)( '%:,&( )*+)( !-+2%4( +&5( A#""%+&( -,.@( $#4*( $/-,( )*+&( )*+&(

<&'.%"*6(=-,&4*(+&5(B3+&%"*(/&()*,(#",(/0($/-3*/./'@(C,&5%&'"6(

+&5("/(/&D()/(+55()/(/-($/5%0@()*,($,+&%&'"(/0(7/-5">(E*,("+$,(

3*,&/$,&/&( 4+&( +."/( 2,( %..#")-+),5( 7,..( /&( +( "$+..( "4+.,( 2@(

4/&"%5,-%&'( )*,( 0%-")( F5/4#$,&)G( %&( )*,( 3+-+..,.( 4/-3#"( C)*,( 0%-")(

:,-",D6("*/7&(%&(E+2.,(H>(?)(%"(.%8,.@()/(2,(&/(4/%&4%5,&4,()*+)()*,(

2,")(4-/""I.+&'#+',(3-,5%4)%/&(-,"#.)"(7,(+4*%,:,5(7,-,(0/-(3+%-"(

/0( .+&'#+',"( 7%)*( "%$%.+-( ")+)%")%4"( C0/-( ,;+$3.,6( <&'.%"*( +&5(

=-,&4*D6( +&5( )*,( 7/-")( -,"#.)"( 7,-,( 0/-( )*/",( 7%)*( 5%""%$%.+-(

")+)%")%4"(C"#4*(+"(!-+2%4(+&5(<&'.%"*D(C",,(E+2.,(J(+&5(E+2.,(KD>(

!"#$%&'(&)$$*+,-",./0&/1&+,",.+,.2"$&3.11%-%02%+&#%,4%%0&

$"05*"5%+&&

& !%6,& 4/-3&

2/*0,&

7&/1&

,/,"$&

!A( !"#$%(&$'()*$(+$(,-.(/01*$(23>((

L( MK(

<N( ?&( )*,( 2,'%&&%&'( O/5( 4-,+),5( )*,(

*,+:,&"(+&5()*,(,+-)*>(

MP( QK(

=A( !#( 4/$$,&4,$,&)( R%,#( 4-S+( .,"(

4%,#;(,)(.+(),--,>(

T( QM(

AU( 4( 567689( :;<=;>?8( @;A( 59B;( ?(C9D8E>(

V( MV(

<B( <&(,.(3-%&4%3%/(4-%W(R%/"(./"(4%,./"(@(

.+()%,--+>(

MP( QK(

&

!8!9:&

&

;<&

&

=>>&

(

E*,( ")+)%")%4+.( 5%00,-,&4,"( 4+&6( %&( 0+4)6( 2,( "*/7&( )/( *+:,( +(

5,)-%$,&)+.(,00,4)(/&(XB!(Y(&/)(9#")(,$3%-%4+..@6(2#)()*,/-,)%4+..@(

+"( 7,..>( U&5,-( )*,( ")+&5+-5( ./'I,&)-/3@( 7,%'*)%&'( "4*,$,6( 7,(

4+&(:,-%0@(7*,)*,-6(+44/-5%&'()/()*%"("4*,$,6()*,(4/&)-%2#)%/&(/0(

,+4*(/0()*,(Z(.+&'#+',"(%&(/#-($#.)%.%&'#+.(+.%'&,5(3+-+..,.(4/-3#"(

%"(,[#+.(Y(7*%4*(%)("*/#.5(2,6(%0()*,()-+&".+)%/&"(+-,(4/$3.,),(+&5(

+44#-+),>( E*,( 4/$3#),5( ,&)-/3@( %"( +( $,+"#-,( /0( %&0/-$+)%/&(

4/&),&)\( 2,4+#",( )*,( "+$,( %&0/-$+)%/&( %"( 2,%&'( 4/&:,@,5( %&( )*,(

)-+&".+)%/&"(/0(+&@('%:,&(5/4#$,&)(%&()*,()-+%&%&'(4/-3#"6()*,()/)+.(

,&)-/3@(3,-( .+&'#+',(C)*,("#$(/0( ),-$(,&)-/3%,"(/0( ),-$"( %&()*+)(

.+&'#+',D("*/#.5(2,(4/&")+&)(0/-(+&@('%:,&(5/4#$,&)>(

U3/&( ,;+$%&+)%/&6(7,( 0/#&5( )*+)(7%)*( )*,( ")+&5+-5( ./'I,&)-/3@(

7,%'*)%&'( )*,( 4/$3#),5( %&0/-$+)%/&( 4/&),&)( :+-%,"( [#%),(7%5,.@(

2@( .+&'#+',6( 7*%4*( %"( 3,-*+3"( #&"#-3-%"%&'>( ?0( %)( )+8,"( ZLP6ZQK(

A#""%+&( 7/-5"( )/( ,;3-,""( 7*+)( <&'.%"*( "+@"( %&( VHT6VKK( 7/-5"6(

)*,&(/&(+:,-+',(A#""%+&(7/-5"($#")(4/&)+%&($/-,(%&0/-$+)%/&(C/-(

$,+&%&'D()*+&(<&'.%"*(7/-5"(C+'+%&6(+(&/)%/&(7*%4*(%"(4/&"%"),&)(

7%)*(7*+)(7,(8&/7(+2/#)( )*,(7+@(7/-5"(+-,( 0/-$,5( %&(A#""%+&(

+&5(<&'.%"*D>(]/7,:,-6("%&4,(,&)-/3@(C/-(%&0/-$+)%/&(4/&),&)D(%&(

)*,(./'I,&)-/3@("4*,$,(%"("%$3.@()*,(,&)-/3@(/0(+(3+-)%4#.+-(),-$(

+4-/""( +..( 5/4#$,&)"6( +&5( "%&4,( )*,( "4*,$,( )+8,"( &/( +44/#&)( /0(

)*,( "3,4%0%4( 3-/3,-)%,"( /0( 5%00,-,&)( .+&'#+',"6( )*,-,( %"( &/(

'#+-+&),,( )*+)( )*,( 4/&)-%2#)%/&"( /0( 5%00,-,&)( .+&'#+',"( %&( )*,(

3+-+..,.( 4/-3#"( 7%..( 2,( ,[#+.( +"( )*,@( "*/#.5( 2,>( ?&( 0+4)6( %&( /#-(

3+-+..,.( 4/-3#"6( 7*,-,( #&5,-( XB!( +..( .+&'#+',"( +-,( F$%;,5(

)/',)*,-G( %&( )*,( 2+'I/0I7/-5"( +33-/+4*6( .+&'#+',"( 7*%4*( *+:,(

$/-,( ),-$"( /:,-+..( C"#4*( +"( <&'.%"*( +&5( =-,&4*D( ',&,-+..@(

+44/#&)( 0/-( +( *%'*,-( 3,-4,&)+',( /0( )*,( F%&0/-$+)%/&G( %&( ,+4*(

5/4#$,&)>(E*%"(3/%&)"()/(+(0.+7(%&(")+&5+-5($#.)%.%&'#+.(XB!6(/-(

+)(.,+")(%&()*,(./'I,&)-/3@(7,%'*)%&'("4*,$,(+"(+33.%,5(7%)*%&()*+)(

+33-/+4*>(

^&,( /)*,-( 3/%&)( )/( &/),( %"( )*,( 5%00,-,&4,( 2,)7,,&( )*,( )/)+.( /0(

MLJ6VKZ( "*/7&( %&( E+2.,( V( +2/:,6( +&5( )*,( 0%'#-,( /0( MLP6JTL(

$,&)%/&,5(%&(",4)%/&(K>M>(E*,(5%00,-,&4,(/0(J6JKT(-,3-,",&)"()*/",(

),-$"()*+)(/44#-(%&($/-,()*+&(/&,(.+&'#+',6("#4*(+"(F5,G(CF/0G(%&(

=-,&4*( +&5( B3+&%"*D6( <&'.%"*( F4/%&G( :,-"#"( =-,&4*( F4/%&G(

CF4/-&,-GD>(E*,(-,.+)%:,.@("$+..(&#$2,-(/0("#4*(),-$"(%"(#&.%8,.@(

)/(+00,4)()*,(4-/""I.+&'#+',(3-,4%"%/&(-,"#.)"("%'&%0%4+&).@6(2#)(%)(%"(

7/-)*( 3/%&)%&'( /#)( )*+)( ")+&5+-5(XB!(*+"(&/(7+@( )/(5%")%&'#%"*(

2,)7,,&(*/$/'-+3*"(0-/$(5%00,-,&)(.+&'#+',"6(+&5(%&("/$,(4+","(

)*%"(4/#.5(2,(3-/2.,$+)%46(,"3,4%+..@(7*,&()*,(*/$/'-+3*"(*+:,(

:,-@(5%00,-,&)($,+&%&'"(%&()*,(5%00,-,&)(.+&'#+',">J(

O%:,&(+..()*%"6()*,(")+)%")%4+.(,;3.+&+)%/&(",,$"()/(2,(+(-,+"/&+2.,(

/&,( 0/-( 7*@6( 7*,&( 7,( +)),$3),5( )/( $+3( )*,( 5/4#$,&)"(

'-+3*%4+..@( "#4*( )*+)( "%$%.+-( 5/4#$,&)"( 7,-,( 4./",( )/( /&,(

+&/)*,-6( )*,( 5/4#$,&)"( 4.#"),-,5( 2@( .+&'#+',( -+)*,-( )*+&( 2@(

)/3%4>( _-,4%",.@( )*,( "+$,( %""#,( *+"( 2,,&( %5,&)%0%,5( ,.",7*,-,( %&(

)*,( .%),-+)#-,\( `+)*%,#( ,)( +.( aMHb( -,3/-)( )*+)( F,:,&( %0( )*,( 4-/""I

.%&'#+.( "%$%.+-%)@($,+"#-,( %"( 5,"%'&,5( )/( 2,*+:,( )*,( "+$,(7*,&(

4/$3+-%&'( 5/4#$,&)"( 7-%)),&( %&( )*,( "+$,( .+&'#+',( +&5(

5/4#$,&)"(7-%)),&( %&(5%00,-,&)(/&,"6(/#-(,:+.#+)%/&("*/7"()*+)( %)(

")%..(),&5"()/('+)*,-(%&(+(4.#"),-(5/4#$,&)"(/0("+$,(.+&'#+',(3-%/-(

)/(5%00,-,&)(.+&'#+',(/&,"G>(

(

?(! 9@&9:!AB@9!)CA&9DDB89EF&!"( 5%"4#"",5( %&( )*,( 3-,:%/#"( ",4)%/&6( %)( %"( +( 5-+72+48( /0( )*,(

")+&5+-5( +33-/+4*( )/( XB!( )*+)( )*,-,( %"( &/( 5,.%&,+)%/&( 2,)7,,&(

5%00,-,&)( .+&'#+',"( %&( )*,( )-+%&%&'( 5+)+>( !..( .+&'#+',"( +-,(

4/&4+),&+),5( )/',)*,-( %&( )-+%&%&'6( "/( )*+)( ,+4*( F5/4#$,&)G( %"(

$#.)%.%&'#+.>( c%)*%&( )*,( XB!( 0-+$,7/-86( */7,:,-6( )*%"( %"(

#&+:/%5+2.,6( "%&4,( 7%)*/#)( )*,( 4/&4+),&+)%/&6( XB!( %"( #&+2.,( )/(

$+8,()*,(+""/4%+)%/&"(2,)7,,&(7/-5"(%&(5%00,-,&)(.+&'#+',"(7*,&(

)*,@(4/I/44#->(

E/( /:,-4/$,( )*%"( 3-/2.,$6( )*,-,0/-,6( 7,( 3-/3/",( +( &/:,.(

+33.%4+)%/&( /0( _!A!=!dQ( aMJb( +"( +&( +.),-&+)%:,( )/( XB!>(

_!A!=!dQ( %"( +( :+-%+&)( /0( _!A!=!d( aMQb6( +( $#.)%I7+@(

',&,-+.%e+)%/&(/0( )*,(BfR>(E*,(_!A!=!d($/5,.( %"(2+",5(/&(+(

F3+-+..,.( 3-/3/-)%/&+.( 3-/0%.,( 3-%&4%3.,G( )*+)( +33.%,"( )*,( "+$,(

0+4)/-"( +4-/""( +( 3+-+..,.( ",)( /0( $+)-%4,"( )/( $%&%$%e,( +( .,+")I

"[#+-,"(/29,4)%:,>(X,)()*,(`(!(N($+)-%;(g86(8(h(M6(i6(j6(5,&/),(

)*,(!)*(".%4,(/0(+()*-,,I7+@(5+)+(+--+@(g6(+&5(.,)(A(2,()*,(&#$2,-(

/0( 5%$,&"%/&"( /0( )*,( XB!( 4/&4,3)#+.( "3+4,>( E*,&( )*,( ")+&5+-5(

_!A!=!d($/5,.(%"(

( g8(h(U(B8(fE( ( ( CMD(

(((((((((((((((((((((((((((((((((((((((( (((((((((((((((((((((((((

J(E*,(5%00,-,&4,(2,)7,,&()*,()/)+.(/0(J6JPV6LZK(+&5()*,(Q6LHK6TJH(

&/&e,-/"( $,&)%/&,5( %&( ",4)%/&( K>M( 4+&( +."/( 2,( ,;3.+%&,5\( )*,(

0%'#-,(%&(E+2.,(L(%"(*%'*,-(2,4+#",("/$,(),-$"(/44#-($/-,()*+&(

/&4,(%&()*,("+$,(:,-",>(=/-(,;+$3.,6(%0(F)*,G(/44#-"()*-,,()%$,"(

%&( +( 3+-)%4#.+-( :,-",6( )*%"( 7/#.5( +44/#&)( 0/-( )*-,,( )/8,&"6( 2#)(

/&.@(/&,(&/&e,-/(,&)-@(%&()*,(),-$I2@I5/4#$,&)($+)-%;>(

Research Track Paper

148

Page 36: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Language Morphology

• Isolating language: One morpheme per word- e.g., "He travelled by hovercraft on the sea." Largely isolating, but travelled and hovercraft each

have two morphemes per word. (Wikipedia)

• Synthetic language: High morpheme-per-word ratio- German: Aufsichtsratsmitgliederversammlung => "On-view-council-with-limbs-gathering" meaning

"meeting of members of the supervisory board". (Wikipedia)- Chulym: Aalychtypiskem => "I went out moose hunting" - Yupʼik Eskimo: tuntusssuatarniksaitengqiggtuq => “He had not yet said again that he was going to

hunt reindeer.” (Payne, 1997)

Translation Terms Total Words

English (King James) 12,335 789,744

Arabic (Smith Van Dyke) 55,300 440,435

Isolating language Synthetic language

Chinese Quechua, Inuit (Eskimo)

Languages convey information in different number of words

Page 37: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Term-Doc Matrix

Term-by-verse matrix for all languages

terms

Bible verses

English

Spanish

Russian

Arabic

French

163,745 x 31,230

Look for co-occurrence of terms in the same verses and across languages to capture latent concepts

• Approach is not new- pairs of languages in LSA suggested by (Berry et al., 1994)- multi-parallel corpus is new

Page 38: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Latent Semantic Indexing

Term-by-verse matrix for all languages

terms

Bible verses

English

Spanish

Russian

Arabic

French

UVΣ T

Truncated SVD

Project new documents of interest into subspace of UΣ-1

• cosine similarities for clustering• machine learning applications

term x concept

dimension 1 0.1375dimension 2 0.1052dimension 3 0.0341dimension 4 0.0441dimension 5 -0.0087dimension 6 0.0410dimension 7 0.1011dimension 8 0.0020dimension 9 0.0518dimension 10 0.0822dimension 11 -0.0101dimension 12 -0.1154dimension 13 -0.0990dimension 14 0.0228dimension 15 -0.0520dimension 16 0.1096dimension 17 0.0294dimension 18 0.0495dimension 19 0.0553dimension 20 0.1598

Projectnew document

Document feature vector

Ak = Uk!kV Tk =

k!

i=1

!iuivTi

Page 39: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Quran as Test Set

• Quran is translated into many languages, just like the Bible

• 114 suras (or chapters)

• More variation across translations => harder IR task

Page 40: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Performance Metrics

• Precision at 1 document (P1)- Equals 100% if the translation of the query ranked

highest, 0% otherwise- Calculated as an average over all queries for each

language pair or as a total average- Essentially, P1 measures success in retrieving

documents when the source and target languages are specified

• Average multilingual precision at 5 (or n) documents (MP5)- The average percentage of the top 5 documents that

are translations of the query document- Calculated as an average for all queries & all languages- Essentially, MP5 measures success in multilingual

clustering

• MP5 is a stricter measure than P1 because the retrieval task is harder

?

?

?

Lang 1

Lang 1Lang 2

query

query

Page 41: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

LSA Results

Documents tend to cluster more by language than by topic

5 languages, 240 latent dimensions

(Chew,Bader,Kolda,Abdelali, 2007)

Method Average P1 Average MP5

SVD/LSA 76.0% 26.1%

Page 42: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

LSA Results

Method Average P1 Average MP5

SVD/LSA (α=1) 76.0% 26.1%

SVD/LSA (α=1.8) 87.6% 65.5%

5 languages, 240 latent dimensions

Page 43: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

New Approach: Multi-matrix Array

X5

X4

X3X

2English

X1

Spanish

Russian

Arabic

French

Term-by-verse matrix for each language

(Chew, Bader, Kolda, Abdelali, 2007)

Array size: 55,300 x 31,230 x 5 with 2,765,719 nonzeros

Page 44: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Tucker1

VT

S1

=U1

X1

X2

X3

U2

U3

!

!Tucker

Tucker1

Page 45: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Tucker1 Results

Only minor improvement because each Uk is not orthogonal

5 languages, 240 latent dimensions

Method Average P1 Average MP5

SVD/LSA (α=1) 76.0% 26.1%

SVD/LSA 87.6% 65.5%

Tucker1 89.5% 71.3%

Page 46: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

PARAFAC2

Where each Uk is orthonormal and Sk is diagonal

Xk ! UkHSkV T

(Harshman, 1972)

Page 47: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

PARAFAC2 Results

Modest improvement over LSA

5 languages, 240 latent dimensions

Method Average P1 Average MP5

SVD/LSA (α=1) 76.0% 26.1%

SVD/LSA 87.6% 65.5%

Tucker1 89.5% 71.3%

PARAFAC2 89.8% 78.5%

Page 48: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Bible, color-coded according to language, are represented in the vector space.18

Note that

the books cluster first to their counterparts in other languages, and then into larger

clusters containing related books; this kind of visualization is possible only when MP6 is

satisfactorily high. In summary, these techniques have effectively allowed us to factor

language out, focusing only on topic, just as we had hoped.

Figure 6. Visualization of multilingual Bible books in vector space

18 To create this visualization, we used Tamale 1.2. The graph layout is created using VxOrd. Both are

software created in-house at Sandia National Laboratories. For further details on VxOrd, see Boyack et al.

(2005).

Bible Clustering with LMSATA

• Books of the Bible color coded by language• MP5 about 90%• Books cluster first with

their counterparts in other languages, then in larger clusters by topic

• Visualization software: Tamale 1.2• Graph layout: VxOrd (Boyack et

al., 2005)

Page 49: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Graph Layout (close-up)

Figure 7. Partial visualization of multilingual Bible books in vector space

7 Discussion

In this paper, we have offered what we believe to be clear evidence that computational

linguistics provides fertile ground for rethinking many of the standard techniques of

information retrieval. This is true because of the strong basis CL has in information

theory. All of our proposed modifications (weighting term-document pairs according to

mutual information; redefining the ‘term’ as the morpheme, or minimal unit of meaning;

not excluding high-entropy terms; and including term alignments of the sort used in

SMT) have obvious connections to IT. What is more, if included in an IR framework,

they clarify the relationship between IT and IR, theoretically strengthening the latter

without necessarily getting in the way of speed and efficiency.

We have examined four basic types of modification, but we believe that there are further

opportunities for improvement. An obvious example is that instead of limiting the

morphological component to stem-suffix tokenization, we could also take prefixes and

word-internal suffixes into account. Similarly, the performance of our CLIR model with

respect to Arabic and other Semitic languages could be improved if the morphological

component were able to identify non-concatenative roots and patterns.

Following this line of reasoning further, one could conceive of building a phonological

and/or morphophonological component into the framework without compromising its

basis in unsupervised learning. If morphological regularities can be learned in an

unsupervised manner, there is no reason in principle why morphophonological

regularities of the sort described in Chomsky and Halle (1968) cannot also be learned in

the same way. Where morphophonological or phonological alternations are reflected in

orthography, as in English divid+e/divis+ion or Russian !"#$%&+'/!"#$%(+#), one can

• John and Acts have tight clusters• Some mixing with Matthew, Mark,

Luke (synoptic gospels - share a similar perspective)

Page 50: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Using DEDICOM for Completely Unsupervised Part-of-Speech Tagging

Joint work with Peter Chew

Page 51: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Approaches to POS tagging (1)

•Supervised

–Rule-based (e.g. Harris 1962)• Dictionary + manually developed rules• Brittle – approach doesn’t port to new domains

–Stochastic (e.g. Stolz et al. 1965, Church 1988)• Examples: HMMs, CRFs• Relies on estimation of emission and transition probabilities from a tagged training corpus

• Again, difficulty in porting to new domains

Page 52: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Approaches to POS tagging (2)

•Unsupervised–All approaches exploit distributional patterns–Singular Value Decomposition (SVD) of term-

adjacency matrix (Schütze 1993, 1995)–Graph clustering (Biemann 2006)

–Our approach: DEDICOM of term-adjacency matrix

–Most similar to Schütze (1993, 1995)Advantages:

• can be reconciled to stochastic approaches• completely unsupervised, like SVD and graph clustering• initial results appear promising

Page 53: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

DEDICOM – application to POS tagging

• The assumption that terms are a “single set of objects”, whether they precede or follow, sets DEDICOM apart from SVD and other unsupervised approaches

• This assumption models the fact that tokens play the same syntactic role whether we view them as the first or second element in a bigram

Term adjacency matrix ‘R’ matrix

‘A’ matrix

Page 54: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Comparing DEDICOM output to HMM input

• The output of DEDICOM is essentially a transition and emission probability matrix

• DEDICOM offers the possibility of getting the familiar transition and emission probabilities without tagged training data

‘R’ matrix

‘A’ matrix

Output of DEDICOM

Transition prob. matrix

Emission prob. matrix

Input to HMM(after normalization of counts)

Page 55: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Validation: method 1 (theoretical)

• Hypothetical example - suppose tagged training corpus exists

The man walked the big dog

DT NN VBD DT JJ NN

Corpus:

X:

A*: term-tag counts R*: tag-adjacency counts

• By definition (subject to difference of 1 for final token):– row sums of X = col sums of X = row sums of A*– col sums of A* = row sums of R* = col sums of R*

sparse matrix of bigram counts

Page 56: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Validation: method 1 (theoretical)

• To turn A* and R* into transition and emission probability matrices, we simply multiply each by a diagonal matrix D where the entries are the inverses of the row-sum vector

• If the DEDICOM model is a good one, we should be able to multiply A*DR*D(A*)T to approximate the original matrix X

• In our example, A*DR*D(A*)T =

• This approximates X, and it also captures some syntactic regularities which aren’t instantiated in the corpus (this is one reason HMM-based POS tagging is successful)

Page 57: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Validation: method 2 (empirical)

• Use a tagged corpus (CONLL 2000)– CONLL 2000 has 19,440 distinct terms– There are 44 distinct tags in the tagset (our “gold standard”)

• Tabulate X matrix (solely from bigram frequencies, blind to tags)

• Fit DEDICOM model to X (using k = 44) to ‘learn’ emission and transition probability matrices

• Use these as input to a HMM; tag each token with a numerical index (one of the DEDICOM ‘dimensions’)

• Evaluate by looking at correlation of induced tags with the 44 “gold standard” tags in a confusion matrix

Page 58: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Validation: method 2 (empirical)

Confusion matrix: correlation with ‘ideal’ diagonal matrix = 0.494

Assuming the ‘gold standard’ is the optimal tagging scheme, the ideal confusion matrix would have one DEDICOM class per ‘gold standard’ tag

(either a diagonal matrix or some permutation thereof).

DEDICOM tags

Gold standard

tag

Page 59: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Validation: method 2 (empirical)

• Examples of DEDICOM dimensions or clusters:

Soft clustering: U.S. appears under tag 2 (nouns) and tag 8 (adjectives)

“NN”

“NN”

“DT”

Tag 3: Grouping of ‘new’ with ‘the’ and ‘a’ explained by distributional similarities (all precede nouns). This is also in accordance with a traditional English grammar, which held

that determiners are a type of adjective.

Page 60: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Future Work in POS Tagging

• Constrained DEDICOM- Tags for common terms (e.g., determiners, prepositions)- Semi-supervised learning

• Analyze multiple corpora via 3-way DEDICOM

• Constraints on R matrix in DEDICOM

Page 61: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Discussion and Questions

• Challenges with global optimization- Ways reduce search space and find global minimizer - Deterministic algorithm that returns a good approximation

to global solution

• Fast algorithms for large-scale problems

• Other data analysis models

Page 62: Advances in Data Analysis using PARAFAC2 and ... - cid.csic.es · NN-ASALSAN 0.083 (47) 1.0 (74) Kiers [23] 0.022 (67) 22.3 (400+) 4 Experimental results We consider two applications:

Selected Publications

• Bader, Harshman, and Kolda. 2007. Temporal analysis of semantic graphs using ASALSAN. Proceedings of IEEE International Conference on Data Mining (ICDM), 2007.• Chew, Bader, Kolda and Abdelali. 2007. Cross-language

information retrieval using PARAFAC2. Proceedings of KDD 2007.• Chew, Kegelmeyer, Bader and Abdelali. 2008. The Knowledge of

Good and Evil: Multilingual Ideology Classification with PARAFAC2 and Machine Learning. Language Forum (34), 37-52.• Chew, Bader, and Rozovskaya. 2009. Using DEDICOM for

Completely Unsupervised Part-of-Speech Tagging. Proceedings of NAACL-HLT, Workshop on Unsupervised and Minimally Supervised Learning of Lexical Semantics.• B. W. Bader. 2009. Constrained and Unconstrained Optimization.

Brown, Tauler, Walczak (eds.) Comprehensive Chemometrics, volume 1, pp. 507-545 Oxford: Elsevier.

Brett Bader ([email protected])