Multilinear Algebra for Analyzing Data with Multiple Linkages Tamara G. Kolda plus: Brett Bader,...

31
Multilinear Algebra for Analyzing Data with Multiple Linkages Tamara G. Kolda plus: Brett Bader, Danny Dunlavy, Philip Kegelmeyer Sandia National Labs TRICAP 2006, Chania, Greece, June 4-9, 2006

Transcript of Multilinear Algebra for Analyzing Data with Multiple Linkages Tamara G. Kolda plus: Brett Bader,...

Page 1: Multilinear Algebra for Analyzing Data with Multiple Linkages Tamara G. Kolda plus: Brett Bader, Danny Dunlavy, Philip Kegelmeyer Sandia National Labs.

Multilinear Algebra for Analyzing Data with Multiple Linkages

Tamara G. Kolda

plus: Brett Bader, Danny Dunlavy, Philip Kegelmeyer

Sandia National Labs

TRICAP 2006, Chania, Greece, June 4-9, 2006

Page 2: Multilinear Algebra for Analyzing Data with Multiple Linkages Tamara G. Kolda plus: Brett Bader, Danny Dunlavy, Philip Kegelmeyer Sandia National Labs.

Tamara G. Kolda – TRICAP – June 6, 2006 - p.2

Linear Algebra for Data with Linkages

Circle-Square Matrix

Circle-Circle Co-Link Matrix Square-Square Co-Link Matrix

SVD Rank-k Approximation (k=2)

Page 3: Multilinear Algebra for Analyzing Data with Multiple Linkages Tamara G. Kolda plus: Brett Bader, Danny Dunlavy, Philip Kegelmeyer Sandia National Labs.

Tamara G. Kolda – TRICAP – June 6, 2006 - p.3

Latent Semantic Indexing (LSI)

for Text Retrieval

• S. T. Dumais, G. W. Furnas, T. K. Landauer, S. Deerwester, and R. Harshman. Using latent semantic analysis to improve access to textual information. In CHI '88, pp. 281–285, 1988

• S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. J. Am. Soc. Inform. Sci., 41(6):391–407, 1990

• M. W. Berry, S. T. Dumais, and G. W. O'Brien. Using linear algebra for intelligent information retrieval. SIAM Rev., 37(4):573–595, 1995

Term-Document Matrix

“Car Service”Query

SMART Retrieval SystemG. Salton (1971)

LSI S. Dumais et al. (1988)

Terms

Documents

car

repair

service

military

d1

d2

d3

Page 4: Multilinear Algebra for Analyzing Data with Multiple Linkages Tamara G. Kolda plus: Brett Bader, Danny Dunlavy, Philip Kegelmeyer Sandia National Labs.

Tamara G. Kolda – TRICAP – June 6, 2006 - p.4

Applications of LSI

Terms

Documents

car

repair

service

military

d1

d2

d3

Graph the Results using U2 and V2 Term-Document Similarities

car

service

military

repair

d1 d2 d3

car

service

military

repaircar service military repair

Term-Term

Document-Document

Page 5: Multilinear Algebra for Analyzing Data with Multiple Linkages Tamara G. Kolda plus: Brett Bader, Danny Dunlavy, Philip Kegelmeyer Sandia National Labs.

Tamara G. Kolda – TRICAP – June 6, 2006 - p.5

Caveats for LSI

• How to use • Term-document matrix weighting is critical!

Local WeightLog

fij = frequency

Global Term WeightInverse Document Frequency

N = total docsni = # docs with term i

Normalization Factor“Cosine”

Page 6: Multilinear Algebra for Analyzing Data with Multiple Linkages Tamara G. Kolda plus: Brett Bader, Danny Dunlavy, Philip Kegelmeyer Sandia National Labs.

Tamara G. Kolda – TRICAP – June 6, 2006 - p.6

Citation/Link Analysis(Same Nodes)

1

2

3

4

Link Matrix

Co-Citation Matrix

Co-Reference Matrix

Hub Scores

Authority Scores

Doc 3 is the most important authority!

Doc 1 is the most important hub!

J. M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM, 46(5):604–632, 1999.

Examples:Citation data,

Web links

Page 7: Multilinear Algebra for Analyzing Data with Multiple Linkages Tamara G. Kolda plus: Brett Bader, Danny Dunlavy, Philip Kegelmeyer Sandia National Labs.

Tamara G. Kolda – TRICAP – June 6, 2006 - p.7

Multiple Links?

Suppose the connections between nodes are “labeled”

in some fashion.

In other words, we have meta-data on the

connections.

Can we somehow use

multilinear algebra for link

analysis?

Page 8: Multilinear Algebra for Analyzing Data with Multiple Linkages Tamara G. Kolda plus: Brett Bader, Danny Dunlavy, Philip Kegelmeyer Sandia National Labs.

Tamara G. Kolda – TRICAP – June 6, 2006 - p.8

PARAFAC

• PARAFAC = Parallel Factors

• aka. CANDECOMP = Canonical Decomposition

• Higher-order analogue of the SVD

• Columns of A, B, and C are not orthonormal

• If R is minimal, then R is called the rank of the tensor (Kruskal 1977)

• Can have rank(X) > min{I,J,K}

• Often guaranteed to be a unique rank decomposition!

=A

I x R

B

J x RI x J x K

C

K x R

R x R x R

= + + + …I

•R. A. Harshman. Foundations of the PARAFAC procedure: models and conditions for an “explanatory” multi-modal factor analysis. UCLA working papers in phonetics, 16:1–84, 1970•J. D. Carroll and J. J. Chang. Analysis of individual differences in multidimensional scaling via an N-way generalization of `Eckart-Young' decomposition. Psychometrika, 35:283–319, 1970.

Page 9: Multilinear Algebra for Analyzing Data with Multiple Linkages Tamara G. Kolda plus: Brett Bader, Danny Dunlavy, Philip Kegelmeyer Sandia National Labs.

Tamara G. Kolda – TRICAP – June 6, 2006 - p.9

Many ways to write PARAFAC

J. B. Kruskal. Three-way arrays: rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear Algebra Appl., 18(2):95–138, 1977.

“Kruskal Operator”

Easy to write N-way case:

“Tucker Operator”

Page 10: Multilinear Algebra for Analyzing Data with Multiple Linkages Tamara G. Kolda plus: Brett Bader, Danny Dunlavy, Philip Kegelmeyer Sandia National Labs.

Tamara G. Kolda – TRICAP – June 6, 2006 - p.10

Properties of the Kruskal Operator

PARAFAC core for a Tucker decomposition:

Matricize (arbitrary map of indices to rows and columns):

Mode-n matricize:

Norm of a PARAFAC decomposition:

Page 11: Multilinear Algebra for Analyzing Data with Multiple Linkages Tamara G. Kolda plus: Brett Bader, Danny Dunlavy, Philip Kegelmeyer Sandia National Labs.

Tamara G. Kolda – TRICAP – June 6, 2006 - p.11

PARAFAC for sparse data &

approximations• Our interest in the mathematical operations is motivated on two fronts

(1) Sparse computations

(2) Using tensor decompositions for approximation

• Ex: Considering how to efficiently implement PARAFAC-ALS for sparse data

• Can PARAFAC be used for the best rank-k approximation, rather than finding an exact decomposition (excepting noise)

What does it even mean in this case??

Page 12: Multilinear Algebra for Analyzing Data with Multiple Linkages Tamara G. Kolda plus: Brett Bader, Danny Dunlavy, Philip Kegelmeyer Sandia National Labs.

Tamara G. Kolda – TRICAP – June 6, 2006 - p.12

Multilink Analysis using PARAFAC

• Quick Review: Tensors for Web Link Analysis page x page x anchor text (TOPHITS)

• New work: Tensors for Publication Data Analysis Case 1: doc x doc x similarity

Case 2: term x doc x author (HO-LSA??)

Page 13: Multilinear Algebra for Analyzing Data with Multiple Linkages Tamara G. Kolda plus: Brett Bader, Danny Dunlavy, Philip Kegelmeyer Sandia National Labs.

Tamara G. Kolda – TRICAP – June 6, 2006 - p.13

TOPHITS: PARAFAC for Web Link Analysis

A set of four hyperlinked web pages

Graph representation shows basic connectivity

Labeled edges capture context

Page 14: Multilinear Algebra for Analyzing Data with Multiple Linkages Tamara G. Kolda plus: Brett Bader, Danny Dunlavy, Philip Kegelmeyer Sandia National Labs.

Tamara G. Kolda – TRICAP – June 6, 2006 - p.14

Analyzing Publication Data:Doc x Doc x Similarity

Representation

Page 15: Multilinear Algebra for Analyzing Data with Multiple Linkages Tamara G. Kolda plus: Brett Bader, Danny Dunlavy, Philip Kegelmeyer Sandia National Labs.

Tamara G. Kolda – TRICAP – June 6, 2006 - p.15

Computing Different Doc-Doc Similarities

Computing term-based similarities (k=1,2,3)

Computing author similarities (k=4)

Enforces sparseness!

• 5022 papers• 16617 unique

terms (ignoring stop words, words with length less than 3 or greater than 30 characters, and words that appear less than 2 times)

Titles: 5164 Abstracts: 15752 Keywords: 5248

• 6891 authors

• 2659 citations

Page 16: Multilinear Algebra for Analyzing Data with Multiple Linkages Tamara G. Kolda plus: Brett Bader, Danny Dunlavy, Philip Kegelmeyer Sandia National Labs.

Tamara G. Kolda – TRICAP – June 6, 2006 - p.16

PARAFAC for Doc x Doc x Similarity

• H = “hubs”

• A = “authorities”

• C = “connections”

• Rank-30 decomposition

Central idea:Each triplet provides a core “grouping” of the

data, i.e., a specific topic.

Page 17: Multilinear Algebra for Analyzing Data with Multiple Linkages Tamara G. Kolda plus: Brett Bader, Danny Dunlavy, Philip Kegelmeyer Sandia National Labs.

Tamara G. Kolda – TRICAP – June 6, 2006 - p.17

Sample: Grouping 1

Page 18: Multilinear Algebra for Analyzing Data with Multiple Linkages Tamara G. Kolda plus: Brett Bader, Danny Dunlavy, Philip Kegelmeyer Sandia National Labs.

Tamara G. Kolda – TRICAP – June 6, 2006 - p.18

Sample: Grouping 10

Page 19: Multilinear Algebra for Analyzing Data with Multiple Linkages Tamara G. Kolda plus: Brett Bader, Danny Dunlavy, Philip Kegelmeyer Sandia National Labs.

Tamara G. Kolda – TRICAP – June 6, 2006 - p.19

Applications of the [H,A,C]

Decomposition• Latent document similarities

Calculate S = ½ HHT + ½ AAT

• Analyzing a body of work ch = hub centroid, ca = authority centroid

s = ½ H ch + ½ A ca

• Disambiguation (EXAMPLE) Calculate centroids using A (could also use H or A+H)

Calculate simiarlities of centroids

• Journal predicition Use matrix A as features for input to a decision tree

ensemeble classifier

Page 20: Multilinear Algebra for Analyzing Data with Multiple Linkages Tamara G. Kolda plus: Brett Bader, Danny Dunlavy, Philip Kegelmeyer Sandia National Labs.

Tamara G. Kolda – TRICAP – June 6, 2006 - p.20

Example of Disambiguation

ResultsTwo authors with missing middle

initials.

3 possible matches

Matrix of Similarities

Page 21: Multilinear Algebra for Analyzing Data with Multiple Linkages Tamara G. Kolda plus: Brett Bader, Danny Dunlavy, Philip Kegelmeyer Sandia National Labs.

Tamara G. Kolda – TRICAP – June 6, 2006 - p.21

Analyzing Publication Data:Term x Doc x Author

Representation

term

doc auth

or

Form tensor X as:

Element (i,j,k) is nonzero only if author k wrote document j using term i.

767 documents2251 terms

1072 authors59738 nonzeros

Terms must appear in at least 3 documents and no

more than 10% of all documents. Moreover, it

must have at least 2 characters and no more

than 30.

Page 22: Multilinear Algebra for Analyzing Data with Multiple Linkages Tamara G. Kolda plus: Brett Bader, Danny Dunlavy, Philip Kegelmeyer Sandia National Labs.

Tamara G. Kolda – TRICAP – June 6, 2006 - p.22

Different Graph Interpretations for

Term x Doc x Author

• term-doc with author links

• term-author with doc links

• author-doc with term links

• term-doc-author with links

Term

DocDifferent author links represented

by different colors

Page 23: Multilinear Algebra for Analyzing Data with Multiple Linkages Tamara G. Kolda plus: Brett Bader, Danny Dunlavy, Philip Kegelmeyer Sandia National Labs.

Tamara G. Kolda – TRICAP – June 6, 2006 - p.23

Author Data is Too Sparse

term

doc

auth

or

Result: Resulting tensor has just a few nonzero columns in

each lateral slice.

Experimentally, PARAFAC seems to overfit such data and not do a good job of

“mixing” different authors.

Page 24: Multilinear Algebra for Analyzing Data with Multiple Linkages Tamara G. Kolda plus: Brett Bader, Danny Dunlavy, Philip Kegelmeyer Sandia National Labs.

Tamara G. Kolda – TRICAP – June 6, 2006 - p.24

Idea: Use Tucker Transformation to

Compress

or, equivalently

We transform the tensor to a smaller tensor as follows:

This transformation forces the authors to be mixed and produces a dense result.Main problem: How to transform sparse tensor without creating dense intermediate results?

(rank 75) (rank 50)

Compute rank-25 PARAFAC on compressed tensor and transform.

Page 25: Multilinear Algebra for Analyzing Data with Multiple Linkages Tamara G. Kolda plus: Brett Bader, Danny Dunlavy, Philip Kegelmeyer Sandia National Labs.

Tamara G. Kolda – TRICAP – June 6, 2006 - p.25

Tucker & PARAFAC

• Want PARAFAC for X in term x doc x author space

• First, apply dimensionality reduction to X to obtain Y Y in “conceptual” space

• Next, compute PARAFAC on Y

• Finally, reassemble results to yield PARAFAC for X

Page 26: Multilinear Algebra for Analyzing Data with Multiple Linkages Tamara G. Kolda plus: Brett Bader, Danny Dunlavy, Philip Kegelmeyer Sandia National Labs.

Tamara G. Kolda – TRICAP – June 6, 2006 - p.26

Three-Way Fingerprints

• Each of the Terms, Docs, and Authors has a rank-k (k=25) fingerprint from the PARAFAC approximation

• All items can be directly compared in “concept space”

• Thus, we can compare any of the following Term-Term Doc-Doc Term-Doc Author-Author Author-Term Author-Doc

• The fingerprints can be used as inputs for clustering, classification, etc.

Page 27: Multilinear Algebra for Analyzing Data with Multiple Linkages Tamara G. Kolda plus: Brett Bader, Danny Dunlavy, Philip Kegelmeyer Sandia National Labs.

Tamara G. Kolda – TRICAP – June 6, 2006 - p.27

Sample Results: Term

Page 28: Multilinear Algebra for Analyzing Data with Multiple Linkages Tamara G. Kolda plus: Brett Bader, Danny Dunlavy, Philip Kegelmeyer Sandia National Labs.

Tamara G. Kolda – TRICAP – June 6, 2006 - p.28

Sample Results: Term

Page 29: Multilinear Algebra for Analyzing Data with Multiple Linkages Tamara G. Kolda plus: Brett Bader, Danny Dunlavy, Philip Kegelmeyer Sandia National Labs.

Tamara G. Kolda – TRICAP – June 6, 2006 - p.29

Sample Results: Author

Page 30: Multilinear Algebra for Analyzing Data with Multiple Linkages Tamara G. Kolda plus: Brett Bader, Danny Dunlavy, Philip Kegelmeyer Sandia National Labs.

Tamara G. Kolda – TRICAP – June 6, 2006 - p.30

Summary & Future Work

• PARAFAC provides a technique for analyzing semantic graphs Third dimension captures different connection types

Or may consider it as the interconnection of 3 different node types

• Analyzed journal articles using different tensor representations Doc x Doc x Connection

• Need to make definitive case of why 3D is better than 2D

Term x Doc x Author• Too sparse?

• Still working towards large-scale, sparse problems Need implicit compression for PARAFAC

~5M nonzeros

• Other decompositions? Other hybrids

Symmetry

Page 31: Multilinear Algebra for Analyzing Data with Multiple Linkages Tamara G. Kolda plus: Brett Bader, Danny Dunlavy, Philip Kegelmeyer Sandia National Labs.

Tamara G. Kolda – TRICAP – June 6, 2006 - p.31

Acknowledgments & More Information

• Thanks to… Brett Bader, Danny Dunlavy, Philip Kegelmeyer Web data: Joe Kenny, Travis Bauer et al., Ken Kolda Journal data: Kevin Boyack Graph viz: Ann Yoshimura

• Related papers Algorithm xxx: MATLAB Tensor Classes for Fast Algorithm Prototyping (with B.W. Bader), ACM

TOMS, to appear. Multilinear algebra for analyzing data with multiple linkages (with D. Dunlavy and W. P.

Kegelmeyer), Technical Report SAND2006-2079, Apr. 2006. Temporal analysis of social networks using three-way DEDICOM (with B.W. Bader and

R.Harshman), Technical Report SAND2006-2161, Apr. 2006. Multilinear operators for higher-order decompositions. Technical Report SAND2006-2081, Apr.

2006. The TOPHITS model for higher-order web link analysis (with B. Bader), in Proc. Workshop on Link

Analysis, Counterterrorism and Security, SDM06, Apr. 2006 Higher-order web link analysis using multilinear algebra (with B.W.Bader), ICDM 2005, pp. 242–

249, Nov. 2005.

• Contact Info: [email protected] http://csmr.ca.sandia.gov/~tgkolda/

Thank You!