Dexin Zhou – Bard College (Presenter) Ralph Abbey – North Carolina State U. Jeremy Diepenbrock...

27
Dexin Zhou – Bard College (Presenter) Ralph Abbey – North Carolina State U. Jeremy Diepenbrock – Washington U. at St. Louis Project Advisor: Dr. Carl Meyer Additional Advising: Dr. Amy Langville Graduate Assistant: Shaina Race

Transcript of Dexin Zhou – Bard College (Presenter) Ralph Abbey – North Carolina State U. Jeremy Diepenbrock...

Page 1: Dexin Zhou – Bard College (Presenter) Ralph Abbey – North Carolina State U. Jeremy Diepenbrock – Washington U. at St. Louis Project Advisor: Dr. Carl Meyer.

Dexin Zhou – Bard College (Presenter)Ralph Abbey – North Carolina State U.

Jeremy Diepenbrock – Washington U. at St. Louis

Project Advisor: Dr. Carl MeyerAdditional Advising: Dr. Amy Langville

Graduate Assistant: Shaina Race

Page 2: Dexin Zhou – Bard College (Presenter) Ralph Abbey – North Carolina State U. Jeremy Diepenbrock – Washington U. at St. Louis Project Advisor: Dr. Carl Meyer.

What is Data Clustering?Clustering is the partitioning of a data set

into subsets (clusters).We are interested in creating good clusters

that allow us to reorganize disordered data into a block structure so that useful information can be extracted.

Page 3: Dexin Zhou – Bard College (Presenter) Ralph Abbey – North Carolina State U. Jeremy Diepenbrock – Washington U. at St. Louis Project Advisor: Dr. Carl Meyer.

A Visible ExampleBefore Clustering After Clustering

Page 4: Dexin Zhou – Bard College (Presenter) Ralph Abbey – North Carolina State U. Jeremy Diepenbrock – Washington U. at St. Louis Project Advisor: Dr. Carl Meyer.

What are we clustering?An 86 mini-document set that we created

with 13 topicsA 185 document set used in Daniel Boley’s

paper with 10 topicsSAS grocery store dataset

Page 5: Dexin Zhou – Bard College (Presenter) Ralph Abbey – North Carolina State U. Jeremy Diepenbrock – Washington U. at St. Louis Project Advisor: Dr. Carl Meyer.

Preparing the data

• Term Aij is in the following form• g term is a function of term i, it downplays

the terms that appear frequently globally• l term is a function of the raw frequency of a

certain term in document j(eg: log)• d term is a normalization factor

Page 6: Dexin Zhou – Bard College (Presenter) Ralph Abbey – North Carolina State U. Jeremy Diepenbrock – Washington U. at St. Louis Project Advisor: Dr. Carl Meyer.

How?Principal Direction Divisive PartitioningPrincipal Direction Gap PartitioningNon-Negative Matrix FactorizationClustering Aggregation

Page 7: Dexin Zhou – Bard College (Presenter) Ralph Abbey – North Carolina State U. Jeremy Diepenbrock – Washington U. at St. Louis Project Advisor: Dr. Carl Meyer.

Singular Value Decomposition

Page 8: Dexin Zhou – Bard College (Presenter) Ralph Abbey – North Carolina State U. Jeremy Diepenbrock – Washington U. at St. Louis Project Advisor: Dr. Carl Meyer.

Principle Direction Divisive Partitioning

Page 9: Dexin Zhou – Bard College (Presenter) Ralph Abbey – North Carolina State U. Jeremy Diepenbrock – Washington U. at St. Louis Project Advisor: Dr. Carl Meyer.

PDDP

Page 10: Dexin Zhou – Bard College (Presenter) Ralph Abbey – North Carolina State U. Jeremy Diepenbrock – Washington U. at St. Louis Project Advisor: Dr. Carl Meyer.

PDDP

Page 11: Dexin Zhou – Bard College (Presenter) Ralph Abbey – North Carolina State U. Jeremy Diepenbrock – Washington U. at St. Louis Project Advisor: Dr. Carl Meyer.

Principle Direction Gap Partitioning

Sorted Indices Sorted Indices

Sor

ted

Val

ue

Sor

ted

Val

ue

Plot of the First Right Singular Vector Plot of the Second Right Singular Vector

Page 12: Dexin Zhou – Bard College (Presenter) Ralph Abbey – North Carolina State U. Jeremy Diepenbrock – Washington U. at St. Louis Project Advisor: Dr. Carl Meyer.

A Comparison of PDGP w/ PDDP

Page 13: Dexin Zhou – Bard College (Presenter) Ralph Abbey – North Carolina State U. Jeremy Diepenbrock – Washington U. at St. Louis Project Advisor: Dr. Carl Meyer.

Centering Vs. Non-Centering

Page 14: Dexin Zhou – Bard College (Presenter) Ralph Abbey – North Carolina State U. Jeremy Diepenbrock – Washington U. at St. Louis Project Advisor: Dr. Carl Meyer.

Non-Negative Matrix Factorization

Page 15: Dexin Zhou – Bard College (Presenter) Ralph Abbey – North Carolina State U. Jeremy Diepenbrock – Washington U. at St. Louis Project Advisor: Dr. Carl Meyer.

NMF Clustering

Page 16: Dexin Zhou – Bard College (Presenter) Ralph Abbey – North Carolina State U. Jeremy Diepenbrock – Washington U. at St. Louis Project Advisor: Dr. Carl Meyer.

Cluster Aggregation

Page 17: Dexin Zhou – Bard College (Presenter) Ralph Abbey – North Carolina State U. Jeremy Diepenbrock – Washington U. at St. Louis Project Advisor: Dr. Carl Meyer.

Cluster Aggregation

Page 18: Dexin Zhou – Bard College (Presenter) Ralph Abbey – North Carolina State U. Jeremy Diepenbrock – Washington U. at St. Louis Project Advisor: Dr. Carl Meyer.

MetricsEntropy Method

A standard measurement based on our prior knowledge to the data file.

Density MethodDoes not require prior knowledge to the data

file.Less accurate.

Page 19: Dexin Zhou – Bard College (Presenter) Ralph Abbey – North Carolina State U. Jeremy Diepenbrock – Washington U. at St. Louis Project Advisor: Dr. Carl Meyer.

Mini-document dataset

Page 20: Dexin Zhou – Bard College (Presenter) Ralph Abbey – North Carolina State U. Jeremy Diepenbrock – Washington U. at St. Louis Project Advisor: Dr. Carl Meyer.

Mini-document dataset Result

Page 21: Dexin Zhou – Bard College (Presenter) Ralph Abbey – North Carolina State U. Jeremy Diepenbrock – Washington U. at St. Louis Project Advisor: Dr. Carl Meyer.

Boley’s J1 Dataset

Page 22: Dexin Zhou – Bard College (Presenter) Ralph Abbey – North Carolina State U. Jeremy Diepenbrock – Washington U. at St. Louis Project Advisor: Dr. Carl Meyer.

Boley’s Dataset Result

Page 23: Dexin Zhou – Bard College (Presenter) Ralph Abbey – North Carolina State U. Jeremy Diepenbrock – Washington U. at St. Louis Project Advisor: Dr. Carl Meyer.

SAS Grocery Dataset

Page 24: Dexin Zhou – Bard College (Presenter) Ralph Abbey – North Carolina State U. Jeremy Diepenbrock – Washington U. at St. Louis Project Advisor: Dr. Carl Meyer.

SAS Grocery Dataset Results

Page 25: Dexin Zhou – Bard College (Presenter) Ralph Abbey – North Carolina State U. Jeremy Diepenbrock – Washington U. at St. Louis Project Advisor: Dr. Carl Meyer.

SAS Grocery Dataset Result

Page 26: Dexin Zhou – Bard College (Presenter) Ralph Abbey – North Carolina State U. Jeremy Diepenbrock – Washington U. at St. Louis Project Advisor: Dr. Carl Meyer.

Conclusion

Page 27: Dexin Zhou – Bard College (Presenter) Ralph Abbey – North Carolina State U. Jeremy Diepenbrock – Washington U. at St. Louis Project Advisor: Dr. Carl Meyer.

For Additional InformationPlease Visit

http://meyer.math.ncsu.edu/Meyer/REU/REU.html