THIC MedIX Summer 2015 Poster

1
Thresholded Hierarchical Itemset Clustering for Expert Explorations Diana Zajac, Thomas Lux, Dr. Jacob Furst, Dr. Daniela Raicu College of Computing and Digital Media, DePaul University Summer 2015 Introduction Clustering Algorithms THIC Datasets Traditional Machine Learning (ML) techniques are able to cluster datasets, yet they produce difficult to interpret clusters. Noise in the data, as well as high-dimensional and complex data, can make clustering difficult, and produce undesirable results. In addition, most clustering algorithms produce clusters without any explanation as to what patters are found between data points, and based on what patters those clusters were formed. In attempt to solve the problem of clustering high- dimensional, complex and noisy datasets, and producing interpretable results, we created an interactive user-interface called THIC. THIC stands for Thresholded Hierarchical Itemset Clustering, and we have given it this name to describe the method in which it clusters data. What makes THIC so innovative, is it’s ability to modify the clustering algorithm with ‘expert’ feedback. An ‘expert’ referring to some outside source of information that can provide intuitive guidance as to what features the algorithm should cluster upon. Figure 1 is a part of the 2012 City Livability dataset obtained with permission of The Economic Intelligence Unit (EIU) from their collaboration with BuzzData. Another example, given an ‘expert’ who is well-traveled, the expert could instruct THIC to group countries “most homey” under one cluster, countries “most beautiful” under another, etc. THIC will cluster the cities based on the experts guidance, but will also predict which clusters the cities the expert hasn’t yet traveled to may fit intoand then explain which city features are most important in determining the clusters. Other datasets we worked with included a large text corpus, lung cancer data, and Chronic Fatigue Syndrome data. K-Means: K-Means clustering is an algorithm that makes k number of clusters based on distances of each data-point from the cluster centers. It begins by plotting each data pointin the case of City Livability, each city is a pointwith the features as dimensions. For an n number of features, there are n number of dimensions. So each point has a given (x, y, z, …, n) coordinate based on its features. K-Means chooses initial cluster centers, and then iteratively moves them until the distances of the points to the centers is minimal, and the clusters are separated as best as possible. K-Means with Feature Selection (KMFS): KMFS uses feature selection algorithms in aiding k-means clustering. Feature selection is usually used in order to strip a dataset of irrelevant, corrupted, or redundant features, thereby enhancing the analysis capabilities based on those features. KMFS selects features one-by-one starting with those that create the ‘best’—most defined and separateclusters, and continues to add features until the clusters become ‘bad’—overlapping and spread-out. Incorporating feature selection into k-means clustering allows for k- means to cluster data and return to the use the most relevant features used. KMFS gives the user an idea of what each cluster is based on (what features ‘trend’ in each cluster), but it describes cluster features based on probabilities rather than 100% accuracy, and also fails to provide user-control. Why THIC is better: Expert-guided clustering Better data interpretability Many different possibilities (for results) Provides a controllable tradeoff between optimal results and meaningful results Doesn’t lose data dimensionality (no important information lost in feature selection) THIC’s philosophy is focused on aiding a user in understanding and exploring datasets, finding unseen patterns and correlations in datasets, and creating unconventional clustering of data. Group 1: High: Green Space, Sprawl Group 2: Low: Sprawl, Culture and Environment, Infrastructure Group 3: Low: Infrastructure High: Green Space Group 4: Low: Sprawl, Culture and Environment Group 5: Low: Green Space, Sprawl Group 6: Low: Green Space High: Sprawl Group 7: Low: Sprawl High: Green Space The dataset below is one of the datasets we used in testing THIC. This dataset is particularly interesting because of the ‘expert feedback’ opportunity. For example, an expert may want to cluster cities based on “what do European countries have in common:” the expert would instruct THIC to group European countries under one cluster, and THIC will produce results explaining which features all European cities have in common. THIC is an interactive interface that allows users to import a numerical dataset and cluster the data based on their own preferences, such as: Which features should be included/excluded Which features should be given higher priority (more weight) Sizes of groups Making subgroups Number of groups Define groups using features Control between optimal clustering and clustering meaningful to user Acknowledgments Dr. Jacob Furst, PhD, 1998, professor, DePaul University, CDM Dr. Daniela Raicu, PhD, 2002, professor, DePaul University, CDM College of Digital Media, DePaul University Science Research Fellows DePauw University Future Work Although we completed THIC’s preliminary phase and there is still much to improve on. The current THIC implementation focuses on single-item-itemsets, because increasing itemset size increases the computation time and amount of overlap in groups. Another interest would be developing better ‘stopping criteria’ for the algorithm, which at the moment is based on group overlap and minimal coverage. With a better stopping criteria, expanding to multi-item-itemsets would be more feasible, without contradicting the philosophy of THIC. When completed, THIC will be able to provide meaningful information in multiple domains, including but not limited to economics, medical sciences, and statistical analysis. THIC produces diverse results depending on all of these preferences. So, the focus of THIC isn’t necessarily the ‘best’ clusters/groupings, but instead is more about producing results that can aid in understanding a data set, such as: Finding certain patterns that may not be evident without THIC (due to size of dataset or complexity) Producing results by defining ‘known’ clusters, and matching the rest of the cases to those Describing relationships between different features, as well as different casesin City Livability, cases are the cities, and features are qualities, such as pollution and quality of education.

Transcript of THIC MedIX Summer 2015 Poster

Page 1: THIC MedIX Summer 2015 Poster

Thresholded Hierarchical Itemset Clustering for Expert ExplorationsDiana Zajac, Thomas Lux, Dr. Jacob Furst, Dr. Daniela Raicu

College of Computing and Digital Media, DePaul University

Summer 2015

Introduction Clustering Algorithms THIC

Datasets

Traditional Machine Learning (ML) techniques are able to

cluster datasets, yet they produce difficult to interpret clusters.

Noise in the data, as well as high-dimensional and complex

data, can make clustering difficult, and produce undesirable

results. In addition, most clustering algorithms produce clusters

without any explanation as to what patters are found between

data points, and based on what patters those clusters were

formed. In attempt to solve the problem of clustering high-

dimensional, complex and noisy datasets, and producing

interpretable results, we created an interactive user-interface

called THIC. THIC stands for Thresholded Hierarchical Itemset

Clustering, and we have given it this name to describe the

method in which it clusters data. What makes THIC so

innovative, is it’s ability to modify the clustering algorithm with

‘expert’ feedback. An ‘expert’ referring to some outside source

of information that can provide intuitive guidance as to what

features the algorithm should cluster upon.

Figure 1 is a part of the 2012 City Livability dataset obtained with permission of The Economic

Intelligence Unit (EIU) from their collaboration with BuzzData.

Another example, given an ‘expert’ who is well-traveled,

the expert could instruct THIC to group countries “most homey”

under one cluster, countries “most beautiful” under another,

etc. THIC will cluster the cities based on the experts guidance,

but will also predict which clusters the cities the expert hasn’t

yet traveled to may fit into—and then explain which city

features are most important in determining the clusters.

Other datasets we worked with included a large text

corpus, lung cancer data, and Chronic Fatigue Syndrome data.

K-Means:

K-Means clustering is an

algorithm that makes k number of

clusters based on distances of each

data-point from the cluster centers. It

begins by plotting each data point—

in the case of City Livability, each

city is a point—with the features as

dimensions. For an n number of

features, there are n number of

dimensions. So each point has a

given (x, y, z, …, n) coordinate

based on its features. K-Means chooses initial cluster centers, and then

iteratively moves them until the distances of the points to the centers is

minimal, and the clusters are separated as best as possible.

K-Means with Feature Selection (KMFS):

KMFS uses feature selection algorithms in aiding k-means clustering.

Feature selection is usually used in order to strip a dataset of irrelevant,

corrupted, or redundant features, thereby enhancing the analysis capabilities

based on those features. KMFS selects features one-by-one starting with

those that create the ‘best’—most defined and separate—clusters, and

continues to add features until the clusters become ‘bad’—overlapping and

spread-out. Incorporating feature selection into k-means clustering allows for k-

means to cluster data and return to the use the most relevant features used.

KMFS gives the user an idea of what each cluster is based on (what features

‘trend’ in each cluster), but it describes cluster features based on probabilities

rather than 100% accuracy, and also fails to provide user-control.

Why THIC is better:

Expert-guided clustering

Better data interpretability

Many different possibilities (for results)

Provides a controllable tradeoff between optimal results and meaningful

results

Doesn’t lose data dimensionality (no important information lost in feature

selection)

THIC’s philosophy is focused on aiding a user in understanding and

exploring datasets, finding unseen patterns and correlations in datasets, and

creating unconventional clustering of data.

Group 1: High: Green Space, SprawlGroup 2: Low: Sprawl, Culture and Environment, InfrastructureGroup 3: Low: InfrastructureHigh: Green SpaceGroup 4: Low: Sprawl, Culture and EnvironmentGroup 5: Low: Green Space, SprawlGroup 6: Low: Green SpaceHigh: SprawlGroup 7: Low: SprawlHigh: Green Space

The dataset below is one of the datasets we used in

testing THIC. This dataset is particularly interesting because of

the ‘expert feedback’ opportunity. For example, an expert may

want to cluster cities based on “what do European countries

have in common:” the expert would instruct THIC to group

European countries under one cluster, and THIC will produce

results explaining which features all European cities have in

common.

THIC is an interactive interface that allows users to import a numerical

dataset and cluster the data based on their own preferences, such as:

Which features should be included/excluded

Which features should be given higher priority (more weight)

Sizes of groups

Making subgroups

Number of groups

Define groups using features

Control between optimal clustering and clustering meaningful to user

Acknowledgments Dr. Jacob Furst, PhD, 1998, professor, DePaul University, CDM

Dr. Daniela Raicu, PhD, 2002, professor, DePaul University, CDM

College of Digital Media, DePaul University

Science Research Fellows

DePauw University

Future WorkAlthough we completed THIC’s preliminary phase and there is still much to

improve on. The current THIC implementation focuses on single-item-itemsets,

because increasing itemset size increases the computation time and amount of

overlap in groups. Another interest would be developing better ‘stopping criteria’

for the algorithm, which at the moment is based on group overlap and minimal

coverage. With a better stopping criteria, expanding to multi-item-itemsets would

be more feasible, without contradicting the philosophy of THIC.

When completed, THIC will be able to provide meaningful information in

multiple domains, including but not limited to economics, medical sciences, and

statistical analysis.

THIC produces diverse results depending on all of these preferences.

So, the focus of THIC isn’t necessarily the ‘best’ clusters/groupings, but

instead is more about producing results that can aid in understanding a data

set, such as:

Finding certain patterns that may not be evident without THIC (due to size

of dataset or complexity)

Producing results by defining ‘known’ clusters, and matching the rest of

the cases to those

Describing relationships between different features, as well as different

cases—in City Livability, cases are the cities, and features are qualities,

such as pollution and quality of education.