Post on 20-May-2020
By
Hanaa Ismail Elshazly
PhD Student
Faculty of Computers and Information
Cairo University
Intelligent
Visualization of Multi
Dimension Data Sets
Faculty of Computers and Information - Cairo University
Department of Computer Sciences
Supervisors
Prof. Dr. Aboul Ella Hassanien & Prof. Dr. Abeer Mohamed El
Korany
Prof.Dr.Mostafa Reda Eltantawi
Contents
Introduction 1
2
3
Experimental Results 4
Conclusion 5 5
Future Work 6
Proposed Model
6
Related work
2
Highlights
We introduce an automatic system to visualize
multidimensional rules.
◦ Reducing the dimensions of the input data sets by feature
selection techniques.
◦ A new emerged problem is the generated rules number.
◦ Rules were refined using Genetic Algorithms to be
visualized.
◦ Refined Rules were interactively visualized using nodes and
edges.
Introduction
Multidimensional data
Reduction
Visualize
Intelligent Visualization of
Multidimensional Data Sets
Dimensions: A dimension is a
key descriptor, an index, by which
you can access facts according to
the value (or values) you want
Information visualization is the
study of (interactive) visual
representations of abstract data to
reinforce human cognition. The abstract
data include both numerical and non-
numerical data, such as text and
geographic information
Introduction General
Massive and complex data are generated every day in many fields due to the advance of hardware and software technology.
Curse of dimensionality is a major obstacle in machine learning and data mining.
Clinical data referring to patients’ investigations contain irrelevant attributes that degrade the classification performance.
Visualization is important when analyzing multidimensional datasets, since it can help humans discover and understand complex relationships in data.
Introduction Data Problems
Data Quality
Integrating redundant data
from different sources
Mining information from
heterogeneous databases
Difficulty in training set
Dynamic databases
Dimensionality
Introduction Dimensionality reduction
In machine learning and statistics, dimensionality reduction or
dimension reduction is the process of reducing the number of
random variables under consideration via obtaining a set of
principal variables. It can be divided into feature selection and
feature extraction.
Most popular search methods that are manageable in low space
can be totally unmanageable in high dimension space
The curse of dimensionality is a major obstacle in machine
learning and data mining
Reduction of the dimensionality of features space leads to a
successful classification
Selecting the optimal feature subset can substantially improve the
classification performance
Filter
Wrapper
Embedded
• Improve the
comprehensibility of
the induced concepts
• Decrease of dataset
complexity
• Improve classification
performance
• Resources saving
• Visualization ability
• Better understanding
of extracted
knowledge
• Reducing computation
Requirement
• Reduces the effect of
curse of dimensionality
FS Techniques
Reduced Data Massive Data
Microarray GE
Medical Images
Huge Databases
Finance Data
Sensor Arrays
Web
Documents
Introduction Dimensionality reduction
Visualization Techniques for Rules Mining
Table : Most common and simple [Romero,C., Luna,J. M., Romero, J.R., Ventura, S. , 2011]
Scatter Plot : Represent rules as points in the coordinate plane by their interestingness measures [Hahsler and Chellubonia.,S., 2011 ]
Parallel Coordinate : Represent rules as polygonal lines that intersect multiple vertical axes representing associated items. [Usman and Usman, 2016 ]
Directed Graph : Use nodes to represent items and directed edges for the antecedents and the consequents[Sekhavat,Y.A. and Hoeber,O. , 2013]
Matrix : Prevail approach to represent the antecedents and consequent[Lei et al., 2016]
Motivation
• Information has become a very valuable commodity, many features that seem to be useful and leads to increase computational cost, storage requirements and decrease accuracy.
• There are many tasks that depend on dimensionality: text categorization, genomics, econometrics and computer vision.
Introduction
Problem Definition
• Dimensionality reduction is crucial in order to remove noisy and improve accuracy.
• As extracted rules cardinality is a mainstay in rule visualization process, it may hinder the benefit of those rules.
• Limited work has been found on how the user understand and use it.
• There is a need to trust and better insight into those rules.
Introduction
Objectives
• Develop and implement an automatic system that is capable of reducing multidimensional data sets as well as the number of generated rules.
• Provide a dynamic Visualization Decision rules facility that leads to better gain insight into the mined rules.
• Provides different visualized levels of trusting for extracting rules.
Introduction
Feature Selection
Before -after Data Set Method Paper
12600 – 1000
2000 - 500
7129 - 100
7129 - 100
Prostate Cancer Data Set
Colon Cancer Data Set
Leukemia
Myeloma
Dual-process sample selection using
Support Vector Machine
(Liu,Q. etal, 2013)
30 - 12
32 - 13
34 - 18
Breast Cancer Diagnosis
Breast Cancer Prognosis
Erythemato-squamous
diseases
Breast Cancer Data Set
Linguistic hedges neuro-fuzzy
classifier(LHNFC)
(Azar, A. and
Hassanien,2015)
10 - 7
10 - 5
30 - 25
30 - 20
Wisconsin Breast
Cancer(1992)
Wisconsin Breast Cancer
(1995)
Wisconsin Breast
Cancer(1992)
Wisconsin Breast Cancer
(1995)
Genetic/Particle Swarm (GPSO)
Genetic/Fruit Fly (GFOA)
(Fei Ye, 2016)
Related Work
Traditional Classifiers
Accuracy
%
Data Set Classifier Authors
94% Prostate Artificial Neural
Networks(ANN)
(Saritas, I. and Ozkan, I.A. and
Sert, I.U.,2010)
95.5%
93.1%
93%
Breast Cancer
(Naive Bayes +
Feature Ranking)
Random Forest+
Feature Ranking)
SVM+ Feature
Ranking)
(Santos, V. and Datia, Nuno
and Pato, M.P.M.,2014)
96.45%
90.6%
82.8%
97.9%
Breast Cancer
Heart Valve
Heart Disease
Dermatology
dominance-based
rough set(DRSA)
(Azar,A.T. etal,2016)
Related Work
Ensemble Classifiers
Accuracy% Data Set Classifier Authors
79.96% Lymphography Bagging credal
decision trees(B-CDT)
Abellán, J. and
Masegosa , A.R., 2012
96.4%
64.7%
80.7%
Breast
Liver
Lymphography
Differential Evolution (De Falco, I. , 2013)
93%
92%
Breast Random Forest+
Feature Ranking)
Bagging+ Feature
Ranking)
(Santos, V. and Datia,
Nuno and Pato,
M.P.M.,2014)
94.8%
94.8%
Spine Diagnosis RF +PCA
Bagging
(Indrajit Mandal,2015)
Related Work
Enhanced visual data mining process for dynamic decision-making
Related Work
Limitations :
• The physician reduce rules by himself
according to the frequency of the act.
• The physician add weight for each rule to
reflect the highest score.
• Depending on the physician expert to
specify the principal reason for the infection
• Dimensionality will be solved manually by
the physician filteration.
(Ltifi,H. and Benmohamed E. and Kolski C. and Ben Ayed M. , 2016)
The research presented the design of visual data mining method
for support the dynamic decision-making.
Aim : Assist physicians to fight against nosocomial infections in the
intensive care unit in the Habib Bourguiba Hospital of Sfax, Tunisia.
Steps : Temporal Data Manipulation
Temporal Visualization
Discovered Knowledge Management.
Construction and evaluation of structured association map for
visual exploration of association rules
Related Work
(Kim J. W. , 2017)
The research proposed a novel visualization method , a variant of
cluster heat map for representing association rules.
Aim : Assist analyzers to select relevant items to be used in many to
many association rule mining.
Steps : Items Classification (Factor Items and Response Items by
Analyzers).
Generate factor dendogram and Response dendogram by applying
hierarchical clustering algorithm and distance measures.
Matrix Generation reflecting the interestingness measure of each
rule .
Limitations :
• In High dimension Dataset, it is to difficult to
specify factor and response items.
• An additional burden is the sorting of items
according to the position in associated
dendograms.
The Proposed General Model
Pre-processing
phase
Feature Reduction
phase
Refinement Phase
Classification
phase
Visualization phase
Experimental Data Sets
Classes Instances Features Source Data Set
2 classes 569 samples Features 32 UCI (Machine Learning
Repository)
Wisconsin Breast
Cancer–Diagnosis
2 classes samples 198 Features 32 UCI (Machine Learning
Repository)
Wisconsin Breast
Cancer–Prognosis
2 classes 267 samples 45 Features UCI (Machine Learning
Repository)
SPECTF Heart Dataset
4 classes 148 samples 18 Features University Medical
Centre, Institute of
Oncology, Ljubljana,
Yugoslavia
Lymphography
2 classes 583 samples 11 Features UCI (Machine Learning
Repository)
Indian Liver Patient
Dataset
2 classes 102 samples 12600
Features
UCI (Machine Learning
Repository)
Prostate
Proposed Model Features
• It should be reliable and robust enough to cope with
different data types.
• The proposed model addresses the tedious tasks encountered by the physician or decision maker during the exploration of the classification outcomes.
• It provides the decision-maker the chance to make proactive, knowledge-driven decisions and to be a part of the mining process by harnessing the perceptual capabilities of the human visual system.
Pre-processing Phase
Aim : Used to reduce the number of
values for a given continuous attribute
by dividing the range of the attribute
into intervals and replacing low level
concepts by higher level concepts.
Techniques:
• Equal Binning : Transform
numerical variables into
categorical counterparts.
• Simplification : Rescaling
data in the range [1,3].
Discretization
Pre-processing Phase Equal Binning Algorithm
Foreach feature V in data (D)
{ Dividing domain of V into k intervals of equal size.
The width of intervals is:
w = (max(V)-min(V))/k
And the interval boundaries are:
min+w, min+2w, ... , min+(k-1)w
}
Hanaa Ismail Elshazly et al., “Rough Sets and Genetic Algorithms: A hybrid approach to breast cancer classification”, Proceedings of
the Information and Communication Technologies, (WICT), ISBN: 978-1-4673-4806-5, World Congress, IEEE, pp 260-265, 2012.
How Discretization techniques influence the classification of breast cancer data
Bool.Reas% Binging
%
Entropy
%
91 92.9 77.2 Naïve Bayes
95.3 95.3 91.4 Decision Rules
94 94.7 76.1 KNN
Feature Selection Phase
• The goal of this step is reducing the number of features that hinders severely the applicability of popular search methods are designed and dedicated for low space while they are totally unmanageable in high dimension space.
Techniques:
• PCA: a statistical technique useful in machine
learning applications for data compression and
reduction of massive data dimensions.
• Rough Set: offers mathematical tool to
discover patterns hidden in data, used for
• Feature selection
• Data reduction
• Decision rule generation
Comparison of different selection techniques over three
multidimensional data sets
Rough based Feature Selection
Technique
Rough Set feature section technique realizes highest results over 2
data sets over 3 data sets
What is Rough Set
Rough Set Concepts
• Information/Decision Systems (Tables)
• Indiscernibility
• Set Approximation
• Reducts and Core
U
setX
U/R
R : subset of
attributes
XR
XXR
Hanaa Ismail Elshazly, Ahmad Taher Azar, Abeer Mohamed El Korany, Aboul Ella Hassanien, “Hybrid System based on Rough Sets and Genetic
Algorithms for Medical Data Classifications”, International Journal of Fuzzy System Applications (IJFSA), doi: 10.4018/ijfsa.2013100103, 3(4), 31-46,
2013.Descrinibility
Rough Sets Algorithm for Reduct Generation
Let T = (U, C, D) be a decision table, with }.,...,,{ 21 nuuuU M(T), we will mean matrix defined as:
)]d(u)[d(u Dd if )}c(u)c(u :C{c
)]d(u)[d(u Dd if λ ijjiji
jim
nn
ijm ,Uui }},...,2,1{,:{)( njijmuf ijj
iT
ijm ,ijma .ijm
),( falsemij .ijm
),(truetmij .ijm
Where
is the disjunction of all variables a such that
(2)
(3)
if
if
(1) if
For any
Rough Set Rules
Generation Algorithm Let T = (U, C, D) be a decision table, with }.,...,,{ 21 nuuuU
M(T), we will mean matrix defined as:
)]d(u)[d(u Dd if )}c(u)c(u :C{c
)]d(u)[d(u Dd if λ ijjiji
jim
nn
ijm is the set of all the condition attributes that classify objects ui and uj into
different classes.
,Uui }},...,2,1{,:{)( njijmuf ijj
iT
ijm ,ijma .ijm
),( falsemij .ijm
),(truetmij .ijm
Where
is the disjunction of all variables a such that
(2)
(3)
if
if
(1) if
Rules Refinement
Phase
Reduce rules number to be easily
visualized and presented to an
expert without decreasing the
accuracy.
Techniques:
• Dependency Calculation
• Rules Generation
• Reduct Evaluation using
Entropy
• GA using Support and
Confidence as Fitness Function
•Decision attribute is dependent on this feature. γ ( B,d )=|
POS( B,d ) | / |U, |
• | U|and| POS( B,d ) | denotes the cardinality of the sets U
and POS( B,d ) respectively .
•d-dispensable attribute, is a condition which can be
removed without
• losing classification performance since preseving the
indiscernibility
•relations else the condition attribute is d-indispensable
Dependency calculation
Reduct Evaluation
Calculate entropy of the target : Gain(T) = Entropy (T);
Entropy (T) = where c is the possible values of
the target
Foreach in Reducts
{
Foreach x In R
{
Entropy (T,X) =
}
}
Choose with the largest information gain.
i2
c
1i i plogp
iR
E(c) ) (c
xccP
iR
),( XTEntEi
Genetic Algorithm Using
Support and Confidence as
Fitness Function Body ==> Consequent [ Support , Confidence ]
Consequent: represents a discovered property for the examined data.
Support: represents the percentage of the records satisfying the body or the consequent.
Confidence: represents the percentage of the records satisfying both the body and the consequent to those satisfying only the body.
Main advantage for using GAs is their robustness, once the problem is correctly modelled, the algorithm is able to explore the feasible region within the search space and exploit the best global solution.
Classification Phase Classification Phase
Phase
Rule Generation
Classification
with Decision
Rules
Testing
Generated
Rules
Classified
Instances
Tested
Instances
Multidimensional
Data
Final Reducts
Aim : The learning algorithm
called classifier has as goal to
return a set of decision rules with a
procedure that makes possible to
classify objects not found in the
original decision table.
Techniques:
Rough Set Rules Generation
using Discernibility Matrix
Visualization Phase
• Graph Nodes
• Edges
• Charts
• Grids
VISUALIZATION
Measurement Calculation for
Rules Supporting
Refined Rules with
Trusted Levels
Rendering
Rules & Reducts
Refined Decision
Rules
Expert can manage induced rules
through levels of trusting that
enable fast trust decision.
Experimental Results
Parameters Setting for Breast Cancer Experiment
Train : 70% Test : 30%
Encoding Strategy : Discretized values are ranged on scale (1-3).
Population Size : 400
Crossover Selection Parents : Random.
Crossover Probability : 0.1
Crossover position : Single point
Cut Position : Random
Fitness Threshold : 0-2
Termination Criteria : Set to the rules number specified by the physician.
Significance Level : 0 – 0.3 Less Significant
0.4 – 0.5 Medium Significant
> 0.5 Most Significant
Visualization of Breast Cancer Reducts
Visualization of features of the breast data set ordered by its occurrence over all extracted reducts.
Experimental Results
Visualization of Breast Cancer Rules
Visualization of global and detailed nodes representing refined classification rules of the breast data.
212 R 400 R 87000 R
Experimental Results
Visualization of Breast Cancer refined
rules distributed over 2 levels of Trust
www.themegallery.com
Visualization of Breast Cancer Rules
Visualization of Refined Breast Cancer Decision Rules According to Trusting Levels.
Experimental Results
Visualization of Breast Cancer Rules
Navigation through Refined Breast Cancer Decision Rules Details
According to Trusting Levels.
Experimental Results
Experimental Results Parameters Setting for Prostate Cancer Experiment
Train : 70% Test : 30%
Encoding Strategy : Discretized values are ranged on scale (1-3) .
Population Size : 117
Crossover Selection Parents : Random.
Crossover Probability : 0-2
Crossover position : Single point
Cut Position : Random
Fitness Threshold : 0-1
Termination Criteria : Set to the rules number specified by the physician.
Significance Level : 0 – 0.3 Less Significant
0.4 – 0.5 Medium Significant
> 0.5 Most Significant
Visualization of Prostate Cancer Reducts Visualization of all reducts of the Prostate Cancer data set and all features ordered by its occurrence in all extracted reducts.
Experimental Results
Visualization of Prostate Cancer Rules
Navigation through Refined Prostate Cancer Decision Rules According to
Trusting Levels.
71 R 117R 22000 R
Experimental Results
Hanaa Ismail Elshazly et al., ”Weighted Reduct Selection Metaheuristic Based Approach for Rules Reduction and
Visualization” , International Conference on Computing Communication and Automation (ICCCA2016), IEEE, Buddh
Nagar Uttar Pradesh, India, 2016
Visualization of Prostate Cancer Rules
Visualization of Refined Prostate Cancer Decision Rules According to Trusting
Levels.
Experimental Results
Visualization of Prostate Cancer Rules
Navigation through Refined Prostate Cancer Decision Rules According
to Trusting Levels.
Experimental Results
Performance analysis
Reduct Matching Approach :
Consider all features of informative
reduct .
Core Matching Approach :
Consider only the intersection of
all reducts.
Prostate
117
Breast Cancer
472
71
99
400
98
RULES
ACC%
60
98
212
95
RULES
ACC%
Conclusions
• A proposed model for knowledge-based
classification and visualization of decision rules
which enhances the classification process and
improves the insight into rules knowledge.
• Physician can detect a minimum number of rules
with trusted levels to reach an efficient diagnosis
of diseases.
• Interactive Visualization Approach is presented
• The user can explore large data sets of rules
freely by focusing his attention on limited subsets
Future Work
• Promising results of the
proposed model encourages the
possibility of applying the model
on other multi dimensional data
sets.
• Other visualization dynamic
techniques can be applied to
meet the different requirements
of physicians.