HIS'2008: Genetic-based Synthetic Data Sets for the Analysis of Classifiers Behavior
-
Upload
albert-orriols-puig -
Category
Education
-
view
266 -
download
0
description
Transcript of HIS'2008: Genetic-based Synthetic Data Sets for the Analysis of Classifiers Behavior
Genetic-based Synthetic Data S t f th A l i fSets for the Analysis of
Classifiers BehaviorClassifiers Behavior8th I t ti l C f H b id I t lli t S t8th International Conference on Hybrid Intelligent Systems
Núria MaciàAlb t O i l P iAlbert Orriols-Puig
Ester Bernadó-Mansilla{nmacia,aorriols,esterb}@salle.url.edu
Grup de Recerca en Sistemes Intel·ligentsEnginyeria i Arquitectura La Salle
Universitat Ramon Llull
Motivation
KnowledgeData SetReal-world
problem
Knowledge Extraction
LearnerModel
+
Necessity of synthetic data sets
+Prediction
Necessity of synthetic data setsTo evaluate real learners performance under controlled scenarioscontrolled scenarios
How to generate synthetic data sets?Data complexity (Ho & Basu, 2002)
Length of the class boundary (Macià et al., 2008)Length of the class boundary (Macià et al., 2008)
Objective: Set of benchmark problems to analyze
Slide 2
learners behaviorOverview and Future Research
Outline
1 Data complexity1. Data complexity2. Synthetic data sets3. Design of GA4 Experiments and results4. Experiments and results5. Conclusions and further work
Slide 3Overview and Future Research
1. Data complexity
Length of the class boundaryLength of the class boundaryBuild minimum spanning tree (MST) connecting all the points regardless of classthe points regardless of classCount the number of edges joining
it lopposite classes
Two cases of many points in boundary:Very interleaved or random data
Linearly separable problem with narrow margins
Slide 4Overview and Future Research
2. Synthetic data sets
Generation procedureGeneration procedureSet the number of instances n, the number ofattributes m and the length of the class boundaryattributes m, and the length of the class boundaryb.G t i t di t ib t d d l d b ildGenerate n points distributed randomly and buildthe MST.
Label the class of each instances
Slide 5Overview and Future Research
2. Synthetic data sets
Exhaustive searchExhaustive searchLabelings grow exponentially with the number ofinstancesinstances
Heuristic searchDemanded length of the class boundary is notalways achievedNo diverse solutions
G ti l ithGenetic algorithm
Slide 6Overview and Future Research
3. Design of GA
Knowledge representationKnowledge representationk-ary string where the bit i stores the class label of the ith instancethe ith instance
Data set i Individual iAtt. 1 Att. 2 … Att. N Class0 4 0 5 0 4 00.4 0.5 0.4 0
0.2 1.0 0.2 10.5 0.3 0.4 10.6 0.5 0.4 0
0 1 1 0 1 10.7 0.1 1.0 1
0.5 0.3 0.9 1
Slide 7Overview and Future Research
3. Design of GA
Genetic operatorsGenetic operatorss-wise tournament selectionT i tTwo-point crossoverBit-wise mutation
Fitness functionbbfitness −= iobji bbfitness =
Slide 8Overview and Future Research
4. Experiment and results (I)
Synthetic data set generationSynthetic data set generationDifferent solutions < Solutions
Pop lation con erge to the same sol tionPopulation converge to the same solution{0100,1011} are equivalent individuals
I t di t l it bt i d i lIntermediate complexity are obtained in early generations
Slide 9Overview and Future Research
4. Experiment and results (II)
Analysis of classifiers behaviorAnalysis of classifiers behaviorThree different paradigms: C4.5, Naïve Bayes, and SMOSMOSimilar accuracy rates with noticeable variability
Slide 10Overview and Future Research
5. Conclusions
The GA allows us to generate data sets withThe GA allows us to generate data sets with the demanded length of the class boundary
Slide 11Overview and Future Research
6. Further work
Efficiency and scalabilityEfficiency and scalabilityMove from simple GA to competent GA
C f fCapacity of satisfying multiple criteriaMulti-objective strategyj gy
Achieve structure of real-world problems Provide a set of benchmark problemsProvide a set of benchmark problems
Slide 12Overview and Future Research
Genetic-based Synthetic Data S t f th A l i fSets for the Analysis of
Classifiers BehaviorClassifiers Behavior8th I t ti l C f H b id I t lli t S t8th International Conference on Hybrid Intelligent Systems
Núria MaciàAlb t O i l P iAlbert Orriols-Puig
Ester Bernadó-Mansilla{nmacia,aorriols,esterb}@salle.url.edu
Grup de Recerca en Sistemes Intel·ligentsEnginyeria i Arquitectura La Salle
Universitat Ramon Llull