HIS'2008: Genetic-based Synthetic Data Sets for the Analysis of Classifiers Behavior

13
Genetic-based Synthetic Data St f th A l i f Sets for the Analysis of Classifiers Behavior Classifiers Behavior 8 th It ti lC f H b idI t lli tS t 8 th International Conference on Hybrid Intelligent Systems Núria Macià Alb t Oil Pi Albert Orriols-Puig Ester Bernadó-Mansilla {nmacia,aorriols,esterb}@salle.url.edu Grup de Recerca en Sistemes Intel·ligents Enginyeria i Arquitectura La Salle Universitat Ramon Llull

description

 

Transcript of HIS'2008: Genetic-based Synthetic Data Sets for the Analysis of Classifiers Behavior

Page 1: HIS'2008: Genetic-based Synthetic Data Sets for the Analysis of Classifiers Behavior

Genetic-based Synthetic Data S t f th A l i fSets for the Analysis of

Classifiers BehaviorClassifiers Behavior8th I t ti l C f H b id I t lli t S t8th International Conference on Hybrid Intelligent Systems

Núria MaciàAlb t O i l P iAlbert Orriols-Puig

Ester Bernadó-Mansilla{nmacia,aorriols,esterb}@salle.url.edu

Grup de Recerca en Sistemes Intel·ligentsEnginyeria i Arquitectura La Salle

Universitat Ramon Llull

Page 2: HIS'2008: Genetic-based Synthetic Data Sets for the Analysis of Classifiers Behavior

Motivation

KnowledgeData SetReal-world

problem

Knowledge Extraction

LearnerModel

+

Necessity of synthetic data sets

+Prediction

Necessity of synthetic data setsTo evaluate real learners performance under controlled scenarioscontrolled scenarios

How to generate synthetic data sets?Data complexity (Ho & Basu, 2002)

Length of the class boundary (Macià et al., 2008)Length of the class boundary (Macià et al., 2008)

Objective: Set of benchmark problems to analyze

Slide 2

learners behaviorOverview and Future Research

Page 3: HIS'2008: Genetic-based Synthetic Data Sets for the Analysis of Classifiers Behavior

Outline

1 Data complexity1. Data complexity2. Synthetic data sets3. Design of GA4 Experiments and results4. Experiments and results5. Conclusions and further work

Slide 3Overview and Future Research

Page 4: HIS'2008: Genetic-based Synthetic Data Sets for the Analysis of Classifiers Behavior

1. Data complexity

Length of the class boundaryLength of the class boundaryBuild minimum spanning tree (MST) connecting all the points regardless of classthe points regardless of classCount the number of edges joining

it lopposite classes

Two cases of many points in boundary:Very interleaved or random data

Linearly separable problem with narrow margins

Slide 4Overview and Future Research

Page 5: HIS'2008: Genetic-based Synthetic Data Sets for the Analysis of Classifiers Behavior

2. Synthetic data sets

Generation procedureGeneration procedureSet the number of instances n, the number ofattributes m and the length of the class boundaryattributes m, and the length of the class boundaryb.G t i t di t ib t d d l d b ildGenerate n points distributed randomly and buildthe MST.

Label the class of each instances

Slide 5Overview and Future Research

Page 6: HIS'2008: Genetic-based Synthetic Data Sets for the Analysis of Classifiers Behavior

2. Synthetic data sets

Exhaustive searchExhaustive searchLabelings grow exponentially with the number ofinstancesinstances

Heuristic searchDemanded length of the class boundary is notalways achievedNo diverse solutions

G ti l ithGenetic algorithm

Slide 6Overview and Future Research

Page 7: HIS'2008: Genetic-based Synthetic Data Sets for the Analysis of Classifiers Behavior

3. Design of GA

Knowledge representationKnowledge representationk-ary string where the bit i stores the class label of the ith instancethe ith instance

Data set i Individual iAtt. 1 Att. 2 … Att. N Class0 4 0 5 0 4 00.4 0.5 0.4 0

0.2 1.0 0.2 10.5 0.3 0.4 10.6 0.5 0.4 0

0 1 1 0 1 10.7 0.1 1.0 1

0.5 0.3 0.9 1

Slide 7Overview and Future Research

Page 8: HIS'2008: Genetic-based Synthetic Data Sets for the Analysis of Classifiers Behavior

3. Design of GA

Genetic operatorsGenetic operatorss-wise tournament selectionT i tTwo-point crossoverBit-wise mutation

Fitness functionbbfitness −= iobji bbfitness =

Slide 8Overview and Future Research

Page 9: HIS'2008: Genetic-based Synthetic Data Sets for the Analysis of Classifiers Behavior

4. Experiment and results (I)

Synthetic data set generationSynthetic data set generationDifferent solutions < Solutions

Pop lation con erge to the same sol tionPopulation converge to the same solution{0100,1011} are equivalent individuals

I t di t l it bt i d i lIntermediate complexity are obtained in early generations

Slide 9Overview and Future Research

Page 10: HIS'2008: Genetic-based Synthetic Data Sets for the Analysis of Classifiers Behavior

4. Experiment and results (II)

Analysis of classifiers behaviorAnalysis of classifiers behaviorThree different paradigms: C4.5, Naïve Bayes, and SMOSMOSimilar accuracy rates with noticeable variability

Slide 10Overview and Future Research

Page 11: HIS'2008: Genetic-based Synthetic Data Sets for the Analysis of Classifiers Behavior

5. Conclusions

The GA allows us to generate data sets withThe GA allows us to generate data sets with the demanded length of the class boundary

Slide 11Overview and Future Research

Page 12: HIS'2008: Genetic-based Synthetic Data Sets for the Analysis of Classifiers Behavior

6. Further work

Efficiency and scalabilityEfficiency and scalabilityMove from simple GA to competent GA

C f fCapacity of satisfying multiple criteriaMulti-objective strategyj gy

Achieve structure of real-world problems Provide a set of benchmark problemsProvide a set of benchmark problems

Slide 12Overview and Future Research

Page 13: HIS'2008: Genetic-based Synthetic Data Sets for the Analysis of Classifiers Behavior

Genetic-based Synthetic Data S t f th A l i fSets for the Analysis of

Classifiers BehaviorClassifiers Behavior8th I t ti l C f H b id I t lli t S t8th International Conference on Hybrid Intelligent Systems

Núria MaciàAlb t O i l P iAlbert Orriols-Puig

Ester Bernadó-Mansilla{nmacia,aorriols,esterb}@salle.url.edu

Grup de Recerca en Sistemes Intel·ligentsEnginyeria i Arquitectura La Salle

Universitat Ramon Llull