CCIA'2008: On the dimensions of data complexity through synthetic data sets

17
On the dimensions of data l it th h th ti complexity through synthetic data sets data sets O èC é óC è f Onzè Congrés Internacional de l’Associació Catalana d’Intel·ligència Artificial Núria Macià Et B M ill Ester Berna-Mansilla Albert Orriols-Puig {nmacia,esterb,aorriols}@salle.url.edu Grup de Recerca en Sistemes Intel·ligents Enginyeria i Arquitectura La Salle Universitat Ramon Llull

description

 

Transcript of CCIA'2008: On the dimensions of data complexity through synthetic data sets

Page 1: CCIA'2008: On the dimensions of data complexity through synthetic data sets

On the dimensions of data l it th h th ticomplexity through synthetic

data setsdata setsO è C é ó C è fOnzè Congrés Internacional de l’Associació Catalana d’Intel·ligència Artificial

Núria MaciàE t B dó M illEster Bernadó-Mansilla

Albert Orriols-Puig{nmacia,esterb,aorriols}@salle.url.edu

Grup de Recerca en Sistemes Intel·ligentsEnginyeria i Arquitectura La Salle

Universitat Ramon Llull

Page 2: CCIA'2008: On the dimensions of data complexity through synthetic data sets

Motivation (I)

Competitive learners that extract accurate models from pdata have been developedWhich is the best technique?qMy learner has outperformed another in 1000 data sets. Will it outperform it again in the 1001st data set? g

Toward:Data characterization to know which and why a method is the best for a given problem

Slide 2

Page 3: CCIA'2008: On the dimensions of data complexity through synthetic data sets

Motivation (II)

Slide 3

Page 4: CCIA'2008: On the dimensions of data complexity through synthetic data sets

Motivation (III)

The purpose of this work is to:The purpose of this work is to:Analyze the impact of different data sets characteristicsConsider two frequent measuresConsider two frequent measures

Number of attributesNumber of instancesNumber of instances

Consider a measure about class geometryLength of the class boundaryLength of the class boundary

Slide 4

Page 5: CCIA'2008: On the dimensions of data complexity through synthetic data sets

Outline

1. Approach1. Approach2. Complexity dimensions3 Synthetic data sets3. Synthetic data sets4. Design of experiments5. Results6. Conclusions7. Further work

Slide 5

Page 6: CCIA'2008: On the dimensions of data complexity through synthetic data sets

1. Approach

Study how different dimensions influence classifierStudy how different dimensions influence classifier behavior

Number of attributesNumber of attributesNumber of instancesLength of the class boundaryLength of the class boundary

Through synthetic data setsTh id t ll d i ith kThey provide a controlled scenario with known characteristics

Slide 6

Page 7: CCIA'2008: On the dimensions of data complexity through synthetic data sets

2. Complexity dimensions

Number of attributes and instancesNumber of attributes and instancesNo relation is observed among these characteristics and classifiers’ accuracyc ass e s accu acy

Slide 7

Page 8: CCIA'2008: On the dimensions of data complexity through synthetic data sets

2. Complexity dimensions

Length of the class boundaryLength of the class boundaryLinear correlation with classifiers’ accuracy

Slide 8

Page 9: CCIA'2008: On the dimensions of data complexity through synthetic data sets

2. Complexity dimensions

Length of the class boundaryLength of the class boundaryIt estimates the density of points located near the class boundarybou da yTo compute the measure:

Build the minimum spanning tree (MST)Build the minimum spanning tree (MST) connecting all the points regardless of classCount the number of edges joining opposite classes and divide this ppvalue by the total number of points

Slide 9

Page 10: CCIA'2008: On the dimensions of data complexity through synthetic data sets

3. Synthetic data sets

Real-world problemsReal world problemsNot cover all the complexity spectrumAre characterized by uncertainty missing values classAre characterized by uncertainty, missing values, class imbalance…

Synthetic data setsProvide a controlled frameworkProvide a controlled frameworkPermit varying different dimensions independently

Slide 10

Page 11: CCIA'2008: On the dimensions of data complexity through synthetic data sets

3. Synthetic data sets

Generation procedureGeneration procedureSet the number of instances n, the number of attributes m,and the length of the class boundary b.g yGenerate n points that follow a uniform distributionBuild the MSTLabel the point to obtain the required boundary

Attributes: 2

Instances: 17

Required boundary: 0 25Required boundary: 0.25

Slide 11

Page 12: CCIA'2008: On the dimensions of data complexity through synthetic data sets

4. Design of experiments

Three different paradigms of learners:Three different paradigms of learners:Tree induction – C4.5Rule learning – PARTRule learning – PARTSupport vector machine – SMO

10 fold cross validation10-fold cross-validation1000 synthetic data sets

Slide 12

Page 13: CCIA'2008: On the dimensions of data complexity through synthetic data sets

5. Results (I)

n = 200, b = 0.3Cl ifi b h i il lClassifiers behave similarlyAccuracy rates range in the interval [0.6-08]: 1 - bVariability indicates that other dimensions are required to describe the data complexity

Slide 13

Page 14: CCIA'2008: On the dimensions of data complexity through synthetic data sets

5. Results (II)

m = 2, b = 0.3Cl ifi b h i il lClassifiers behave similarlySpread of accuracy rates decreases with increasing number of instancesnumber of instances

Slide 14

Page 15: CCIA'2008: On the dimensions of data complexity through synthetic data sets

6. Conclusions

Information provided by the number of attributesInformation provided by the number of attributes and the number of instances can be embedded in the measure of length of the class boundaryg yLength of the class boundary is a relevant estimate of classifier accuracy, but it is not enough toof classifier accuracy, but it is not enough to describe the whole data set complexityUsing synthetic data sets along the experimentsUsing synthetic data sets along the experiments allows us to vary different dimensions independently and work under a controlled scenarioindependently and work under a controlled scenario

Slide 15

Page 16: CCIA'2008: On the dimensions of data complexity through synthetic data sets

7. Further work

Synthetic data sets do not contain similarSynthetic data sets do not contain similar structures to those of real-world problems

Generate data sets that follow physical processesGenerate data sets that follow physical processesStudy other complexity dimensions

Slide 16

Page 17: CCIA'2008: On the dimensions of data complexity through synthetic data sets

On the dimensions of data l it th h th ticomplexity through synthetic

data setsdata setsO è C é ó C è fOnzè Congrés Internacional de l’Assocaició Catalana d’Intel·ligència Artificial

Núria MaciàE t B dó M illEster Bernadó-Mansilla

Albert Orriols-Puig{nmacia,esterb,aorriols}@salle.url.edu

Grup de Recerca en Sistemes Intel·ligentsEnginyeria i Arquitectura La Salle

Universitat Ramon Llull