SPRINT : A Scalable Parallel Classifier for Data Mining

21
SPRINT : A Scalable Parallel Classifier for Data Mining John Shafer, Rakesh Agrawal, Manish Mehta

description

SPRINT : A Scalable Parallel Classifier for Data Mining. John Shafer, Rakesh Agrawal, Manish Mehta. PATHWAY. Terms Partition Algorithm Data Structures Performing Split Serial SPRINT Parallel SPRINT Results. Terms. Training Data Set Attributes : Categorical and Continuous Class Label. - PowerPoint PPT Presentation

Transcript of SPRINT : A Scalable Parallel Classifier for Data Mining

Page 1: SPRINT : A Scalable Parallel Classifier for Data Mining

SPRINT : A Scalable Parallel Classifier for Data Mining

John Shafer, Rakesh Agrawal, Manish Mehta

Page 2: SPRINT : A Scalable Parallel Classifier for Data Mining

PATHWAY Terms Partition Algorithm Data Structures Performing Split Serial SPRINT Parallel SPRINT Results

Page 3: SPRINT : A Scalable Parallel Classifier for Data Mining

Terms Training Data Set

Attributes : Categorical and Continuous

Class Label

Page 4: SPRINT : A Scalable Parallel Classifier for Data Mining

Partition AlgorithmPartition( Data S ) {

if all points in S are in the same classreturnfor each attribute Aevaluate split on attribute Afind best splitpartition S into S1 and S2 call Partition( S1 )call Partition( S2 )

}

Page 5: SPRINT : A Scalable Parallel Classifier for Data Mining

Data Structures Attribute Lists

Histograms : Continuous and Categorical

Page 6: SPRINT : A Scalable Parallel Classifier for Data Mining

Finding Split PointGini(S) = 1 – Sum( Pj*Pj )

Gini Index(S) = Gini(S1)*n1/n + Gini(S2)*n2/n

Page 7: SPRINT : A Scalable Parallel Classifier for Data Mining

Split on Continuous Attributes Threshold value : Cabove and Cbelow

Sorted Once and Sequential Scan

Deallocation of Cabove and Cbelow

Page 8: SPRINT : A Scalable Parallel Classifier for Data Mining

Split on Categorical Attributes Create Count-Matrix All subsets of attribute values as possible split

point Compute Gini Index Gini from Count Matrix only Memory deallocation

Page 9: SPRINT : A Scalable Parallel Classifier for Data Mining

Perform Split and Partitioning Select splitting attribute and splitting value Create two child nodes and divide data on RIDs Optimization using Hashing <RID,child-ptr> Optimization depending on number of RIDs Partitioned Hashing for large hash-table Create new histogram and count-matrix of children

Page 10: SPRINT : A Scalable Parallel Classifier for Data Mining

Parallel SPRINT Environment : Shared nothing

Data placement and workload balancing

Parallel computation of categorical attribute lists

Page 11: SPRINT : A Scalable Parallel Classifier for Data Mining

Repartition of Continuous Attributes Global Sort Equal re-partitioning Relation between Cabove and Cbelow and processor

number Parallel computation of split index

Page 12: SPRINT : A Scalable Parallel Classifier for Data Mining

Split point for Categorical Attributes Create global matrix at coordinator

Compute split-index

Page 13: SPRINT : A Scalable Parallel Classifier for Data Mining

Partitioning Collect RIDs of splitting attributes from processors

Exchange RIDs

Page 14: SPRINT : A Scalable Parallel Classifier for Data Mining

Age Class Rid

17 High 1

20 High 5

23 High 0

Age Class Rid

17 High 1

20 High 5

23 High 0

32 Low 4

43 High 2

68 Low 3

CarT Class Rid

Family High 0

Sport High 1

Family High 5

CarT Class Rid

Family High 0

Sport High 1

Sport High 2

Family Low 3

Truck Low 4

Family High 5

Age Class Rid

32 Low 4

43 High 2

68 Low 3

CarT Class Rid

Sport High 2

Family Low 3

Truck Low 4

0

1 2

Age < 27.5

Page 15: SPRINT : A Scalable Parallel Classifier for Data Mining

Age Class Rid

17 High 1

20 High 5

23 High 0

32 Low 4

43 High 2

68 Low 3

0 0

4 2

1 2

3 0

Position 0

Position 3

Cbelow

Cabove

Cbelow

Cabove

H L

H L

Attribute List

Page 16: SPRINT : A Scalable Parallel Classifier for Data Mining

CarT Class Rid

Family High 0

Sport High 1

Sport High 2

Family Low 3

Truck Low 4

Family High 5

2 12 00 1

family

sport

truck

H L

Count MatrixAttribute List

Page 17: SPRINT : A Scalable Parallel Classifier for Data Mining

Breakdown of Response Time

Page 18: SPRINT : A Scalable Parallel Classifier for Data Mining

Scaleup of SPRINT

Page 19: SPRINT : A Scalable Parallel Classifier for Data Mining

Speedup of SPRINT

Page 20: SPRINT : A Scalable Parallel Classifier for Data Mining

Sizeup of SPRINT

Page 21: SPRINT : A Scalable Parallel Classifier for Data Mining

Age CarT Risk

23 Family High

17 Sports High

43 Sports High

68 Family Low

32 Truck Low

20 Family High

Age < 25

CarType=sports

High

High Low

Example:Decision Tree