Multiclassification with Decision Tree in Spark MLlib 1.3

References

● 201504 Advanced Analytics with Spark● 201409 (清華大學) 資料挖礦與大數據分析● Apache Spark MLlib API● MCI Machine Learning Repository

A simple Decision Tree

Figure 4-1. Decision tree: Is it spoiled?

LabeledPointThe Spark MLlib abstraction for a feature vector is known as a LabeledPoint, which consists of a Spark MLlib Vector of features, and a target value, here called the label.

1-of-n EncodingLabeledPoint can be used with categorical features, with appropriate encoding.

Covtype DatasetThe data set records the types of forest covering parcels of land in Colorado, USA.

Name Data Type MeasurementElevation quantitative metersAspect quantitative azimuthSlope quantitative degreesHorizontal_Distance_To_Hydrology quantitative metersVertical_Distance_To_Hydrology quantitative metersHorizontal_Distance_To_Roadways quantitative metersHillshade_9am quantitative 0 to 255 indexHillshade_Noon quantitative 0 to 255 indexHillshade_3pm quantitative 0 to 255 indexHorizontal_Distance_To_Fire_Points quantitative metersWilderness_Area (4 binary columns) qualitative 0 or 1Soil_Type (40 binary columns) qualitative 0 or 1Cover_Type (7 types) integer 1 to 7

2596,51,3,258,0,510,221,232,148,6279, 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0 ,52590,56,2,212,-6,390,220,235,151,6225, 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0 ,5

https://archive.ics.uci.edu/ml/datasets/Covertype

A First Decision Tree (1)val rawData = sc. textFile("hdfs:///user/ds/covtype.data")

val data = rawData. map { line =>

val values: Array[Double] = line. split(',').map(_.toDouble)

val featureVector:Vector[Double] = Vectors.dense(values.init)

val label = values. last - 1

LabeledPoint(label, featureVector)

}

val Array(trainData, cvData, testData)

= data.randomSplit(Array(0.8, 0.1, 0.1))

trainData.cache();cvData.cache();testData.cache()

for classification, labels should take values {0, 1, ..., numClasses-1}.

A First Decision Tree (2)val model = DecisionTree.trainClassifier(trainData, 7, Map[Int,Int](), "gini", 4, 100)

val predictionsAndLabels = cvData.map(example =>

(model.predict(example.features), example.label)

)

val metrics = new MulticlassMetrics( predictionsAndLabels)

A First Decision Tree (3)println(model.toDebugString)

metrics.precision

= 0.6996101063190258

metrics.confusionMatrix

DecisionTreeModel classifier of depth 4 with 31 nodes If (feature 0 <= 3046.0) If (feature 0 <= 2497.0) If (feature 3 <= 0.0) If (feature 12 <= 0.0) Predict: 3.0 Else (feature 12 > 0.0) Predict: 5.0

Actual \ Predict

cat0 cat1 cat2 cat3 cat4 cat5 cat6

cat0 14248.0 6615.0 5.0 0.0 0.0 1.0 422.0

cat1 5556.0 22440.0 355.0 19.0 0.0 4.0 41.0

cat2 0.0 452.0 3050.0 74.0 0.0 14.0 0.0

cat3 0.0 0.0 163.0 109.0 0.0 0.0 0.0

cat4 0.0 885.0 40.0 1.0 0.0 0.0 0.0

cat5 0.0 564.0 1091.0 37.0 0.0 53.0 0.0

cat6 1078.0 24.0 0.0 0.0 0.0 0.0 883.0

Tuning Decision Trees (1)val evaluations: Array[((String, Int, Int), Double)] =for (impurity <- Array("gini", "entropy"); depth <- Array(1, 20); bins <- Array(10, 300)) yield { val model = DecisionTree.trainClassifier( trainData, 7, Map[Int,Int](), impurity, depth, bins) val predictionsAndLabels = cvData.map( example => (model.predict(example.features), example.label) ) val accuracy = new MulticlassMetrics( predictionsAndLabels). precision ((impurity, depth, bins), accuracy)}

Tuning Decision Trees (2)evaluations.sortBy{ case ((impurity, depth, bins), accuracy) => accuracy}.reverse.foreach(println)... ((entropy,20,300),0.9119046392195256)) ((gini ,20,300),0.9058758867075454) ((entropy,20,10 ),0.8968585218391989) ((gini ,20,10 ),0.89050342659865) ((gini ,1 ,10 ),0.6330018378248399) ((gini ,1 ,300),0.6323319764346198) ((entropy,1 ,300),0.48406932206592124) ((entropy,1 ,10 ),0.48406932206592124)

Sort by accuracy, descending, and print

Revising Categorical Features (1)With one 40-valued categorical feature, the decision tree can create decisions based on groups of categories in one decision, which may be more direct and optimal. On the other hand, having 40 numeric features represent one 40-valued categorical feature also increases memory usage and slows things downval data = rawData.map { line => val values = line.split(',').map(_.toDouble) val wilderness = values.slice(10, 14).indexOf(1.0).toDouble val soil = values.slice(14, 54).indexOf(1.0).toDouble val featureVector = Vectors.dense(values.slice(0, 10) :+ wilderness :+ soil) // (3) val label = values.last - 1 LabeledPoint(label, featureVector)}val Array(trainData, cvData, testData) = data.randomSplit(Array(0.8, 0.1, 0.1))

..1000.. => 0

..0100.. => 1

..0010.. => 2

..0001.. => 3

val evaluations = for (impurity <- Array("gini", "entropy"); depth <- Array(10, 20, 30); bins <- Array(40, 300)} yield { val model = DecisionTree.trainClassifier(trainData, 7 , Map(10 -> 4, 11 -> 40), impurity, depth, bins) val predictionsAndLabels = cvData.map( example => (model.predict(example.features), example.label) ) val accuracy = new MulticlassMetrics( predictionsAndLabels). precision ((impurity, depth, bins), accuracy)}evaluations.sortBy{ case ((impurity, depth, bins), accuracy) => accuracy}.reverse.foreach(println)…((entropy,30,300),0.9446513552658804)((gini, 30,300),0.9391509759293745)((entropy,30,40) ,0.9389268225394855)((gini, 30,40) ,0.9355817642596042)

Revising Categorical Features (2)

vs. tuned 1-of-n encoding DT((entropy,20,300),0.9119046392195256))

Map storing arity of categorical features. E.g., an entry (n -> k) indicates that feature n is categorical with k categories indexed from 0: {0, 1, ..., k-1}.

CV set vs. Test setIf the purpose of the CV set was to evaluate parameters fit to the training set, then the purpose of the test set is to evaluate hyperparameters that were “fit” to the CV set. That is, the test set ensures an unbiased estimate of the accuracy of the final, chosen model and its hyperparameters.val model = DecisionTree. trainClassifier( trainData.union(cvData), 7, Map[Int,Int](), "entropy", 20, 300)val predictionsAndLabels = testData.map(example => (model.predict(example.features), example.label) )val metrics = new MulticlassMetrics( predictionsAndLabels)metricsOpt.precision = 0.9161946933031271

Random Decision ForestsIt would be great to have not one tree, but many trees, each producing reasonable but different and independent estimations of the right target value. Their collective average prediction should fall close to the true answer, more than any individual tree’s does. It’s the randomness in the process of building that helps create this independence. This is the key to random decision forests.val model = RandomForest.trainClassifier(dataTrain, 7, Map(10 -> 4, 11 -> 40), 20, "auto", "entropy", 30, 300)val predictionsAndLabels = cvData.map(example => (model.predict(example.features), example.label) )val metrics = new MulticlassMetrics( predictionsAndLabels)metrics.precision = 0.9630068932322555

vs. Categorical Features DT ,0.9446513552658804))

Multiclassification with Decision Tree in Spark MLlib 1.3

Technology

Transcript of Multiclassification with Decision Tree in Spark MLlib 1.3