Multiclassification with Decision Tree in Spark MLlib 1.3

download Multiclassification with Decision Tree in Spark MLlib 1.3

of 15

  • date post

    26-Jan-2017
  • Category

    Technology

  • view

    415
  • download

    1

Embed Size (px)

Transcript of Multiclassification with Decision Tree in Spark MLlib 1.3

  • Multiclassification with Decision Tree in Spark MLlib 1.3

  • References

    201504 Advanced Analytics with Spark 201409 () Apache Spark MLlib API MCI Machine Learning Repository

  • A simple Decision Tree

    Figure 4-1. Decision tree: Is it spoiled?

  • LabeledPointThe Spark MLlib abstraction for a feature vector is known as a LabeledPoint, which consists of a Spark MLlib Vector of features, and a target value, here called the label.

  • 1-of-n EncodingLabeledPoint can be used with categorical features, with appropriate encoding.

  • Covtype DatasetThe data set records the types of forest covering parcels of land in Colorado, USA.

    Name Data Type MeasurementElevation quantitative metersAspect quantitative azimuthSlope quantitative degreesHorizontal_Distance_To_Hydrology quantitative metersVertical_Distance_To_Hydrology quantitative metersHorizontal_Distance_To_Roadways quantitative metersHillshade_9am quantitative 0 to 255 indexHillshade_Noon quantitative 0 to 255 indexHillshade_3pm quantitative 0 to 255 indexHorizontal_Distance_To_Fire_Points quantitative metersWilderness_Area (4 binary columns) qualitative 0 or 1Soil_Type (40 binary columns) qualitative 0 or 1Cover_Type (7 types) integer 1 to 7

    2596,51,3,258,0,510,221,232,148,6279, 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0 ,52590,56,2,212,-6,390,220,235,151,6225, 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0 ,5

    https://archive.ics.uci.edu/ml/datasets/Covertype

  • A First Decision Tree (1)val rawData = sc. textFile("hdfs:///user/ds/covtype.data")

    val data = rawData. map { line =>

    val values: Array[Double] = line. split(',').map(_.toDouble)

    val featureVector:Vector[Double] = Vectors.dense(values.init)

    val label = values. last - 1

    LabeledPoint(label, featureVector)

    }

    val Array(trainData, cvData, testData)

    = data.randomSplit(Array(0.8, 0.1, 0.1))

    trainData.cache();cvData.cache();testData.cache()

    for classification, labels should take values {0, 1, ..., numClasses-1}.

  • A First Decision Tree (2)val model = DecisionTree.trainClassifier(trainData, 7, Map[Int,Int](), "gini", 4, 100)

    val predictionsAndLabels = cvData.map(example =>

    (model.predict(example.features), example.label)

    )

    val metrics = new MulticlassMetrics( predictionsAndLabels)

  • A First Decision Tree (3)println(model.toDebugString)

    metrics.precision

    = 0.6996101063190258

    metrics.confusionMatrix

    DecisionTreeModel classifier of depth 4 with 31 nodes If (feature 0

  • Tuning Decision Trees (1)val evaluations: Array[((String, Int, Int), Double)] =for (impurity
  • Tuning Decision Trees (2)evaluations.sortBy{ case ((impurity, depth, bins), accuracy) => accuracy}.reverse.foreach(println)... ((entropy,20,300),0.9119046392195256)) ((gini ,20,300),0.9058758867075454) ((entropy,20,10 ),0.8968585218391989) ((gini ,20,10 ),0.89050342659865) ((gini ,1 ,10 ),0.6330018378248399) ((gini ,1 ,300),0.6323319764346198) ((entropy,1 ,300),0.48406932206592124) ((entropy,1 ,10 ),0.48406932206592124)

    Sort by accuracy, descending, and print

  • Revising Categorical Features (1)With one 40-valued categorical feature, the decision tree can create decisions based on groups of categories in one decision, which may be more direct and optimal. On the other hand, having 40 numeric features represent one 40-valued categorical feature also increases memory usage and slows things downval data = rawData.map { line => val values = line.split(',').map(_.toDouble) val wilderness = values.slice(10, 14).indexOf(1.0).toDouble val soil = values.slice(14, 54).indexOf(1.0).toDouble val featureVector = Vectors.dense(values.slice(0, 10) :+ wilderness :+ soil) // (3) val label = values.last - 1 LabeledPoint(label, featureVector)}val Array(trainData, cvData, testData) = data.randomSplit(Array(0.8, 0.1, 0.1))

    ..1000.. => 0

    ..0100.. => 1

    ..0010.. => 2

    ..0001.. => 3

  • val evaluations = for (impurity (model.predict(example.features), example.label) ) val accuracy = new MulticlassMetrics( predictionsAndLabels). precision ((impurity, depth, bins), accuracy)}evaluations.sortBy{ case ((impurity, depth, bins), accuracy) => accuracy}.reverse.foreach(println)((entropy,30,300),0.9446513552658804)((gini, 30,300),0.9391509759293745)((entropy,30,40) ,0.9389268225394855)((gini, 30,40) ,0.9355817642596042)

    Revising Categorical Features (2)

    vs. tuned 1-of-n encoding DT((entropy,20,300),0.9119046392195256))

    Map storing arity of categorical features. E.g., an entry (n -> k) indicates that feature n is categorical with k categories indexed from 0: {0, 1, ..., k-1}.

  • CV set vs. Test setIf the purpose of the CV set was to evaluate parameters fit to the training set, then the purpose of the test set is to evaluate hyperparameters that were fit to the CV set. That is, the test set ensures an unbiased estimate of the accuracy of the final, chosen model and its hyperparameters.val model = DecisionTree. trainClassifier( trainData.union(cvData), 7, Map[Int,Int](), "entropy", 20, 300)val predictionsAndLabels = testData.map(example => (model.predict(example.features), example.label) )val metrics = new MulticlassMetrics( predictionsAndLabels)metricsOpt.precision = 0.9161946933031271

  • Random Decision ForestsIt would be great to have not one tree, but many trees, each producing reasonable but different and independent estimations of the right target value. Their collective average prediction should fall close to the true answer, more than any individual trees does. Its the randomness in the process of building that helps create this independence. This is the key to random decision forests.val model = RandomForest.trainClassifier(dataTrain, 7, Map(10 -> 4, 11 -> 40), 20, "auto", "entropy", 30, 300)val predictionsAndLabels = cvData.map(example => (model.predict(example.features), example.label) )val metrics = new MulticlassMetrics( predictionsAndLabels)metrics.precision = 0.9630068932322555

    vs. Categorical Features DT ,0.9446513552658804))