Download

download Download

of 6

description

download

Transcript of Download

  • Global Tree Optimization:A Non-greedy Decision Tree AlgorithmKristin P. BennettEmail [email protected] of Mathematical SciencesRensselaer Polytechnic InstituteTroy, NY 12180 AbstractA non-greedy approach for constructing globally optimalmultivariate decision trees with xed structure is pro-posed. Previous greedy tree construction algorithms arelocally optimal in that they optimize some splitting crite-rion at each decision node, typically one node at a time.In contrast, global tree optimization explicitly considersall decisions in the tree concurrently. An iterative linearprogramming algorithm is used to minimize the classi-cation error of the entire tree. Global tree optimizationcan be used both to construct decision trees initially andto update existing decision trees. Encouraging computa-tional experience is reported.1 IntroductionGlobal Tree Optimization (GTO) is a new approach forconstructing decision trees that classify two or more setsof n-dimensional points. The essential dierence betweenthis work and prior decision tree algorithms (e.g. CART[5] and ID3 [10]) is that GTO is non-greedy. For greedy al-gorithms, the \best" decision at each node is found by op-timizing some splitting criterion. This process is startedat the root and repeated recursively until all or almostall of the points are correctly classied. When the setsto be classied are disjoint, almost any greedy decisiontree algorithm can construct a tree consistent with allthe points, given a sucient number of decision nodes.However, these trees may not generalize well (i.e., cor-rectly classify future not-previously-seen points) due toover-tting or over-parameterizing the problem. In prac-tice decision nodes are pruned from the tree. Typically,the pruning process does not allow the remaining deci-sion nodes to be adjusted, thus the tree may still be over-This material is based on research supported by National Sci-ence Foundation Grant 949427. This paper appeared in ComputingScience and Statistics, 26, pg. 156-160, 1994.

    parameterized. The strength of the greedy algorithm isthat by growing the tree and pruning it, the greedy al-gorithm determines the structure of the tree, the classat each of the leaves, and the decision at each non-leafnode. The limitations of greedy approaches are that lo-cally \good" decisions may result in a bad overall treeand existing trees are dicult to update and modify.GTO overcomes these limitations by treating the deci-sion tree as a function and optimizing the classicationerror of the entire tree. The function is similar to the oneproposed for MARS [8], however MARS is still a greedyalgorithm. Greedy algorithms optimize one node at atime and then x the resulting decisions. GTO startsfrom an existing tree. The structure of the starting tree(i.e. the number of decisions, the depth of the tree, andthe classication of the leaves) determines the classica-tion error function. GTO minimizes the classication er-ror by changing all the decisions concurrently while keep-ing the underlying structure of the tree xed. The advan-tages of this approach over greedy methods are that xingthe structure helps prevent overtting or overparameter-izing the problem, locally bad but globally good decisionscan be made, existing trees can be re-optimized with ad-ditional data, and domain knowledge can be more readilyapplied. Since GTO requires the structure of the treeas input, it complements (not replaces) existing greedydecision tree methods. By complementing greedy algo-rithms, GTO oers the promise of making decision treesa more powerful, exible, accurate, and widely acceptedparadigm.Minimizing the global error of a decision tree with xedstructure is a non-convex optimization problem. Theproblem of constructing a decision tree with a xed num-ber of decisions to correctly classify two or more sets isa special case of the NP-complete polyhedral separabil-ity problem [9]. Consider this seemingly simple but NP-complete problem [9]: Can a tree with just two decisionnodes correctly classify two disjoint point sets? In [4],

  • 12 34 w4x 4 > 0w2x 2 > 0 w1x 1 > 0w1x 1 0 w3x 3 0 w3x 3 > 0w2x 2 0 w4x 4 01 2 4 53Class A Class BFigure 1: A typical two-class decision treethis problem was formulated as a bilinear program. Wenow extend this work to general decision trees, resultingin a multilinear program that can be solved using theFrank-Wolfe algorithm proposed for the bilinear case.This paper is organized as follows. We begin with abrief review of the well-known case of optimizing a treeconsisting of a single decision. The tree is represented asa system of linear inequalities and the system is solvedusing linear programming. In Section 3 we show howmore general decision trees can be expressed as a sys-tem of disjunctive linear inequalities and formulated asa multilinear programming problem. Section 4 explainsthe iterative linear programming algorithm for optimizingthe resulting problem. Computational results and conclu-sions are given in Section 5.GTO applies to binary trees with a multivariate deci-sion at each node of the following form: If x is a pointbeing classied, then at decision node d, if xwd > d thepoint follows the right branch, if xwd d then the pointfollows the left branch. The choice of which branch thepoint follows at equality is arbitrary. This type of decisionhas been used in greedy algorithms [6, 1]. The univariatedecisions found by CART [5] for continuous variables canbe considered special cases of this type of decision withonly one nonzero component of w. A point is classiedby following the path of the point through the tree untilit reaches a leaf node. A point is strictly classied bythe tree if it reaches a leaf of the correct class and equal-ity does not hold at any decision along the path to theleaf (i.e. xwd 6= d for any decision d in the path). Al-though GTO is applicable to problems with many classes,for simplicity we limit discussion to the problem of clas-sifying the two sets A and B. A sample of such a tree isgiven in Figure 1. Let A consist of k points contained inRn and B consist of m points contained in Rn. Let Aj

    (a) Tree found by Greedy LP Algorithm(b) Tree found by GTOFigure 2: Geometric depiction of decision treesdenote the jth point in A.2 Optimizing a Single DecisionMany methods exist for minimizing the error of a treeconsisting of a single decision node. We briey reviewone approach which formulates the problem as a set oflinear inequalities and then uses linear programming tominimize the errors in the inequalities [3]. The reader isreferred to [3] for full details of the practical and theoret-ical benets of this approach.Let xw = be the plane formed by the decision. Forany point x, if xw < then the point is classied in classA, and if xw > then the point is classied in class B. Ifxw = the class can be chosen arbitrarily. All the pointsin A and B are strictly classied if there exist w and

  • such that Ajw < 0 j = 1 : : :mBiw > 0 i = 1 : : :k (1)or equivalentlyAjw + 1 j = 1 : : :mBiw 1 i = 1 : : :k (2)Note that Equations (1) and (2) are alternative denitionsof linear separability. The choice of the constant 1 isarbitrary. Any positive constant may be used.If A and B are linearly separable then Equation (2)is feasible, and the linear program (LP) (3) will have azero minimum. The resulting (w; ) forms a decision thatstrictly separates A and B. If Equation (2) is not feasible,then LP (3) minimizes the average misclassication errorwithin each class.minw; 1m mXj=1 yj + 1k kXi=1 zis:t: yj Ajw + 1 yj 0 j = 1 : : :mzi Biw + + 1 zi 0 i = 1 : : :k (3)LP (3) has been used recursively in a greedy decisiontree algorithm called Multisurface Method-Tree (MSMT)[1]. While it compares favorably with other greedy de-cision tree algorithms, it also suers the problem of allgreedy approaches. Locally good but globally poor deci-sions near the root of the tree can result in overly largetrees with poor generalization. Figure 2 shows an exam-ple of a case where this phenomenon occurs. Figure 2adepicts the 11 planes used by MSMT to completely clas-sify all the points. The decisions chosen near the rootof the tree are largely redundant. As a result the deci-sions near the leaves of the tree are based on an unnec-essarily small number of points. MSMT constructed anexcessively large tree that does not reect the underlyingstructure of the problem. In contrast, GTO was able tocompletely classify all the points using only three deci-sions (Figure 2b).3 Problem FormulationFor general decision trees, the tree can be represented asa set of disjunctive inequalities. A multilinear programis used to minimize the error of the disjunctive linearinequalities. We now consider the problem of optimizing atree with the structure given in Figure 1, and then brieyconsider the problem for more general trees.Recall that a point is strictly classied by the tree inFigure 1 if the point reaches a leaf of the correct classi-cation and equality does not hold for any of the decisionsalong the path to the leaf. A point Aj 2 A is strictly

    classied if it follows the path through the tree to therst or fourth leaf node, i.e. if Ajw1 1 + 1 0Ajw2 2 + 1 0 or* Ajw1 + 1 + 1 0Ajw3 3 + 1 0Ajw4 + 4 + 1 0 + (4)or equivalently(Ajw1 1 + 1)+ (Ajw2 2 + 1)+ = 0or(Ajw1 + 1 + 1)+ (Ajw3 3 + 1)+(Ajw4 + 4 + 1)+ = 0 (5)where ()+ := maxf; 0g.Similarly a point Bi 2 B is strictly classied if it followsthe path through the tree to the second, third, or fth leafnode, i.e. if Biw1 1 + 1 0Biw2 + 2 + 1 0 or* Biw1 + 1 + 1 0Biw3 3 + 1 0Biw4 4 + 1 0 +or Biw1 + 1 + 1 0Biw3 + 3 + 1 0 (6)or equivalently(Biw1 1 + 1)+ (Biw2 + 2 + 1)+ = 0or(Biw1 + 1 + 1)+ (Biw3 3 + 1)+(Biw4 4 + 1)+ = 0or(Biw1 + 1 + 1)+(Biw3 + 3 + 1)+ = 0 (7)A decision tree exists that strictly classies all thepoints in sets A and B if and only if the following equationhas a feasible solution:mXj=1(y1j + y2j ) (z1j + y3j + z4j ) +kXi=1(u1i + v2i) (v1i + u3i + u4i) (v1i + v3i) = 0where ydj = (Ajwd d + 1)+ j = 1 : : :mzdj = (Ajwd + d + 1)+udi = (Biwd d + 1)+ i = 1 : : :kvdi = (Biwd + d + 1)+for d = 1 : : :Dand D = number of decisions in tree:(8)

  • Furthermore, (wd; d); d = 1 : : :D; satisfying (8) formthe decisions of a tree that strictly classies all the pointsin the sets A and B.Equivalently, there exists a decision tree with the givenstructure that correctly classies the points in sets A andB if and only if the following multilinear program has azero minimum:minw;;y;z;u;v 1m kXj=1(y1j + y2j ) (z1j + y3j + z4j ) +1k mXi=1(u1i + v2i) (v1i + u3i + u4i) (v1i + v3i)s:t: ydj Ajwd d + 1 j = 1 : : :mzdj Ajwd + d + 1udi Biwd d + 1 i = 1 : : :kvdi Biwd + d + 1for d = 1 : : : jy; z; u; v 0 (9)The coecients 1m and 1k were chosen so that (9) isidentical to the LP (3) for the single decision case, thusguaranteeing that w = 0 is never the unique solutionfor that case [3]. These coecients also help to makethe method more numerically stable for large training setsizes.This general approach is applicable to any multivariatebinary decision tree used to classify two or more sets.There is an error term for each point in the training set.The error for that point is the product of the errors ateach of the leaves. The error at each leaf is the sum ofthe errors in the decisions along the path to that leaf. If apoint is correctly classied at one leaf, the error along thepath will be zero, and the product of the leaf errors willbe zero. Space does not permit discussion of the generalformulation in this paper, thus we refer the reader to [2]for more details.4 Multilinear ProgrammingThe multilinear program (3) and its more general for-mulation can be optimized using the iterative linear pro-gramming Frank-Wolfe type method proposed in [4]. Weoutline the method here, and refer the reader to [2] forthe mathematical properties of the algorithm.Consider the problem minx f(x) subject to x 2 X wheref : Rn ! R; X is a polyhedral set in Rn containing theconstraint x 0, f has continuous rst partial deriva-tives, and f is bounded below. The Frank-Wolfe algo-rithm for problem is the following:Algorithm 4.1 (Frank-Wolfe algorithm [7, 4])Start with any x0 2 X . Compute xi+1 from xi as fol-

    lows.(i) vi 2 arg vertex minx2X 5 f(xi)x(ii) Stop if 5 f(xi)vi = 5f(xi)xi(iii) xi+1 = (1 i)xi + ivi wherei 2 arg min01 f((1 )xi + vi)In the above algorithm \arg vertex min" denotesa vertex solution set of the indicated linear program.The algorithm terminates at some xj that satisesthe minimum principle necessary optimality condition:5f(xj)(x xj) 0, for all x 2 X , or each accumula-tion point x of the sequence fxig satises the minimumprinciple [4].The gradient calculation for the GTO function isstraightforward. For example, when Algorithm 4.1 is ap-plied to Problem (9), the following linear subproblem issolved in step (i) with (w; ; y; z; u; v) = xi:minw;;y;z;u;v 1m mXj=1(y1j + y2j )(z1j + y3j + z4j ) +1m mXj=1(y1j + y2j ) (z1j + y3j + z4j )+1k kXi=1(u1i + v2i) (v1i + u3i + u4i)(v1i + v3i) +1k kXi=1(u1i + v2i) (v1i + u3i + u4i)(v1i + v3i) +1k kXi=1(u1i + v2i) (v1i + u3i + u4i)(v1i + v3i)s:t: ydj Ajwj d + 1 For d = 1; : : : ; Dzdj Ajwd + d + 1 j = 1 : : :mudi Biwd d + 1 i = 1 : : :kvdi Biwd + d + 1y; z; u; v 0 fixed y; z; u; v; 05 Results and ConclusionsGTO was implemented for general decision trees withxed structure. In order to test the eectiveness of theoptimization algorithm, random problems with known so-lutions were generated. For a given dimension, a treewith 3 to 7 decision nodes was randomly generated toclassify points in the unit cube. Points in the unit cube

  • were randomly generated and classied and grouped intoa training set (500 to 1000 points) and a testing set (5000points). MSMT, the greedy algorithm discussed in Sec-tion 2, was used to generate a greedy tree that correctlyclassied the training set. The MSMT tree was thenpruned to the known structure (i.e. the number of de-cision nodes) of the tree. The pruned tree was used as astarting point for GTO. The training and testing set errorof the MSMT tree, the pruned tree (denoted MSMT-P),and the GTO tree were measured, as was the trainingtime. This experiment was repeated for trees rangingfrom 3 to 7 nodes in 2 to 25 dimensions. The results wereaveraged over 10 trials.We summarize the test results and refer the reader to[2] for more details. Figure 3 presents the average resultsfor randomly generated trees with three decision nodes.These results are typical of those observed in the otherexperiments. MSMT achieved 100% correctness on thetraining set but used an excessive number of decisions.The training and testing set accuracy of the pruned treesdropped considerably. The trees once optimized by GTOwere signicantly better in terms of testing set accuracythan both unpruned and pruned MSMT trees.The computational results are promising. The Frank-Wolfe algorithm converges in relatively few iterations toan improved solution. However GTO did not always ndthe global minimum. We expect the problem to havemany local minima since it is NP-complete. We plan toinvestigate using global optimization techniques to avoidlocal minima. The overall execution time of GTO tendsto grow as the problem size increases. Parallel compu-tation can be used to improve the execution time of theexpensive LP subproblems. The LP subproblems (e.g.Problem (9)) have a block-separable structure and canbe divided into independent LPs solvable in parallel.We have introduced a non-greedy approach for opti-mizing decision trees. The GTO algorithm starts with anexisting decision tree, xes the structure of the tree, for-mulates the error of the tree, and then optimizes that er-ror. An iterative linear programming algorithm performswell on this NP-complete problem. GTO optimizes allthe decisions in the tree, and thus has many potential ap-plications such as: decreasing greediness of constructivealgorithms, reoptimizing existing trees when additionaldata is available, pruning greedy decision trees, and in-corporating domain knowledge into the decision tree.References[1] K. P. Bennett. Decision tree construction via linearprogramming. In M. Evans, editor, Proceedings ofthe 4th Midwest Articial Intelligence and Cognitive

    05

    101520253035

    5 10 25dimension

    Perc

    ent E

    rror

    s

    Training Set Accuracy (500 points)

    0

    20

    7

    0

    29

    9

    0

    29

    6

    05

    101520253035

    5 10 25dimension

    Perc

    ent E

    rror

    s

    Testing Set Accuracy (5000 points)

    10

    22

    8

    22

    31

    11

    30

    35

    15

    0

    50

    100

    150

    200

    5 10 25dimension

    CPU

    Sec

    onds

    Training Time

    MSMT

    MSMT-P

    GTO

    1.4 1.524.3

    3.8 3.9

    56.8

    12.812.9

    193.4Figure 3: Average results over 10 trials for randomly gen-erated decision trees with 3 decision nodes.Science Society Conference, pages 97{101, Utica, Illi-nois, 1992.[2] K. P. Bennett. Optimal decision trees through mul-tilinear programming. R.P.I. Math Report No. 214,Rensselaer Polytechnic Institute, Troy, NY, 1994.[3] K. P. Bennett and O. L. Mangasarian. Robust lin-ear programming discrimination of two linearly in-separable sets. Optimization Methods and Software,1:23{34, 1992.[4] K. P. Bennett and O. L. Mangasarian. Bilinear sep-aration of two sets in n-space. Computational Opti-mization and Applications, 2:207{227, 1993.[5] L. Breiman, J. Friedman, R. Olshen, and C. Stone.Classication and Regression Trees. Wadsworth In-ternational, Califormna, 1984.[6] C. E. Brodley and P. E. Utgo. Multivariate decisiontrees. COINS Technical Report 92-83, University ofMassachussets, Amherst, Massachusetts, 1992. Toappear in Machine Learning.[7] M. Frank and P. Wolfe. An algorithm for quadraticprogramming. Naval Research Logistics Quarterly,3:95{110, 1956.

  • [8] J. H. Friedman. Multivariate adaptive regressionsplines (with discussion). Annals of Statistics, 19:1{141, 1991.[9] N. Megiddo. On the complexity of polyhedral separa-bility. Discrete and Computational Geometry, 3:325{337, 1988.[10] J. R. Quinlan. Induction of decision trees. MachineLearning, 1:81{106, 1984.