Decision Trees for Functional Variables - DIMACS Trees for Functional Variables Suhrid Balakrishnan...
Transcript of Decision Trees for Functional Variables - DIMACS Trees for Functional Variables Suhrid Balakrishnan...
Decision Trees for Functional Variables
Suhrid BalakrishnanDepartment of Computer Science
Rutgers UniversityPiscataway, NJ 08854, USA
David MadiganDepartment of Statistics
Rutgers UniversityPiscataway, NJ 08854, USA
Abstract
Classification problems with functionallystructured input variables arise naturally inmany applications. In a clinical domain, forexample, input variables could include a timeseries of blood pressure measurements. Ina financial setting, different time series ofstock returns might serve as predictors. Inan archeological application, the 2-D pro-file of an artifact may serve as a key inputvariable. In such domains, accuracy of theclassifier is not the only reasonable goal tostrive for; classifiers that provide easily in-terpretable results are also of value. In thiswork, we present an intuitive scheme for ex-tending decision trees to handle functionalinput variables. Our results show that suchdecision trees are both accurate and readilyinterpretable.
1 INTRODUCTION
We present an extension to standard decision trees (forexample CART, Breiman et al. 1984 or C4.5, Quinlan1993) that enables them to be applied to classificationproblems with functional data as inputs. In so doing,we aim to leverage the interpretability of decision treesas well as their other important benefits like reason-able classification performance and efficient associatedlearning procedures.
The application that motivated this work concernsvaccine efficacy. The study in question followed thirtyvaccinated non-human primates (NHPs) for a year andthen “challenged” the animals. Of the 30 NHPs, 21survived the challenge and 9 died. Repeated measure-ments during the year assessed over a dozen aspectsof the putative immune response. These measure-ments include an immunoglobulin G enzyme-linked
immunosorbent assay (IgG), various interleukin mea-sures (IL2, IL4, IL6), and a so-called “stimulation in-dex” (SI), to name a few, with the number of measure-ments varying somewhat from animal to animal. Thegoal of the study is to understand the predictive valueof the various assays with respect to survival.
0 10 20 30 40 500
2
4
6
8IgG, 9 dead (red), 21 survived (green)
0 10 20 30 40 500
1
2
3
4IL6
Figure 1: IgG and IL6 measurements for all 30 NHPs.Green (solid) represents animals that survived; red(dashed) represents animals that died. The thickcurves represent sample means.
Our initial approach to this problem used a logisticregression model (Genkin et al., 2004) and treatedeach assay-timepoint combination as a separate inputvariable. While this provided good predictive perfor-mance, it selected a biologically meaningless set ofpredictors such as IgG at week 46, SI at week 38,and IL6 at week 12. The study immunologists in-stead sought insights such as “IgG trajectories thatrise more rapidly after the 4-week booster shot and fallmore slowly after the 26-week booster lead to highersurvival probability.” In other words, the underlying
biology suggests that the shape of the assay curvesshould be predictive of survival rather than measure-ments at specific timepoints. Figure 1, for example,shows IgG and IL6 trajectories. For IgG, comparisonof the thick green curve with the thick red curve showsthat higher values are beneficial at the beginning, thenlower values, and then higher values again towards theend. For IL6, it appears higher values of the curve ingeneral are beneficial. We shall return to the motivat-ing example in later sections.
More generally, we consider the following multi-classclassification problem: the training dataset comprisesn labeled training examples, D = {(xi, yi)}n
i=1 wherexi = [xi1, . . . ,xid] is a list of d features and yi ∈{1, . . . , c}, the c labels. We account for functionallystructured input data, by allowing the elements ofxi to be vectors/lists themselves, each representingan instance of a function of some independent vari-able. For example, a time series classification prob-lem with only one time-series (or functional) variableof fixed-length T say, as input, would be representedas xi = xi1 in our setup, with the single featurexi1 = [x(1)
i1 , x(2)i1 , . . . , x
(T )i1 ]. Here time would be the
independent variable.
We allow the inputs to be multivariate—meaningthere may be more than one functional variable(i.e., more than one vector/list element of xi)and also allow for standard (non-functional) dis-crete/continuous/categorical variables (in this case therelevant components of xi will be the correspondingscalars/nominal values). Thus, standard decision treescan be considered a type of special case of the aboveformulation where all inputs are restricted to length 1.
2 PREVIOUS STUDIES
Past approaches to this problem applying standardmachine learning algorithms have typically relied onsome sort of ad hoc and domain specific preprocessingto extract predictive features. A few previous studieslook for interesting “events” in the training instancesof the functional variables and then construct auxil-iary variables based on them. These auxiliary vari-ables are then used by either a particular classifier (de-cision trees, regression trees and 1-nearest neighbor,Geurts 2001 which retains interpretability), or genericclassifiers (Kadous and Sammut, 2005) (with inter-pretable results available if a rule learning classifier isapplied), or used in literals as base classifiers combinedvia boosting (Gonzalez and Diez, 2004). Another ap-proach is the scheme by Kudo et al. (1999), whichconstrains the functional variables to pass through cer-tain regions in (discretized) space and disallows otherregions. There are also techniques that create fea-
tures via specialized functional PCA techniques, de-signed to deal with large data applications (EEG data)where the functional variables are lengthy and numer-ous. Finally, there are support vector machine basedapproaches that are typically applied by defining anappropriate kernel function for the problem domain(Shimodaira et al., 2001).
3 CANDIDATE SPLITS FORFUNCTIONAL VARIABLES
In order for a decision tree to be able to process func-tional variables, we first need to define candidate splitsfor such variables. In other words, we need to definea procedure that results in a partition of the spaceof possible function instances (in a manner similarto partitions for discrete/continuous/categorical vari-ables). We describe the idea using the time series ex-ample. Consider a binary classification problem, i.e.,yi ∈ {1, 2}, and a functional input variable, a timeseries of fixed length T , xi1 = [x(1)
i1 , x(2)i1 , . . . , x
(T )i1 ].
While many splitting rules can be imagined, we pro-pose the following: consider two representative curves,xr and xl where xr = [x(1)
r , x(2)r , . . . , x
(T )r ] (and xl is
similarly defined). The (binary) split is defined by theset of function instances that are closer (in terms ofsome distance) to one representative curve than theother.
Note that this definition allows for the construction ofvery flexible partitions. Multi-way splits are an imme-diate extension and are quite simply defined by con-sidering more than two representative curves. The dis-tance function used can be application specific. In ourexperiments, we primarily focus on binary splits andtwo kinds of distance measures: Euclidean distanceand dynamic time warping (DTW) distance. The useof DTW distance further exemplifies how flexible thiskind of split is, enabling function instances of differentlengths to be compared.
Classification with such splits in trees proceeds ex-actly as with standard decision trees, and the inputtest function instance follows the branch correspond-ing to the closest representative curve (with leaf nodesas usual holding predicted class labels).
Note that for a given distance measure, differentchoices for xr and xl lead to different candidate splits.Each candidate split corresponds to a partition of“function space” into two regions, functions withinone region being more similar to each other than tofunctions in the other region. Our proposed approachrests on two basic assumptions. First, the partitionshould be interpretable. That is, the choice of xr andxl and the particular distance measure should lead to
sets of functions that correspond to recognizably ho-mogeneous sets of curves. Second, the true classifi-cation rule needs to be smooth with respect to thechosen distance measure. That is, functional inputsthat are close together according the distance mea-sure should generally belong to the same output class.What we attempt to show in later sections is thatstraightforward choices for xr, xl, and the distancemeasure lead to functional-split-enabled decision treeswith good classification accuracy and tree structuresthat provide valuable insights.
3.1 FINDING REPRESENTATIVECURVES
Having provided the intuition for defining the candi-date test on the basis of proximity of the function in-stances to representative curves xr and xl (quantifiedin some manner), we now focus on how these curvescan be automatically obtained from the data.
Note that the regularity assumption we make, is equiv-alent to assuming that the instances (or curves) clusterin some manner in the input domain. Consequently, asimple idea is to perform a functional variable-specificclustering, and then use cluster representatives in thecandidate tests. The choice of different clustering pro-cedures may lead to different decision trees and in gen-eral this choice will require some application-specificconsideration.
In our experiments we used two clustering procedures:standard k-means clustering ( k = 2 for binary par-titions) with the Euclidean distance function betweentwo instances of the same length, and a clustering pro-cedure using DTW distances between instances. WithDTW distance, the mean of the curves is not a par-ticulary well-motivated representative of a cluster (in-stances look more like warped/time-shifted versions ofeach other than the mean). We instead use an in-stance as representative of the cluster—in particular,we perform the following EM-like iterations to find therepresentatives: a) Set the cluster representative to bethe instance which is closest (has smallest combinedDTW distance) to all the other instances in the clus-ter. b) Reassign instances to clusters based on theirdistance to the representatives1. We will refer to theseprocedures as “Euclidean clustering” and “DTW clus-tering” in the remainder of the paper.
3.2 CHOOSING GOOD SPLITS
While reasonable clustering procedures can providereasonable representative curves, most standard clus-
1Note that this procedure bears resemblance tocomplete-link hierarchical clustering.
tering algorithms are only guaranteed to converge tolocally optimal solutions. The standard approach toalleviate this problem is to do multiple restarts (mul-tiplicity m) initialized randomly and pick the tightestclusters found.
Recall that in our application, however, the criterionfor “goodness” of a candidate test is not how tight thefound clusters are, but rather how well the represen-tative curves partition the data class labels. Typicalmeasures of partition purity used include entropy, in-formation gain and the Gini diversity index. In ourexperiments we use the Gini index.
Summing up, in order to find good (high purity)splits for a functional variable we perform multiplerestarts of clustering with random initializations (set-ting m = 1000, unless noted otherwise) and pick thepartition (the representative curves summarize/definethis partition) that has highest Gini index. This searchprocedure can be easily plugged in to standard divideand conquer decision tree building algorithms like C4.5and CART, and Algorithm 1 provides an outline inhigh level pseudo-code.
Algorithm 1: Search procedure for functional vari-able splitData: subset of the training data D.Result: Representative curves xr,xl that partition
the input data Dl, Dr (D = Dl ∪Dr), scoreof partition.
Initialize best = [0, φ, φ] (stores [score,Dl, Dr]).xr,xl = φ.for j = 1, 2, . . . , m do
Run clustering procedure with randominitialization. Obtain candidate representativecurves and partition.Compute score (e.g., Gini index using candidatepartition).if score is better than current best score then
Update best,xr,xl.else
Continue.end
end
4 APPLICATIONS
We now describe applications of our algorithm to bothsimulated and real datasets (see Table 1 for details).We will first examine three simulated datasets, thecylinder, bell, funnel dataset (CBF), an extension ofit we created (CBF-2), and the control chart (CC)dataset. Table 2 provides some details about the pre-dictive performance experiments. For each dataset,
Table 1: Dataset Descriptions
Data Src. n df c Tr ProtocolCBF * 798 1 3 128 10-fold CV
CBF-2 * 798 2 2 128 10-fold CVCC 1 600 1 6 60 10-fold CVJV 1 370 12 9 7-29 test 270
Bone 2 96 1 2 100 leave one outNHP * 30 7 2 7-10 leave one out
*: Own/Simulated, 1: UCI KDD Archive (Hettich andBay, 1999), 2: Ramsay and Silverman (2002). n: Num. ofobservations, df : Num. of functional variables, c: Num.
of classes and Tr: length of the functional instances.
we compare predictive error rates to one or more ofthe following: 1. LR (an L1 constrained logistic re-gression classifier), using BBR2 (Genkin et al., 2004)(for binary classification problems) and BMR3 (formulti-class problems) trained with 10-fold CV usedto pick the hyperparameter from amongst the set{0.001, 0.01, 0.1, 1, 10, 100, 1000}. This is a reasonablyfair baseline that represents state-of-the-art classifierperformance (see Genkin et al. 2004, for how BBRcompares to SVMs etc.). 2. Seg. (segmented in-puts), which are the best previously published resultsfrom Geurts (2001) on various classifiers that use seg-mented auxiliary variables as input. 3. Fnl. (func-tional), which are the best previously published resultsof Geurts (2001) using comparably interpretable clas-sifiers constructed by combining functional patternsand decision trees. 4. Best, which are the best pub-lished results otherwise known (not any of the otherthree categories).
For datasets where 10-fold CV was used to estimateerror rates (CBF, CBF-2 and CC), the functional deci-sion trees (FDT) were pruned by training on 8 out the10 folds, and picking the sub-tree of the full tree thatgave smallest error on the 9th fold (the pruning set).Finally, prediction errors were counted on the remain-ing 10th fold. For leave-one-out protocol datasets, nopruning was done.
Also, in order to display the functional splits wewill use the following conventions in displayed treesthroughout the paper: a functional split will be dis-played by showing the functional variable name, a <symbol, followed by a unique integer for the split. Forexample, x1 < 1 represents a split on the functionalvariable 1 (xi1), and the index 1 likely indicates this isthe root split. Further, for any split, the left branchrepresentative xl will be shown in plots by a solidline and the right branch representative xr, willbe shown by a dashed line.
2http://www.stat.rutgers.edu/∼madigan/BBR3http://www.stat.rutgers.edu/∼madigan/BMR
4.1 CBF
0 50 100−5
0
5
10
Bell (b)
0 50 100−5
0
5
10
Cylinder (c)
0 50 100−5
0
5
10
Funnel (f)
0 50 100
0
2
4
6
8Means
Figure 2: Cylinder, bell, funnel dataset.
The cylinder, bell, funnel dataset proposed by Saito(1994) is a three class problem yi ∈{b,c,f}, with onetime series (functional) variable of fixed length. Theequations describing the functional attribute for eachclass have both random noise as well as random startand end points for the generating events of each class,making for quite a lot of variability in the instances.As in past studies, we simulated 266 instances of eachclass to construct the dataset. Instances of each classare shown in Figure 2, where also shown is a partic-ular instance of each class in bold and the computedclass means in the bottom right panel. Although this
0 50 100
0
2
4
6
8x1<1
0 50 100
0
2
4
6
8x1<2
c
f b
x1 < 1
x1 < 2
Figure 3: CBF results: pruned tree, splits.
is essentially an easy classification problem (reportedaccuracies are in the high 90’s), it is often used as
a sanity check when testing algorithms that performfunctional variable classification. As far as predictiveperformance goes, our procedure also performs at parwith some of the best known methods on this dataset(see Table 2). The functional decision tree provides ahighly interpretable representation. Shown in Figure3 is a pruned decision tree constructed on the wholedataset using Euclidean distance4. The leaf splits are,as expected, very representative of the class means (cf.,Figure 2).
4.2 CBF-2
This is an artificial dataset we created that extendsthe CBF dataset by adding another independent func-tional variable. This second functional variable we alsoset to be a cylinder, bell or funnel instance. Finally, wecreate a binary classification problem with these inputsby assigning patterns to be class 1 if and only if thefirst variable instance xi1, is a cylinder and the secondvariable instance xi2, is a bell (again, we simulated 266instances of each class for each variable). The classifi-cation problem is pretty much of the same hardness asthe original CBF problem (see the predictive resultstable). We choose to work with this dataset because itis a perfect example to apply functional decision treesto: a combination of both input functional instancescarry the entire predictive signal. The interpretabilityof the learned tree highlights how effective functionaldecision trees can be—see Figure 4. The splits, oneon each variable, correspond exactly to the true datagenerating mechanism, and this mechanism is evidentin the plots of the functional splits.
0 50 100−5
0
5
10
x2<1
0 50 100−5
0
5
10
x1<2
0
0 1
x2 < 1
x1 < 2
Figure 4: CBF results: learnt tree, splits. Branchespredicting class 1 are shown in red.
4Note that the DTW results are reported in Table 2.Euclidean clustering results are slightly worse.
4.3 Control Chart
This is an artificial dataset in the UCI KDDArchive5 consisting 100 objects of each ofthe six classes. The instances of each class∈{normal,cyclic,up,down,increasing,decreasing}are defined by 60 time points and the label broadlydescribes the behavior of the function with time—seeFigure 5 (Up/Down denote a sudden jump in theseries up/down respectively) . Euclidean clustering
0 20 40 600
20
40
60
80Cyclic, Up
0 20 40 600
10
20
30
40Down, Normal
0 20 40 60−20
0
20
40
60
80Increasing, Decreasing
0 20 40 600
20
40
60Class means
Figure 5: Control chart dataset.
performs best on this dataset, with competitivepredictive accuracy to other techniques (Table 2).Figure 6 shows the complete functional decision treealong with select functional splits. Notice that thehighest level split (x1 < 1) corresponds broadly topartitioning the instances as those that generallyincrease and those that generally decrease (cyclicand normal are arbitrarily assigned to one or other).Appealingly, other internal splits display similar sortsof intuitive behavior (see x1 < 4, for example) andleaf splits display strong representative instances ofthe component classes (see x1 < 9 and x1 < 10).
4.4 Japanese Vowels
Another dataset in the UCI KDD Archive, theJapanese Vowels dataset was first used in Kudo et al.(1999). The classification problem is a speaker identi-fication task and the dataset consists the utterance ofthe Japanese vowels “a” and “e” by nine male speak-ers (yi ∈ {1, . . . , 9}). Each utterance is describedby 12 temporal attributes, which are 12 time-varyingLPC spectrum coefficients (for details of how these at-tributes were obtained see Kudo et al. 1999). An im-portant challenge this dataset poses to many standard
5Hettich and Bay (1999), http://kdd.ics.uci.edu
up incr up cyclic
up incr cyclic normal decrdown down
down decr
x1 < 1
x1 < 2 x1 < 3
x1 < 4 x1 < 5 x1 < 6 x1 < 7
x1 < 8 x1 < 9 x1 < 10 x1 < 11
x1 < 14
0 20 40 6020
30
40
50x1<1
0 20 40 6025
30
35
40
45
50x1<4
0 20 40 6010
20
30
40
50x1<9
0 20 40 600
10
20
30
40x1<10
Figure 6: Control chart: learnt tree and some splits.
classification techniques is that the input sequences areof variable length (owing to the variable length of eachutterance). For this reason, our results are obtainedby applying the DTW clustering procedure, for whichvariable lengths are not a problem.
For this problem, since the number of classes is fairlylarge compared to the number of examples of eachclass, we constructed a 15-bit error correcting outputcode classifier (Dietterich and Bakiri, 1995) from thefunctional decision trees. Although our prediction ac-curacy isn’t quite at par with the best methods forthis problem (see Table 2) it is better than that of theother functional classifier that produces interpretableoutputs, Geurts (2001).
More interesting for this dataset, are the learnt de-cision trees and their comparison with Kudo et al.’sanalysis of the problem. Shown in Figure 7 is the treelearnt for speaker 1 (a 1-vs.-all binary classificationproblem), select splits from the tree, and previouslypublished results based on Kudo et al.’s analysis (whoalso construct individual speaker classifiers). Kudo
et al.’s procedure constructs regions in space throughwhich the speakers curves mostly should (solid blackrectangles) and mostly should not (dashed purple rec-tangles) pass. The qualitative correspondence betweenthese regions and the functional splits we obtain isquite striking.
4.5 Bone
This second real dataset records the planar cross-sectional outline of the intercondylar notch from theknee joint (on the femur) of adults. The data has beencollected from exhumed skeletons and includes con-comitant information such as the sex of the individual.We obtained the data from Ramsay and Silverman’sbook (Ramsay and Silverman, 2002) data section6.The raw data (outline of the coordinates of the bone)is first parameterized by arc length and sampled—seeRamsay and Silverman (2002) for details of this stan-dard pre-processing. This arc length parameterizeddata is what we set (both x and y coordinates) as oursingle two-dimensional functional variable (the inde-pendent variable being arc length). The binary classi-fication problem involves predicting arthritis and pastanalyses of this problem have shown the notch shapeto be an important predictor. Besides obtaining com-petitive predictive accuracy on this dataset, the learnttree displays interesting splits. Firstly, the tree con-tains both non-functional splits (on sex of the indi-vidual) and functional splits. Second, the functionalsplits themselves are instructive about bone shapeand arthritis implications—see Figure 8 (shown witha comparable plot from the literature). Corroboratingthe conclusions James and other authors reached byindependent analyzes on the same dataset (James andSilverman, 2005; Ramsay and Silverman, 2002), thefunctional splits we learn show that both variability inthe y-direction (shrunken bones) and bending to theright are predictive of a greater risk of arthritis.
4.6 NHP study
As described in the introduction, this dataset involvestime series measurements (of different lengths) relatedto the state of the immune system of vaccinated NHPs.Once again, the tree we learn is competitive in predic-tive accuracy. Figure 9 shows the learnt tree and func-tional splits (as in the introduction, red corresponds tobranches for predicting death yi = d, while green cor-responds to NHPs that survive, yi = a). The splitsare biologically sensible. A increasing IL6 trajectoryis predictive of survival; A rapid early rise in SI is pre-dictive of death; an early rise in TNF (tumor necrosisfactor) is predictive of survival.
6http://www.stats.ox.ac.uk/∼silverma/fdacasebook/
Table 2: Predictive Performance—Error RatesData Best LR Seg. Fnl. FDT
CBF 01 2.77 0.5 1.17 0.13D
CBF2 - 5.29 - - 0.25D
CC 0.332 10.33 0.5 2.33 2.0E
JV 3.83 - 3.51 19.4 9.46D
Bone - 21.86 - - 19.79E
NHP - 33.33 - - 26.67E
1: Kadous and Sammut (2005) 2: Geurts and Wehenkel(2005) 3: Kudo et al. (1999) D: DTW clustering, E:
Euclidean clustering.
5 DISCUSSION, FUTURE WORK
In this paper, we presented a simple and effective ex-tension to decision trees that allows them to operateon functional input variables. We presented resultsshowing that these functional decision trees are accu-rate and produce interpretable classifiers.
Many extensions to the basic idea presented here sug-gest themselves; we describe a few. The representa-tive curves can be generated by more sophisticatedclustering algorithms; of particular interest would beones designed for clustering functional curves. For ex-ample, the one proposed by James and Sugar (2003).Also, a range of algorithms from model-based cluster-ing (e.g. HMM based) to non-parametric clustering(e.g. Gaussian processes based clustering methods)might be applied.
Further, one is not limited to decision trees as thebase classifier either. An alternative way to view asingle functional split is that it defines an auxiliaryvariable that may be used in addition to standard fea-tures in any classification algorithm. Multi-way splits,for example, might be particularly powerful featuresin multi-class problems. Finally, predictive accuracycan likely be improved by boosting these functionaldecision trees, a topic we are currently investigating.
Acknowledgements
We thank the US National Science Foundation for fi-nancial support as well as the KD-D group, who sup-ported this work through NSF grant EIA-0087022.
References
L. Breiman, J. H. Friedman, R. A. Olshen, andC. J. Stone. Classification and Regression Trees.Wadsworth International Group., 1984.
T. G. Dietterich and G. Bakiri. Solving multiclasslearning problems via error-correcting output codes.Journal of Artificial Intelligence Research, 2:263 – 286,1995.
A. Genkin, D. D. Lewis, and D. Madi-gan. Large-scale bayesian logisitic regres-sion for text categorization., 2004. URLhttp://www.stat.rutgers.edu/∼madigan/PAPERS/.
P. Geurts. Pattern extraction for time-series clas-sification. In L. de Raedt and A. Siebes, editors,PKDD, LNAI 2168, pages 115 – 127, Freiburg, Sep-tember 2001.
P. Geurts and L. Wehenkel. Segment and combineapproach for non-parametric time-series classification.In PKDD, October 2005.
C. J. A. Gonzalez and J. J. R. Diez. Boosting interval-based literals: Variable length and early classifica-tion. In A. Kandel M. Last and H. Bunke, editors,Data mining in time series databases. World Scien-tific, 2004.
S. Hettich and S. D. Bay. The UCI KDD archive,1999. URL http://kdd.ics.uci.edu.
G. James and B. Silverman. Functional adaptivemodel estimation. Journal of the American StatisticalAssociation, 100:565 – 576, 2005.
G. James and C. Sugar. Clustering for sparsely sam-pled functional data. Journal of the American Statis-tical Association, 98:397 – 408, 2003.
M. W. Kadous and C. Sammut. Classification of mul-tivariate time series and structured data using con-structive induction. Machine Learning Journal, 58:179 – 216, 2005.
M. Kudo, J. Toyama, and M. Shimbo. Multidimen-sional curve classification using passing-through re-gions. Pattern Recognition Letters, 20(11-13):1103 –1111, 1999.
J. R. Quinlan. C4.5: Programs for Machine Learning.Morgan Kaufmann, San Mateo, California, 1993.
J. O. Ramsay and B. W. Silverman. Applied Func-tional Data Analysis: Methods and Case Studies.Springer-Verlag, New York, 2002.
N. Saito. Local feature extraction and its applicationusing a library of bases. PhD thesis, Yale University,1994.
H. Shimodaira, K. i. Noma, M. Nakai, andS. Sagayama. Dynamic time-alignment kernel in sup-port vector machine. In NIPS, volume 2, pages 921 –928, 2001.
0 0
0 0
0 1 1 0
x11 < 1
x4 < 2 x10 < 3
x3 < 4 x1 < 5
x2 < 7 x3 < 8
0 10 20 30−1
0
1
2
3x1<5
0 10 20 30−2
−1
0
1x2<7
0 10 20 30−1
0
1
2x3<4
0 10 20 30−1
−0.5
0
0.5
1x4<2
Figure 7: Japanese Vowels: Functional splits corre-sponding to reported results in Kudo et al. (1999).Branches corresponding to speaker 1 have been col-ored red for ease of comparison. Note: the bottomfigure is a capture of a figure in Kudo et al. (1999).
0 0.2 0.4 0.6 0.8 10
0.1
0.4
0.6
0.8
1x1<1
x
y
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1x1<2
x
y
Figure 8: Bone data comparative results. Top Row fig-ures: captures from a figure in James and Silverman(2005). Plot shows, in blue, the first two principalcomponents of the predictive model they propose. ’-’sign curve being for arthritic bones and the ’+’ curvebeing for healthy bone shapes (mean bone shape inred). Lower row figures: root and next level split oflearnt FDT (tree not shown). Branches of the func-tional splits predicting arthritis are colored red.
0 10 20 30 40 500
1
2
3
4x7 < 1, IL6
0 10 20 30 40 500
1
2
3
4x3 < 2, Si
0 10 20 30 40 500.4
0.6
0.8
1
1.2
1.4x9 < 3, TNF
Weeks
a
a
a d
x7 < 1
x3 < 2
x9 < 3
Figure 9: NHP learnt tree, functional splits. Branchesof the functional splits predicting death are coloredred.