Download - MLT Document Format

  • 8/10/2019 MLT Document Format


    Part of the book is available online in the web address!pg=P"#$#!lpg=P"#$#!d%=silho&ette'in'distance'based'(odeling!so&rce=bl!ots=X)p*pn+o,O!sig=-((0Xt123vel2)3to-56cO7!hl=en!sa=X!ei=dihk5O&&8ef&g9&_43ow!ved=;*v=onepage!%=silho&ette#;in#;distance#;based#;(odeling!f=tr&e

    6able : " s(all training set for 9pa("ssassin

    6he col&(ns (arkedx andx# indicate the res&lts of two tests on fo&r di@erent eA(ails. 6he fo&rth col&(nindicates which of the eA(ails are spa(. 6he rightA(ost col&(n de(onstrates that bB thresholding thef&nction 4x'4x# at $C we can separate spa( fro( ha(.Peter

    -ig&re : 0inear classiDcation in two di(ensions

    6he straight line separates the positives fro( the negatives. Et is deDned bB w. xi= t C where w is a vectorperpendic&lar to the decision bo&ndarB and pointing in the direction of the positivesC t is the decisionthresholdC and xi points to a point on the decision bo&ndarB. En partic&larC x; points in the sa(e direction aswC fro( which it follows that w.x;=FFw||FFx;FF=t GFFx||denotes the length of the vector xH.
  • 8/10/2019 MLT Document Format


    -ig&re #: 1achine learning for spa( Dltering

    "t the top we see how 9pa("ssassin approaches the spa( eA(ail classiDcation task:the teIt of each eA(ail is converted into a data point bB (eans of 9pa("ssassinJs b&iltAin testsC and a linearclassiDer is applied to obtain a Kspa( or ha(J decision. "t the botto( Gin bl&eH we see the bit that is done bB(achine learning.

    -ig&re : Low (achine learning helps to solve a task

    "n overview of how (achine learning is &sed to address a given task. " task Gred boIH re%&ires anappropriate (apping M a (odel M fro( data described bB feat&res to o&tp&ts. Obtaining s&ch a (apping fro(training data is what constit&tes a learning proble( Gbl&e boIH.

    The ingredients of machine learningTasks: the proble(s that can be solved with (achine learningModels: the o&tp&t of (achine learningFeatures: the workhorses of (achine learning

  • 8/10/2019 MLT Document Format


    6asks for (achine learningt inarB and (&ltiAclass classiDcation: categorical targett 2egression: n&(erical targett l&stering: hidden targett -inding &nderlBing str&ct&re

    Structure I

    onsider the following (atriI

    E(agine these represent ratings bB siI di@erent people Gin rowsHC on a scale of ; to C of fo&rdi@erent Dl(s M saB 6he 9hawshank 2ede(ptionC 6he Ns&al 9&spectsC 6he 3odfatherC 6he ig0ebowskiC Gin col&(nsC fro( left to rightH. 6he 3odfather see(s to be the (ost pop&lar of the fo&rwith an average rating of .$C and 6he 9hawshank 2ede(ption is the least appreciated with anaverage rating of ;.$. an Bo& see anB str&ct&re in this (atriI?Looking for Structure II

    6he rightA(ost (atriI associates Dl(s Gin col&(nsH with genres Gin rowsH: 6he 9hawshank2ede(ption and 6he Ns&al 9&spects belong to two di@erent genresC saB dra(a and cri(eC 6he3odfather belongs to bothC and 6he ig 0ebowski is a cri(e Dl( and also introd&ces a new genreGsaB co(edBH. 6he tallC AbBA (atriI then eIpresses peopleJs preferences in ter(s of genres.-inallBC the (iddle (atriI states that the cri(e genre is twice as i(portant as the other twogenres in ter(s of deter(ining peopleJs preferences.

    1achine learning settings

  • 8/10/2019 MLT Document Format


    6he rows refer to whether the training data is labelled with a target variableC while the col&(ns indicatewhether the (odels learned are &sed to predict a target variable or rather describe the given data.

    1achine learning (odels

    1achine learning (odels can be disting&ished according to their (ain int&ition:t 3eo(etric (odels &se int&itions fro( geo(etrB s&ch as separatingGhBperAHplanesC linear transfor(ations and distance (etrics.t Probabilistic (odels view learning as a process of red&cing &ncertaintB.t 0ogical (odels are deDned in ter(s of logical eIpressions."lternativelBC theB can be characterised bB their (od&s operandi:t 3ro&ping (odels divide the instance space into seg(ents in each seg(enta verB si(ple Ge.g.C constantH (odel is learned.t 3rading (odels learning a singleC global (odel over the instance space.

    -ig&re .: asic linear classiDer

    6he basic linear classiDer constr&cts a decision bo&ndarB bB halfAwaB intersecting the line between the

    positive and negative centres of (ass. Et is described bB the e%&ation w.x =t C with w=p-n the decisionthreshold can be fo&nd bB noting that Gp+nH/# is on the decision bo&ndarBC and hence t =Gp-nH . Gp+nH/#= GFFp||#AFFn||#H/#C where FFx||denotes the length of vector x.

    -ig&re .#: 9&pport vector (achine

    6he decision bo&ndarB learned bB a s&pport vector (achine fro( the linearlB separable data fro( -ig&re .6he decision bo&ndarB (aIi(ises the (arginC which is indicated bB the dotted lines. 6he circled data pointsare the s&pport vectors.

  • 8/10/2019 MLT Document Format


    6able .#: " si(ple probabilistic (odel

    K5iagraJ and KlotterBJ are two oolean feat&res Y is the class variableC with val&es Kspa(J and Kha(J. En eachrow the (ost likelB class is indicated in bold.Peter

    7Ia(ple .: Posterior odds

    Nsing a 1"P decision r&le we predict ha( in the top two cases and spa( in the botto( two.3iven that the f&ll posterior distrib&tion is all there is to know abo&t the do(ain in a statistical

    senseC these predictions are the best we can do: theB are aBesAopti(al.6able .: 7Ia(ple (arginal likelihoods

    7Ia(ple .4: Nsing (arginal likelihoodsNsing the (arginal likelihoods fro( 6able .C we can approIi(ate the likelihood ratios Gtheprevio&slB calc&lated odds fro( the f&ll posterior distrib&tion are shown in bracketsH:

  • 8/10/2019 MLT Document Format


    e see thatC &sing a (aIi(&( likelihood decision r&leC o&r verB si(ple (odel arrives at theaBesAopti(al prediction in the Drst three casesC b&t not in the fo&rth GK5iagraJ and KlotterBJ bothpresentHC where the (arginal likelihoods are act&allB verB (isleading.

    -ig&re .: 6he 9cottish classiDer


    top! 5is&alisation of two (arginal likelihoods as esti(ated fro( a s(all data set. 6he colo&rs indicatewhether the likelihood points to spa( or ha(. "ottom! o(bining the two (arginal likelihoods gives apattern not &nlike that of a 9cottish tartan.

    #-ig&re .4: " feat&re tree

    left! " feat&re tree co(bining two oolean feat&res. 7ach internal node or split is labelled with a feat&reC

  • 8/10/2019 MLT Document Format


    and each edge e(anating fro( a split is labelled with a feat&re val&e. 7ach leaf therefore corresponds to a&ni%&e co(bination of feat&re val&es. "lso indicated in each leaf is the class distrib&tion derived fro( thetraining set. right! " feat&re tree partitions the instance space into rectang&lar regionsC one for each leaf.e can clearlB see that the (aQoritB of ha( lives in the lower leftAhand corner.

    7Ia(ple .$: 0abelling a feat&re tree

    6he leaves of the tree in -ig&re .4 co&ld be labelledC fro( left to rightC as ha( M spa( M spa(C

    e(ploBing a si(ple decision r&le called (aQoritB class.t "lternativelBC we co&ld label the( with the proportion of spa( eA(ail occ&rring in each leaf:fro( left to rightC /C #/C and 4/$.t OrC if o&r task was a regression taskC we co&ld label the leaves with predicted real val&es oreven linear f&nctions of so(e otherC realAval&ed feat&res.

    -ig&re .R: 1apping (achine learning (odels

    " K(apJ of so(e of the (odels that will be considered in this book. 1odels that share characteristics areplotted closer together: logical (odels to the rightC geo(etric (odels on the top left and probabilistic (odelson the botto( left. 6he horiSontal di(ension ro&ghlB ranges fro( grading (odels on the left to gro&ping

    (odels on the right.

    -ig&re .+: 10 taIono(B

  • 8/10/2019 MLT Document Format


    " taIono(B describing (achine learning (ethods in ter(s of the eItent to which theB are grading orgro&ping (odelsC logicalC geo(etric or a co(binationC and s&pervised or &ns&pervised. 6he colo&rs indicatethe tBpe of (odelC fro( left to right: logical GredHC probabilistic GorangeH and geo(etric Gp&rpleH.

    Features# the workhorses of machine learning7Ia(ple .R: 6he 101 data set9&ppose we have a n&(ber of learning (odels that we want to describe in ter(s of a n&(ber ofproperties:t the eItent to which the (odels are geo(etricC probabilistic or logicalt whether theB are gro&ping or grading (odelst the eItent to which theB can handle discrete and/or realAval&ed feat&rest whether theB are &sed in s&pervised or &ns&pervised learning andt the eItent to which theB can handle (&ltiAclass proble(s.6he Drst two properties co&ld be eIpressed bB discrete feat&res with three and two val&esCrespectivelB or if the distinctions are (ore grad&alC each aspect co&ld be rated on so(en&(erical scale. " si(ple approach wo&ld be to (eas&re each propertB on an integer scale fro(; to C as in6able .4. 6his table establishes a data set in which each row represents an instanceand each col&(n a feat&re.

    6able .4: 6he 101 data set

  • 8/10/2019 MLT Document Format


    6he 101 data set describing properties of (achine learning (odels. oth -ig&re .R and -ig&re .+ weregenerated fro( this data.

    -ig&re .8: " s(all regression tree

    (left) A regression tree combining a one-split feature tree with linear regression models in the leaves.Notice howx is used as both a splitting feature and a regression variable. (right) The functiony cosx

    on the interval 1 x 1, and the piecewise linear approimation achieved b! the regression tree.

    -ig&re .;: lassAsensitive discretisation

  • 8/10/2019 MLT Document Format


    (left) Artificial data depicting a histogram of bod! weight measurements of people with "blue# and

    without "red# diabetes, with eleven fied intervals of 1$ %ilograms width each. (right) &! 'oining the first

    and second, third and fourth, fifth and sith, and the eighth, ninth and tenth intervals, we obtain a

    discretisation such that the proportion of diabetes cases increases from left to right. This discretisationma%es the feature more useful in predicting diabetes.

    7Ia(ple .8: 6he kernel trick

    That is, b! s(uaring the dot product in the original space we obtain the dot product in the new spacewithout actuall! constructing the feature vectors) A function that calculates the dot product in feature

    space directl! from the vectors in the original space is called a %ernel * here the %ernel is%"x1,x+#"x1 . x+#+.

    -ig&re .: TonAlinearlB separable data

    left! " linear classiDer wo&ld perfor( poorlB on this data. right! B transfor(ing the original GxCyH datainto GxCyH=Gx#Cy#HC the data beco(es (ore KlinearJC and a linear decision bo&ndarBx+y= separates thedata fairlB well. En the original space this corresponds to a circle with radi&s U aro&nd the origin.

    inarB classiDcation and related tasks lassiDcation

  • 8/10/2019 MLT Document Format


    9coring and ranking lass probabilitB esti(ation

    6able #.: Predictive (achine learning scenarios


    (left) A feature tree with training set class distribution in the leaves. (right) A decision treeobtained using the ma'orit! class decision rule.

    6able #.#: ontingencB table

  • 8/10/2019 MLT Document Format


    (left) A two-class contingenc! table or confusion matri depicting the performance of the

    ecision tree in igure +.1. Numbers on the descending diagonal indicate correct predictions,while the ascending diagonal concerns prediction errors. (right) A contingenc! table with the

    same marginals but independent rows and columns.

    7Ia(ple #.: "cc&racB as a weighted average9&ppose a classiDerJs predictions on a test set are as in the following table:

    -ro( this tableC we see that the tr&e positive rate is tpr =;/R$ = ;.+; and the tr&e negativerate is tnr=$/#$ = ;.;. 6he overall acc&racB is acc=G;'$H/;; = ;.R$C which is no longerthe average of tr&e positive andnegative rates. LoweverC taking into acco&nt the proportion of positives pos = ;.R$ and theproportion of negatives neg =Apos =;.#$C we see that acc =pos .tpr+neg .tnr

    -ig&re #.#: " coverage plot

    (left) A coverage plot depicting the two contingenc! tables in Table +.+. The plot is s(uarebecause the class distribution is uniform. (right) /overage plot for 0ample +.1, with a class

    ratio clr =.

    -ig&re #.: "n 2O plot

  • 8/10/2019 MLT Document Format


    (left) /1 and / both dominate /+, but neither dominates the other. The diagonal line indicates

    that /1 and / achieve e(ual accurac!. (right) The same plot with normali2ed aes. 3e can

    interpret this plot as a merger of the two coverage plots in igure +.+, emplo!ing normalisationto deal with the different class distributions. The diagonal line now indicates that /1 and / have

    the same average recall.

    Peter-ig&re #.4: o(paring coverage and 2O plots

    (left) 4n a coverage plot, accurac! isometrics have a slope of 1, and average recall isometrics are

    parallel to the ascending diagonal. (right) 4n the corresponding 56/ plot, average recall

    isometrics have a slope of 17 the accurac! isometric here has a slope of , corresponding to theratio of negatives to positives in the data set.

    Scoring and ranking

    -ig&re #.$: " scoring tree

  • 8/10/2019 MLT Document Format


    (left) A feature tree with training set class distribution in the leaves. (right) A scoring tree usingthe logarithm of the class ratio as scores7 spam is ta%en as the positive class.

    7Ia(ple #.#: 2anking eIa(pleThe scoring tree in igure +.8produces the following ran%ing9


  • 8/10/2019 MLT Document Format


    (left) 0ach cell in the grid denotes a uni(ue pair of one positive and one negative eample9 the green cellsindicate pairs that are correctl! ran%ed b! the classifier, the red cells represent ran%ing errors, and the

    orange cells are half-errors due to ties. (right) The coverage curve of a tree-based scoring classifier has

    one line segment for each leaf of the tree, and one "FP,TP# pair for each possible threshold on the score.

    7Ia(ple #.4: lass i(balanceCuppose we feed the scoring tree in igure +.8 an etended test set, with an additional batch of 8$


    t The added negatives happen to be identical to the original ones, so the net effect is that the number of

    negatives in each leaf doubles.

    t As a result the coverage curve changes "because the class ratio changes#, but the 56/ curve sta!s the

    same "igure +.?#.

    tNote that the AD/ sta!s the same as well9 while the classifier ma%es twice as man! ran%ing errors, there

    are also twice as man! positive*negative pairs, so the ran%ing error rate doesnEt change.

    -ig&re #.+: lass i(balance

    (left) A coverage curve obtained from a test set with class ratio clr 1F+. (right) The corresponding56/ curve is the same as the one corresponding to the coverage curve in igure +.@ "right#.

  • 8/10/2019 MLT Document Format


    -ig&re #.8: 2ankings fro( grading classiDers

    (left) A linear classifier induces a ran%ing b! ta%ing the signed distance to the decision boundar! as the

    score. This ran%ing onl! depends on the orientation of the decision boundar!9 the three lines result ineactl! the same ran%ing. (right) The grid of correctl! ran%ed positive*negative pairs "in green# and

    ran%ing errors "in red#.

    -ig&re #.;: overage c&rve of grading classiDer

    The coverage curve of the linear classifier in igure +.. The points labelled A, & and / indicate theclassification performance of the corresponding decision boundaries. The dotted lines indicate the

    improvement that can be obtained b! turning the grading classifier into a grouping classifier with four


    -ig&re #.: -inding the opti(al point

  • 8/10/2019 MLT Document Format


    Celecting the optimal point on an 56/ curve. The top dotted line is the accurac! isometric, with a slope

    of +F. The lower isometric doubles the value "or prevalence# of negatives, and allows a choice of

    thresholds. &! intersecting the isometrics with the descending diagonal we can read off the achieved

    accurac! on they-ais.

    Class probability estimation

    -ig&re #.#: ProbabilitB esti(ation tree

    A probabilit! estimation tree derived from the feature tree in igure 1.=

    7Ia(ple #.: 9%&ared error9&ppose one (odel predicts G;.R;C ;.;C ;.#;H for a partic&lar eIa(plex in a threeAclass taskCwhile another appears (&ch (ore certain bB predicting G;.88C;C ;.;H.t Ef the Drst class is the act&al classC the second prediction is clearlB better than the Drst: the 97of the Drst prediction is GG;.R;AH#'G;.;A;H#'G;.#;A;H#H/# = ;.;RC while for the second predictionit is GG;.88AH#'G;A;H# 'G;.;A;H#H/# = ;.;;;. 6he Drst (odel gets p&nished (ore beca&seCaltho&gh (ostlB rightC it isnJt %&ite s&re of it.

    t LoweverC if the third class is the act&al classC the sit&ation is reversed: now the 97 of the Drstprediction isGG;.R;A;H# 'G;.;A;H# 'G;.#;AH#H/#=;.$RC and of the second GG;.88A;H# 'G;A;H# 'G;.;AH#H/# =;.8+. 6he second (odel gets p&nished (ore for not Q&st being wrongC b&t being pres&(pt&o&s.

    -ig&re #.: 2O conveI h&ll

    (left) The solid red line is the conve hull of the dotted 56/ curve. (right) The corresponding calibration

    map in red9 the plateaus correspond to several eamples being mapped to the same segment of the conve

    hull, and linear interpolation between eample scores occurs when we transition from one conve hull

    segment to the net. A Gaplace-corrected calibration map is indicated b! the dashed line in blue9 Gaplace

  • 8/10/2019 MLT Document Format


    smoothing compresses the range of calibrated probabilities but can sometimes affect the ran%ing.

    Beyond binary classification Landling (ore than two classes 2egression Nns&pervised and descriptive learning

    >andling more than + classes

    7Ia(ple .: Perfor(ance of (&ltiAclass classiDers E

    6he acc&racB of this classiDer is G$'$'4$H/;; = ;.R$.t e can calc&late perAclass precision and recall: for the Drst class this is $/#4 = ;. and $/#;= ;.R$ respectivelBC for the second class $/#; = ;.R$ and $/; = ;.$;C and for the third class4$/$ = ;.+; and4$/$; = ;.8;.

    7Ia(ple .: Perfor(ance of (&ltiAclass classiDers EEt 3e could average these numbers to obtain single precision and recall numbers for the whole classifier,

    or we could ta%e a weighted average ta%ing the proportion of each class into account. or instance, the

    weighted average precision is $.+$ . $.H;$.$ . $.@8;$.8$ . $.?$ $.@8.

    t Another possibilit! is to perform a more detailed anal!sis b! loo%ing at precision and recall numbers for

    each pair of classes9 for instance, when distinguishing the first class from the third precision is 18F1@

    $.?? and recall is 18F1? $.?, while distinguishing the third class from the first these numbers are =8F=?

    $.= and =8F=@ $.H "can !ou eplain wh! these numbers are much higher in the latter directionI#.

    7Ia(ple .$: 1&ltiAclass "N E"ss&(e we have a (&ltiAclass scoring classiDer that prod&ces a kAvector of scores VsGxH =G VsGxHC . . . C Vsk GxHH for each test instancex.t B restricting attention to Vsi GxH we obtain a scoring classiDer for class Ci against the otherclassesC and we can calc&late the oneAvers&sArest "N for Ci in the nor(al waB.t B waB of eIa(pleC s&ppose we have three classesC and the oneAvers&sArest "Ns co(e o&t as for the Drst classC ;.+ for the second class and ;. for the third class. 6h&sC for instanceC allinstances of class receive a higher Drst entrB in their score vectors than anB of the instances ofthe other two classes.t6he average of these three "Ns is ;.+C which reWects the fact thatC if we &nifor(lB choose an


    C and we select an instancex

    &nifor(lB a(ong classCi

    and another instancex;

    &nifor(lBa(ong all instances not fro( Ci C then the eIpectation that Vsi GxH Vsi Gx;H is ;.+.

    7Ia(ple .$: 1&ltiAclass "N EECuppose now C1 has 1$ instances, C+ has +$ and C @$.

    t The weighted average of the one-versus-rest AD/s is then $.H?9 that is, if we uniforml! choose x

    without reference to the class, and then choose x$ uniforml! from among all instances not of the same

    class asx$, the epectation that Jsi "x# K Jsi "x$# is $.H?.

  • 8/10/2019 MLT Document Format


    t This is lower than before, because it is now more li%el! that a random x comes from class C, whose

    scores do a worse ran%ing 'ob.

    OneAvers&sAone "Ne can obtain si(ilar averages fro( oneAvers&sAone "Ns.t -or instanceC we can deDne "Ni j as the "N obtained &sing scores Vsi to rank instances fro(classes Ci and Cj . Totice that Vsj (aB rank these instances di@erentlBC and so "Nj i "Ni j .t6aking an &nweighted average over all i j esti(ates the probabilitB thatC for &nifor(lB chosen

    classes i andji and &nifor(lB chosenx Ci andx; Cj C we have Vsi GxH Vsi Gx;H.t6he weighted version of this esti(ates the probabilitB that the instances are correctlB ranked ifwe donJt preAselect the class.


    7Ia(ple .+: 0ine Dtting eIa(pleonsider the following set of Dve points:

    3e want to estimateyb! means of a pol!nomial inx. igure .+ "left# shows the result for degrees of 1 to

    8 using tlinear regression, which will be eplained in /hapter @. The top two degrees fit the given points

    eactl! "in general, an! set of n points can be fitted b! a pol!nomial of degree no more than n 1#, but

    the! differ considerabl! at the etreme ends9 e.g., the pol!nomial of degree = leads to a decreasing trend

    fromx $ tox 1, which is not reall! 'ustified b! the data.

    -ig&re .#: -itting polBno(ials to data

    (left) Lol!nomials of different degree fitted to a set of five points. rom bottom to top in the top right-hand corner9 degree 1 "straight line#, degree + "parabola#, degree , degree = "which is the lowest degree

    able to fit the points eactl!#, degree 8. (right) A piecewise constant function learned b! a grouping

    model7 the dotted reference line is the linear function from the left figure.

    -ig&re .: ias and variance

  • 8/10/2019 MLT Document Format


    A dartboard metaphor illustrating the concepts of bias and variance. 0ach dartboard corresponds to a

    different learning algorithm, and each dart signifies a different training sample. The top row learning

    algorithms ehibit low bias, sta!ing close to the bullEs e!e "the true function value for a particular x# on

    average, while the ones on the bottom row have high bias. The left column shows low variance and the

    right column high variance.