MLT Document Format
Embed Size (px)
Transcript of MLT Document Format

8/10/2019 MLT Document Format
1/20
Part of the book is available online in the web addresshttp://books.google.co.in/books?id=Ofp4h_oXsZ4!pg=P"#$#!lpg=P"#$#!d%=silho&ette'in'distance'based'(odeling!so&rce=bl!ots=X)p*pn+o,O!sig=((0Xt123vel2)3to56cO7!hl=en!sa=X!ei=dihk5O&&8ef&g9&_43ow!ved=;*v=onepage!%=silho&ette#;in#;distance#;based#;(odeling!f=tr&e
6able : " s(all training set for 9pa("ssassin
6he col&(ns (arkedx andx# indicate the res<s of two tests on fo&r di@erent eA(ails. 6he fo&rth col&(nindicates which of the eA(ails are spa(. 6he rightA(ost col&(n de(onstrates that bB thresholding thef&nction 4x'4x# at $C we can separate spa( fro( ha(.Peter
ig&re : 0inear classiDcation in two di(ensions
6he straight line separates the positives fro( the negatives. Et is deDned bB w. xi= t C where w is a vectorperpendic&lar to the decision bo&ndarB and pointing in the direction of the positivesC t is the decisionthresholdC and xi points to a point on the decision bo&ndarB. En partic&larC x; points in the sa(e direction aswC fro( which it follows that w.x;=FFwFFx;FF=t GFFxdenotes the length of the vector xH.
http://books.google.co.in/books?id=Ofp4h_oXsZ4C&pg=PA252&lpg=PA252&dq=silhouette+in+distance+based+modeling&source=bl&ots=XJpYpn8oKO&sig=F1mm3CLXtMRGBvelRJGtoFVTcOE&hl=en&sa=X&ei=dihkVOuuB9efugSu_4GoDw&ved=0CDYQ6AEwAw#v=onepage&q=silhouette%20in%20distance%20based%20modeling&f=truehttp://books.google.co.in/books?id=Ofp4h_oXsZ4C&pg=PA252&lpg=PA252&dq=silhouette+in+distance+based+modeling&source=bl&ots=XJpYpn8oKO&sig=F1mm3CLXtMRGBvelRJGtoFVTcOE&hl=en&sa=X&ei=dihkVOuuB9efugSu_4GoDw&ved=0CDYQ6AEwAw#v=onepage&q=silhouette%20in%20distance%20based%20modeling&f=truehttp://books.google.co.in/books?id=Ofp4h_oXsZ4C&pg=PA252&lpg=PA252&dq=silhouette+in+distance+based+modeling&source=bl&ots=XJpYpn8oKO&sig=F1mm3CLXtMRGBvelRJGtoFVTcOE&hl=en&sa=X&ei=dihkVOuuB9efugSu_4GoDw&ved=0CDYQ6AEwAw#v=onepage&q=silhouette%20in%20distance%20based%20modeling&f=truehttp://books.google.co.in/books?id=Ofp4h_oXsZ4C&pg=PA252&lpg=PA252&dq=silhouette+in+distance+based+modeling&source=bl&ots=XJpYpn8oKO&sig=F1mm3CLXtMRGBvelRJGtoFVTcOE&hl=en&sa=X&ei=dihkVOuuB9efugSu_4GoDw&ved=0CDYQ6AEwAw#v=onepage&q=silhouette%20in%20distance%20based%20modeling&f=truehttp://books.google.co.in/books?id=Ofp4h_oXsZ4C&pg=PA252&lpg=PA252&dq=silhouette+in+distance+based+modeling&source=bl&ots=XJpYpn8oKO&sig=F1mm3CLXtMRGBvelRJGtoFVTcOE&hl=en&sa=X&ei=dihkVOuuB9efugSu_4GoDw&ved=0CDYQ6AEwAw#v=onepage&q=silhouette%20in%20distance%20based%20modeling&f=truehttp://books.google.co.in/books?id=Ofp4h_oXsZ4C&pg=PA252&lpg=PA252&dq=silhouette+in+distance+based+modeling&source=bl&ots=XJpYpn8oKO&sig=F1mm3CLXtMRGBvelRJGtoFVTcOE&hl=en&sa=X&ei=dihkVOuuB9efugSu_4GoDw&ved=0CDYQ6AEwAw#v=onepage&q=silhouette%20in%20distance%20based%20modeling&f=truehttp://books.google.co.in/books?id=Ofp4h_oXsZ4C&pg=PA252&lpg=PA252&dq=silhouette+in+distance+based+modeling&source=bl&ots=XJpYpn8oKO&sig=F1mm3CLXtMRGBvelRJGtoFVTcOE&hl=en&sa=X&ei=dihkVOuuB9efugSu_4GoDw&ved=0CDYQ6AEwAw#v=onepage&q=silhouette%20in%20distance%20based%20modeling&f=truehttp://books.google.co.in/books?id=Ofp4h_oXsZ4C&pg=PA252&lpg=PA252&dq=silhouette+in+distance+based+modeling&source=bl&ots=XJpYpn8oKO&sig=F1mm3CLXtMRGBvelRJGtoFVTcOE&hl=en&sa=X&ei=dihkVOuuB9efugSu_4GoDw&ved=0CDYQ6AEwAw#v=onepage&q=silhouette%20in%20distance%20based%20modeling&f=truehttp://books.google.co.in/books?id=Ofp4h_oXsZ4C&pg=PA252&lpg=PA252&dq=silhouette+in+distance+based+modeling&source=bl&ots=XJpYpn8oKO&sig=F1mm3CLXtMRGBvelRJGtoFVTcOE&hl=en&sa=X&ei=dihkVOuuB9efugSu_4GoDw&ved=0CDYQ6AEwAw#v=onepage&q=silhouette%20in%20distance%20based%20modeling&f=true 
8/10/2019 MLT Document Format
2/20
ig&re #: 1achine learning for spa( Dltering
"t the top we see how 9pa("ssassin approaches the spa( eA(ail classiDcation task:the teIt of each eA(ail is converted into a data point bB (eans of 9pa("ssassinJs b&iltAin testsC and a linearclassiDer is applied to obtain a Kspa( or ha(J decision. "t the botto( Gin bl&eH we see the bit that is done bB(achine learning.
ig&re : Low (achine learning helps to solve a task
"n overview of how (achine learning is &sed to address a given task. " task Gred boIH re%&ires anappropriate (apping M a (odel M fro( data described bB feat&res to o&tp&ts. Obtaining s&ch a (apping fro(training data is what constit&tes a learning proble( Gbl&e boIH.
The ingredients of machine learningTasks: the proble(s that can be solved with (achine learningModels: the o&tp&t of (achine learningFeatures: the workhorses of (achine learning

8/10/2019 MLT Document Format
3/20
6asks for (achine learningt inarB and (<iAclass classiDcation: categorical targett 2egression: n&(erical targett l&stering: hidden targett inding &nderlBing str&ct&re
Structure I
onsider the following (atriI
E(agine these represent ratings bB siI di@erent people Gin rowsHC on a scale of ; to C of fo&rdi@erent Dl(s M saB 6he 9hawshank 2ede(ptionC 6he Ns&al 9&spectsC 6he 3odfatherC 6he ig0ebowskiC Gin col&(nsC fro( left to rightH. 6he 3odfather see(s to be the (ost pop&lar of the fo&rwith an average rating of .$C and 6he 9hawshank 2ede(ption is the least appreciated with anaverage rating of ;.$. an Bo& see anB str&ct&re in this (atriI?Looking for Structure II
6he rightA(ost (atriI associates Dl(s Gin col&(nsH with genres Gin rowsH: 6he 9hawshank2ede(ption and 6he Ns&al 9&spects belong to two di@erent genresC saB dra(a and cri(eC 6he3odfather belongs to bothC and 6he ig 0ebowski is a cri(e Dl( and also introd&ces a new genreGsaB co(edBH. 6he tallC AbBA (atriI then eIpresses peopleJs preferences in ter(s of genres.inallBC the (iddle (atriI states that the cri(e genre is twice as i(portant as the other twogenres in ter(s of deter(ining peopleJs preferences.
1achine learning settings

8/10/2019 MLT Document Format
4/20
6he rows refer to whether the training data is labelled with a target variableC while the col&(ns indicatewhether the (odels learned are &sed to predict a target variable or rather describe the given data.
1achine learning (odels
1achine learning (odels can be disting&ished according to their (ain int&ition:t 3eo(etric (odels &se int&itions fro( geo(etrB s&ch as separatingGhBperAHplanesC linear transfor(ations and distance (etrics.t Probabilistic (odels view learning as a process of red&cing &ncertaintB.t 0ogical (odels are deDned in ter(s of logical eIpressions."lternativelBC theB can be characterised bB their (od&s operandi:t 3ro&ping (odels divide the instance space into seg(ents in each seg(enta verB si(ple Ge.g.C constantH (odel is learned.t 3rading (odels learning a singleC global (odel over the instance space.
ig&re .: asic linear classiDer
6he basic linear classiDer constr&cts a decision bo&ndarB bB halfAwaB intersecting the line between the
positive and negative centres of (ass. Et is described bB the e%&ation w.x =t C with w=pn the decisionthreshold can be fo&nd bB noting that Gp+nH/# is on the decision bo&ndarBC and hence t =GpnH . Gp+nH/#= GFFp#AFFn#H/#C where FFxdenotes the length of vector x.
ig&re .#: 9&pport vector (achine
6he decision bo&ndarB learned bB a s&pport vector (achine fro( the linearlB separable data fro( ig&re .6he decision bo&ndarB (aIi(ises the (arginC which is indicated bB the dotted lines. 6he circled data pointsare the s&pport vectors.

8/10/2019 MLT Document Format
5/20
6able .#: " si(ple probabilistic (odel
K5iagraJ and KlotterBJ are two oolean feat&res Y is the class variableC with val&es Kspa(J and Kha(J. En eachrow the (ost likelB class is indicated in bold.Peter
7Ia(ple .: Posterior odds
Nsing a 1"P decision r&le we predict ha( in the top two cases and spa( in the botto( two.3iven that the f&ll posterior distrib&tion is all there is to know abo&t the do(ain in a statistical
senseC these predictions are the best we can do: theB are aBesAopti(al.6able .: 7Ia(ple (arginal likelihoods
7Ia(ple .4: Nsing (arginal likelihoodsNsing the (arginal likelihoods fro( 6able .C we can approIi(ate the likelihood ratios Gtheprevio&slB calc&lated odds fro( the f&ll posterior distrib&tion are shown in bracketsH:

8/10/2019 MLT Document Format
6/20
e see thatC &sing a (aIi(&( likelihood decision r&leC o&r verB si(ple (odel arrives at theaBesAopti(al prediction in the Drst three casesC b&t not in the fo&rth GK5iagraJ and KlotterBJ bothpresentHC where the (arginal likelihoods are act&allB verB (isleading.
ig&re .: 6he 9cottish classiDer
4
top! 5is&alisation of two (arginal likelihoods as esti(ated fro( a s(all data set. 6he colo&rs indicatewhether the likelihood points to spa( or ha(. "ottom! o(bining the two (arginal likelihoods gives apattern not &nlike that of a 9cottish tartan.
#ig&re .4: " feat&re tree
left! " feat&re tree co(bining two oolean feat&res. 7ach internal node or split is labelled with a feat&reC

8/10/2019 MLT Document Format
7/20
and each edge e(anating fro( a split is labelled with a feat&re val&e. 7ach leaf therefore corresponds to a&ni%&e co(bination of feat&re val&es. "lso indicated in each leaf is the class distrib&tion derived fro( thetraining set. right! " feat&re tree partitions the instance space into rectang&lar regionsC one for each leaf.e can clearlB see that the (aQoritB of ha( lives in the lower leftAhand corner.
7Ia(ple .$: 0abelling a feat&re tree
6he leaves of the tree in ig&re .4 co&ld be labelledC fro( left to rightC as ha( M spa( M spa(C
e(ploBing a si(ple decision r&le called (aQoritB class.t "lternativelBC we co&ld label the( with the proportion of spa( eA(ail occ&rring in each leaf:fro( left to rightC /C #/C and 4/$.t OrC if o&r task was a regression taskC we co&ld label the leaves with predicted real val&es oreven linear f&nctions of so(e otherC realAval&ed feat&res.
ig&re .R: 1apping (achine learning (odels
" K(apJ of so(e of the (odels that will be considered in this book. 1odels that share characteristics areplotted closer together: logical (odels to the rightC geo(etric (odels on the top left and probabilistic (odelson the botto( left. 6he horiSontal di(ension ro&ghlB ranges fro( grading (odels on the left to gro&ping
(odels on the right.
ig&re .+: 10 taIono(B

8/10/2019 MLT Document Format
8/20
" taIono(B describing (achine learning (ethods in ter(s of the eItent to which theB are grading orgro&ping (odelsC logicalC geo(etric or a co(binationC and s&pervised or &ns&pervised. 6he colo&rs indicatethe tBpe of (odelC fro( left to right: logical GredHC probabilistic GorangeH and geo(etric Gp&rpleH.
Features# the workhorses of machine learning7Ia(ple .R: 6he 101 data set9&ppose we have a n&(ber of learning (odels that we want to describe in ter(s of a n&(ber ofproperties:t the eItent to which the (odels are geo(etricC probabilistic or logicalt whether theB are gro&ping or grading (odelst the eItent to which theB can handle discrete and/or realAval&ed feat&rest whether theB are &sed in s&pervised or &ns&pervised learning andt the eItent to which theB can handle (<iAclass proble(s.6he Drst two properties co&ld be eIpressed bB discrete feat&res with three and two val&esCrespectivelB or if the distinctions are (ore grad&alC each aspect co&ld be rated on so(en&(erical scale. " si(ple approach wo&ld be to (eas&re each propertB on an integer scale fro(; to C as in6able .4. 6his table establishes a data set in which each row represents an instanceand each col&(n a feat&re.
6able .4: 6he 101 data set

8/10/2019 MLT Document Format
9/20
6he 101 data set describing properties of (achine learning (odels. oth ig&re .R and ig&re .+ weregenerated fro( this data.
ig&re .8: " s(all regression tree
(left) A regression tree combining a onesplit feature tree with linear regression models in the leaves.Notice howx is used as both a splitting feature and a regression variable. (right) The functiony cosx
on the interval 1 x 1, and the piecewise linear approimation achieved b! the regression tree.
ig&re .;: lassAsensitive discretisation

8/10/2019 MLT Document Format
10/20
(left) Artificial data depicting a histogram of bod! weight measurements of people with "blue# and
without "red# diabetes, with eleven fied intervals of 1$ %ilograms width each. (right) &! 'oining the first
and second, third and fourth, fifth and sith, and the eighth, ninth and tenth intervals, we obtain a
discretisation such that the proportion of diabetes cases increases from left to right. This discretisationma%es the feature more useful in predicting diabetes.
7Ia(ple .8: 6he kernel trick
That is, b! s(uaring the dot product in the original space we obtain the dot product in the new spacewithout actuall! constructing the feature vectors) A function that calculates the dot product in feature
space directl! from the vectors in the original space is called a %ernel * here the %ernel is%"x1,x+#"x1 . x+#+.
ig&re .: TonAlinearlB separable data
left! " linear classiDer wo&ld perfor( poorlB on this data. right! B transfor(ing the original GxCyH datainto GxCyH=Gx#Cy#HC the data beco(es (ore KlinearJC and a linear decision bo&ndarBx+y= separates thedata fairlB well. En the original space this corresponds to a circle with radi&s U aro&nd the origin.
inarB classiDcation and related tasks lassiDcation

8/10/2019 MLT Document Format
11/20
9coring and ranking lass probabilitB esti(ation
6able #.: Predictive (achine learning scenarios
Classification
(left) A feature tree with training set class distribution in the leaves. (right) A decision treeobtained using the ma'orit! class decision rule.
6able #.#: ontingencB table

8/10/2019 MLT Document Format
12/20
(left) A twoclass contingenc! table or confusion matri depicting the performance of the
ecision tree in igure +.1. Numbers on the descending diagonal indicate correct predictions,while the ascending diagonal concerns prediction errors. (right) A contingenc! table with the
same marginals but independent rows and columns.
7Ia(ple #.: "cc&racB as a weighted average9&ppose a classiDerJs predictions on a test set are as in the following table:
ro( this tableC we see that the tr&e positive rate is tpr =;/R$ = ;.+; and the tr&e negativerate is tnr=$/#$ = ;.;. 6he overall acc&racB is acc=G;'$H/;; = ;.R$C which is no longerthe average of tr&e positive andnegative rates. LoweverC taking into acco&nt the proportion of positives pos = ;.R$ and theproportion of negatives neg =Apos =;.#$C we see that acc =pos .tpr+neg .tnr
ig&re #.#: " coverage plot
(left) A coverage plot depicting the two contingenc! tables in Table +.+. The plot is s(uarebecause the class distribution is uniform. (right) /overage plot for 0ample +.1, with a class
ratio clr =.
ig&re #.: "n 2O plot

8/10/2019 MLT Document Format
13/20
(left) /1 and / both dominate /+, but neither dominates the other. The diagonal line indicates
that /1 and / achieve e(ual accurac!. (right) The same plot with normali2ed aes. 3e can
interpret this plot as a merger of the two coverage plots in igure +.+, emplo!ing normalisationto deal with the different class distributions. The diagonal line now indicates that /1 and / have
the same average recall.
Peterig&re #.4: o(paring coverage and 2O plots
(left) 4n a coverage plot, accurac! isometrics have a slope of 1, and average recall isometrics are
parallel to the ascending diagonal. (right) 4n the corresponding 56/ plot, average recall
isometrics have a slope of 17 the accurac! isometric here has a slope of , corresponding to theratio of negatives to positives in the data set.
Scoring and ranking
ig&re #.$: " scoring tree

8/10/2019 MLT Document Format
14/20
(left) A feature tree with training set class distribution in the leaves. (right) A scoring tree usingthe logarithm of the class ratio as scores7 spam is ta%en as the positive class.
7Ia(ple #.#: 2anking eIa(pleThe scoring tree in igure +.8produces the following ran%ing9
:+$;,8

8/10/2019 MLT Document Format
15/20
(left) 0ach cell in the grid denotes a uni(ue pair of one positive and one negative eample9 the green cellsindicate pairs that are correctl! ran%ed b! the classifier, the red cells represent ran%ing errors, and the
orange cells are halferrors due to ties. (right) The coverage curve of a treebased scoring classifier has
one line segment for each leaf of the tree, and one "FP,TP# pair for each possible threshold on the score.
7Ia(ple #.4: lass i(balanceCuppose we feed the scoring tree in igure +.8 an etended test set, with an additional batch of 8$
negatives.
t The added negatives happen to be identical to the original ones, so the net effect is that the number of
negatives in each leaf doubles.
t As a result the coverage curve changes "because the class ratio changes#, but the 56/ curve sta!s the
same "igure +.?#.
tNote that the AD/ sta!s the same as well9 while the classifier ma%es twice as man! ran%ing errors, there
are also twice as man! positive*negative pairs, so the ran%ing error rate doesnEt change.
ig&re #.+: lass i(balance
(left) A coverage curve obtained from a test set with class ratio clr 1F+. (right) The corresponding56/ curve is the same as the one corresponding to the coverage curve in igure +.@ "right#.

8/10/2019 MLT Document Format
16/20
ig&re #.8: 2ankings fro( grading classiDers
(left) A linear classifier induces a ran%ing b! ta%ing the signed distance to the decision boundar! as the
score. This ran%ing onl! depends on the orientation of the decision boundar!9 the three lines result ineactl! the same ran%ing. (right) The grid of correctl! ran%ed positive*negative pairs "in green# and
ran%ing errors "in red#.
ig&re #.;: overage c&rve of grading classiDer
The coverage curve of the linear classifier in igure +.. The points labelled A, & and / indicate theclassification performance of the corresponding decision boundaries. The dotted lines indicate the
improvement that can be obtained b! turning the grading classifier into a grouping classifier with four
segments.
ig&re #.: inding the opti(al point

8/10/2019 MLT Document Format
17/20
Celecting the optimal point on an 56/ curve. The top dotted line is the accurac! isometric, with a slope
of +F. The lower isometric doubles the value "or prevalence# of negatives, and allows a choice of
thresholds. &! intersecting the isometrics with the descending diagonal we can read off the achieved
accurac! on theyais.
Class probability estimation
ig&re #.#: ProbabilitB esti(ation tree
A probabilit! estimation tree derived from the feature tree in igure 1.=
7Ia(ple #.: 9%&ared error9&ppose one (odel predicts G;.R;C ;.;C ;.#;H for a partic&lar eIa(plex in a threeAclass taskCwhile another appears (&ch (ore certain bB predicting G;.88C;C ;.;H.t Ef the Drst class is the act&al classC the second prediction is clearlB better than the Drst: the 97of the Drst prediction is GG;.R;AH#'G;.;A;H#'G;.#;A;H#H/# = ;.;RC while for the second predictionit is GG;.88AH#'G;A;H# 'G;.;A;H#H/# = ;.;;;. 6he Drst (odel gets p&nished (ore beca&seCaltho&gh (ostlB rightC it isnJt %&ite s&re of it.
t LoweverC if the third class is the act&al classC the sit&ation is reversed: now the 97 of the Drstprediction isGG;.R;A;H# 'G;.;A;H# 'G;.#;AH#H/#=;.$RC and of the second GG;.88A;H# 'G;A;H# 'G;.;AH#H/# =;.8+. 6he second (odel gets p&nished (ore for not Q&st being wrongC b&t being pres&(pt&o&s.
ig&re #.: 2O conveI h&ll
(left) The solid red line is the conve hull of the dotted 56/ curve. (right) The corresponding calibration
map in red9 the plateaus correspond to several eamples being mapped to the same segment of the conve
hull, and linear interpolation between eample scores occurs when we transition from one conve hull
segment to the net. A Gaplacecorrected calibration map is indicated b! the dashed line in blue9 Gaplace

8/10/2019 MLT Document Format
18/20
smoothing compresses the range of calibrated probabilities but can sometimes affect the ran%ing.
Beyond binary classification Landling (ore than two classes 2egression Nns&pervised and descriptive learning
>andling more than + classes
7Ia(ple .: Perfor(ance of (<iAclass classiDers E
6he acc&racB of this classiDer is G$'$'4$H/;; = ;.R$.t e can calc&late perAclass precision and recall: for the Drst class this is $/#4 = ;. and $/#;= ;.R$ respectivelBC for the second class $/#; = ;.R$ and $/; = ;.$;C and for the third class4$/$ = ;.+; and4$/$; = ;.8;.
7Ia(ple .: Perfor(ance of (<iAclass classiDers EEt 3e could average these numbers to obtain single precision and recall numbers for the whole classifier,
or we could ta%e a weighted average ta%ing the proportion of each class into account. or instance, the
weighted average precision is $.+$ . $.H;$.$ . $.@8;$.8$ . $.?$ $.@8.
t Another possibilit! is to perform a more detailed anal!sis b! loo%ing at precision and recall numbers for
each pair of classes9 for instance, when distinguishing the first class from the third precision is 18F1@
$.?? and recall is 18F1? $.?, while distinguishing the third class from the first these numbers are =8F=?
$.= and =8F=@ $.H "can !ou eplain wh! these numbers are much higher in the latter directionI#.
7Ia(ple .$: 1<iAclass "N E"ss&(e we have a (<iAclass scoring classiDer that prod&ces a kAvector of scores VsGxH =G VsGxHC . . . C Vsk GxHH for each test instancex.t B restricting attention to Vsi GxH we obtain a scoring classiDer for class Ci against the otherclassesC and we can calc&late the oneAvers&sArest "N for Ci in the nor(al waB.t B waB of eIa(pleC s&ppose we have three classesC and the oneAvers&sArest "Ns co(e o&t as for the Drst classC ;.+ for the second class and ;. for the third class. 6h&sC for instanceC allinstances of class receive a higher Drst entrB in their score vectors than anB of the instances ofthe other two classes.t6he average of these three "Ns is ;.+C which reWects the fact thatC if we &nifor(lB choose an
indeIi
C and we select an instancex
&nifor(lB a(ong classCi
and another instancex;
&nifor(lBa(ong all instances not fro( Ci C then the eIpectation that Vsi GxH Vsi Gx;H is ;.+.
7Ia(ple .$: 1<iAclass "N EECuppose now C1 has 1$ instances, C+ has +$ and C @$.
t The weighted average of the oneversusrest AD/s is then $.H?9 that is, if we uniforml! choose x
without reference to the class, and then choose x$ uniforml! from among all instances not of the same
class asx$, the epectation that Jsi "x# K Jsi "x$# is $.H?.

8/10/2019 MLT Document Format
19/20
t This is lower than before, because it is now more li%el! that a random x comes from class C, whose
scores do a worse ran%ing 'ob.
OneAvers&sAone "Ne can obtain si(ilar averages fro( oneAvers&sAone "Ns.t or instanceC we can deDne "Ni j as the "N obtained &sing scores Vsi to rank instances fro(classes Ci and Cj . Totice that Vsj (aB rank these instances di@erentlBC and so "Nj i "Ni j .t6aking an &nweighted average over all i j esti(ates the probabilitB thatC for &nifor(lB chosen
classes i andji and &nifor(lB chosenx Ci andx; Cj C we have Vsi GxH Vsi Gx;H.t6he weighted version of this esti(ates the probabilitB that the instances are correctlB ranked ifwe donJt preAselect the class.
Regression
7Ia(ple .+: 0ine Dtting eIa(pleonsider the following set of Dve points:
3e want to estimateyb! means of a pol!nomial inx. igure .+ "left# shows the result for degrees of 1 to
8 using tlinear regression, which will be eplained in /hapter @. The top two degrees fit the given points
eactl! "in general, an! set of n points can be fitted b! a pol!nomial of degree no more than n 1#, but
the! differ considerabl! at the etreme ends9 e.g., the pol!nomial of degree = leads to a decreasing trend
fromx $ tox 1, which is not reall! 'ustified b! the data.
ig&re .#: itting polBno(ials to data
(left) Lol!nomials of different degree fitted to a set of five points. rom bottom to top in the top righthand corner9 degree 1 "straight line#, degree + "parabola#, degree , degree = "which is the lowest degree
able to fit the points eactl!#, degree 8. (right) A piecewise constant function learned b! a grouping
model7 the dotted reference line is the linear function from the left figure.
ig&re .: ias and variance

8/10/2019 MLT Document Format
20/20
A dartboard metaphor illustrating the concepts of bias and variance. 0ach dartboard corresponds to a
different learning algorithm, and each dart signifies a different training sample. The top row learning
algorithms ehibit low bias, sta!ing close to the bullEs e!e "the true function value for a particular x# on
average, while the ones on the bottom row have high bias. The left column shows low variance and the
right column high variance.