Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision...

69
Machine Learning Part I • Machine Learning I – Machine Learning Introduction – Induction of Decision Trees – Perceptron and Simple Neural Network Learning – Naïve Bayesian Learning – Instance Based Learning • Textbook: DataMining by Witten and Frank

Transcript of Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision...

Page 1: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Machine Learning Part I

• Machine Learning I– Machine Learning Introduction– Induction of Decision Trees – Perceptron and Simple Neural Network Learning– Naïve Bayesian Learning – Instance Based Learning

• Textbook: DataMining by Witten and Frank

Page 2: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Learning: Mature Technology

• Many Applications– Detect fraudulent credit card transactions

– Information filtering systems that learn user preferences

– Autonomous vehicles that drive public highways (ALVINN)

– Decision trees for diagnosing heart attacks

– Speech recognition and synthesis (correct pronunciation) (NETtalk)

• Data Mining: huge datasets, scaling issues, visualization issues, software engineering issues

Page 3: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Defining a Learning Problem(Chapters 1&2)

• Experience: Training Examples

• Learning Task: target of learning

• Performance Measure: accuracy? faster?

A program is said to learn from experience E with respect to task T and performance measure P, if it’s performance at tasks in T, as measured by P, improves with experience E.

• Representation of Target Function Approximation• Learning Algorithm

Page 4: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Example: “The Weather Problem”(page 9)

• Attributes of instances – Wind– Temperature– Humidity– Outlook

• Feature = attribute with one value– E.g. outlook = sunny

• Sample instance = record = example = …– wind=weak, temp=hot, humidity=high, outlook=sunny, play=yes

Page 5: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Experience: “Good day for tennis”

• MS Excel File• Rules that might be learned:

– If outlook=sunny and humidity=high then play=no– If outlook = rainy and windy = true then play = no– If outlook = overcast then play = yes– If humidity = normal then play = yes– If none of the above then play = yes

• These rules are called “decision list”• Questin: correct for all examples?• Question: individual rules correct for all?

Page 6: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Numeric Attributes?

• If outlook = sunny and humidity > 83 then play = no

Page 7: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Classification vs Association

• In Weather data:• If templerature = cool then humidity = normal• If outlook=sunny then humidity=high

• Association: Statistical Correlation• Question: are the rules correct?

– If A then B• P(B|A)=? • P(A and B)=? • P(A)=?

Page 8: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Terminology

• Concepts and hypothesis (page 38)– The thing to be learned– Examples: decision lists, trees, clusters, asssociations, bayes nets,

• Supervised: – training examples: outcomes are provided by teachers– Play or not; – Often used in Classification and Prediction– Yes/No: lablel or class

• Unsupervised: training examples not labeled or classified by teacher– Clustering

• Concept space: the space of all concepts– Sometimes ordered from specific to general

Page 9: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Terminology (p 46)• Input to a machine learning algorithm

– a set of instances or examples or samples

• Positive and Negative examples• Closed World Assumption sometimes used

– Provide only positive examples– Assume that rest are negative examples

• Relations:– >, <, =,

• Attributes, features, columns• Values:

– Nominal, categorical: finite set of discrete values• Sunny, overcast, rainy

– Ordinal: ordering: hot > mid > cool

Page 10: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Overfitting

• Learned hypothesis or concept only works well on training data, but does poorly on testing data

• This can occur if construct complex concept first.• Example• Avoidance Bias (page 32)

– Simple-concept first in searching for a learned concept

– Sometimes called forward pruning

Page 11: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Bias (page 29)

• Kinds of bias– Language Bias

• Do we allow conjunctions? For-alls? etc• shrink the size of the hypothesis space

– Search bias• ordering over hypotheses: specific to general or else?• Use greedy search?

– Overfitting-avoidance bias

Page 12: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Weka Input Format

• Arff format (page 49)

• Missing Values– Indicated by out-of-range values such as –1, or ‘-’– Sometimes can mean significant things– Other times, can be replaced by mean values

Page 13: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Simple Rule Learning (p78)

• For classification (supervised)– Thus, data is divided into attributes and class (ie. Play)

• Idea: – Test single attribute and branch according to different values of the

attribute

– For Each Value: assign majority class

– Select Rule with smallest total error rate

• Error Rate: C = number of correct instances, and N = total number of instances

N

CE

Page 14: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Algorithm 1R (p79)

• For each attribute– For each value of attribute, make a rule:

• Count how often each class appears

• Find most frequent class

• Make the rule asisgn that class to this attribute-value

– Calculate the error rate of the rules

• Choose rules with the smallest error rate

Page 15: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Example (Table 4.1)

• Outlook (4/14)– Sunny no (2/5)– Overcast yes (0/4)– Rainy yes (2/5)

• Temperature (5/14)

• Humidity (4/14)

• Windy (5/14)

Page 16: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Missing Values and Numerical (p 80)

• In 1R: generate a separate attribute value called “missing”

• Numerical– Discretization: divide temperature into low, mid, high– Where to draw the lines?– Answer: how the classes are distributed– Aim: Predicted class is the majority class!

Page 17: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Training and Testing

• Question: is it possible to get a rule that never get an example outside the training set correct?

• Validation: divide the data into two sets– Training set: used to build the model

Example: train an 1R classification rule

– Testing set: used to evaluate the error rate• Example: apply the 1R rule to all instances in testing;

– Overfitting: small error rate on training, big on testing• Example: two attributes: StudentID, Play

– Generally, overfitting in 1R occurs when…• Hint: cause of overfitting usually not enough training examples for each branch• Thus, if a rule has … then each branch has few examples.

Page 18: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

More on Model Evaluation (Chap 5)

• Error rate alone may not be always useful– How do you distinguish between testdata1 with 1000

instances and testdata2 with 100 instances, all with 25% error rate?

• Answer: confidence interval

czXzP )(

6

Possibility(%)

2 10

75% confidence interval

Example: 100 data, 6 classified wrong. Error rate= 6%. With confidence of 75% we guarantee that actual error between 2 and 10 data in future tests.

Page 19: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

More on confidence interval

• P=function of – C: the confidence– N: the number of testing data– F: the mean error rate

• Example:– F=75%, N=1000, C=95% [0.733, 0.768]– F=75%, N=100, C=95% [0.70, 0.81]– N=10 [0.65, 1.02]

Page 20: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Cross Validation (p 126)

• Holdout procedure: – hold out certain amount of data for testing, remaining for training.

• Stratefied holdout:– Classes in test data is representative of training data

• Repeated Holdout:– Randomly select training data, repeat several times, average error rate

• N-fold cross-validation:– Divide all data into n partitions n folds– Use (n-1) fold for training and one fold for testing– Repeat n times

• Best result: 10-fold

Page 21: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

More on validation (p 127)

• Leave-one-out– Data set size = N,– Leave one data instance out at one time as testing,– Use N-1 data as training repeat N-times

• Why good:– No randomness,– Max amount for training

• Why bad:– Testing not stratefied– Example:

Page 22: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Bootstrap

• Build the training data of size n from original data of same size by randomly selecting, with replacement, instances– Data not picked are used as testing

• 0.632bootstrap– Uniform probability of selection 1/n for data of size n

Training data set of size n = 63.2% of the original data

Those not picked in training set => test set (36.8% of original data)

Page 23: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Naïve Bayesian Learning

• Chapter 4.2 (page 82)• Instead of computing rules, use training data statistics as

learned model– Assumption: attributes are independent of each other given

hypothesis (e.g. play=yes)

• Example: testing data x= <outlook=sunny>– Likelihood of “yes”=P(outlook=sunny|

play=yes)*P(play=yes)=(2/9)*(9/14)=2/14– Likelihood of “no” = P(outlook=sunny|

play=no)*P(play=no)=(3/5)*(5/14)=3/14– Thus, choose “no” as the answer

Page 24: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Naïve Bayesian Learning

• In general, want to know value hypothesis H (e.g., H=(Play=yes))

• Evidence = Attributes = Ei where i goes from 1 to n.• Bayes rule + Independence assumption

• However, P(E) is not known?

i

i HEPHEP

EPHPHEPEHP

)|()|(

)(/)(*)|()|(

Page 25: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Laplace Estimator

• It may happen that a particular attribute value (Outlook=cloudy) does not appear in the training set with every class value (play=no),

• Then P(Outlook=cloudy|play=no)=0!• Use Laplace estimator (outlook=<sunny,overc,cloudy>)

9

3,

9

4,

9

2)|( yesoutlookP

9

3/3,

9

3/4,

9

3/2)|( yesoutlookP

Page 26: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Weka Java Package Demo

• Confusion Matrix

• Stratefied Cross Validation

• Java Classpath, Jar files, etc.

Page 27: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Continuous Values (Temperature)

• Use probability density function

2

2

2

)(

2

1)|66(

x

eyesetemperaturp

Possibility(%)

Normal distribution

)(xf

Page 28: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

An Ecommerce Example

• In recommendation systems for movies, we have– Properties of movies– Log file of which movie is watched by each movie goer– A rating of each movie by movie-goer

• Want to know:– What other movies a movie-goer would like to see?

Page 29: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

A Naïve Baysian Classifier for Texts

• Step 1: extract keywords

• Step 2: build a relational table for training

• Step 3: apply to testing examples

Page 30: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Text Ranking

• Question: given a set of documents and a keyword, how important is the keyword in document search?

• For a given document d and term (keyword) k,– Term Frequency, tf(k, d):

• The number of times a term appears in a document, normalized on all words

– Document frequency df(k,D):• D: set of documents under consideration• Df(k,D)=The number of documents containing the term.• Inverse document freqency, idf(k,D),

dw

wcount

dkcountdktf

2)(

),(),(

)),(

||log(),(

Dkdf

DDkidf

Page 31: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Given a query, rank the documents

• Given query keyword k, a document d in a collection of documents D, what is the score of the document d for the keyword and D?

• Keyword selection: find the set of keywords for which idf(k,D) is large

),(*),(),,( DkidfdktfDdkscore

Page 32: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Instance Based Learning/CBR(page 193)

• K-nearest neighbor learning– assuming all training instances are of the form: <a1(x),a2(x)…an(x)>, where ai

are attributes.

• Each attribute may have a certain weight: w(ai), to show how important it is

• Given a test instance y, its distance from a training instance x is:

• retrieve the k-nearest neighbors as candidates of hypothesis (classify following majority vote)

• K-nearest neighbors: k instances from training set with shortest distances

n

i

iii yaxaawyxd1

22 ))()((*)(),(

Page 33: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Neural Networks: Perceptrons(Russell and Norvig Chap 19)

x1

x2

xn

Sum Output=1 if Sum(wi*xi)>0, -1 o/w.

w1

w2

wn

Sum=Sigma(wi*xi)

++

+

-

-

+

+ -

-

Target function

Page 34: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

How to update the weights

• Let Input[j] be the jth input

• Let Err be the error from output:– Err = T – O = (target – Output)

• Update jth weight:– Wj = Wj + * Input[j]* Err– Where = learning rate

Page 35: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Step function used as logic gates

t=1.5

W=1

W=1t=-0.5

W=-1t=0.5

W=1

W=1

ai

Step function

AND OR NOT

Stept(x)= {1, if x>=t

0, if x<tt

+1

ini

ai

Step function

Page 36: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

NEURAL NETWORKS

• Neural network composed of :– A number of nodes, or units– Links which connect nodes– Each link has a numeric weight associated with it– Units: input units, output units, hidden units

• How neural network classify :– After initializing, weights and biases can be modified to

improve its performance on input/output pairs, so as to achieve the optimal classification for given training data.

Page 37: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Neural network composition

Weight W_ji

Weights W_kj

Hidden nodes a_j

Output vector O_i

Input vector I_k

Weight W_kj

Node k

Node j

I_1 I_2

O_1 O_3O_2 Node O_i

Page 38: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

A Neuron

• Computation: input signals input function(linear) activation function(nonlinear) output signal

ajoutput links

ak

gInput links

Wkj

ai = g(inj)

inj

j

kkjj IWin *

Page 39: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Three different activation functions

+1

ini

+1

ini

-1

aiai

t

+1

ini

ai

Step function Sign function Sigmoid function

Stept(x)= {1, if x>=t

0, if x<t{

-1, if x<0

1, if x>=0Sign(x)=

1+e-x

1Sigmoid(x)=

Page 40: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Multi Layer Perceptron

Number of Output nodes: Number of Classes

Number of Input Nodes:Dimensionality of input tuples

Weight Wij

Node i

Node j

X_0 X_2

Y_0 Y_2Y_1

Number of Hidden nodes:Adjusted during training

Page 41: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

• The output at a hidden or output node j is defined by

• That is, Ij is a linear combination of node j’s inputs, and the output of node j is the sigmoid function of this combination.

jIjj

kkkjj

eIgO

OWI

1

1)(

Activation Function

Page 42: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Weight Update (pages 579, 580)

• For link from hidden unit j to output i:

))(1(*)()('

)('*)(*

xgxgxg

ingOTWW iiijiji

• For link from input unit k to output j:

ijijj

jkkjkj

wingErr

ErrIWW

*)('

**

Page 43: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Initialize all weights and biases in network;While terminating condition is not met do for each training example X /* Forward propagate the inputs */ for each hidden or output unit i (forward)

compute its output Oi; /* Backpropagate the errors */ for each output or hidden unit j (backward)

compute its error Errj; adjust each weight;

Algorithm

Page 44: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

NEURAL NETWORKS

• Advantages– prediction accuracy is generally high– robust, works when training examples contain errors– output may be discrete, real-valued, or a vector of several

discrete or real-valued attributes– fast evaluation of the learned target function.

• Criticism– long training time– difficult to understand the learned function(weights).– not easy to incorporate domain knowledge– do not provide probability distributions on the output

values

Page 45: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Back-propagation NetworkBack-propagation Network

Output Lay

er

hidden Lay

er

Input Lay

er

W2ij

W1ij

s1 s2 s3 s4sc

x1 x2 x3 x4 xn

h1 h2 h3 h4 h5hb

Output Layer i:

Hidden Layer j:

Input Layer k:

Page 46: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

DECISION TREE(data mining book, page 89)

• An internal node represents a test on an attribute.• A branch represents an outcome of the test, e.g., Color=red.• A leaf node represents a class label or class label

distribution.• At each node, one attribute is chosen to split training

examples into distinct classes as much as possible• A new case is classified by following a matching path to a

leaf node.

Page 47: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Outlook Tempreature Humidity Windy Classsunny hot high false Nsunny hot high true Novercast hot high false Prain mild high false Prain cool normal false Prain cool normal true Novercast cool normal true Psunny mild high false Nsunny cool normal false Prain mild normal false Psunny mild normal true Povercast mild high true Povercast hot normal false Prain mild high true N

Training Set

Page 48: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Outlook

overcast

humidity windy

high normal falsetrue

sunny rain

N NP P

P

overcast

Example

Page 49: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Building Decision Trees

• Top-down tree construction– At start, all training examples are at the root.– Partition the examples recursively by choosing one

attribute each time.

• Bottom-up tree pruning– Remove subtrees or branches, in a bottom-up manner, to

improve the estimated accuracy on new cases.

Page 50: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Simplest TreeDay Outlook Temp Humid Wind Play?d1 s h h w nd2 s h h s nd3 o h h w yd4 r m h w yd5 r c n w yd6 r c n s yd7 o c n s yd8 s m h w nd9 s c n w yd10 r m n w yd11 s m n s yd12 o m h s yd13 o h n w yd14 r m h s n

How good?

yes

[10+, 4-]Means: correct on 10 examples incorrect on 4 examples

Page 51: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Successors Yes

Outlook Temp

Humid Wind

Which attribute should we use to split?

Page 52: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

To be decided:

• How to choose best attribute?– Information gain– Entropy (disorder)

• When to stop growing tree?

Page 53: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Entropy (disorder) is badHomogeneity is good

• Let S be a set of examples• Entropy(S) = -P log2(P) - N log2(N)

– where P is proportion of pos example– and N is proportion of neg examples– and 0 log 0 == 0

• Example: S has 9 pos and 5 negEntropy([9+, 5-]) = -(9/14) log2(9/14) - (5/14)log2(5/14)

= 0.940

Page 54: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Entropy

.00 .50 1.00

1.0

0.5

P as %

Page 55: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Information Gain

• Measure of expected reduction in entropy• Resulting from splitting along an attribute

Gain(S,A) = Entropy(S) - (|Sv| / |S|) Entropy(Sv)

Where Entropy(S) = -P log2(P) - N log2(N)

v Values(A)

Page 56: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

)(||

||)(),(

1

SiES

SiSESAgain

v

i

||

),(

||

),()( 2

1

logS

SCifreq

S

SCifreqSE

k

i

Information Gain• Consider a set of examples S and k classes Ci, i=1,…,k. • freq(Ci,S) examples of class Ci in S.• The entropy (or impurity) of classes in S is defined as

• E(S) is minimized if classes in S are skew or pure.• The information gain of choosing attribute A to partition S into sets {S1,

S2 , …, Sv} is defined as the reduction of entropy:

• The attribute that maximizes the information gain is chosen.

Page 57: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Gain of Splitting on WindDay Wind Tennis?d1 weak nd2 s nd3 weak yesd4 weak yesd5 weak yesd6 s yesd7 s yesd8 weak nd9 weak yesd10 weak yesd11 s yesd12 s yesd13 weak yesd14 s n

Values(wind)=weak, strongS = [9+, 5-]

Gain(S, wind) = Entropy(S) - (|Sv| / |S|) Entropy(Sv)

= Entropy(S) - 8/14 Entropy(Sweak)- 6/14 Entropy(Ss)

= 0.940 - (8/14) 0.811 - (6/14) 1.00 = 0.048

v {weak, s}

Sweak = [6+, 2-]Ss = [3+, 3-]

Page 58: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Evaluating Attributes

Yes

Outlook Temp

Humid Wind

Gain(S,Humid)=0.151

Gain(S,Outlook)=0.246

Gain(S,Temp)=0.029

Gain(S,Wind)=0.048

Page 59: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Resulting Tree ….

Outlook

Sunny Overcast Rain

Good day for tennis?

Non-leaf[2+, 3-]

Leaf[4+]

Non-leaf[2+, 3-]

Page 60: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Recurse!

Day Temp Humid Wind Tennis?d1 h h weak nd2 h h s nd8 m h weak nd9 c n weak yesd11 m n s yes

Outlook

Sunny

Page 61: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

One Step Later…

Outlook

Humidity

Sunny Overcast Rain

HighNormal

Leaf[2+]

Leaf[4+]

Non-Leaf[2+, 3-]

Non-leaf[3-]

Page 62: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Example

• In the given sample data, attribute outlook is chosen to split at the root :– gain(outlook) = 0.246– gain(temperature) = 0.029– gain(humidity) = 0.151– gain(windy) = 0.048

Page 63: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

SplitInfo S ASiS

SiS

( , )| || |

log| || |

. 2

.),(

),(),(ASSplitInfo

ASGainASGainRatio

Gain Ratio

• Info gain favors attributes with many values because splitting examples (by having more values) always reduces entropy.

• Gain ratio (Quinlan’86) normalizes info gain by this reduction:

Page 64: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Stopping Critera

• When all cases have the same class. The leaf node is labeled by this class.

• When there is no available attribute. The leaf node is labeled by the majority class.

• When the number of cases is less than a specified threshold. The leaf node is labeled by the majority class.

Page 65: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Prune Overfitting

• Overfitting: a fullygrown tree acheives 100% accuracy on the training set, but performs poorly on the testing set.

• Two approaches to prune overfitting: – Stop earlier: Stop growing the tree earlier.– Post-prune: Grow the tree and then prune subtrees.

• Stop-earlier:– Use minimum description length (MDL) principle: halting growth

of the tree when the encoding is minimized (SLIQ and SPRINT).

• Post-prune (C4.5):– tree pruning– rule pruning

Page 66: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Tree Pruning in C4.5• Bottom-up pruning: at each non-leaf node v, if merging

the subtree at v into a leaf node improves accuracy, perform the merging.

• Method 1: compute accuracy using examples not seen by the algorithm.

• Method 2: estimate accuracy using the training examples:– consider classifying E examples incorrectly out of N examples

as observing E events in N trials in the binomial distribution. For a given confidence level CF, the upper limit on the error rate over the whole population is with CF% confidence.

),( NEUCF

Page 67: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Example for Estimating Error

• Consider a subtree with 3 leaf nodes:– education spending=n: democrat (6)– education spending=y: democrat (9)– education spending=u: republican (1)

• The estimated error for this subtree is6*0.206+9*0.143+1*0.750=3.273

• If the subtree is replaced with the leaf democrat, the estimated error is

• So the pruning is performed.

750.0)1,0(,143.0)9,0(,206.0)6,0( 25.025.025.0 UUU

512.2157.0*16)16,1(*16 25.0 U

Page 68: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Temperature

play tennis

40 48 60 72 80 90

No No Yes Yes Yes No

Continuous Values [Q93]

• Sort the examples according to the continuous attribute A

• Identify the cut point that maximizes the goodness. • Participate in the competition with other attributes.• A continuous attribute could be selected again in a

subtree.• How to avoid resorting at each node?

Page 69: Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision Trees –Perceptron and Simple Neural Network Learning –Naïve.

Missing Values [Q93]

• Assign missing attribute values either– the most common value of A, or– the probability to each possible value of A. C4.5 sends an

example containing a missing value down to each branch together with the probability