Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision...

Machine Learning Part I

• Machine Learning I– Machine Learning Introduction– Induction of Decision Trees – Perceptron and Simple Neural Network Learning– Naïve Bayesian Learning – Instance Based Learning

• Textbook: DataMining by Witten and Frank

Learning: Mature Technology

• Many Applications– Detect fraudulent credit card transactions

– Information filtering systems that learn user preferences

– Autonomous vehicles that drive public highways (ALVINN)

– Decision trees for diagnosing heart attacks

– Speech recognition and synthesis (correct pronunciation) (NETtalk)

• Data Mining: huge datasets, scaling issues, visualization issues, software engineering issues

Defining a Learning Problem(Chapters 1&2)

• Experience: Training Examples

• Learning Task: target of learning

• Performance Measure: accuracy? faster?

A program is said to learn from experience E with respect to task T and performance measure P, if it’s performance at tasks in T, as measured by P, improves with experience E.

• Representation of Target Function Approximation• Learning Algorithm

Example: “The Weather Problem”(page 9)

• Attributes of instances – Wind– Temperature– Humidity– Outlook

• Feature = attribute with one value– E.g. outlook = sunny

• Sample instance = record = example = …– wind=weak, temp=hot, humidity=high, outlook=sunny, play=yes

Experience: “Good day for tennis”

• MS Excel File• Rules that might be learned:

– If outlook=sunny and humidity=high then play=no– If outlook = rainy and windy = true then play = no– If outlook = overcast then play = yes– If humidity = normal then play = yes– If none of the above then play = yes

• These rules are called “decision list”• Questin: correct for all examples?• Question: individual rules correct for all?

Numeric Attributes?

• If outlook = sunny and humidity > 83 then play = no

Classification vs Association

• In Weather data:• If templerature = cool then humidity = normal• If outlook=sunny then humidity=high

• Association: Statistical Correlation• Question: are the rules correct?

– If A then B• P(B|A)=? • P(A and B)=? • P(A)=?

Terminology

• Concepts and hypothesis (page 38)– The thing to be learned– Examples: decision lists, trees, clusters, asssociations, bayes nets,

• Supervised: – training examples: outcomes are provided by teachers– Play or not; – Often used in Classification and Prediction– Yes/No: lablel or class

• Unsupervised: training examples not labeled or classified by teacher– Clustering

• Concept space: the space of all concepts– Sometimes ordered from specific to general

Terminology (p 46)• Input to a machine learning algorithm

– a set of instances or examples or samples

• Positive and Negative examples• Closed World Assumption sometimes used

– Provide only positive examples– Assume that rest are negative examples

• Relations:– >, <, =,

• Attributes, features, columns• Values:

– Nominal, categorical: finite set of discrete values• Sunny, overcast, rainy

– Ordinal: ordering: hot > mid > cool

Overfitting

• Learned hypothesis or concept only works well on training data, but does poorly on testing data

• This can occur if construct complex concept first.• Example• Avoidance Bias (page 32)

– Simple-concept first in searching for a learned concept

– Sometimes called forward pruning

Bias (page 29)

• Kinds of bias– Language Bias

• Do we allow conjunctions? For-alls? etc• shrink the size of the hypothesis space

– Search bias• ordering over hypotheses: specific to general or else?• Use greedy search?

– Overfitting-avoidance bias

Weka Input Format

• Arff format (page 49)

• Missing Values– Indicated by out-of-range values such as –1, or ‘-’– Sometimes can mean significant things– Other times, can be replaced by mean values

Simple Rule Learning (p78)

• For classification (supervised)– Thus, data is divided into attributes and class (ie. Play)

• Idea: – Test single attribute and branch according to different values of the

attribute

– For Each Value: assign majority class

– Select Rule with smallest total error rate

• Error Rate: C = number of correct instances, and N = total number of instances

N

CE

Algorithm 1R (p79)

• For each attribute– For each value of attribute, make a rule:

• Count how often each class appears

• Find most frequent class

• Make the rule asisgn that class to this attribute-value

– Calculate the error rate of the rules

• Choose rules with the smallest error rate

Example (Table 4.1)

• Outlook (4/14)– Sunny no (2/5)– Overcast yes (0/4)– Rainy yes (2/5)

• Temperature (5/14)

• Humidity (4/14)

• Windy (5/14)

Missing Values and Numerical (p 80)

• In 1R: generate a separate attribute value called “missing”

• Numerical– Discretization: divide temperature into low, mid, high– Where to draw the lines?– Answer: how the classes are distributed– Aim: Predicted class is the majority class!

Training and Testing

• Question: is it possible to get a rule that never get an example outside the training set correct?

• Validation: divide the data into two sets– Training set: used to build the model

Example: train an 1R classification rule

– Testing set: used to evaluate the error rate• Example: apply the 1R rule to all instances in testing;

– Overfitting: small error rate on training, big on testing• Example: two attributes: StudentID, Play

– Generally, overfitting in 1R occurs when…• Hint: cause of overfitting usually not enough training examples for each branch• Thus, if a rule has … then each branch has few examples.

More on Model Evaluation (Chap 5)

• Error rate alone may not be always useful– How do you distinguish between testdata1 with 1000

instances and testdata2 with 100 instances, all with 25% error rate?

• Answer: confidence interval

czXzP )(

6

Possibility(%)

2 10

75% confidence interval

Example: 100 data, 6 classified wrong. Error rate= 6%. With confidence of 75% we guarantee that actual error between 2 and 10 data in future tests.

More on confidence interval

• P=function of – C: the confidence– N: the number of testing data– F: the mean error rate

• Example:– F=75%, N=1000, C=95% [0.733, 0.768]– F=75%, N=100, C=95% [0.70, 0.81]– N=10 [0.65, 1.02]

Cross Validation (p 126)

• Holdout procedure: – hold out certain amount of data for testing, remaining for training.

• Stratefied holdout:– Classes in test data is representative of training data

• Repeated Holdout:– Randomly select training data, repeat several times, average error rate

• N-fold cross-validation:– Divide all data into n partitions n folds– Use (n-1) fold for training and one fold for testing– Repeat n times

• Best result: 10-fold

More on validation (p 127)

• Leave-one-out– Data set size = N,– Leave one data instance out at one time as testing,– Use N-1 data as training repeat N-times

• Why good:– No randomness,– Max amount for training

• Why bad:– Testing not stratefied– Example:

Bootstrap

• Build the training data of size n from original data of same size by randomly selecting, with replacement, instances– Data not picked are used as testing

• 0.632bootstrap– Uniform probability of selection 1/n for data of size n

Training data set of size n = 63.2% of the original data

Those not picked in training set => test set (36.8% of original data)

Naïve Bayesian Learning

• Chapter 4.2 (page 82)• Instead of computing rules, use training data statistics as

learned model– Assumption: attributes are independent of each other given

hypothesis (e.g. play=yes)

• Example: testing data x= <outlook=sunny>– Likelihood of “yes”=P(outlook=sunny|

play=yes)*P(play=yes)=(2/9)*(9/14)=2/14– Likelihood of “no” = P(outlook=sunny|

play=no)*P(play=no)=(3/5)*(5/14)=3/14– Thus, choose “no” as the answer

Naïve Bayesian Learning

• In general, want to know value hypothesis H (e.g., H=(Play=yes))

• Evidence = Attributes = Ei where i goes from 1 to n.• Bayes rule + Independence assumption

• However, P(E) is not known?

i

i HEPHEP

EPHPHEPEHP

)|()|(

)(/)(*)|()|(

Laplace Estimator

• It may happen that a particular attribute value (Outlook=cloudy) does not appear in the training set with every class value (play=no),

• Then P(Outlook=cloudy|play=no)=0!• Use Laplace estimator (outlook=<sunny,overc,cloudy>)

9

3,

9

4,

9

2)|( yesoutlookP

9

3/3,

9

3/4,

9

3/2)|( yesoutlookP

Weka Java Package Demo

• Confusion Matrix

• Stratefied Cross Validation

• Java Classpath, Jar files, etc.

Continuous Values (Temperature)

• Use probability density function

2

2

2

)(

2

1)|66(

x

eyesetemperaturp

Possibility(%)

Normal distribution

)(xf

An Ecommerce Example

• In recommendation systems for movies, we have– Properties of movies– Log file of which movie is watched by each movie goer– A rating of each movie by movie-goer

• Want to know:– What other movies a movie-goer would like to see?

A Naïve Baysian Classifier for Texts

• Step 1: extract keywords

• Step 2: build a relational table for training

• Step 3: apply to testing examples

Text Ranking

• Question: given a set of documents and a keyword, how important is the keyword in document search?

• For a given document d and term (keyword) k,– Term Frequency, tf(k, d):

• The number of times a term appears in a document, normalized on all words

– Document frequency df(k,D):• D: set of documents under consideration• Df(k,D)=The number of documents containing the term.• Inverse document freqency, idf(k,D),

dw

wcount

dkcountdktf

2)(

),(),(

)),(

||log(),(

Dkdf

DDkidf

Given a query, rank the documents

• Given query keyword k, a document d in a collection of documents D, what is the score of the document d for the keyword and D?

• Keyword selection: find the set of keywords for which idf(k,D) is large

),(*),(),,( DkidfdktfDdkscore

Instance Based Learning/CBR(page 193)

• K-nearest neighbor learning– assuming all training instances are of the form: <a1(x),a2(x)…an(x)>, where ai

are attributes.

• Each attribute may have a certain weight: w(ai), to show how important it is

• Given a test instance y, its distance from a training instance x is:

• retrieve the k-nearest neighbors as candidates of hypothesis (classify following majority vote)

• K-nearest neighbors: k instances from training set with shortest distances

n

i

iii yaxaawyxd1

22 ))()((*)(),(

Neural Networks: Perceptrons(Russell and Norvig Chap 19)

x1

x2

xn

Sum Output=1 if Sum(wi*xi)>0, -1 o/w.

w1

w2

wn

Sum=Sigma(wi*xi)

++

+

-

-

+

+ -

-

Target function

How to update the weights

• Let Input[j] be the jth input

• Let Err be the error from output:– Err = T – O = (target – Output)

• Update jth weight:– Wj = Wj + * Input[j]* Err– Where = learning rate

Step function used as logic gates

t=1.5

W=1

W=1t=-0.5

W=-1t=0.5

W=1

W=1

ai

Step function

AND OR NOT

Stept(x)= {1, if x>=t

0, if x<tt

+1

ini

ai

Step function

NEURAL NETWORKS

• Neural network composed of :– A number of nodes, or units– Links which connect nodes– Each link has a numeric weight associated with it– Units: input units, output units, hidden units

• How neural network classify :– After initializing, weights and biases can be modified to

improve its performance on input/output pairs, so as to achieve the optimal classification for given training data.

Neural network composition

Weight W_ji

Weights W_kj

Hidden nodes a_j

Output vector O_i

Input vector I_k

Weight W_kj

Node k

Node j

I_1 I_2

O_1 O_3O_2 Node O_i

A Neuron

• Computation: input signals input function(linear) activation function(nonlinear) output signal

ajoutput links

ak

gInput links

Wkj

ai = g(inj)

inj

j

kkjj IWin *

Three different activation functions

+1

ini

+1

ini

-1

aiai

t

+1

ini

ai

Step function Sign function Sigmoid function

Stept(x)= {1, if x>=t

0, if x<t{

-1, if x<0

1, if x>=0Sign(x)=

1+e-x

1Sigmoid(x)=

Multi Layer Perceptron

Number of Output nodes: Number of Classes

Number of Input Nodes:Dimensionality of input tuples

Weight Wij

Node i

Node j

X_0 X_2

Y_0 Y_2Y_1

Number of Hidden nodes:Adjusted during training

• The output at a hidden or output node j is defined by

• That is, Ij is a linear combination of node j’s inputs, and the output of node j is the sigmoid function of this combination.

jIjj

kkkjj

eIgO

OWI

1

1)(

Activation Function

Weight Update (pages 579, 580)

• For link from hidden unit j to output i:

))(1(*)()('

)('*)(*

xgxgxg

ingOTWW iiijiji

• For link from input unit k to output j:

ijijj

jkkjkj

wingErr

ErrIWW

*)('

**

Initialize all weights and biases in network;While terminating condition is not met do for each training example X /* Forward propagate the inputs */ for each hidden or output unit i (forward)

compute its output Oi; /* Backpropagate the errors */ for each output or hidden unit j (backward)

compute its error Errj; adjust each weight;

Algorithm

NEURAL NETWORKS

• Advantages– prediction accuracy is generally high– robust, works when training examples contain errors– output may be discrete, real-valued, or a vector of several

discrete or real-valued attributes– fast evaluation of the learned target function.

• Criticism– long training time– difficult to understand the learned function(weights).– not easy to incorporate domain knowledge– do not provide probability distributions on the output

values

Back-propagation NetworkBack-propagation Network

Output Lay

er

hidden Lay

er

Input Lay

er

W2ij

W1ij

s1 s2 s3 s4sc

x1 x2 x3 x4 xn

h1 h2 h3 h4 h5hb

Output Layer i:

Hidden Layer j:

Input Layer k:

DECISION TREE(data mining book, page 89)

• An internal node represents a test on an attribute.• A branch represents an outcome of the test, e.g., Color=red.• A leaf node represents a class label or class label

distribution.• At each node, one attribute is chosen to split training

examples into distinct classes as much as possible• A new case is classified by following a matching path to a

leaf node.

Outlook Tempreature Humidity Windy Classsunny hot high false Nsunny hot high true Novercast hot high false Prain mild high false Prain cool normal false Prain cool normal true Novercast cool normal true Psunny mild high false Nsunny cool normal false Prain mild normal false Psunny mild normal true Povercast mild high true Povercast hot normal false Prain mild high true N

Training Set

Outlook

overcast

humidity windy

high normal falsetrue

sunny rain

N NP P

P

overcast

Example

Building Decision Trees

• Top-down tree construction– At start, all training examples are at the root.– Partition the examples recursively by choosing one

attribute each time.

• Bottom-up tree pruning– Remove subtrees or branches, in a bottom-up manner, to

improve the estimated accuracy on new cases.

Simplest TreeDay Outlook Temp Humid Wind Play?d1 s h h w nd2 s h h s nd3 o h h w yd4 r m h w yd5 r c n w yd6 r c n s yd7 o c n s yd8 s m h w nd9 s c n w yd10 r m n w yd11 s m n s yd12 o m h s yd13 o h n w yd14 r m h s n

How good?

yes

[10+, 4-]Means: correct on 10 examples incorrect on 4 examples

Successors Yes

Outlook Temp

Humid Wind

Which attribute should we use to split?

To be decided:

• How to choose best attribute?– Information gain– Entropy (disorder)

• When to stop growing tree?

Entropy (disorder) is badHomogeneity is good

• Let S be a set of examples• Entropy(S) = -P log2(P) - N log2(N)

– where P is proportion of pos example– and N is proportion of neg examples– and 0 log 0 == 0

• Example: S has 9 pos and 5 negEntropy([9+, 5-]) = -(9/14) log2(9/14) - (5/14)log2(5/14)

= 0.940

Entropy

.00 .50 1.00

1.0

0.5

P as %

Information Gain

• Measure of expected reduction in entropy• Resulting from splitting along an attribute

Gain(S,A) = Entropy(S) - (|Sv| / |S|) Entropy(Sv)

Where Entropy(S) = -P log2(P) - N log2(N)

v Values(A)

)(||

||)(),(

1

SiES

SiSESAgain

v

i

||

),(

||

),()( 2

1

logS

SCifreq

S

SCifreqSE

k

i

Information Gain• Consider a set of examples S and k classes Ci, i=1,…,k. • freq(Ci,S) examples of class Ci in S.• The entropy (or impurity) of classes in S is defined as

• E(S) is minimized if classes in S are skew or pure.• The information gain of choosing attribute A to partition S into sets {S1,

S2 , …, Sv} is defined as the reduction of entropy:

• The attribute that maximizes the information gain is chosen.

Gain of Splitting on WindDay Wind Tennis?d1 weak nd2 s nd3 weak yesd4 weak yesd5 weak yesd6 s yesd7 s yesd8 weak nd9 weak yesd10 weak yesd11 s yesd12 s yesd13 weak yesd14 s n

Values(wind)=weak, strongS = [9+, 5-]

Gain(S, wind) = Entropy(S) - (|Sv| / |S|) Entropy(Sv)

= Entropy(S) - 8/14 Entropy(Sweak)- 6/14 Entropy(Ss)

= 0.940 - (8/14) 0.811 - (6/14) 1.00 = 0.048

v {weak, s}

Sweak = [6+, 2-]Ss = [3+, 3-]

Evaluating Attributes

Yes

Outlook Temp

Humid Wind

Gain(S,Humid)=0.151

Gain(S,Outlook)=0.246

Gain(S,Temp)=0.029

Gain(S,Wind)=0.048

Resulting Tree ….

Outlook

Sunny Overcast Rain

Good day for tennis?

Non-leaf[2+, 3-]

Leaf[4+]

Non-leaf[2+, 3-]

Recurse!

Day Temp Humid Wind Tennis?d1 h h weak nd2 h h s nd8 m h weak nd9 c n weak yesd11 m n s yes

Outlook

Sunny

One Step Later…

Outlook

Humidity

Sunny Overcast Rain

HighNormal

Leaf[2+]

Leaf[4+]

Non-Leaf[2+, 3-]

Non-leaf[3-]

Example

• In the given sample data, attribute outlook is chosen to split at the root :– gain(outlook) = 0.246– gain(temperature) = 0.029– gain(humidity) = 0.151– gain(windy) = 0.048

SplitInfo S ASiS

SiS

( , )| || |

log| || |

. 2

.),(

),(),(ASSplitInfo

ASGainASGainRatio

Gain Ratio

• Info gain favors attributes with many values because splitting examples (by having more values) always reduces entropy.

• Gain ratio (Quinlan’86) normalizes info gain by this reduction:

Stopping Critera

• When all cases have the same class. The leaf node is labeled by this class.

• When there is no available attribute. The leaf node is labeled by the majority class.

• When the number of cases is less than a specified threshold. The leaf node is labeled by the majority class.

Prune Overfitting

• Overfitting: a fullygrown tree acheives 100% accuracy on the training set, but performs poorly on the testing set.

• Two approaches to prune overfitting: – Stop earlier: Stop growing the tree earlier.– Post-prune: Grow the tree and then prune subtrees.

• Stop-earlier:– Use minimum description length (MDL) principle: halting growth

of the tree when the encoding is minimized (SLIQ and SPRINT).

• Post-prune (C4.5):– tree pruning– rule pruning

Tree Pruning in C4.5• Bottom-up pruning: at each non-leaf node v, if merging

the subtree at v into a leaf node improves accuracy, perform the merging.

• Method 1: compute accuracy using examples not seen by the algorithm.

• Method 2: estimate accuracy using the training examples:– consider classifying E examples incorrectly out of N examples

as observing E events in N trials in the binomial distribution. For a given confidence level CF, the upper limit on the error rate over the whole population is with CF% confidence.

),( NEUCF

Example for Estimating Error

• Consider a subtree with 3 leaf nodes:– education spending=n: democrat (6)– education spending=y: democrat (9)– education spending=u: republican (1)

• The estimated error for this subtree is6*0.206+9*0.143+1*0.750=3.273

• If the subtree is replaced with the leaf democrat, the estimated error is

• So the pruning is performed.

750.0)1,0(,143.0)9,0(,206.0)6,0( 25.025.025.0 UUU

512.2157.0*16)16,1(*16 25.0 U

Temperature

play tennis

40 48 60 72 80 90

No No Yes Yes Yes No

Continuous Values [Q93]

• Sort the examples according to the continuous attribute A

• Identify the cut point that maximizes the goodness. • Participate in the competition with other attributes.• A continuous attribute could be selected again in a

subtree.• How to avoid resorting at each node?

Missing Values [Q93]

• Assign missing attribute values either– the most common value of A, or– the probability to each possible value of A. C4.5 sends an

example containing a missing value down to each branch together with the probability

Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision...

Documents

Transcript of Machine Learning Part I Machine Learning I –Machine Learning Introduction –Induction of Decision...