ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline •...

118
اِ ٰ ْ اﻟﺮِ اﻟﻠﻪِ ْ ِ ِ ْ ِ ﻟﺮ

Transcript of ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline •...

Page 1: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

ن احم حيم بسم الله الر لر

Page 2: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Machine LearningLecture 05 – Decision Tree Learning

Dr. Rao Muhammad Adeel Nawab

Page 3: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 3

How to Work?

.کام ��﮲نا

ی کام ��﮲نا﮶و�

﮲ی �

﮶و�

﮲� .

ی کام ��﮶و�

﮲ی �

﮶و�

﮲�ے ��� ھ 

﮴ .﮲ناال�ه �و سا�﮵تت﮴ :آ ك � ��

�ك نعبد وا �� �

تعني ا س��﮳��ه

﮴�﮵ں: � ے �ہ

﮴� ی �ب﮳ادت �� �﮵ری �ہ

﮴م � ﮵ا ال�ه �ہ . ن

�﮵ں  ے �ہ﮴﮲��� �ے �د د ما ی  ﮳�ھ �ہ

اور �﮴

Page 4: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 4

ی﮴��﮵ں ہہاری

﮲﮴ت ��﮳�ی �

﮲. ��ب

�﮵ں ے �ہ﮴� ��ے �﮳�ی �﮷ل ج﮳ا ے 

﯀� و ��و

﮴وں � �﮵ں �ہ .دعائ

�ط : دعا تقمي رص �ط ٱٱلمس� مٱٱهد� ٱٱلرص �ن ٱ�نعمت �لهي ٱٱ���﮳��ه

﮴عام � : �

﮲ے ا�

﮲� و 

﮴��﮵ں ��﮵د�ی راه د�ھا ان �و�وں �ی راه ﮳�﮲ں �﮷� � ﮵ا�ہ . ب

Power of Effort & Dua

Page 5: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 5

�هم� خر يل وا�رت يل الل�متنا ال� ما �ل

�ب�انك ال �مل لنا ا س�

�ك ٱ�نت العلمي الحكمي ن�ا

ح يل صدري رب ارش يل امري و�رس

وا�لل عقدة من لساين یفقهوا قويل

Dua – Take Help from Allah Before Starting Any Task

Page 6: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 6

Mainly get Excellence in two things

Course Focus

Become a great Human Being

Become a great Machine Learning Engineer

Page 7: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 7

ت آج � و�تى

�� � ��  ت آج � و�تى

�ل � ��  ت آج � و�تى

Page 8: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 8

How You Can Have A Photographic Memory

اہن�ڑ دو�

Page 9: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 9

Lecture Outline

• Decision Trees• ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information Gain• Inductive Bias in Decision Tree Learning• Refinements to Basic Decision Tree Learning

Reading Chapter 3 of MitchellSections 4.3 and 6.1 of Wittena and Frank

Page 10: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Decision Trees

Page 11: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 11

Concept Learning - Summary

Input

OutputLearner – takes two inputs

Training Examples (D)Hypothesis Space (H)

Learner – outputA hypothesis h from H, which best fits the training examples (D)

Page 12: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 12

FIND-S and Candidate Elimination Algorithms - Summary

Representation of Input and OutputAttribute-Value Pair

Representation of HypothesisConjunction (AND) of constraints on attributes

Inductive BiasTraining data is error freeTarget concept is present in the Hypothesis Space (H)

Page 13: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 13

Decision Tree Learning

Representation of Input and OutputAttribute-Value Pair

Representation of HypothesisDisjunction (OR) of Conjunction (AND) of attribute tests (a.k.a. Decision Tree)

“Hypothesis representation” in decision tree learning is more expressive and powerful than FIND-S and

Candidate Elimination AlgorithmsNote

Page 14: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 14

Hypothesis Representation

Height

Short Normal Tall

HairLength

Short Long

HeadCovered

No Yes

Male Female

Female

FemaleMale

Page 15: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 15

Hypothesis Representation

Decision Tree (or hypothesis) is represented as

Disjunction (OR) of Conjunction (AND) of attribute testsIn given example, there are total 5 paths in the Decision TreeEach “path” in this tree is “Conjunction (AND) of attribute tests”

Page 16: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 16

Hypothesis Representation

Height = Short ∧ HairLength = Short1

Height = Short ∧ HairLength = Long2

Height = Normal3

Height = Tall ∧ Head Covered = Yes4

Height = Tall ∧ Head Covered = No5

5 paths in decision tree are

Page 17: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 17

Hypothesis Representation

IF (Height=Short ∧ HairLength=Short) ∨ (Height=Normal) ∨(Height=Tall ∧ HeadCovered=Yes)

Then Gender = 1 (Female)Otherwise Gender = 0 Male

Decision Tree = “Disjunction (OR) of Paths”

Page 18: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 18

Decision Trees - Definition

Decision Trees

Decision tree learning is a method for approximating discrete-valued target functions, in which the learned function is represented by a decision tree

Such problems, in which the task is to classify examples into one of a discrete set of possible categories, are often referred to as “Classification Problems”

Page 19: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 19

Decision Trees - Definition

Learned trees can also be re-represented as sets of “if-then rules” to improve human readability

Most popular of inductive learning algorithms

Successfully applied to a broad range of tasks

Page 20: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 20

Decision Tree for Gender Identification Learning Problem

Height

Short Normal Tall

HairLength

Short Long

Male Female

Each internal node tests an attribute

Each branch corresponds to anattribute value node

Each leaf node assigns a classification

Page 21: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 21

Decision Trees

Decision trees are trees which classify instances by testing at each node some attribute of the instanceTesting starts at the root node and proceeds downwards to a leaf node, which indicates the classification of the instanceEach branch leading out of a node corresponds to a value of the attribute being tested at that nodeAs a complex rule, such a decision tree could be coded by handHowever, the challenge for machine learning is to propose algorithms for learning decision trees from examples

Page 22: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 22

Applications of Decision Tree Learning

Appl

icat

ions

Predicting Magnetic Properties of CrystalsProfiling Higher-Priced Houses

Detecting Advertisements on the WebControlling a Production Process

Diagnosing HypothyroidismAssessing Credit RiskSentiment Analysis

Author ProfilingWord Sense Disambiguation

Text Reuse and Plagiarism Detection

Page 23: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 23

Problems Appropriate for Decision Tree Learning

Decision tree learning is best for problems where

Instances describable by attribute – value pairs

Usually nominal (categorical enumerated / discrete) attributes with small number of discrete values, but can be numeric (ordinal / continuous)

Target function (f) is discrete valued

In Gender Identification example target function is Boolean-valuedEasy to extend to target functions with > 2 output valuesHarder, but possible, to extend to numeric target functions

Page 24: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 24

Problems Appropriate for Decision Tree Learning

Decision tree learning is best for problems where

Disjunctive hypothesis may be required

Easy for decision trees to learn disjunctive concepts (note such concepts were outside the hypothesis space of the Candidate Elimination Algorithm)

Possibly noisy/incomplete training data

Robust to errors in Classification of training examples “attribute values” describing

training examplesCan be trained on examples where for some instances some attribute values are unknown/missing

Page 25: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

ID3 - Basic Decision Tree Learning Algorithm

Page 26: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 26

ID3 - Basic Decision Tree Learning Algorithm

ID3 algorithm, learns decision trees by constructing them top- down, beginning with the question

Which attribute should be tested at the root of the tree?

Each instance attribute is evaluated using a statistical test to determine how well it alone classifies the training examples

Page 27: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 27

ID3 - Basic Decision Tree Learning Algorithm

The “best attribute” is selected and used as the test at the root node of the tree

A descendant of the root node is then created for each possible value of this attribute, and the training examples are sorted to the appropriate descendant node (i.e., down the branch corresponding to the example's value for this attribute)

Page 28: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 28

ID3 - Basic Decision Tree Learning Algorithm

The entire process is then repeated using the training examples associated with each descendant node to select the best attribute to test at that point in the tree

This process continues for each new leaf node until either of two conditions is met:

Every attribute has already been included along this path through the tree, or The training examples associated with this leaf node all have the same target attribute value (i.e., their entropy is zero)

Page 29: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 29

ID3 - Basic Decision Tree Learning Algorithm

This forms a greedy search for an acceptable decision tree, in which the algorithm never backtracks to reconsider earlier choices

Page 30: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 30

ID3 - Basic Decision Tree Learning Algorithm

ID3(Example, Target_Attribute, Attribute)Create Root node for the treeIf all examples +ve, return 1-node tree Root with label = +If all examples -ve, return 1-node tree Root with label = -If Attributes=[], return 1-node tree Root with label=most common value of Target_Attribute in ExamplesOtherwise

ID3 Algorithm

Page 31: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 31

ID3 - Basic Decision Tree Learning AlgorithmBegin

A ← attribute in Attributes that best classifies Examples The decision attribute for Root ← AFor each possible value vi of A

Add a new branch below Root for test A = viLet Examplesvi = subset of Examples with value vi for AIf Examplesvi = []

Then below this new branch add leaf node with label=most common value of Target_Attribute in ExamplesElse below this new branch add subtree

ID3(Examplesvi, Target_Attribute, Attributes – {A})EndReturn Root

Page 32: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 32

Which Attribute is the Best Classifier?

In the ID3 algorithm, choosing which attribute to test at the next node is a crucial step

Would like to choose that attribute which does best at separating training examples according to their

target classification

Page 33: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 33

Which Attribute is the Best Classifier?

An attribute which separates training examples into two sets each of which contains positive / negative examples of the target attribute in the same ratio as the initial set of examples has not helped us progress towards a classification.

Page 34: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 34

Which Attribute is the Best Classifier?

Suppose we have 14 training examples, 9 +ve and 5 -ve, of the gender.

For each gender we have information about the attributes HairLength and HeadCovered

Page 35: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 35

Training ExamplesExample No. Height Weight HairLength HeadCovered Gender

x1 Short Light Short Yes Malex2 Short Light Short No Malex3 Normal Light Short Yes Femalex4 Tall Normal Short Yes Femalex5 Tall Heavy Long Yes Femalex6 Tall Heavy Long No Malex7 Normal Heavy Long Yes Femalex8 Short Normal Short Yes Malex9 Short Heavy Long Yes Femalex10 Tall Normal Long No Femalex11 Short Normal Long No Femalex12 Normal Normal Short No Femalex13 Normal Light Long Yes Femalex14 Tall Normal Short No Male

Page 36: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 36

Which Attribute is the Best Classifier ?

Which attribute is the best classifier?

HairLength

LongShort

S: [9+, 5-]

S: [3+, 4-] S: [6+, 1-]

HeadCovered

NoYes

S: [9+, 5-]

S: [6+, 2-] S: [3+, 3-]

Page 37: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Entropy and Information Gain

Page 38: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 38

Entropy and Information Gain

Information Gain

A useful measure for picking the best classifier attribute is information gain.Information gain measures how well a given attribute separates training examples with respect to their target classification.Information gain is defined in terms of entropy as used in information theory.

Page 39: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 39

Entropy

Entropy([9+, 5−]) = −p⊕ log2 p⊕− p⊖ log2⊖

= −(9/14)log2(9/14)−(5/14)log2(5/14)

= 0.940

For our previous example (14 examples, 9 positive, 5 negative)

Page 40: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 40

Entropy

Think of Entropy(S) as expected number of bits needed to encode class (⊕ or ⊖) of randomly drawn member of S (under the

optimal, shortest-length code)

Page 41: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 41

Entropy

If p⊕ = 1 (all instances are positive) then no message need be sent (receiver knows example will be positive) and Entropy = 0 (“pure sample”)If p⊕ = 0.5 then 1 bit need be sent to indicate whether instance negative or positive and Entropy = 1If p⊕ = 0.8 then less than 1 bit need be sent on average –assign shorter codes to collections of positive examples and longer ones to negative ones

Example

Page 42: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 42

Entropy

Information theory: optimal length code assigns −log2p bits to. message having

probability p

p⊕(−log2 p⊕)+ p⊖(−log2 p⊖)

Entropy(S) ≡ −p⊕ log2p⊕− p⊖ log2p⊖

So, expected number of bits needed to encode class (⊕ or ⊖) of random member of S

Page 43: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 43

Information Gain

Entropy gives a measure of purity/impurity of a set of examples

Define information gain as the expected reduction in entropy resulting from partitioning a set of examples on the basis of an attribute

Page 44: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 44

Information Gain

Formally, given a set of examples S and attribute A

Values(A) is the set of values attribute A can take on

Sv is the subset of S for which A has value v

Page 45: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 45

First term in Gain(S,A) is entropy of original set

Second term is expected entropy after partitioning on A = sum of entropies of each subset Sv weighted

by ratio of Sv in S

Information Gain

Page 46: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 46

Information Gain

𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽 𝑯𝑯𝑽𝑽𝑽𝑽𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑽𝑽𝑯𝑯𝑽𝑽𝑯𝑯 = {𝒀𝒀𝑽𝑽𝑽𝑽,𝑵𝑵𝑯𝑯}

𝑺𝑺 = 𝟗𝟗 + ,𝟓𝟓 −𝑺𝑺𝑵𝑵𝑯𝑯 = 𝟑𝟑 + ,𝟑𝟑 −𝑺𝑺𝒀𝒀𝑽𝑽𝑽𝑽 = 𝟔𝟔 + ,𝟐𝟐 −

𝑮𝑮𝑽𝑽𝑮𝑮𝑮𝑮 𝑺𝑺,𝑯𝑯𝑽𝑽𝑽𝑽𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑽𝑽𝑯𝑯𝑽𝑽𝑯𝑯 = 𝑬𝑬𝑮𝑮𝑬𝑬𝑯𝑯𝑯𝑯𝑬𝑬𝑬𝑬 𝑺𝑺 − �

}𝑯𝑯∈{𝒀𝒀𝑽𝑽𝑽𝑽,𝑵𝑵𝑯𝑯

𝑺𝑺𝝂𝝂𝑽𝑽𝑬𝑬𝑮𝑮𝑬𝑬𝑯𝑯𝑯𝑯𝑬𝑬𝑬𝑬(𝑺𝑺𝝂𝝂)

= 𝑬𝑬𝑮𝑮𝑬𝑬𝑯𝑯𝑯𝑯𝑬𝑬𝑬𝑬 𝑺𝑺 −𝟔𝟔𝟏𝟏𝟏𝟏

𝑬𝑬𝑮𝑮𝑬𝑬𝑯𝑯𝑯𝑯𝑬𝑬𝑬𝑬 𝑺𝑺𝑵𝑵𝑯𝑯 −𝟖𝟖𝟏𝟏𝟏𝟏

𝑬𝑬𝑮𝑮𝑬𝑬𝑯𝑯𝑯𝑯𝑬𝑬𝑬𝑬 𝑺𝑺𝒀𝒀𝑽𝑽𝑽𝑽

= 𝟎𝟎.𝟗𝟗𝟏𝟏𝟎𝟎 −𝟔𝟔𝟏𝟏𝟏𝟏

𝟏𝟏.𝟎𝟎𝟎𝟎 −𝟖𝟖𝟏𝟏𝟏𝟏

𝟎𝟎.𝟖𝟖𝟏𝟏𝟏𝟏 = 𝟎𝟎.𝟎𝟎𝟏𝟏𝟖𝟖

Returning to the Example

Page 47: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 47

Information Gain

Which attribute is the best classifier ?

LongShort

S: [9+, 5-]

S: [3+, 4-] S: [6+, 1-]

NoYes

S: [9+, 5-]

S: [6+, 2-] S: [3+, 3-]

𝑮𝑮𝑽𝑽𝑮𝑮𝑮𝑮 𝑺𝑺,𝑯𝑯𝑽𝑽𝑮𝑮𝑯𝑯𝑯𝑯𝑽𝑽𝑮𝑮𝑯𝑯𝑬𝑬𝑯𝑯

= 𝟎𝟎.𝟗𝟗𝟏𝟏𝟎𝟎 −𝟕𝟕𝟏𝟏𝟏𝟏

𝟎𝟎.𝟗𝟗𝟖𝟖𝟓𝟓 −𝟕𝟕𝟏𝟏𝟏𝟏

𝟎𝟎.𝟓𝟓𝟗𝟗𝟐𝟐= 𝟎𝟎.𝟓𝟓𝟏𝟏𝟓𝟓

𝑮𝑮𝑽𝑽𝑮𝑮𝑮𝑮 𝑺𝑺,𝑯𝑯𝑽𝑽𝑽𝑽𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑽𝑽𝑯𝑯𝑽𝑽𝑯𝑯

= 𝟎𝟎.𝟗𝟗𝟏𝟏𝟎𝟎 −𝟖𝟖𝟏𝟏𝟏𝟏

𝟎𝟎.𝟖𝟖𝟏𝟏𝟏𝟏 −𝟔𝟔𝟏𝟏𝟏𝟏

𝟏𝟏.𝟎𝟎𝟎𝟎

= 𝟎𝟎.𝟎𝟎𝟏𝟏𝟖𝟖

E = 0.985 E = 0.592HairLength

E = 0.811 E = 1.00

HeadCovered

Page 48: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 48

Training ExamplesExample No. Height Weight HairLength HeadCovered Gender

x1 Short Light Short Yes Malex2 Short Light Short No Malex3 Normal Light Short Yes Femalex4 Tall Normal Short Yes Femalex5 Tall Heavy Long Yes Femalex6 Tall Heavy Long No Malex7 Normal Heavy Long Yes Femalex8 Short Normal Short Yes Malex9 Short Heavy Long Yes Femalex10 Tall Normal Long No Femalex11 Short Normal Long No Femalex12 Normal Normal Short No Femalex13 Normal Light Long Yes Femalex14 Tall Normal Short No Male

Page 49: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 49

First step: which attribute to test at the root ?

Gain(S, Height) = 0.246

Gain(S, Weight) = 0.029

Gain(S, HairLength) = 0.151

Gain(S, HeadCovered) = 0.084

Which attribute should be tested at the root?

Height provides the best prediction for the target

Page 50: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 50

Information Gain

Height

Short Normal Tall

Female

[x1,x2,…,x14][9+,5-]

Sshort=[x1, x2, x8, x9, x11][2+,3-]

? ?

[x3, x7, x12, x13][4+,0-]

[x4, x5, x6, x10, x14][3+,2-]

Lets grow the tree

Add to the tree a successor for each possible value of

Height

Partition the training samples according to the value of Height

Page 51: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 51

After first step

Sshort ={x1,x2,x8,x9,x11}

Gain(Sshort, HairLength)=0.970 - (3/5)(0.0) - (2/5)(0.0) = 0.970

Gain(Sshort, Weight)=0.970 - (2/5)(0.0) - 2/5(1.0) - (1/5)(0.0) = 0.570

Gain(Sshort, HeadCovered)=0.970 - (2/5)(1.0) - (3/5)(0.918) = 0.019

Which attribute should be tested here?

Page 52: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 52

Second step

Sshort ={x1,x2,x8,x9,x11}

Gain(Sshort, HairLength)=0.970 - (3/5)(0.0) - (2/5)(0.0) = 0.970

Gain(Sshort, Weight)=0.970 - (2/5)(0.0) - 2/5(1.0) - (1/5)(0.0) = 0.570

Gain(Sshort, HeadCovered)=0.970 - (2/5)(1.0) - (3/5)(0.918) = 0.019

Working on Height = Short node

HairLength provides the best prediction for the target

Page 53: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 53

Second and Third Steps

[x3,x7,x12,x13]

[x9,x11] [x6,x10,x14][x1,x2, x8]

Height

Short Normal Tall

HairLength

Short Long

HeadCovered

No Yes

Male Female

Female

FemaleMale[x4,x5]

Lets grow the tree Final tree for S is

Page 54: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 54

Hypothesis Space Search by ID3

+ - +

+ - +

A1

- - ++ - +

A2

+ - -

+ - +

A2

-A4

+ -

A2

-A3

- +

Page 55: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 55

Hypothesis Space Search by ID3

Recall – Learning is a Searching problemGiven a set of training examples and Hypothesis Space (H), find (or search) a hypothesis h which best fits the training examples

ID3 searches a Hypothesis Space (set of possible decision trees) for one fitting the training dataSearch is simple-to-complex, hill-climbing search guided by the Information Gain evaluation function

Page 56: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 56

Hypothesis Space Search by ID3

Hypothesis Space of ID3 is complete space of finite, discrete-valued functions w.r.t available attributes

Contrast with incomplete hypothesis spaces, such as conjunctive hypothesis space

Page 57: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 57

Hypothesis Space Search by ID3

ID3 maintains only one hypothesis at any time, instead of, e.g., all hypotheses consistent with training examples seen so far

Contrast with CANDIDATE-ELIMINATIONMeans can’t determine how many alternative decision trees are consistent with dataMeans can’t ask questions to resolve competing alternatives

Page 58: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 58

Hypothesis Space Search by ID3

ID3 performs no backtracking – once an attribute is selected for testing at a given node, this choice is never reconsidered

So, susceptible to converging to locally optimal rather than globally optimal solutions

Page 59: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 59

Hypothesis Space Search by ID3

Uses all training examples at each step to make statistically-based decision about how to refine current hypothesis

Contrast with CANDIDATE-ELIMINATION or FIND-S – make decisions incrementally based on single training examples Using statistically-based properties of all examples (information gain) means technique is robust in the face of errors in individual examples

Page 60: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Inductive Bias in Decision Tree Learning

Page 61: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 61

Inductive Bias in Decision Tree Learning

Inductive Bias

The set of assumptions needed in addition to training data to justify deductively learner’s classification

Given a set of training examples, there may be many decision trees consistent with them.

Inductive bias of ID3 is shown by which of these trees it chooses

Page 62: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 62

Inductive Bias in Decision Tree Learning

Selects shorter trees over longer onesSelects trees that place attributes with highest Information Gain closest to root

ID3’s search strategy (simple-to-complex, hill

climbing)

Page 63: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 63

Inductive Bias in Decision Tree Learning

Inductive bias of ID3

1. Shorter trees are preferred over longer trees.2. Trees that place high information gain attributes

close to the root are preferred to those that do not.

Note that one could produce a decision tree learning algorithm with the simpler bias of always preferring a shorter tree

Page 64: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 64

Inductive Bias in Decision Tree Learning

How does inductive bias of ID3 compare to that of version space CANDIDATE-

ELIMINATION algorithm?

ID3 incompletely searches a complete hypothesis spaceCANDIDATE-ELIMINATION completely searches an incomplete hypothesis space

Page 65: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 65

Inductive Bias in Decision Tree Learning

Can be put differently by saying

Inductive bias of ID3 follows from its search strategy (preference bias or search bias)Inductive bias of CANDIDATE-ELIMINATION follows from its definition of its search space (restriction bias or language bias)

Page 66: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 66

Inductive Bias in Decision Tree Learning

Note that preference bias only effects order in which hypotheses are investigated; restriction bias effects which hypotheses are investigated

Generally better to choose algorithm with preference bias rather than restriction bias – with restriction bias target function may not be contained in hypothesis space

Page 67: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 67

Inductive Bias in Decision Tree Learning

Note that some algorithms may combine preference and restriction biases – e.g.

checker’s learning program

Linear weighted function of fixed set of board features introduces restriction bias (non-linear potential target functions excluded)Least mean square parameter tuning introduces preference bias into search through space of parameter values

Page 68: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 68

Inductive Bias in Decision Tree Learning

“Occam’s Razor” – prefer simplest hypothesis that fits the data.

This is a general assumption that many natural scientists make

One Response

Is ID3’s inductive bias sound? Why prefer shorter hypotheses/trees?

Page 69: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 69

Occam’s Razor

Why prefer short hypotheses?

Argument in favor: Fewer short hypotheses than long hypothesesA short hypothesis that fits the data is unlikely to be a coincidenceA long hypothesis that fits the data might be a coincidence

Argument opposed:There are many ways to define small sets of hypotheses E.g. All trees with a prime

number of nodes that use attributes beginning with ”Z”

What is so special about small sets based on size of hypothesis?

Page 70: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Refinements to Basic Decision Tree Learning

Page 71: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 71

Issues in Decision Tree Learning

Determining how deeply to grow the decision treeHandling continuous attributesChoosing an appropriate attribute selection measureHandling training data with missing attribute valuesHandling attributes with differing costs andImproving computational efficiency

Practical issues in learning decision trees include

Page 72: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 72

Refinements to Basic Decision Tree Learning

Overfitting Training Data + Tree Pruning

The simple ID3 algorithm can produce trees that overfit the training examples

In case ofNoise in the dataNumber of training examples is too small to produce a representative sample of the true target function

Page 73: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 73

Refinements to Basic Decision Tree Learning

‹Tall, Heavy, Long, No, Gender = Female›

Suppose in addition to the 14 examples for Gender Identification we get a 15th example

whose target classification is wrong

Page 74: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 74

Refinements to Basic Decision Tree Learning

Height

Short Normal Tall

HairLength

Short Long

HeadCovered

No Yes

Male Female

Female

FemaleMale

What impact this will have on our

earlier tree ?

Page 75: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 75

Refinements to Basic Decision Tree Learning

‹Short, Heavy , Short, Yes, Gender = Male›‹Short, Normal , Short, Yes, Gender = Male›

Since we previously had the correct example

Result will be tree that performs well on (errorful) training examples, but less well on new unseen instances

Tree will be elaborated below

right branch of HairLength

Page 76: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 76

Refinements to Basic Decision Tree Learning

The addition of this incorrect example will now cause ID3 to construct a more complex tree. The new example will be sorted into the second leaf node from the left in the learned tree, along with the previous positive examples x9 and x11.

Because the new example is labeled as a negative example, ID3 will search for further refinements to the tree below this node.

Result will be tree which performs well on (error full training examples) and less well on new unseen instances.

Page 77: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 77

Refinements: Overfitting Training Data

Overfitting can also occur when the number of training examples is too small to be representative of the true target function Coincidental regularities may be

picked up during training

Adapting to noisy training data is

one type of overfitting

Page 78: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 78

Refinements: Overfitting Training Data

Definition

Given a hypothesis space H, a hypothesis h ∈ H overfits the training data if there is another hypothesis h′ ∈ H such that h has smaller error than h′ over the training data, but h′ has a smaller error over the entire distribution of instances

More precisely

Page 79: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 79

Refinements: Overfitting Training Data

Overfitting is a real problem for decision tree learning – 10% - 25% decrease in accuracy over a range of tasks in one empirical studyOverfitting a problem for many other machine learning methods too

Page 80: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 80

Refinements: Overfitting Training Data (Example)

Example of ID3 learning which medical patients have a form of diabetes

Page 81: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 81

Refinements: Overfitting Training Data (Example)

Accuracy of tree over training examples increases monotonically as tree grows (to be expected)

Accuracy of tree over independent test examples increases till about 25 nodes, then decreases

Page 82: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 82

Refinements: Avoiding Overfitting

Overfitting Training Data + Tree Pruning

Two general approaches

Stop growing tree before perfectly fitting training data

e.g. when data split is not statistically significant

Grow full tree, then prune afterwards

In practice, second approach has been more successful

Page 83: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 83

Refinements: Avoiding Overfitting

Use a set of examples distinct from training examples to evaluate quality of tree; orUse all data for training but apply statistical test to decide whether expanding/pruning a given node is likely to improve performance over whole instance distribution; orMeasure complexity of encoding training examples + decision tree and stop growing tree when this size is minimized –minimum description length principle

For either approach, how can optimal final tree size be decided?

Page 84: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 84

Refinements: Avoiding Overfitting

First approach most common – called training and validation set approach

Divide available instances into

Training set – commonly 2/3 of data

Validation set – commonly 1/3 of data

Hope is that random errors and coincidental regularities learned from training set will not be present in validation set

Page 85: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 85

Training and Validation Set Approach

Data (100%)

Training (90%) Testing (10%)

Train (90%) Validation (10%)

Divide available instances into

100k

10k90k

9k81k

Page 86: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 86

Refinements: Reduced Error Pruning

Assumes data split into training and validation sets

Proceed as follows

Train decision tree on training setDo until further pruning is harmfulFor each decision node evaluate impact on validation set of removing that node and those below it

Remove node that most improves accuracy on validation set

Page 87: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 87

Refinements: Reduced Error Pruning

When a decision node is removed the subtree rooted at it is replaced with a leaf node whose classification is the most

common classification of examples beneath the decision node

How is impact of removing a node evaluated?

Page 88: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 88

Refinements: Reduced Error Pruning

Training examples for the original tree

Validation examples of guiding tree pruning

Test examples to provide an estimate over future unseen examples

To assess value of reduced error pruning, split data into 3 distinct sets

Page 89: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 89

Refinements: Reduced Error Pruning

On previous example, reduced error pruning produces this effect

Page 90: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 90

Refinements: Reduced Error Pruning (cont…)

Holding data back for a validation set reduces data available for training

Drawback

Page 91: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 91

Refinements: Rule Post-Pruning

Perhaps most frequently used method (e.g.,C4.5)

Proceed as follows

Convert tree to equivalent set of rulesPrune each rule independently of othersSort final rules into desired sequence for use

Page 92: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 92

Refinements: Rule Post-Pruning

Convert tree to rules by making the conjunction of decision nodes along each branch the antecedent

of a rule and each leaf the consequent

Page 93: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 93

Refinements: Rule Post-Pruning

Height

Short Normal Tall

HairLength

Short Long

HeadCovered

No Yes

Male Female

Female

FemaleMale

IF (Height = Short) ∧(HairLength = Long)Then Gender = Female

IF (Height = Short) ∧(HairLength = Short)Then Gender = Male

Page 94: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 94

Refinements: Rule Post-Pruning

To prune rules, remove any precondition (conjunct in antecedent) of a rule whose removal does not worsen rule accuracy

Can estimate rule accuracy

By using a separate validation setBy using the training data, but assuming a statistically-based pessimistic estimate of rule accuracy (C4.5)

Page 95: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 95

Refinements: Rule Post-Pruning

Converting to rules allows distinguishing different contexts in which rules are used – treat each path through tree differently contrast: removing a decision node removes all paths beneath itRemoves distinction between testing nodes near root and those near leaves – avoids need to rearrange tree should higher nodes be removedRules often easier for people to understand

Three advantages of converting trees to rules before pruning

Page 96: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 96

Refinements: Continuous-valued Attributes

Initial definition of ID3 restricted to discrete-valued

Target attributesDecision node attributes

Can overcome second limitation by dynamically defining new discrete-valued attributes that partition a continuous attribute value into a set of discrete intervals

Page 97: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 97

Refinements: Continuous-valued Attributes

So, for continuous attribute A dynamically create a new Boolean attribute Ac that is true A > c and false otherwise

How do we pick c ? →Pick c that maximizes information gain

Page 98: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 98

Refinements: Continuous-valued Attributes

E.g. suppose for Gender Identification example we want Weight to be a

continuous attribute

Weight 40 48 60 72 80 90

Gender Female Female Male Male Male Female

Page 99: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 99

Refinements: Continuous-valued Attributes

Sort by weight and identify candidate thresholds midway between points where target attribute changes ((60+48)/2) and ((90+80)/2))

Compute information gain for Weight>54 and Weight<85 and select highest (Weight>54)

Can be extended to split continuous attribute into 2 intervals

Page 100: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 100

Information gain measure favors attributes with many values over those with few values

Refinements: Alternative Attribute Selection Measures

Page 101: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 101

For Example

If we add a CNIC No attribute to the Gender example it will have a distinct value for each gender and will have the highest information gain

This is because date perfectly predicts the target attribute for all training examplesResult is a tree of depth 1 that perfectly classifies training examples but fails on all other data

Refinements: Alternative Attribute Selection Measures

Page 102: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 102

Can avoid this by using other attribute selection measures

One alternative is gain ratio

Refinements: Alternative Attribute Selection Measures

Page 103: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 103

Refinements: Alternative Attribute Selection Measures

Where Si is subset of S for which c-valued attribute A has value vi

(Note: Split Information is entropy of S w.r.t values of A)

Has effect of penalizing attributes with many, uniformly distributed values

Page 104: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 104

Experiments with variants of this and other attribute selection measures have been carried out and are reported in the machine learning literature

Refinements: Alternative Attribute Selection Measures

Page 105: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 105

Refinements: Missing/Unknown Attribute Values

What if a training example x is missing value for attribute A?

Several alternatives have been explored

Page 106: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 106

Refinements: Missing/Unknown Attribute Values

Assign most common value of A among other examples sorted to node n orAssign most common value of A among other examples at n with same target attribute value as x orAssign probability pi to each possible value vi of A estimated from observed frequencies of values of A for examples sorted to A

At decision node n where Gain(S,A) is computed

Page 107: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 107

Refinements: Missing/Unknown Attribute Values

Assign fraction pi of x distributed down each branch in tree below n (this technique is used in C4.5)Last technique can be used to classify new examples with missing attributes (i.e. after learning) in same fashion

At decision node n where Gain(S,A) is computed

Page 108: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 108

Refinements: Attributes with Differing Costs

Different attributes may have different costs associated with acquiring their values

For Example

In medical diagnosis, different tests, such as blood tests, brain scans, have different costsIn robotics positioning a sensing device on a robot so as to take a differing measurement requires differing amounts of time (= cost)

Page 109: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 109

Refinements: Attributes with Differing Costs

How to learn a consistent tree with low expected cost?

Page 110: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 110

Refinements: Attributes with Differing Costs

Various approaches have been explored in which the attribute selection measure is

modified to include a cost term

For Example

Page 111: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 111

Summary

Decision trees classify instances. Testing starts at the root and proceeds downwards

Non-leaf nodes test one attribute of the instance and the attribute value determines which branch is followed.Leaf nodes are instance classifications.

Page 112: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 112

Summary (cont…)

Decision trees are appropriate for problems where:Instances are describable by attribute–value pairs (typically, but not necessarily, nominal)Target function is discrete valued (typically, but not necessarily)Disjunctive hypotheses may be requiredTraining data may be noisy/incomplete.

Page 113: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 113

Summary (cont…)

Various algorithms have been proposed to learn decision trees – ID3 is the classic

Recursively grows tree from the root picking at each point attribute which maximizes information gain with respect to the training examples sorted to the current nodeRecursion stops when all examples down a branch fall into a single class or all attributes have been tested

Page 114: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 114

Summary (cont…)

ID3 carries out incomplete search of complete hypothesis space – contrast with CANDIDATE-ELIMINATION which carries out a complete search of an incomplete hypothesis space

Page 115: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 115

Summary (cont…)

Decision trees exhibit an inductive bias which prefers shorter trees with high information gain attributes closer to the root (at least where information gain is used as the attribute selection criterion, as in ID3)

ID3 searches a complete hypothesis space for discrete-valued functions, but searches the space incompletely, using the information gain heuristic

Page 116: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 116

Summary (cont…)

Overfitting the training data is an important issue in decision tree learning

Noise or coincidental regularities due to small samples may mean that while growing a tree beyond a certain size improves its performance on the training data, it worsens its performance on unseen instances

Page 117: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 117

Summary (cont…)

Overfitting can be addressed by post-pruning the decision in a variety of ways

Page 118: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Dr. Rao Muhammad Adeel Nawab 118

Summary (cont…)

Various other refinements of the basic ID3 algorithm address issues such as:

Handling real-valued attributesHandling training/test instances with missing attribute valuesUsing attribute selection measures other than information gainAllowing costs to be associated with attributes