Decision tree Using c4.5 Algorithm

42

Click here to load reader

Transcript of Decision tree Using c4.5 Algorithm

Page 1: Decision tree Using c4.5 Algorithm

Mohd Noor Abdul Hamid, Ph.DUniversiti Utara Malaysia

[email protected]

Page 2: Decision tree Using c4.5 Algorithm

After this class, you should be able to :

Explain the C4.5 AlgorithmUse the algorithm to develop a Decision Tree

[email protected]

Page 3: Decision tree Using c4.5 Algorithm

Decision tree are constructed using only those attributes best able to differentiate the concepts to be learned.Main goal is to minimize the number of tree levels and tree nodes maximizing data generalization

Bear in [email protected]

Page 4: Decision tree Using c4.5 Algorithm

The C4.5 AlgorithmLet T be the set of training instances

Choose an attribute that best differentiates the instances contained in

T.

[email protected]

Page 5: Decision tree Using c4.5 Algorithm

Let T be the set of training instances

Choose an attribute that best differentiates the instances contained in

T.

The C4.5 Algorithm

Create a tree node whose value is the chosen attribute. Create child links from this node where each link represents a unique value for the chosen attribute.Use the child link values to further subdivide the instances into subclasses.

Page 6: Decision tree Using c4.5 Algorithm

Let T be the set of training instances

Choose an attribute that best differentiates the instances contained in

T.

Create a tree node whose value is the chosen attribute. Create child links from this node where each link represents a unique value for the chosen attribute.Use the child link values to further subdivide the instances into subclasses.

The C4.5 Algorithm

Instances in the subclass satisfy predefine criteria

ORRemaining attributes choice

for the path is null

Page 7: Decision tree Using c4.5 Algorithm

Choose an attribute that best differentiates the instances contained in

T.

Create a tree node whose value is the chosen attribute. Create child links from this node where each link represents a unique value for the chosen attribute.Use the child link values to further subdivide the instances into subclasses.

Instances in the subclass satisfy predefine criteria

ORRemaining attributes choice

for the path is null Specify the classification for new

instances following this decision path

Y

The C4.5 Algorithm

Page 8: Decision tree Using c4.5 Algorithm

Create a tree node whose value is the chosen attribute. Create child links from this node where each link represents a unique value for the chosen attribute.Use the child link values to further subdivide the instances into subclasses.

Instances in the subclass satisfy predefine criteria

ORRemaining attributes choice

for the path is null

Specify the classification for newinstances following this decision path

Y

The C4.5 Algorithm

N

Page 9: Decision tree Using c4.5 Algorithm

Let T be the set of training instances

Choose an attribute that best differentiates the instances contained in

T.

Create a tree node whose value is the chosen attribute. Create child links from this node where each link represents a unique value for the chosen attribute.Use the child link values to further subdivide the instances into subclasses.

The C4.5 Algorithm

END

Page 10: Decision tree Using c4.5 Algorithm

Exercise

Page 11: Decision tree Using c4.5 Algorithm

Exercise : The Scenario• BIGG Credit Card company wish to

develop a predictive model in order to identify customers who are likely to take advantage of the life insurance promotion – so that they can mail the promotional item to the potential customer.

Page 12: Decision tree Using c4.5 Algorithm

Exercise : The ScenarioThe model will be develop using the data

stored in the credit card promotion database. The data contains information obtained about customers through their initial credit card application as well as data about whether these individual have accepted various promotional offerings sponsored by the company

Dataset

Page 13: Decision tree Using c4.5 Algorithm

Let T be set of training instances

ExerciseWe follow our previous work with creditcardpromotion.xls The dataset consist of 15 instances (observations)- T, each having 10 attributes (variables) for our example, the input attributes are limited to 5. Why??

Step 1

Decision tree are constructed using only those attributes best

able to differentiate the concepts to be learned.

Page 14: Decision tree Using c4.5 Algorithm

Let T be set of training instances

ExerciseStep 1

Age Interval 19 – 55 years

IndependentVariables(Inputs)

Sex Nominal MaleFemale

Income Range

Ordinal 20 – 30K30 – 40K40 – 50K50 – 60K

Credit Card Insurance

Binary YesNo

Life Insurance Promotion

Binary YesNo

DependentVariables

(Target / Output)

Page 15: Decision tree Using c4.5 Algorithm

Choose an attribute that best differentiates the instances contained in T

ExerciseC4.5 uses measure taken from information theory to help with the attribute selection process. The idea is; for any choice point in the tree, C4.5 selects the attributes that splits the data so as to show the largest amount of gain in information.We need to choose an input attribute to best differentiate the instances in T our choices are among : - Income Range - Credit Card Insurance - Sex - Age

Step 2

INSURED

Page 16: Decision tree Using c4.5 Algorithm

Choose an attribute that best differentiates the instances contained in T

ExerciseGoodness Score for each attribute is calculated to determined which attribute best differentiate the training instances (T).

Step 2

Sum of the most frequently encountered class in each branch (level) ÷ T Number of branches (levels)

Goodness Score

We can develop a partial tree for each attribute in order to calculate the Goodness Score.

Sum of the most frequently encountered class in each branch (level) ÷ T Number of branches (levels)

Goodness Score Accuracy

Page 17: Decision tree Using c4.5 Algorithm

Choose an attribute that best differentiates the instances contained in T

ExerciseStep 2 2a. Income Range

Page 18: Decision tree Using c4.5 Algorithm

Choose an attribute that best differentiates the instances contained in T

ExerciseStep 2 2a. Income Range

IncomeRange

2 Yes2 No

20 – 30K

4 Yes1 No

30-40K

1 Yes3 No

40-50K

2 Yes

50-60K

Page 19: Decision tree Using c4.5 Algorithm

Choose an attribute that best differentiates the instances contained in T

ExerciseStep 2 2a. Income Range

Sum of the most frequently encountered class in each branch (level) ÷ T Number of branches (levels)

Goodness Score

= (2 + 4 + 3 + 2) ÷ 15 4

= 0.183

Page 20: Decision tree Using c4.5 Algorithm

Choose an attribute that best differentiates the instances contained in T

ExerciseStep 2 2b. Credit Card Insurance

Page 21: Decision tree Using c4.5 Algorithm

Choose an attribute that best differentiates the instances contained in T

ExerciseStep 2 2b. Credit Card Insurance

CCInsurance

6 Yes6 No

No

3 Yes0 No

Yes

Page 22: Decision tree Using c4.5 Algorithm

Choose an attribute that best differentiates the instances contained in T

ExerciseStep 2 2b. Credit Card Insurance

Sum of the most frequently encountered class in each branch (level) ÷ T Number of branches (levels)

Goodness Score

= (6 + 3) ÷ 15 2

= 0.30

Page 23: Decision tree Using c4.5 Algorithm

Choose an attribute that best differentiates the instances contained in T

ExerciseStep 2

2c. Sex

Page 24: Decision tree Using c4.5 Algorithm

Choose an attribute that best differentiates the instances contained in T

ExerciseStep 2

2c. Sex

Sex

3 Yes5 No

Male

6 Yes1 No

Female

Page 25: Decision tree Using c4.5 Algorithm

Choose an attribute that best differentiates the instances contained in T

ExerciseStep 2

2c. Sex

Sum of the most frequently encountered class in each branch (level) ÷ T Number of branches (levels)

Goodness Score

= (6 + 5) ÷ 15 2

= 0.367

Page 26: Decision tree Using c4.5 Algorithm

Choose an attribute that best differentiates the instances contained in T

ExerciseStep 2

2d. Age

Age is an interval variable (numeric), therefore we need to determine where is the best split point for all the values.For this example, we are opt for binary split. Why??

Main goal is to minimize the number

of tree levels and tree nodes

maximizing data generalization

Page 27: Decision tree Using c4.5 Algorithm

Choose an attribute that best differentiates the instances contained in T

ExerciseStep 2

2d. Age Steps to determine the best splitting point for interval (numerical) attribute:1. Sort the Age values (pair with the target – Life Ins Promo)

Age 19 27 29 35 38 39 40 41 42 43 43 43 45 55 55

LIP Y N Y Y Y Y Y Y N Y Y N N N N

Page 28: Decision tree Using c4.5 Algorithm

Choose an attribute that best differentiates the instances contained in T

ExerciseStep 2

2d. Age Steps to determine the best splitting point for interval (numerical) attribute (Age):

2. A Goodness Score is computed for each possible split point

Age 19 27 29 35 38 39 40 41 42 43 43 43 45 55 55

LIP Y N Y Y Y Y Y Y N Y Y N N N N

1 Yes0 No

8 Yes6 No

Goodness Score = (1 + 8) ÷ 15 = 0.30 2

Page 29: Decision tree Using c4.5 Algorithm

Choose an attribute that best differentiates the instances contained in T

ExerciseStep 2

2d. Age Steps to determine the best splitting point for interval (numerical) attribute (Age):

2. A Goodness Score is computed for each possible split point

Age 19 27 29 35 38 39 40 41 42 43 43 43 45 55 55

LIP Y N Y Y Y Y Y Y N Y Y N N N N

1 Yes1 No

8 Yes5 No

Goodness Score = (1 + 8) ÷ 15 = 0.30 2

Page 30: Decision tree Using c4.5 Algorithm

Choose an attribute that best differentiates the instances contained in T

ExerciseStep 2

2d. Age Steps to determine the best splitting point for interval (numerical) attribute (Age):

2. A Goodness Score is computed for each possible split point

This process continues until a score for the split between 45 and 55 is obtained.Split point with the highest Goodness Score is chosen 43.

Page 31: Decision tree Using c4.5 Algorithm

Choose an attribute that best differentiates the instances contained in T

ExerciseStep 2

2d. Age Steps to determine the best splitting point for interval (numerical) attribute (Age):

2. A Goodness Score is computed for each possible split point

Age 19 27 29 35 38 39 40 41 42 43 43 43 45 55 55

LIP Y N Y Y Y Y Y Y N Y Y N N N N

9 Yes3 No

0 Yes3 No

Goodness Score = (9 + 3) ÷ 15 = 0.40 2

Page 32: Decision tree Using c4.5 Algorithm

Choose an attribute that best differentiates the instances contained in T

ExerciseStep 2

2d. Age

Age

9 Yes3 No

≤ 43

0 Yes3 No

> 43

Page 33: Decision tree Using c4.5 Algorithm

Choose an attribute that best differentiates the instances contained in T

ExerciseStep 2

Overall Goodness Score for each input attribute:Attribute Goodness Score

Age 0.4Sex 0.367Credit Card Insurance 0.3Income Range 0.183

Therefore the attribute Age is chosen as the top level node

Page 34: Decision tree Using c4.5 Algorithm

• Create a tree node whose value is the chosen attribute.

• Create child links from this node where each link represents a unique value for the chosen attribute.

ExerciseStep 3

Page 35: Decision tree Using c4.5 Algorithm

ExerciseStep 3 Age

9 Yes3 No

≤ 43

0 Yes3 No

> 43

Page 36: Decision tree Using c4.5 Algorithm

For each subclass :a. If the instances in the subclass satisfy the

predefined criteria or if the set of remaining attribute choices for this path is null, specify the classification for new instances following this decision.

b. If the subclass does not satisfy the predefined criteria and there is at least one attribute to further subdivide the path of the three, let T be the current set of subclass instances and return to step 2.

ExerciseStep 3

[email protected]

Page 37: Decision tree Using c4.5 Algorithm

ExerciseStep 3 Age

9 Yes3 No

≤ 43

0 Yes3 No

> 43

Does not satisfy the predefine criteria.Subdivide!

Satisfy the predefine criteria.Classification :Life Insurance = No

[email protected]

Page 38: Decision tree Using c4.5 Algorithm

Step 3 Age

≤ 43

0 Yes3 No

> 43

Life Insurance = Yes

Sex

Exercise

Female

6 Yes0 No

Male

3 Yes3 No

Life Insurance = Yes Subdivide

[email protected]

Page 39: Decision tree Using c4.5 Algorithm

Age

≤ 43

0 Yes3 No

> 43

Life Insurance = Yes

Sex

Female

6 Yes0 No

Male

Life Insurance = Yes

ExerciseStep 3

CCInsurance

1 Yes3 No

No

2 Yes0 No

Yes

Life Insurance = No Life Insurance = Yes

[email protected]

Page 40: Decision tree Using c4.5 Algorithm

Age

≤ 43

0 Yes3 No

> 43

Life Insurance = No

SexFemale

6 Yes0 No

Male

Life Insurance = Yes

Exercise

CCInsurance

1 Yes3 No

No

2 Yes0 No

Yes

Life Insurance = No Life Insurance = Yes

The Decision Tree:Life Insurance Promo

Page 41: Decision tree Using c4.5 Algorithm

ExerciseThe Decision Tree:1. Our Decision Tree is able to accurately classify 14

out of 15 training instances.2. Therefore, the accuracy of our model is 93%

Page 42: Decision tree Using c4.5 Algorithm

Assignment• Based on the Decision Tree

model for the Life Insurance Promotion, develop application (program) using any tools you are familiar with.

• Submit your code and report next week!

[email protected]