Decision Tree Algorithms Rule Based Suitable for automatic generation.
-
Upload
isaac-pierce -
Category
Documents
-
view
224 -
download
0
description
Transcript of Decision Tree Algorithms Rule Based Suitable for automatic generation.
Decision Tree Algorithms
Rule Based
Suitable for automatic generation
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-2
Decision trees• Logical branching• Historical:
– ID3 – early rule- generating system
• Branches:– Different possible
values• Nodes:
– From which branches emanate
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-3
Goal-Driven Data Mining
• Define goal– Identify fraudulent cases
• Develop rules identifying attributes attaining that goal– IF attorney = Smith, THEN better check
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-4
Tree Structure• Sorts out data
– IF THEN rules– Loan variables
• Age: {young, middle, old}• Income: {low, average, high}• Risk: {low, medium, high}
• Exhaustive tree enumerates all combinations– 81 combinations – classify all
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-5
Types of Trees
• Classification tree– Variable values classes– Finite conditions
• Regression tree– Variable values continuous numbers– Prediction or estimation
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-6
Rule Induction• Automatically process data
– Classification (logical, easier)– Regression (estimation, messier)
• Search through data for patterns & relationships– Pure knowledge discovery
• Assumes no prior hypothesis• Disregards human judgment
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-7
Example• Three variables:
– Age– Income– Risk
• Outcomes:– On-time– Late
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-8
CombinationsVariable Value Cases OT Late Pr(OT)Age Young 12 8 4 0.67
Middle 5 4 1 0.80Old 3 3 0 1.00
Income Low 5 3 2 0.60Average 9 7 2 0.78High 6 5 1 0.83
Risk High 9 5 4 0.55Average 1 0 1 0.00Low 10 10 0 1.00
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-9
Basis for Classification
• If a category has all outcomes of a certain kind, that makes a good rule– IF income = High, they always paid
• ENTROPY: Measure of content – Actually measure of randomness
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-10
Entropy formulaInformation = -{p/(p+n)}log2 {p/(p+n)}-{n/(p+n)}log2 {n/(p+n)}
The lower the measure, the greater the information content
Can use to automatically select variable with most productive rule potential
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-11
Entropy• Young
- 8/12 x -0.390 – 4/12 x -0.528 x 12/20: 0.551
• Middle- 4/5 x -0.258 – 1/5 x -0.464 x 5/20: 0.180
• Old- 3/3 x 0 – 0/3 x 0 x 3/20: 0.000
SUM 0.731Income 0.782Risk 0.446
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-12
Rule
1. IF(Risk = Low) THEN OT2. ELSE LATE
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-13
All Rules
1. IF Risk=Low OT2. IF Risk NOT Low & Age=Middle Late3. IF Risk NOT Low & Age NOT Middle &
Income=High Late4. ELSE OT
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-14
Sample Case
• Age 36 Middle• Income $70K/year Average• Risk:
– Assets $42K– Debts $40K– Wants $5K Average
• Rule 2 applies, says Late
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-15
Fuzzy Decision Trees
• Have assumed distinct (crisp) outcomes• Many data points not that clear• Fuzzy: Membership function represents
belief (between 0 and 1)• Fuzzy relationships have been
incorporated in decision tree algorithms
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-16
Fuzzy ExampleAge Young 0.3 Middle 0.9 Old 0.2Income Low 0.0 Average 0.8 High 0.3Risk Low 0.1 Average 0.8 High 0.3• Definitions:
– Sum will not necessarily equal 1.0– If ambiguous, select alternative with larger
membership value– Aggregate with mean
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-17
Fuzzy Model• IF Risk=Low Then OT
– Membership function: 0.1• IF Risk NOT Low & Age=Middle Then Late
– Risk MAX(0.8, 0.3)– Age 0.9– Membership function: Mean = 0.85
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-18
Fuzzy Model cont.
• IF Risk NOT Low & Age NOT Middle & Income=High THEN Late– Risk MAX(0.8, 0.3) 0.8– Age MAX(0.3, 0.2) 0.3– Income 0.3– Membership function: Mean = 0.433
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-19
Fuzzy Model cont.
• IF Risk NOT Low & Age NOT Middle & Income NOT High THEN Late– Risk MAX(0.8, 0.3) 0.8– Age MAX(0.3, 0.2) 0.3– Income MAX(0.0, 0.8) 0.8– Membership function: Mean = 0.633
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-20
Fuzzy Model cont.
• Highest membership function is 0.633, for Rule 4
• Conclusion: On-time
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-21
Applications
• Inventory Prediction• Clinical Databases• Software Development Quality
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-22
Inventory Prediction• Groceries
– Maybe over 100,000 SKUs– Barcode data input
• Data mining to discover patterns– Random sample of over 1.6 million records– 30 months– 95 outlets– Test sample 400,000 records
• Rule induction more workable than regression– 28,000 rules– Very accurate, up to 27% improvement
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-23
Clinical Database• Headache
– Over 60 possible causes• Exclusive reasoning uses negative rules
– Use when symptom absent• Inclusive reasoning uses positive rules• Probabilistic rule induction expert system
– Headache: Training sample over 50,000 cases, 45 classes, 147 attributes
– Meningitis: 1200 samples on 41 attributes, 4 outputs
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-24
Clinical Database• Used AQ15, C4.5
– Average accuracy 82%• Expert System
– Average accuracy 92%• Rough Set Rule System
– Average accuracy 70%• Using both positive & negative rules from
rough sets– Average accuracy over 90%
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-25
Software Development Quality• Telecommunications company• Goal: find patterns in modules being
developed likely to contain faults discovered by customers– Typical module several million lines of code– Probability of fault averaged 0.074
• Apply greater effort for those– Specification, testing, inspection
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-26
Software Quality• Preprocessed data• Reduced data• Used CART
– (Classification & Regression Trees)– Could specify prior probabilities
• First model 9 rules, 6 variables– Better at cross-validation– But variable values not available until late
• Second model 4 rules, 2 variables– About same accuracy, data available earlier
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-27
Decision Trees
• Very effective & useful• Automatic machine learning
– Thus unbiased (but omit judgment)• Can handle very large data sets
– Not affected much by missing data• Lots of software available