Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP...
-
Upload
austin-oneal -
Category
Documents
-
view
226 -
download
4
Transcript of Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP...
![Page 1: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/1.jpg)
Information Theory, Classification & Decision Trees
Ling 572Advanced Statistical Methods in NLP
January 5, 2012
![Page 2: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/2.jpg)
2
Information Theory
![Page 3: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/3.jpg)
3
EntropyInformation theoretic measure
Measures information in model
Conceptually, lower bound on # bits to encode
Entropy: H(X): X is a random var, p: prob fn
)(log)()( 2 xpxpXHXx
![Page 4: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/4.jpg)
4
Cross-EntropyComparing models
Actual distribution unknown pUse simplified model to estimate m
Closer match will have lower cross-entropy
![Page 5: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/5.jpg)
5
Cross-EntropyComparing models
Actual distribution unknown pUse simplified model to estimate m
Closer match will have lower cross-entropy
![Page 6: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/6.jpg)
6
Cross-EntropyComparing models
Actual distribution unknown pUse simplified model to estimate m
Closer match will have lower cross-entropy
![Page 7: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/7.jpg)
7
Cross-EntropyComparing models
Actual distribution unknown pUse simplified model to estimate m
Closer match will have lower cross-entropy
![Page 8: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/8.jpg)
8
Cross-EntropyComparing models
Actual distribution unknown pUse simplified model to estimate m
Closer match will have lower cross-entropy
![Page 9: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/9.jpg)
9
Cross-EntropyComparing models
Actual distribution unknown pUse simplified model to estimate m
Closer match will have lower cross-entropy
![Page 10: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/10.jpg)
10
Relative EntropyCommonly known as Kullback-Liebler divergence
Expresses difference between probability distributions
![Page 11: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/11.jpg)
11
Relative EntropyCommonly known as Kullback-Liebler divergence
Expresses difference between probability distributions
![Page 12: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/12.jpg)
12
Relative EntropyCommonly known as Kullback-Liebler divergence
Expresses difference between probability distributions
Not a proper distance metric:
![Page 13: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/13.jpg)
13
Relative EntropyCommonly known as Kullback-Liebler divergence
Expresses difference between probability distributions
Not a proper distance metric: asymmetricKL(p||q) != KL(q||p)
![Page 14: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/14.jpg)
14
Joint & Conditional Entropy
Joint entropy:
![Page 15: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/15.jpg)
15
Joint & Conditional Entropy
Joint entropy:
Conditional entropy:
![Page 16: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/16.jpg)
16
Joint & Conditional Entropy
Joint entropy:
Conditional entropy:
![Page 17: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/17.jpg)
17
Joint & Conditional Entropy
Joint entropy:
Conditional entropy:
![Page 18: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/18.jpg)
18
Perplexity and Entropy
Given that
Consider the perplexity equation:
PP(W) = P(W)-1/N = = = 2H(L,P)
Where H is the entropy of the language L
![Page 19: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/19.jpg)
19
Mutual InformationMeasure of information in common between two
distributions
![Page 20: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/20.jpg)
20
Mutual InformationMeasure of information in common between two
distributions
![Page 21: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/21.jpg)
21
Mutual InformationMeasure of information in common between two
distributions
![Page 22: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/22.jpg)
22
Mutual InformationMeasure of information in common between two
distributions
![Page 23: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/23.jpg)
23
Mutual InformationMeasure of information in common between two
distributions
Symmetric: I(X;Y) = I(Y;X)
![Page 24: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/24.jpg)
24
Mutual InformationMeasure of information in common between two
distributions
Symmetric: I(X;Y) = I(Y;X)
I(X;Y) = KL(p(x,y)||p(x)p(y))
![Page 25: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/25.jpg)
25
Decision Trees
![Page 26: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/26.jpg)
26
Classification TaskTask:
C is a finite set of labels (aka categories, classes)Given x, determine its category y in C
![Page 27: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/27.jpg)
27
Classification TaskTask:
C is a finite set of labels (aka categories, classes)Given x, determine its category y in C
Instance: (x,y)x: thing to be labeled/classifiedy: label/class
![Page 28: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/28.jpg)
28
Classification TaskTask:
C is a finite set of labels (aka categories, classes)Given x, determine its category y in C
Instance: (x,y)x: thing to be labeled/classifiedy: label/class
Data: set of instances labeled data: y is knownunlabeled data: y is unknown
![Page 29: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/29.jpg)
29
Classification TaskTask:
C is a finite set of labels (aka categories, classes) Given x, determine its category y in C
Instance: (x,y) x: thing to be labeled/classified y: label/class
Data: set of instances labeled data: y is known unlabeled data: y is unknown
Training data, test data
![Page 30: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/30.jpg)
30
Two StagesTraining:
Learner: training data classifier
![Page 31: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/31.jpg)
31
Two StagesTraining:
Learner: training data classifier
Testing:Decoder: test data + classifier classification
output
![Page 32: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/32.jpg)
32
Two StagesTraining:
Learner: training data classifierClassifier: f(x) =y: x is input; y in C
Testing:Decoder: test data + classifier classification output
AlsoPreprocessingPostprocessingEvaluation
![Page 33: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/33.jpg)
33
![Page 34: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/34.jpg)
34
RoadmapDecision Trees:
Sunburn exampleDecision tree basicsFrom trees to rulesKey questions
Training procedure?Decoding procedure? Overfitting?Different feature type?
Analysis: Pros & Cons
![Page 35: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/35.jpg)
35
Sunburn ExampleName Hair Height Weight Lotion Result
Sarah Blonde Average Light No Burn
Dana Blonde Tall Average Yes None
Alex Brown Short Average Yes None
Annie Blonde Short Average No Burn
Emily Red Average Heavy No Burn
Pete Brown Tall Heavy No None
John Brown Average Heavy No None
Katie Blonde Short Light Yes None
![Page 36: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/36.jpg)
36
Learning about Sunburn
Goal:Train on labeled examplesPredict Burn/None for new instances
Solution??Exact match: same features, same output
Problem: 2*3^3 feature combinations Could be much worse
Same label as ‘most similar’Problem: What’s close? Which features matter?
Many match on two features but differ on result
![Page 37: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/37.jpg)
37
Learning about SunburnBetter Solution: Decision tree
Training:Divide examples into subsets based on feature testsSets of samples at leaves define classification
Prediction:Route NEW instance through tree to leaf based on
feature testsAssign same value as samples at leaf
![Page 38: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/38.jpg)
38
Sunburn Decision Tree
Hair Color
Lotion Used
BlondeRed
Brown
Alex: NoneJohn: NonePete: None
Emily: Burn
No Yes
Sarah: BurnAnnie: Burn
Katie: NoneDana: None
![Page 39: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/39.jpg)
39
Decision Tree Structure Internal nodes:
Each node is a testGenerally tests a single feature
E.g. Hair == ?
Theoretically could test multiple features
Branches: Each branch corresponds to outcome of test
E.g Hair == Red; Hair != Blond
Leaves: Each leaf corresponds to a decision
Discrete class: Classification / Decision TreeReal value: Regression
![Page 40: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/40.jpg)
40
From Trees to RulesTree:
Branches from root to leaves =
Tests => classifications
Tests = if antecedents; Leaf labels= consequent
All Decision trees-> rules Not all rules as trees
![Page 41: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/41.jpg)
41
From ID Trees to RulesHair Color
Lotion Used
BlondeRed
Brown
Alex: NoneJohn: NonePete: None
Emily: Burn
No Yes
Sarah: BurnAnnie: Burn
Katie: NoneDana: None
(if (equal haircolor blonde) (equal lotionused yes) (then None))(if (equal haircolor blonde) (equal lotionused no) (then Burn))(if (equal haircolor red) (then Burn))(if (equal haircolor brown) (then None))
![Page 42: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/42.jpg)
42
Which Tree?Many possible decision trees for any problem
How can we select among them?
What would be the ‘best’ tree?Smallest?
Shallowest?
Most accurate on unseen data?
![Page 43: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/43.jpg)
43
SimplicityOccam’s Razor:
Simplest explanation that covers the data is best
Occam’s Razor for decision trees:Smallest tree consistent with samples will be
best predictor for new data
Problem: Finding all trees & finding smallest: Expensive!
Solution:Greedily build a small tree
![Page 44: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/44.jpg)
44
Building Trees:Basic Algorithm
Goal: Build a small tree such that all samples at leaves have same class
Greedy solution:At each node, pick test using ‘best’ feature
Split into subsets based on outcomes of feature test
Repeat process until stopping criterioni.e. until leaves have same class
![Page 45: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/45.jpg)
45
Key QuestionsSplitting:
How do we select the ‘best’ feature?
Stopping: When do we stop splitting to avoid overfitting?
Features:How do we split different types of features?
Binary? Discrete? Continuous?
![Page 46: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/46.jpg)
46
Building Decision Trees: IGoal: Build a small tree such that all samples at
leaves have same class
Greedy solution:At each node, pick test such that branches are
closest to having same class
Split into subsets where most instances in uniform class
![Page 47: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/47.jpg)
47
Picking a TestHair Color
BlondeRed
Brown
Alex: NPete: NJohn: N
Emily: BSarah: BDana: NAnnie: BKatie: N
Height
WeightLotion
Short AverageTall
Alex:NAnnie:BKatie:N
Sarah:BEmily:BJohn:N
Dana:NPete:N
Sarah:BKatie:N
Light AverageHeavy
Dana:NAlex:NAnnie:B
Emily:BPete:NJohn:N
No Yes
Sarah:BAnnie:BEmily:BPete:NJohn:N
Dana:NAlex:NKatie:N
![Page 48: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/48.jpg)
48
Picking a TestHeight
WeightLotion
Short AverageTall
Annie:BKatie:N
Sarah:B Dana:N
Sarah:BKatie:N
Light AverageHeavy
Dana:NAnnie:B
No Yes
Sarah:BAnnie:B
Dana:NKatie:N
![Page 49: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/49.jpg)
49
Measuring Disorder
Problem: In general, tests on large DB’s don’t yield
homogeneous subsets
Solution:General information theoretic measure of disorderDesired features:
Homogeneous set: least disorder = 0Even split: most disorder = 1
![Page 50: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/50.jpg)
50
Measuring EntropyIf split m objects into 2 bins size m1 & m2,
what is the entropy?
m
m
m
m
m
m
m
mm
m
m
m iii
22
212
1
2
loglog
log
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1 1.2
m1/m
Disorder
![Page 51: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/51.jpg)
51
Measuring Disorder:Entropy
mmp ii / the probability of being in bin i
i
ii pp 2log
mmp ii /
Entropy (disorder) of a split
i
ip 1
00log0 2 Assume
10 ip
-½ log2½ - ½ log2½ = ½ +½ = 1½½
-¼ log2¼ - ¾ log2¾ = 0.5 + 0.311 = 0.811¾¼
-1log21 - 0log20 = 0 - 0 = 001
Entropyp2p1
![Page 52: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/52.jpg)
52
Information GainInfoGain(Y|X)
How many bits can we save if know X?
InfoGain(Y|X) = H(Y) – H(Y|X)
(equivalent to InfoGain(Y,X))
![Page 53: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/53.jpg)
53
Information GainInfoGain(S,A): expected reduction in entropy due to A
Select A with max InfoGainResulting in lowest average entropy
![Page 54: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/54.jpg)
54
Computing Average Entropy
Disorder of class distribution on branch i
Fraction of samples down branch i
|S| instances
Branch1 Branch 2
Sa1 a Sa1 b
Sa2 aSa2 b
![Page 55: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/55.jpg)
55
Entropy in Sunburn Example
Hair color = 0.954-(4/8(-2/4 log 2/4 - 2/4log2/4) + 1/8*0 + 3/8 *0) = 0.954- 0.5 = 0.454
Height = 0.954 - 0.69= 0.264Weight = 0.954 - 0.94= 0.014Lotion = 0.954 - 0.61= 0.344
S = [3B,5N]
![Page 56: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/56.jpg)
56
Entropy in Sunburn Example
Height = 1-2/4(-1/2log1/2-1/2log1/2) + 1/4*0+1/4*0 = 1- 0.5 =0.5Weight = 1-2/4(-1/2log1/2-1/2log1/2) +2/4(-1/2log1/2-1/2log1/2) = 1 Lotion = 1- 0 = 1
S=[2B,2N]
![Page 57: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/57.jpg)
57
Building Decision Trees with Information Gain
Until there are no inhomogeneous leavesSelect an inhomogeneous leaf nodeReplace that leaf node by a test node creating
subsets that yield highest information gain
Effectively creates set of rectangular regionsRepeatedly draws lines in different axes
![Page 58: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/58.jpg)
58
Alternate MeasuresIssue with Information Gain:
Favors features with more values
Option:Gain Ratio
Sa : elements of S with value A=a
![Page 59: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/59.jpg)
59
OverfittingOverfitting:
Model fits the training data TOO wellFits noise, irrelevant details
Why is this bad? Harms generalization Fits training data too well, fits new data badly
For model m, training_error(m), D_error(m) – D=all data
If overfit, for another model m’, training_error(m) < training_error(m’), but D_error(m) > D_error(m’)
![Page 60: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/60.jpg)
60
Avoiding OverfittingStrategies to avoid overfitting:
Early stopping:Stop when InfoGain < thresholdStop when number of instances < thresholdStop when tree depth > threshold
Post-pruningGrow full tree and remove branches
Which is better?Unclear, both used.For some applications, post-pruning better
![Page 61: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/61.jpg)
61
Post-PruningDivide data into
Training set: used to build the original treeValidation set: used to perform pruning
Build decision tree based on training data
Until pruning does not reduce validation set performanceCompute perf. for pruning each nodes (& its children)Greedily remove nodes that do not reduce VS performance
Yields smaller tree with best performance
![Page 62: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/62.jpg)
62
Performance MeasuresCompute accuracy on:
Validation setk-fold cross-validation
Weighted classification error cost:Weight some types of errors more heavily
Minimum description length:Favor good accuracy on compact modelsMDL = error(tree) + model_size(tree)
![Page 63: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/63.jpg)
63
Rule Post-PruningConvert tree to rules
Prune rules independently
Sort final rule set
Probably most widely used method (toolkits)
![Page 64: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/64.jpg)
64
Modeling FeaturesDifferent types of features need different tests
Binary: Test branches on true/false
Discrete: Branches for each discrete value
Continuous? Need to discretizeEnumerate all values not possible or desirablePick value x
Branches: value < x; value >= xHow can we pick split points?
![Page 65: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/65.jpg)
65
Picking Splits Need split useful, sufficient split points
What’s a good strategy?
Approach:Sort all values for the feature in training data Identify adjacent instances of different classes
Candidate split points between those instancesSelect candidate with highest information gain
![Page 66: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/66.jpg)
66
Features in Decision Trees: Pros
Feature selection:Tests features that yield low disorder
E.g. selects features that are important!Ignores irrelevant features
Feature type handling:Discrete type: 1 branch per valueContinuous type: Branch on >= value
Absent features: Distribute uniformly
![Page 67: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/67.jpg)
67
Features in Decision Trees: Cons
Features Assumed independentIf want group effect, must model explicitly
E.g. make new feature AorB
Feature tests conjunctive
![Page 68: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.](https://reader035.fdocuments.net/reader035/viewer/2022062409/56649d9e5503460f94a88322/html5/thumbnails/68.jpg)
68
Decision TreesTrain:
Build tree by forming subsets of least disorder
Predict:Traverse tree based on feature testsAssign leaf node sample label
Pros: Robust to irrelevant features, some noise, fast prediction, perspicuous rule reading
Cons: Poor feature combination, dependency, optimal tree build intractable