Post on 14-Dec-2015
DECISION TREES
Decision trees
One possible representation for hypotheses
Choosing an attribute
Idea: a good attribute splits the examples into subsets that are (ideally) "all positive" or "all negative"
Which is a better choice? Patrons
Using information theory
Implement Choose-Attribute in the DTL algorithm based on information content – measured by Entropy
Entropy is the measure of uncertainty of a random variable More uncertainty leads to higher entropy More knowledge leads to lower entropy
Entropy
For a training set containing p positive examples and n negative examples:
np
n
np
n
np
p
np
p
np
n
np
pI
22 loglog),(
Entropy Examples
Fair coin flip:
Biased coin flip:
Information Gain
Measures Reduction in Entropy achieved because of the split.
Choose the split that achieves most reduction (maximizes Information Gain)
Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure.
k
i
isplit iEntropy
n
npEntropyGAIN
1
)()(
Parent node is split into partitions. is number of records in partition .
Information Gain Example
Consider the attributes Patrons and Type:
Patrons has the highest Information Gain of all attributes and so is chosen by the DTL algorithm as the root
bits 0)]4
2,
4
2(
12
4)
4
2,
4
2(
12
4)
2
1,
2
1(
12
2)
2
1,
2
1(
12
2[1)(
bits 0541.)]6
4,
6
2(
12
6)0,1(
12
4)1,0(
12
2[1)(
IIIITypeGain
IIIPatronsGain
Learned Restaurant Tree
Decision tree learned from the 12 examples:
Substantially simpler than the full tree Raining and Reservation were not necessary to
classify all the data.
Stopping Criteria
Stop expanding a node when all the records belong to the same class
Stop expanding a node when all the records have similar attribute values
Overfitting
Overfitting results in decision trees that are more complex than necessary
Training error does not provide a good estimate of how well the tree will perform on previously unseen records (need a test set)
How to Address Overfitting 1… Pruning
Grow decision tree to its entirety Trim the nodes of the decision tree in a
bottom-up fashion If generalization error is reduced after
trimming, replace sub-tree by a leaf node ( test, see page 706)
Class label of leaf node is determined from majority class of instances in the sub-tree
How to Address Overfitting 2…
Early Stopping Rule Stop the algorithm before it becomes a fully-
grown tree Stopping conditions:
Stop if number of instances is less than some user-specified threshold
Stop if class distribution of instances are independent of the available features (e.g., using test)
Stop if expanding the current node does not improve impurity measures (e.g., information gain).
How to Address Overfitting… Is the early stopping rule strictly better
than pruning (i.e., generating the full tree and then cutting it)?
Remaining Challenges…
Continuous values: Need to be split into discrete categories. Sort all values, then consider split points
between two examples in sorted order that have different classifications.
Missing values: Affect how an example is classified, information
gain calculations, test set error rate. Pretend that the example has all possible values
for the missing attribute, weight by its frequency among all the examples in the current node.
Summary
Advantages of decision trees: Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Accuracy is comparable to other classification
techniques for many simple data sets
Learning performance = prediction accuracy measured on test set
K-NEAREST NEIGHBORS
K-Nearest Neighbors
What value do we assign to the green sample?
K-Nearest Neighbors
1-NN: For a given query point ,
assign the class of the nearest neighbour.
K-NN Compute the nearest
neighbours and assign the class by majority vote.
k = 1
k = 3
Decision Regions for 1-NN
Effect of
𝑘=1 𝑘=5
K-Nearest Neighbors
Euclidian Distance:
Weighted Euclidian Distance:
Where is the dimensionality of the data.
Weighting the Distance to Remove Irrelevant Features
+
+
+
+
+
++ +
o
o
o o
o
o
oo
o
o
o
oo
o
o
o
o
o?
Weighting the Distance to Remove Irrelevant Features
+
+
+
+
+
+ + +o
oo o
o
o
oo
oo
ooo
oo
o
o
o?
Weighting the Distance to Remove Irrelevant Features
+ ++ ++ + + +oo o o oo ooooo oo o oo oo?
Nearest Neighbors Search
Let be a set of training points Given a query point , find the nearest
neighbor of in .
Naïve approach Compute the distance from the query
point to every other point in the database, keeping track of the "best so far".
Running time is O(n).
Data Structure approach Construct a data structure which makes
this search more efficient
qp
Quadtree
Is a tree data structure in which each internal node has up to four children.
Every node in the Quadtree corresponds to a square.
If a node has children, then their corresponding squares are the four quadrants of the square of .
The leaves of a Quadtree form a Quadtree Subdivision of the square of the root.
The children of a node are labelled NE, NW, SW, and SE to indicate to which quadrant they correspond.
Quadtree Construction
Input: point set P
while Some cell C contains more than 1 point do
Split cell C
end
j k f g l d a b
c ei h
X
400
1000
h
b
i
a
cd e
g f
kj
Y
l
X 25, Y 300
X 50, Y 200
X 75, Y 100
Nearest Neighbor Search
Quadtree -Query
X
Y
X1,Y1 P≥X1P≥Y1
P<X1P<Y1
P≥X1P<Y1
P<X1P≥Y1
X1,Y1
Quadtree- Query
X
Y
In many cases works
X1,Y1P<X1P<Y1 P<X1
P≥Y1
X1,Y1
P≥X1P≥Y1
P≥X1P<Y1
Quadtree– Pitfall 1
X
Y
In some cases doesn’t: there could be points in adjacent buckets that are closer
X1,Y1P≥X1P≥Y1
P<X1
P<X1P<Y1 P≥X1
P<Y1P<X1P≥Y1
X1,Y1
Quadtree – Pitfall 2
X
Y
Could result in Query time Exponential in dimensions
Simple data structure. Versatile, easy to implement. Often space and time inefficient.
Quadtree
kd-trees (k-dimensional trees) Main ideas:
one-dimensional splits instead of splitting in the middle, choose
the split “carefully” (many variations) nearest neighbor queries same as for quad-
trees
2-dimensional kd-trees
Algorithm Choose x or y coordinate (alternate between them). Choose the median of the coordinate
this defines a horizontal or vertical line. Recurse on both sides until there is only one point
left, which is stored as a leaf.
We get a binary tree Size O(n). Construction time O(nlogn). Depth O(logn).
Nearest Neighbor with KD Trees
We traverse the tree looking for the nearest neighbor of the query point.
Examine nearby points first: Explore the branch of the tree that is closest to the query point first.
Nearest Neighbor with KD Trees
Examine nearby points first: Explore the branch of the tree that is closest to the query point first.
Nearest Neighbor with KD Trees
When we reach a leaf node: compute the distance to each point in the node.
Nearest Neighbor with KD Trees
When we reach a leaf node: compute the distance to each point in the node.
Nearest Neighbor with KD Trees
Then we can backtrack and try the other branch at each node visited.
Nearest Neighbor with KD Trees
Each time a new closest node is found, we can update the distance bounds.
Nearest Neighbor with KD Trees
Using the distance bounds and the bounds of the data below each node, we can prune parts of the tree that could NOT include the nearest neighbor.
Nearest Neighbor with KD Trees
Using the distance bounds and the bounds of the data below each node, we can prune parts of the tree that could NOT include the nearest neighbor.
Nearest Neighbor with KD Trees
Using the distance bounds and the bounds of the data below each node, we can prune parts of the tree that could NOT include the nearest neighbor.
Nearest Neighbor with KD Trees
Summary of K-Nearest Neighbor
Stores all training data in memory – large space requirement
Can improve query time by representing the data within a k-d tree
K-d trees are only efficient when there are many more examples than dimensions, preferably at least examples for dimensions