A INTELLIGENCE (CSCU9YE ) L 1: REVISION LECTURE · EXAM ¢ Date: Thursday 15 December, 14:00 –...
Transcript of A INTELLIGENCE (CSCU9YE ) L 1: REVISION LECTURE · EXAM ¢ Date: Thursday 15 December, 14:00 –...
ARTIFICIAL INTELLIGENCE (CSCU9YE ) LECTURE 1: REVISION LECTURE
Gabriela Ochoa, Nadarajen Veerapen, Fabio Daolio
EXAM
¢ Date: Thursday 15 December, 14:00 – 15:30 (Room: 2A13) 1.5 hour exam
¢ Attempt BOTH questions. � Q1: Search (25 Marks) � Q2: Machine Learning (25 Marks)
¢ The distribution of marks among the parts of
each question is indicated.
2
SOLVING PROBLEMS BY SEARCHING ¢ Problem-solving agents decide what to do by finding
sequences of actions that lead to desirable states ¢ What is a problem and what is a solution?
� Problem: a goal and a set of means to achieve it � Solution: a sequence of actions to achieve that
goal
¢ Given a precise definition of problem, it is possible to construct a search process for finding solutions
3
EXAMPLE: ROMANIA Google Map: Romania
4
PROBLEM FORMULATION More formally, a problem is defined by these main components: 1. Initial state where the agent starts: e.g., "at Arad"
2. Actions available to the agent � e.g., Arad à Zerind , Arad à Sibiu, … etc.
3. Goal test, determines whether a given state is a goal state. � explicit, e.g., x = "at Bucharest“ � implicit, e.g., Checkmate(x)
4. Path cost (additive) function that assigns a numeric cost to each path. Reflects agents performance measure � e.g., sum of distances, number of actions executed, etc. � c(x,a,y) is the step cost, assumed to be ≥ 0
5
SEARCH ALGORITHMS
¢ Uninformed search strategies: can find solutions to problems by systematically generating new states, and testing them against the goal (eg. BFS and DFS)
¢ Informed search strategies: use some problem-specific knowledge
¢ Knowledge is given by an evaluation function that returns a number describing the desirability (or lack thereof) of expanding a nodes: Examples: Best-first search, Greedy Search, A*
6
Shaded nodes: expanded nodes
Outlined nodes: generated but not expanded
7
BREADTH-FIRST SEARCH
¢ Expand shallowest unexpanded node ¢ Implementation:
� Frontier is a FIFO queue, i.e., new successors go at end
8
BREADTH-FIRST SEARCH
¢ Expand shallowest unexpanded node ¢ Implementation:
� Frontier is a FIFO queue, i.e., new successors go at end
9
BREADTH-FIRST SEARCH
¢ Expand shallowest unexpanded node ¢ Implementation:
� fringe is a FIFO queue, i.e., new successors go at end
10
BREADTH-FIRST SEARCH
¢ Expand shallowest unexpanded node ¢ Implementation:
� fringe is a FIFO queue, i.e., new successors go at end
11
OPTIMISATION PROBLEMS ARE EVERYWHERE!
Logistics, transportation, supply change management
Manufacturing, production lines
Timetabling
Cutting & packing Computer networks and Telecommunications
Software - SBSE 12
HILL-CLIMBING SEARCH Like climbing a mountain in thick fog with amnesia
13
Best Improvement (gradient descent, greedy hill-climbing): Choose maximally improving neighbour First Improvement: Choose the first found improving move. Local optimum: No other solution in the neighbourhood has better fitness
HILL-CLIMBING SEARCH Problem: depending on initial state, can get stuck in local maxima
14
ITERATED LOCAL SEARCH
Procedure Iterated Local Search (ILS) s = initislise(s) s = hill-climbing (s) while NOT termination_criterion { r = s s = perturbation(s) s = hill-climbing (s) if s < r
s = r }
¢ Key idea: use two stages � Local search for reaching local optima (intensification) � Perturbation stage, for escaping local optima (diversification)
¢ Acceptance criterion: to control diversification vs. intensificaction
15
Artificial Intelligence (CSC9YE)Revision - Machine Learning - Decision Trees
Fabio [email protected]
Definitionfrom (T. Mitchell 1997)
“A computer program is said to learn from experience E
with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured
by P, improves with experience E.”
1 / 18
Learning Paradigmswhat kind of experience, what kind of tasks?
Supervised Learning: the program is presented with a series ofinput-output examples and learns a function thatmaps inputs to outputs.
I regressionI classification
Unsupervised Learning: the program is presented with a series ofinputs and learns how they are organised.
I clustering (or segmentation)I dimensionality reduction
Reinforcement Learning: the program learns to determine the idealbehaviour based on feedback from the environment,rewards or punishments.
I game playingI on-line control
2 / 18
Supervised Learning settingoutcome measurements and predictors measurements are available
< x11, x12, … , x1p >
< x21, x22, … , x2p >
< x31, x32, … , x3p >
...
...
...< xn1, xn2, … , xnp >
y1
y2
y3
...
...
...yn
X y Data: list of observations in the formL = {< X , y >}
X
n⇥p feature matrix / design matrixn samples / examples / instancesp features / predictors / covariates
y
n⇥1 target vector / labelsI regression: continuous valuesI classification: finite set of types
Problem: learn y = f (X )
3 / 18
A Binary Classification Task< x
1
, x2
>2 R features, < y >2 {class1, class2} labels, n = 30.How to automatically find a mapping f from (x
1
, x2
) to y?
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00x1
x 2
y class1 class2
4 / 18
Base model: predict the majority classminimise misclassification error
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00x1
x 2
y class1 class2
class122 8
5 / 18
Divide and Conquerrecursive partition and assign a base model to each partition
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00x1
x 2
y class1 class2
x2 < 0.43
x2 >= 0.77
class122 8
class113 0
class19 8
class17 0
class22 8
yes no
6 / 18
Divide and Conquerrecursive partition and assign a base model to each partition
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00x1
x 2
y class1 class2
x2 < 0.43
x2 >= 0.77
class122 8
class113 0
class19 8
class17 0
class22 8
yes no
7 / 18
Divide and Conquerrecursive partition and assign a base model to each partition
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00x1
x 2
y class1 class2
x2 < 0.43
x2 >= 0.77
x1 >= 0.7
class122 8
class113 0
class19 8
class17 0
class22 8
class12 0
class20 8
yes no
8 / 18
Decision Tree: recursive binary splittingthings to notice
I the target y is approximated by a piecewise constant function
I the feature space X is partitioned into disjoint regions
I the goal is to find partitions that minimise the prediction error
I it is computationally infeasible to consider all possiblepartitions
I the recursive binary splitting is a top-down, greedy procedure:I splits are defined by a split variable and a split pointI at any step, all possible splits in the data are testedI the split that yields the most “pure” nodes is chosen
I the splitting could continue until all nodes are “pure”...
9 / 18
Tree Building Algorithmcode from (G. Louppe 2014)
function BuildDecisionTree(L)Create node t
if the stopping criterion is met for t then
Assign a model to byt
else
Find the split on L that maximizes impurity decrease
s
⇤ = argmaxs
i(t) � pLi(tsL) � pR i(t
sR)
Partition L into LtL [ LtR according to s
⇤
tL = BuildDecisionTree(LtL)tR = BuildDecisionTree(LtR )
end if
return t
end function
10 / 18
Measuring Nodes Impurityfor a binary classification task, figure from (Hastie et al. 2009)
9.2 Tree-Based Methods 309
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.1
0.2
0.3
0.4
0.5
p
Entropy
Gini ind
ex
Misclas
sifica
tion e
rror
FIGURE 9.3. Node impurity measures for two-class classification, as a functionof the proportion p in class 2. Cross-entropy has been scaled to pass through(0.5, 0.5).
impurity measure Qm(T ) defined in (9.15), but this is not suitable forclassification. In a node m, representing a region Rm with Nm observations,let
pmk =1
Nm
�
xi�Rm
I(yi = k),
the proportion of class k observations in node m. We classify the obser-vations in node m to class k(m) = arg maxk pmk, the majority class innode m. Di�erent measures Qm(T ) of node impurity include the following:
Misclassification error: 1Nm
Pi�Rm
I(yi �= k(m)) = 1 � pmk(m).
Gini index:P
k �=k� pmkpmk� =PK
k=1 pmk(1 � pmk).
Cross-entropy or deviance: �PK
k=1 pmk log pmk.(9.17)
For two classes, if p is the proportion in the second class, these three mea-sures are 1 � max(p, 1 � p), 2p(1 � p) and �p log p � (1 � p) log (1 � p),respectively. They are shown in Figure 9.3. All three are similar, but cross-entropy and the Gini index are di�erentiable, and hence more amenable tonumerical optimization. Comparing (9.13) and (9.15), we see that we needto weight the node impurity measures by the number NmL and NmR ofobservations in the two child nodes created by splitting node m.
In addition, cross-entropy and the Gini index are more sensitive to changesin the node probabilities than the misclassification rate. For example, ina two-class problem with 400 observations in each class (denote this by(400, 400)), suppose one split created nodes (300, 100) and (100, 300), while
I If p is the proportion of samples of the other class in node t:
Misclassification Rate: i(t) = p � max (p, 1 � p)Gini Index: i(t) = 2p(1 � p)
Cross-Entropy: i(t) = �p log(p)�(1�p) log(1�p)2 log(2)
11 / 18
Classification And Regression Trees
By swapping impurity function and leaf model, decision trees canbe used to solve classification and regression tasks:
classification:
Iy symbolic, discrete, e.g., Y = {class1, class2}
Iy = argmaxc2Y p(c |t), i.e. the majority class in node t
Ii(t) = entropy(t) or i(t) = gini(t)
regression:
Iy numeric, continuous
Iy = mean(y |t), i.e. the point average in node t
Ii(t) = 1
nt
Px,y2Lt
(y � byt)2, i.e. the mean squared error
12 / 18
A Simple Regression Tree
< x , y > continuous variables, n = 20
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
10
20
30
40
0 100 200 300 400 500x
y
x < 418
x >= 154 x < 460
19.5n=20
14.8n=14
11.7n=9
20.5n=5
30.5n=6
24.4n=3
36.5n=3
yes no
13 / 18
Model Selection on tree parametersdepth=1 depth=2
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
10
20
30
40
0 100 200 300 400 500x
y
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
10
20
30
40
0 100 200 300 400 500x
y
depth=3 depth=4
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
10
20
30
40
0 100 200 300 400 500x
y
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
10
20
30
40
0 100 200 300 400 500x
y
14 / 18
Stopping condition: e.g., max depth or min samplesdepth=1 depth=2
x < 418
19.5n=20
14.8n=14
30.5n=6
yes no
x < 418
x >= 154 x < 460
19.5n=20
14.8n=14
11.7n=9
20.5n=5
30.5n=6
24.4n=3
36.5n=3
yes no
depth=3 depth=4
x < 418
x >= 154
x < 366 x >= 37.1
x < 460
x >= 444
19.5n=20
14.8n=14
11.7n=9
10.6n=8
20.5n=1
20.5n=5
18.4n=3
23.7n=2
30.5n=6
24.4n=3
22.3n=2
28.5n=1
36.5n=3
yes no
x < 418
x >= 154
x < 366 x >= 37.1
x < 21.9
x < 460
x >= 444 x < 474
x >= 478
19.5n=20
14.8n=14
11.7n=9
10.6n=8
20.5n=1
20.5n=5
18.4n=3
23.7n=2
20.4n=1
27n=1
30.5n=6
24.4n=3
22.3n=2
28.5n=1
36.5n=3
33.3n=1
38.1n=2
33.7n=1
42.6n=1
yes no
15 / 18
Recall: Underfitting and Overfittingthe goal of the model is to minimise the prediction error on unseen data
10
20
30
1 2 3 4tree maximum depth
MSE
settesttrain
I Overly complex trees are likely to overfit the training data:I to avoid this, tune the stopping criteria (or post-hoc prune)I cross-validation can be used for model selection
16 / 18
Recall: Bias and Variancemodels with low bias and low variance have lower expected prediction error
Low Bias
Low Variance
••••••••
High Variance
••
•
•
• ••
•
High Bias••••••••
••
••
•• •
•
17 / 18
Bias and Variance of a Regression Tree
I Decision trees have, in general, low bias but high variance:I to reduce variance, combine the predictions of several trees!
(see bagging and ensembles of randomised trees)
18 / 18
References
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013).
An Introduction to Statistical Learning: with Applications in R.
Springer.
Louppe, G. (2014).
Understanding Random Forests: From Theory to Practice.
PhD thesis, Universite de Liege, Liege, Belgique.
Artificial Intelligence (CSC9YE)Revision – Machine Learning – Clustering
Unsupervised Learning
I Unsupervised learning: no labeled examples, no training set.
I We want to find interesting things about a set of data. Isthere an informative way to visualize the data? Can wediscover subgroups among the variables or among theobservations?
I This means grouping and separating data points at the sametime.
I We need a way to measure how (dis)similar the data pointsare, for example with the Euclidean distance.
I It is intrinsically more di�cult than supervised learningbecause there is no gold standard (like an outcome variable)and no single objective (like test set accuracy).
1 / 11
Two Clustering Methods
I In K-means clustering, we seek to partition the observationsinto a pre-specified number of clusters k .
I In hierarchical clustering, we do not know in advance howmany clusters we want; in fact, we end up with a tree-likevisual representation of the observations, called a dendrogram,that allows us to view at once the clusterings obtained foreach possible number of clusters, from 1 to n.
2 / 11
K-means: An Optimisation Problem
I Minimise within-cluster variation.I Algorithm:
1. Randomly select k points. These serve as initial clustercentroids for the observations.
2. Assign each observation to the cluster whose centroid isclosest.
3. Iterate until the cluster assignments stop changing:
3.1 For each of the k clusters, compute the cluster centroid. The
k thcluster centroid is the vector of the p feature means for
the observations in the k thcluster.
3.2 Assign each observation to the cluster whose centroid is
closest.
I Properties:I This algorithm is guaranteed to decrease the value of the
objective. However it is not guaranteed to give the globalminimum.
I The algorithm may get stuck in a local optimum.
3 / 11
K-means AlgorithmK-means with k=2, randomly choose centroids in initial step
0 2 4 6 8
02
46
8 1
2
3
4
5 6
7
8
9
10
Distancesc1 c2
1 6.08 5.39
2 5.10 5.10
3 4.24 3.16
4 2.24 5.39
5 1.00 6.40
6 0.00 7.21
7 7.28 6.08
8 6.08 5.00
9 8.06 3.61
10 7.21 0.00
4 / 11
K-means AlgorithmLast step: no change in centroids
0 2 4 6 8
02
46
8 1
2
3
4
5 6
7
8
9
10
c1
c2
Distancesc1 c2
1 3.34 7.96
2 2.34 7.30
3 1.86 4.51
4 0.69 5.94
5 2.67 5.77
6 2.91 6.77
7 7.47 2.51
8 6.07 1.68
9 7.03 1.35
10 5.01 3.58
5 / 11
Hierarchical Clustering
I Hierarchical clustering does not require that we commit to aparticular choice of k .
I Bottom-up or agglomerative clustering: a dendrogram (a tree)is built starting from the leaves and combining clusters up tothe trunk.
I Algorithm:
1. Start with each point in its own cluster.2. Repeat until all points are in a single cluster.
IIdentify the closest two clusters and merge them.
I Similarity between clusters: for single/complete/averagelinkage, compute all pairwise distances between theobservations in cluster A and the observations in cluster B,and record the smallest/largest/average of these distances.
6 / 11
Hierarchical ClusteringExample using Single Linkage
0 2 4 6 8
02
46
8 1
2
3
4
5 6
7
8
9
10
7 / 11
Hierarchical Clustering
Distance Matrix1 2 3 4 5 6 7 8 9 10
1 0.02 1.0 0.03 3.6 2.8 0.04 4.0 3.0 2.2 0.05 6.0 5.0 3.6 2.0 0.06 6.1 5.1 4.2 2.2 1.0 0.07 10.0 9.2 6.4 7.2 6.3 7.3 0.08 8.6 7.8 5.0 5.8 5.1 6.1 1.4 0.09 8.6 8.1 5.4 7.1 7.1 8.1 3.2 2.8 0.010 5.4 5.1 3.2 5.4 6.4 7.2 6.1 5.0 3.6 0.0
Clusters: {1}, {2}, {3}, {4}, {5}, {6}, {7}, {8}, {9}, {10}
8 / 11
Hierarchical Clusteringd k Clusters Comment
0.0 10 {1}, {2}, {3}, {4}, {5},{6}, {7}, {8}, {9}, {10}
Start with each observation as one
cluster.
1.0 8 {1, 2}, {3}, {4}, {5, 6},{7}, {8}, {9}, {10}
Merge {1} and {2} as well as {5}and {6} since they are the closest:
d(1,2)=1 and d(5,6)=1
1.4 7 {1, 2}, {3}, {4}, {5, 6},{7, 8}, {9}, {10}
Merge {7} and {8} since they are the
closest: d(7,8)=1.4
2.0 6 {1, 2}, {3}, {4, 5, 6},{7, 8}, {9}, {10}
Merge {4} and {5, 6} since 4 and 5
are the closest: d(4,5)=2.0
2.2 5 {1, 2}, {3, 4, 5, 6},{7, 8}, {9}, {10}
Merge {3} and {4, 5, 6} since 3 and
4 are the closest: d(3,4)=2.2
2.8 3 {1, 2, 3, 4, 5, 6},{7, 8, 9}, {10}
Merge {1, 2} and {3, 4, 5, 6} as
well as {7, 8} and {9} since 2 and
3 as well as 8 and 9 are the closest:
d(2,3)=2.8 and d(8,9)=2.8
3.2 2 {1, 2, 3, 4, 5, 6, 10},{7, 8, 9}
Merge {1, 2, 3, 4, 5, 6} and
{10} since 3 and 10 are the closest:
d(3,10)=3.2
3.6 1 {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} Merge remaining two clusters,
d(9,10)=3.6
9 / 11
Hierarchical Clustering
0 2 4 6 8
02
46
8 1
2
3
4
5 6
7
8
9
10
10 / 11
Hierarchical Clustering
9
7 8
10
1 2
3
4
5 6
1.0
1.5
2.0
2.5
3.0
3.5
Single Linkage Cluster DendrogramH
eigh
t
11 / 11