— INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the...
Transcript of — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the...
![Page 1: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/1.jpg)
— INF4820 —Algorithms for AI and NLP
Hierarchical Clustering
Erik Velldal & Stephan Oepen
Language Technology Group (LTG)
October 7, 2015
![Page 2: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/2.jpg)
Agenda
Last week
Evaluation of classifiers
Machine learning for class discovery: Clustering Unsupervised learning from unlabeled data. Automatically group similar objects together. No pre-defined classes: we only specify the similarity measure.
Flat clustering, with k-means.
2
![Page 3: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/3.jpg)
Agenda
Last week
Evaluation of classifiers
Machine learning for class discovery: Clustering Unsupervised learning from unlabeled data. Automatically group similar objects together. No pre-defined classes: we only specify the similarity measure.
Flat clustering, with k-means.
Today
Hierarchical clustering Top-down / divisive Bottom-up / agglomerative
Crash course on probability theory
Language modeling
2
![Page 4: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/4.jpg)
Agglomerative clustering
Initially: regards each object as itsown singleton cluster.
Iteratively ‘agglomerates’ (merges)the groups in a bottom-up fashion.
Each merge defines a binarybranch in the tree.
Terminates: when only one clusterremains (the root).
parameters: o1, o2, . . . , on, sim
C = o1, o2, . . . , onT = []do for i = 1 to n − 1cj , ck ← arg max
cj ,ck⊆C ∧ j,k
sim(cj , ck)
C ← C\cj , ckC ← C ∪ cj ∪ ckT [i]← cj , ck
At each stage, we merge the pair of clusters that are most similar, asdefined by some measure of inter-cluster similarity: sim.
Plugging in a different sim gives us a different sequence of merges T.
3
![Page 5: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/5.jpg)
Dendrograms
A hierarchical clusteringis often visualized as abinary tree structureknown as a dendrogram.
A merge is shown as ahorizontal line connectingtwo clusters.
The y-axis coordinate ofthe line corresponds tothe similarity of themerged clusters.
We here assume dot-products of normalized vectors(self-similarity = 1).
4
![Page 6: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/6.jpg)
Definitions of inter-cluster similarity
So far we’ve looked at ways to the define the similarity between
pairs of objects.
objects and a class.
Now we’ll look at ways to define the similarity between collections.
5
![Page 7: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/7.jpg)
Definitions of inter-cluster similarity
So far we’ve looked at ways to the define the similarity between
pairs of objects.
objects and a class.
Now we’ll look at ways to define the similarity between collections.
In agglomerative clustering, a measure of cluster similarity sim(ci , cj) isusually referred to as a linkage criterion:
Single-linkage
Complete-linkage
Average-linkage
Centroid-linkage
Determines the pair of clusters to merge in each step.
5
![Page 8: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/8.jpg)
Single-linkage
Merge the two clusters with theminimum distance between anytwo members.
‘Nearest neighbors’.
Can be computed efficiently by taking advantage of the fact that it’sbest-merge persistent:
Let the nearest neighbor of cluster ck be in either ci or cj . If we mergeci ∪ cj = cl , the nearest neighbor of ck will be in cl .
The distance of the two closest members is a local property that is notaffected by merging.
Undesirable chaining effect: Tendency to produce ‘stretched’ and‘straggly’ clusters.
6
![Page 9: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/9.jpg)
Complete-linkage
Merge the two clusters where themaximum distance between anytwo members is smallest.
‘Farthest neighbors’.
Amounts to merging the two clusters whose merger has the smallestdiameter.
Preference for compact clusters with small diameters.
Sensitive to outliers.
Not best-merge persistent: Distance defined as the diameter of a mergeis a non-local property that can change during merging.
7
![Page 10: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/10.jpg)
Average-linkage (1:2)
AKA group-averageagglomerative clustering.
Merge the clusters with thehighest average pairwisesimilarities in their union.
Aims to maximize coherency by considering all pairwise similaritiesbetween objects within the cluster to merge (excluding self-similarities).
Compromise of complete- and single-linkage.
Not best-merge persistent.
Commonly considered the best default clustering criterion.
8
![Page 11: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/11.jpg)
Average-linkage (2:2)
Can be computed very efficientlyif we assume (i) the dot-product
as the similarity measure for (ii)normalized feature vectors.
Let ci ∪ cj = ck , and sim(ci , cj) = W (ci ∪ cj) = W (ck), then W (ck) =
1
|ck |(|ck | − 1)
∑
~x∈ck
∑
~y,~x∈ck
~x · ~y =1
|ck | (|ck | − 1)
∑
~x∈ck
~x
2
− |ck |
The sum of vector similarities is equal to the similarity of their sums.
9
![Page 12: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/12.jpg)
Centroid-linkage
Similarity of clusters ci and cj
defined as the similarity of theircluster centroids ~µi and ~µj .
10
![Page 13: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/13.jpg)
Centroid-linkage
Similarity of clusters ci and cj
defined as the similarity of theircluster centroids ~µi and ~µj .
Equivalent to the averagepairwise similarity betweenobjects from different clusters:
sim(ci , cj) = ~µi · ~µj =1
|ci ||cj |
∑
~x∈ci
∑
~y∈cj
~x · ~y
10
![Page 14: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/14.jpg)
Centroid-linkage
Similarity of clusters ci and cj
defined as the similarity of theircluster centroids ~µi and ~µj .
Equivalent to the averagepairwise similarity betweenobjects from different clusters:
sim(ci , cj) = ~µi · ~µj =1
|ci ||cj |
∑
~x∈ci
∑
~y∈cj
~x · ~y
Not best-merge persistent.
Not monotonic, subject to inversions: The combination similarity canincrease during the clustering.
10
![Page 15: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/15.jpg)
Monotinicity
A fundamentalassumption in clustering:small clusters are morecoherent than large.
We usually assume that aclustering is monotonic:
Similarity is decreasing
from iteration toiteration.
This assumpion holds true for all our clustering criterions except forcentroid-linkage.
11
![Page 16: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/16.jpg)
Inversions – a problem with centroid-linkage
Centroid-linkage isnon-monotonic.
We risk seeing so-calledinversions:
Similarity can increaseduring the sequence ofclustering steps.
Would show as crossinglines in the dendrogram.
The horizontal merge bar is lower than the bar of a previous merge.
12
![Page 17: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/17.jpg)
Linkage criterions
Single-link Complete-link
Average-link Centroid-link
All the linkage criterions can be computed on the basis of the objectsimilarities; the input is typically a proximity matrix.
13
![Page 18: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/18.jpg)
Cutting the tree
The tree actuallyrepresentsseveral partitions:
one for each level.
If we want to turn thenested partitions into asingle flat partitioning. . .
we must cut the tree.
A cutting criterion can be defined as a threshold on e.g. combinationsimilarity, relative drop in the similarity, number of root nodes, etc.
14
![Page 19: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/19.jpg)
Divisive hierarchical clustering
Generates the nested partitions top-down:
Start: all objects considered part of the same cluster (the root).
Split the cluster using a flat clustering algorithm(e.g. by applying k-means for k = 2).
Recursively split the clusters until only singleton clusters remain (orsome specified number of levels is reached).
15
![Page 20: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/20.jpg)
Divisive hierarchical clustering
Generates the nested partitions top-down:
Start: all objects considered part of the same cluster (the root).
Split the cluster using a flat clustering algorithm(e.g. by applying k-means for k = 2).
Recursively split the clusters until only singleton clusters remain (orsome specified number of levels is reached).
Flat methods are generally very effective (e.g. k-means is linear in thenumber of objects).
Divisive methods are thereby also generally more efficient thanagglomerative, which are at least quadratic (single-link).
Also able to initially consider the global distribution of the data, whilethe agglomerative methods must commit to early decisions based onlocal patterns.
15
![Page 21: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/21.jpg)
University of Oslo : Department of Informatics
INF4820: Algorithms for
Artificial Intelligence and
Natural Language Processing
Basic Probability Theory & Language Models
Stephan Oepen & Erik Velldal
Language Technology Group (LTG)
October 7, 2015
1
![Page 22: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/22.jpg)
Changing of the Guard
So far: Point-wise classification; geometric models.
Next: Structured classification; probabilistic models.
2
![Page 23: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/23.jpg)
Changing of the Guard
So far: Point-wise classification; geometric models.
Next: Structured classification; probabilistic models.
sequences
labelled sequences
trees
2
![Page 24: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/24.jpg)
Changing of the Guard
So far: Point-wise classification; geometric models.
Next: Structured classification; probabilistic models.
sequences
labelled sequences
trees
Kristian (December 10, 2014)2
![Page 25: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/25.jpg)
Changing of the Guard
So far: Point-wise classification; geometric models.
Next: Structured classification; probabilistic models.
sequences
labelled sequences
trees
Kristian (December 10, 2014) Guro (March 16, 2015)2
![Page 26: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/26.jpg)
By the End of the Semester . . .
. . . you should be able to determine
which string is most likely: How to recognise speech vs. How to wreck a nice beach
3
![Page 27: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/27.jpg)
By the End of the Semester . . .
. . . you should be able to determine
which string is most likely: How to recognise speech vs. How to wreck a nice beach
which category sequence is most likely for flies like an arrow : N V D N vs. V P D N
3
![Page 28: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/28.jpg)
By the End of the Semester . . .
. . . you should be able to determine
which string is most likely: How to recognise speech vs. How to wreck a nice beach
which category sequence is most likely for flies like an arrow : N V D N vs. V P D N
which syntactic analysis is most likely:S
NP
I
VP
VBD
ate
NP
N
sushi
PP PPwith tuna
S
NP
I
VP
VBD
ate
NP
N
sushi
PP PPwith tuna
3
![Page 29: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/29.jpg)
By the End of the Semester . . .
. . . you should be able to determine
which string is most likely: How to recognise speech vs. How to wreck a nice beach
which category sequence is most likely for flies like an arrow : N V D N vs. V P D N
which syntactic analysis is most likely:S
NP
I
VP
VBD
ate
NP
N
sushi
PP PPwith tuna
S
NP
I
VP
VBD
ate
NP
N
sushi
PP PPwith tuna
3
![Page 30: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/30.jpg)
Probability Basics (1/4)
Experiment (or trial) the process we are observing
Sample space (Ω) the set of all possible outcomes
Event(s) the subset of Ω we are interested in
P(A) is the probability of event A, a real number ∈ [0, 1]
4
![Page 31: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/31.jpg)
Probability Basics (2/4)
Experiment (or trial) rolling a die
Sample space (Ω) Ω = 1, 2, 3, 4, 5, 6
Event(s) A = rolling a six: 6 B = getting an even number: 2, 4, 6
P(A) is the probability of event A, a real number ∈ [0, 1]
4
![Page 32: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/32.jpg)
Probability Basics (3/4)
Experiment (or trial) flipping two coins
Sample space (Ω) Ω = HH , HT , TH , TT
Event(s) A = the same both times: HH , TT B = at least one head: HH , HT , TH
P(A) is the probability of event A, a real number ∈ [0, 1]
4
![Page 33: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/33.jpg)
Probability Basics (4/4)
Experiment (or trial) rolling two dice
Sample space (Ω) Ω = 11, 12, 13, 14, 15, 16, 21, 22, 23, 24, . . . , 63, 64, 65, 66
Event(s) A = results sum to 6: 15, 24, 33, 42, 51 B = both results are even: 22, 24, 26, 42, 44, 46, 62, 64, 66
P(A) is the probability of event A, a real number ∈ [0, 1]
4
![Page 34: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/34.jpg)
Joint Probability
P(A, B): probability that both A and B happen
also written: P(A ∩ B)
A B
5
![Page 35: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/35.jpg)
Joint Probability
P(A, B): probability that both A and B happen
also written: P(A ∩ B)
A B
What is the probability, when throwing two fair dice, that
A: the results sum to 6 and
B: at least one result is a 1?
5
![Page 36: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/36.jpg)
Joint Probability
P(A, B): probability that both A and B happen
also written: P(A ∩ B)
A B
What is the probability, when throwing two fair dice, that
A: the results sum to 6 and 536
B: at least one result is a 1?
5
![Page 37: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/37.jpg)
Joint Probability
P(A, B): probability that both A and B happen
also written: P(A ∩ B)
A B
What is the probability, when throwing two fair dice, that
A: the results sum to 6 and 536
B: at least one result is a 1? 1136
5
![Page 38: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/38.jpg)
Conditional Probability
Often, we know something about a situation.
What is the probability P(A|B), when throwing two fair dice, that
A: the results sum to 6 given
B: at least one result is a 1?
![Page 39: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/39.jpg)
Conditional Probability
Often, we know something about a situation.
What is the probability P(A|B), when throwing two fair dice, that
A: the results sum to 6 given
B: at least one result is a 1?
A B
Ω
A B
6
![Page 40: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/40.jpg)
Conditional Probability
Often, we know something about a situation.
What is the probability P(A|B), when throwing two fair dice, that
A: the results sum to 6 given
B: at least one result is a 1?
A B
Ω
A B
P(A|B) = P(A∩B)
P(B) (where P(B) > 0)
6
![Page 41: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/41.jpg)
The Chain Rule
Joint probability is symmetric:
P(A ∩ B) = P(A) P(B|A)= P(B) P(A|B) (multiplication rule)
7
![Page 42: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/42.jpg)
The Chain Rule
Joint probability is symmetric:
P(A ∩ B) = P(A) P(B|A)= P(B) P(A|B) (multiplication rule)
More generally, using the chain rule:
P(A1 ∩ · · · ∩ An) = P(A1)P(A2|A1)P(A3|A1 ∩ A2) . . . P(An | ∩n−1i=1 Ai)
7
![Page 43: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/43.jpg)
The Chain Rule
Joint probability is symmetric:
P(A ∩ B) = P(A) P(B|A)= P(B) P(A|B) (multiplication rule)
More generally, using the chain rule:
P(A1 ∩ · · · ∩ An) = P(A1)P(A2|A1)P(A3|A1 ∩ A2) . . . P(An | ∩n−1i=1 Ai)
The chain rule will be very useful to us through the semester:
it allows us to break a complicated situation into parts;
we can choose the breakdown that suits our problem.
7
![Page 44: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/44.jpg)
(Conditional) Independence
If knowing event B is true has no effect on event A, we say
A and B are independent of each other.
If A and B are independent:
P(A) = P(A|B)
P(B) = P(B|A)
P(A ∩ B) = P(A) P(B)
8
![Page 45: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/45.jpg)
Intuition? (1/3)
Let’s say we have a rare disease, and a pretty accurate test for detecting it.Yoda has taken the test, and the result is positive.
The numbers:
disease prevalence: 1 in 1000 people
test false negative rate: 1%
test false positive rate: 2%
What is the probability that he has the disease?
9
![Page 46: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/46.jpg)
Intuition? (2/3)
Given:
event A: have disease
event B: positive test
We know:
P(A) =
P(B|A) =
P(B|¬A) =
We want
P(A|B) = ?
10
![Page 47: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/47.jpg)
Intuition? (2/3)
Given:
event A: have disease
event B: positive test
We know:
P(A) = 0.001
P(B|A) = 0.99
P(B|¬A) = 0.02
We want
P(A|B) = ?
10
![Page 48: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/48.jpg)
Intuition? (3/3)
A ¬ A
B¬ B
P(A) = 0.001; P(B|A) = 0.99; P(B|¬A) = 0.02
11
![Page 49: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/49.jpg)
Intuition? (3/3)
A ¬ A
B¬ B
0.001 1
P(A) = 0.001; P(B|A) = 0.99; P(B|¬A) = 0.02
11
![Page 50: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/50.jpg)
Intuition? (3/3)
A ¬ A
B¬ B
0.001 0.999 1
P(A) = 0.001; P(B|A) = 0.99; P(B|¬A) = 0.02
11
![Page 51: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/51.jpg)
Intuition? (3/3)
A ¬ A
B 0.00099¬ B
0.001 0.999 1
P(A) = 0.001; P(B|A) = 0.99; P(B|¬A) = 0.02
P(A ∩ B) = P(B|A)P(A)
11
![Page 52: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/52.jpg)
Intuition? (3/3)
A ¬ A
B 0.00099 0.01998¬ B
0.001 0.999 1
P(A) = 0.001; P(B|A) = 0.99; P(B|¬A) = 0.02
P(A ∩ B) = P(B|A)P(A)
11
![Page 53: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/53.jpg)
Intuition? (3/3)
A ¬ A
B 0.00099 0.01998 0.02097¬ B
0.001 0.999 1
P(A) = 0.001; P(B|A) = 0.99; P(B|¬A) = 0.02
P(A ∩ B) = P(B|A)P(A)
11
![Page 54: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/54.jpg)
Intuition? (3/3)
A ¬ A
B 0.00099 0.01998 0.02097¬ B 0.00001 0.97902 0.97903
0.001 0.999 1
P(A) = 0.001; P(B|A) = 0.99; P(B|¬A) = 0.02
P(A ∩ B) = P(B|A)P(A)
11
![Page 55: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/55.jpg)
Intuition? (3/3)
A ¬ A
B 0.00099 0.01998 0.02097¬ B 0.00001 0.97902 0.97903
0.001 0.999 1
P(A) = 0.001; P(B|A) = 0.99; P(B|¬A) = 0.02
P(A ∩ B) = P(B|A)P(A)
P(A|B) =P(A ∩ B)
P(B)=
0.00099
0.02097= 0.0472
11
![Page 56: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/56.jpg)
Bayes’ Theorem
From the two ‘symmetric’ sides of the joint probability equation:
P(A|B) = P(B|A)P(A)P(B)
reverses the order of dependence (which can be useful)
in conjunction with the chain rule, allows us to determine theprobabilities we want from the probabilities we know
Other useful axioms
P(Ω) = 1
P(A) = 1 − P(¬A)
12
![Page 57: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/57.jpg)
Bonus: The Monty Hall Problem
On a gameshow, there are three doors.
Behind 2 doors, there is a goat.
Behind the 3rd door, there is a car.
The contestant selects a door that she hopes has the car behind it.
Before she opens that door, the gameshow host opens one of the otherdoors to reveal a goat.
The contestant now has the choice of opening the door she originallychose, or switching to the other unopened door.
What should she do?
13
![Page 58: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/58.jpg)
Coming up Next
Do you want to come to the movies and ?
14
![Page 59: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/59.jpg)
Coming up Next
Do you want to come to the movies and ?
Det var en ?
14
![Page 60: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/60.jpg)
Coming up Next
Do you want to come to the movies and ?
Det var en ?
Je ne parle pas ?
14
![Page 61: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/61.jpg)
Coming up Next
Do you want to come to the movies and ?
Det var en ?
Je ne parle pas ?
Natural language contains redundancy, hence can be predictable.
Previous context can constrain the next word
semantically;
syntactically;
→ by frequency.
14
![Page 62: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/62.jpg)
Recall: By the End of the Semester . . .
. . . you should be able to determine
which string is most likely: How to recognise speech vs. How to wreck a nice beach
15
![Page 63: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/63.jpg)
Recall: By the End of the Semester . . .
. . . you should be able to determine
which string is most likely: How to recognise speech vs. How to wreck a nice beach
which category sequence is most likely for flies like an arrow : N V D N vs. V P D N
15
![Page 64: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/64.jpg)
Recall: By the End of the Semester . . .
. . . you should be able to determine
which string is most likely: How to recognise speech vs. How to wreck a nice beach
which category sequence is most likely for flies like an arrow : N V D N vs. V P D N
which syntactic analysis is most likely:S
NP
I
VP
VBD
ate
NP
N
sushi
PP PPwith tuna
S
NP
I
VP
VBD
ate
NP
N
sushi
PP PPwith tuna
15
![Page 65: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/65.jpg)
Recall: By the End of the Semester . . .
. . . you should be able to determine
which string is most likely: How to recognise speech vs. How to wreck a nice beach
which category sequence is most likely for flies like an arrow : N V D N vs. V P D N
which syntactic analysis is most likely:S
NP
I
VP
VBD
ate
NP
N
sushi
PP PPwith tuna
S
NP
I
VP
VBD
ate
NP
N
sushi
PP PPwith tuna
15
![Page 66: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/66.jpg)
Language Models
A probabilistic (also known as stochastic) language model M assignsprobabilities PM (x) to all strings x in language L.
L is the sample space 0 ≤ PM (x) ≤ 1
∑x∈L
PM (x) = 1
Language models are used in machine translation, speech recognitionsystems, spell checkers, input prediction, . . .
16
![Page 67: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/67.jpg)
Language Models
A probabilistic (also known as stochastic) language model M assignsprobabilities PM (x) to all strings x in language L.
L is the sample space 0 ≤ PM (x) ≤ 1
∑x∈L
PM (x) = 1
Language models are used in machine translation, speech recognitionsystems, spell checkers, input prediction, . . .
We can calculate the probability of a string using the chain rule:
P(w1 . . . wn) = P(w1)P(w2|w1)P(w3|w1 ∩ w2) . . . P(wn | ∩n−1i=1 wi)
P(I want to go to the beach) =P(I) P(want|I) P(to|I want) P(go|I want to) P(to|I want to go) . . .
16
![Page 68: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/68.jpg)
N -Grams
We simplify using the Markov assumption (limited history):
the last n − 1 elements can approximate the effect of the full sequence.
That is, instead of
P(beach| I want to go to the)
selecting an n of 3, we use
P(beach| to the)
17
![Page 69: — INF4820 — Algorithms for AI and NLP...Complete-linkage Merge the two clusters where the maximum distance between any two members is smallest. ‘Farthest neighbors’. Amounts](https://reader035.fdocuments.net/reader035/viewer/2022071411/6107aa72137e1005093fbbe2/html5/thumbnails/69.jpg)
N -Grams
We simplify using the Markov assumption (limited history):
the last n − 1 elements can approximate the effect of the full sequence.
That is, instead of
P(beach| I want to go to the)
selecting an n of 3, we use
P(beach| to the)
We call these short sequences of words n-grams:
bigrams: I want, want to, to go, go to, to the, the beach
trigrams: I want to, want to go, to go to, go to the
4-grams: I want to go, want to go to, to go to the
17