Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine...
-
Upload
amelia-cox -
Category
Documents
-
view
216 -
download
0
Transcript of Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine...
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Lecture 24 of 42
Monday, 24 March 2008
William H. Hsu
Department of Computing and Information Sciences, KSU
KSOL course pages: http://snurl.com/1ydii / http://snipurl.com/1y5ih
Course web site: http://www.kddresearch.org/Courses/Spring-2008/CIS732
Instructor home page: http://www.cis.ksu.edu/~bhsu
Reading:
Today: Section 7.5, Han & Kamber 2e
After spring break: Sections 7.6 – 7.7, Han & Kamber 2e
Model-Based Clustering:Expectation-Maximization
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
• Organizing data into classes such that there is
• high intra-class similarity
• low inter-class similarity
• Finding the class labels and the number of classes directly from the data (in contrast to classification).
• More informally, finding natural groupings among objects.
Also called unsupervised learning, sometimes called classification by statisticians and sorting by psychologists and segmentation by people in marketing
What is Clustering?
Adapted from slides © 2003 Eamonn Keogh http://www.cs.ucr.edu/~eamonn
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Pedro (Portuguese)Petros (Greek), Peter (English), Piotr (Polish), Peadar (Irish), Pierre (French), Peder (Danish), Peka (Hawaiian), Pietro (Italian), Piero (Italian Alternative), Petr (Czech), Pyotr (Russian)
Cristovao (Portuguese)Christoph (German), Christophe (French), Cristobal (Spanish), Cristoforo (Italian), Kristoffer (Scandinavian), Krystof (Czech), Christopher (English)
Miguel (Portuguese)Michalis (Greek), Michael (English), Mick (Irish!)
Hierarchical Clustering:Names (using String Edit Distance) P
iotr
Pyo
tr P
etro
s P
ietr
oP
edro
Pie
rre
Pie
ro P
eter
Ped
er P
eka
Pea
dar
Mic
halis
Mic
hael
Mig
uel
Mic
kC
rist
ovao
Chr
isto
pher
Chr
isto
phe
Chr
isto
phC
risd
ean
Cri
stob
alC
rist
ofor
oK
rist
offe
rK
ryst
of
Adapted from slides © 2003 Eamonn Keogh http://www.cs.ucr.edu/~eamonn
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Pio
tr
Pyo
tr
Pet
ros
Pie
tro
Ped
ro P
ierr
e
Pie
ro
Pet
er
Ped
er P
eka
Pea
dar
Pedro (Portuguese/Spanish)Petros (Greek), Peter (English), Piotr (Polish), Peadar (Irish), Pierre (French), Peder (Danish), Peka (Hawaiian), Pietro (Italian), Piero (Italian Alternative), Petr (Czech), Pyotr (Russian)
Hierarchical Clustering:Names by Linguistic Similarity
Adapted from slides © 2003 Eamonn Keogh http://www.cs.ucr.edu/~eamonn
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Nearest Neighbor ClusteringNearest Neighbor ClusteringNot to be confused with Nearest Neighbor Not to be confused with Nearest Neighbor ClassificationClassification
• Items are iteratively merged into the existing clusters that are closest.
• Incremental
• Threshold, t, used to determine if items are added to existing clusters or a new cluster is created.
Incremental Clustering [1]
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
10
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
Threshold t
t 1
2
Incremental Clustering [2]
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
10
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
New data point arrives…
It is within the threshold for cluster 1, so add it to the cluster, and update cluster center. 1
2
3
Incremental Clustering [3]
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
10
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
New data point arrives…
It is not within the threshold for cluster 1, so create a new cluster, and so on..
1
2
3
4
Algorithm is highly order dependent…
It is difficult to determine t in advance…
Incremental Clustering [4]
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Similarity and clusteringSimilarity and clustering
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
MotivationMotivation
Problem: Query word could be ambiguous: Eg: Query“Star” retrieves documents about astronomy, plants, animals etc. Solution: Visualisation
Clustering document responses to queries along lines of different topics.
Problem 2: Manual construction of topic hierarchies and taxonomies Solution:
Preliminary clustering of large samples of web documents.
Problem 3: Speeding up similarity search Solution:
Restrict the search for documents similar to a query to most representative cluster(s).
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
ExampleExample
Scatter/Gather, a text clustering system, can separate salient topics in the response tokeyword queries. (Image courtesy of Hearst)
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
ClusteringClustering Task : Evolve measures of similarity to cluster a collection of documents/terms into groups within
which similarity within a cluster is larger than across clusters. Cluster Hypothesis: Given a `suitable‘ clustering of a collection, if the user is
interested in document/term d/t, he is likely to be interested in other members of the cluster to which d/t belongs.
Similarity measures Represent documents by TFIDF vectors Distance between document vectors Cosine of angle between document vectors
Issues Large number of noisy dimensions Notion of noise is application dependent
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Top-down clusteringTop-down clustering
k-Means: Repeat… Choose k arbitrary ‘centroids’ Assign each document to nearest centroid Recompute centroids
Expectation maximization (EM): Pick k arbitrary ‘distributions’ Repeat:
Find probability that document d is generated from distribution f for all d and f
Estimate distribution parameters from weighted contribution of documents
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Choosing `k’Choosing `k’
Mostly problem driven Could be ‘data driven’ only when either
Data is not sparse Measurement dimensions are not too noisy
Interactive Data analyst interprets results of structure discovery
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Choosing ‘k’ : ApproachesChoosing ‘k’ : Approaches
Hypothesis testing: Null Hypothesis (Ho): Underlying density is a mixture of ‘k’ distributions
Require regularity conditions on the mixture likelihood function (Smith’85)
Bayesian Estimation Estimate posterior distribution on k, given data and prior on k. Difficulty: Computational complexity of integration Autoclass algorithm of (Cheeseman’98) uses approximations (Diebolt’94) suggests sampling techniques
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Choosing ‘k’ : ApproachesChoosing ‘k’ : Approaches
Penalised Likelihood To account for the fact that Lk(D) is a non-decreasing function of k. Penalise the number of parameters Examples : Bayesian Information Criterion (BIC), Minimum Description
Length(MDL), MML. Assumption: Penalised criteria are asymptotically optimal (Titterington 1985)
Cross Validation Likelihood Find ML estimate on part of training data Choose k that maximises average of the M cross-validated average
likelihoods on held-out data Dtest
Cross Validation techniques: Monte Carlo Cross Validation (MCCV), v-fold cross validation (vCV)
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Similarity and clusteringSimilarity and clustering
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
MotivationMotivation
Problem: Query word could be ambiguous: Eg: Query“Star” retrieves documents about astronomy, plants, animals etc. Solution: Visualisation
Clustering document responses to queries along lines of different topics.
Problem 2: Manual construction of topic hierarchies and taxonomies Solution:
Preliminary clustering of large samples of web documents.
Problem 3: Speeding up similarity search Solution:
Restrict the search for documents similar to a query to most representative cluster(s).
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
ExampleExample
Scatter/Gather, a text clustering system, can separate salient topics in the response tokeyword queries. (Image courtesy of Hearst)
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
ClusteringClustering Task : Evolve measures of similarity to cluster a collection of documents/terms
into groups within which similarity within a cluster is larger than across clusters. Cluster Hypothesis: Given a `suitable‘ clustering of a collection, if the user is
interested in document/term d/t, he is likely to be interested in other members of the cluster to which d/t belongs.
Collaborative filtering: Clustering of two/more objects which have bipartite relationship
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Clustering (contd)Clustering (contd)
Two important paradigms: Bottom-up agglomerative clustering Top-down partitioning
Visualisation techniques: Embedding of corpus in a low-dimensional space
Characterising the entities: Internally : Vector space model, probabilistic models Externally: Measure of similarity/dissimilarity between pairs
Learning: Supplement stock algorithms with experience with data
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Clustering: ParametersClustering: Parameters
Similarity measure: (eg: cosine similarity)
Distance measure: (eg: eucledian distance)
Number “k” of clusters Issues
Large number of noisy dimensions Notion of noise is application dependent
),( 21 dd
),( 21 dd
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Clustering: Formal specificationClustering: Formal specification
Partitioning Approaches Bottom-up clustering Top-down clustering
Geometric Embedding Approaches Self-organization map Multidimensional scaling Latent semantic indexing
Generative models and probabilistic approaches Single topic per document Documents correspond to mixtures of multiple topics
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Partitioning ApproachesPartitioning Approaches Partition document collection into k clusters Choices:
Minimize intra-cluster distance Maximize intra-cluster semblance
If cluster representations are available Minimize Maximize
Soft clustering d assigned to with `confidence’ Find so as to minimize or maximize
Two ways to get partitions - bottom-up clustering and top-down clustering
}.....,{ 21 kDDD
i Ddd i
dd21 ,
21 ),(
i Ddd i
dd21 ,
21 ),(
i Dd
i
i
Dd ),(
iD
i Dd
i
i
Dd ),(
iD idz ,idz ,
i Ddiid
i
Ddz ),(,
i Ddiid
i
Ddz ),(,
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Bottom-up clustering(HAC)Bottom-up clustering(HAC)
Initially G is a collection of singleton groups, each with one document Repeat
Find , in G with max similarity measure, s() Merge group with group
For each keep track of best Use above info to plot the hierarchical merging process (DENDOGRAM) To get desired number of clusters: cut across any level of the dendogram
d
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
DendogramDendogram
A dendogram presents the progressive, hierarchy-forming merging process pictorially.
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Similarity measureSimilarity measure
Typically s() decreases with increasing number of merges Self-Similarity
Average pair wise similarity between documents in
= inter-document similarity measure (say cosine of tfidf vectors) Other criteria: Maximium/Minimum pair wise similarity between
documents in the clusters
21 ,21
2
),(1
)(dd
ddsC
s
),( 21 dds
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
ComputationComputation
Un-normalizedgroup profile:
ddpp̂
Can show:
1)(ˆ),(ˆ
pp
s
1
)(ˆ),(ˆ
pp
s
pp
pppppp
ˆ,ˆ2
ˆ,ˆˆ,ˆˆ,ˆ
O(n2logn) algorithm with n2 space
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
SimilaritySimilarity
))(())((
))(()),((),(
cgcg
cgcgs
productinner ,
Normalizeddocument profile: ))((
))(()(
cg
cgp
Profile fordocument group :
)(
)()(
p
pp
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Switch to top-downSwitch to top-down Bottom-up
Requires quadratic time and space Top-down or move-to-nearest
Internal representation for documents as well as clusters Partition documents into `k’ clusters 2 variants
“Hard” (0/1) assignment of documents to clusters “soft” : documents belong to clusters, with fractional scores
Termination when assignment of documents to clusters ceases to change much OR When cluster centroids move negligibly over successive iterations
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Top-down clusteringTop-down clustering
Hard k-Means: Repeat… Choose k arbitrary ‘centroids’ Assign each document to nearest centroid Recompute centroids
Soft k-Means : Don’t break close ties between document assignments to clusters Don’t make documents contribute to a single cluster which wins narrowly
Contribution for updating cluster centroid from document related to the current similarity between and .
c ddc
ccc
cc d
d
)||exp(
)||exp(2
2
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Seeding `k’ clustersSeeding `k’ clusters
Randomly sample documents Run bottom-up group average clustering algorithm to reduce to k
groups or clusters : O(knlogn) time Iterate assign-to-nearest O(1) times
Move each document to nearest cluster Recompute cluster centroids
Total time taken is O(kn) Non-deterministic behavior
knO
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Choosing `k’Choosing `k’
Mostly problem driven Could be ‘data driven’ only when either
Data is not sparse Measurement dimensions are not too noisy
Interactive Data analyst interprets results of structure discovery
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Choosing ‘k’ : ApproachesChoosing ‘k’ : Approaches
Hypothesis testing: Null Hypothesis (Ho): Underlying density is a mixture of ‘k’ distributions
Require regularity conditions on the mixture likelihood function (Smith’85)
Bayesian Estimation Estimate posterior distribution on k, given data and prior on k. Difficulty: Computational complexity of integration Autoclass algorithm of (Cheeseman’98) uses approximations (Diebolt’94) suggests sampling techniques
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Choosing ‘k’ : ApproachesChoosing ‘k’ : Approaches
Penalised Likelihood To account for the fact that Lk(D) is a non-decreasing function of k. Penalise the number of parameters Examples : Bayesian Information Criterion (BIC), Minimum Description
Length(MDL), MML. Assumption: Penalised criteria are asymptotically optimal (Titterington 1985)
Cross Validation Likelihood Find ML estimate on part of training data Choose k that maximises average of the M cross-validated average
likelihoods on held-out data Dtest
Cross Validation techniques: Monte Carlo Cross Validation (MCCV), v-fold cross validation (vCV)
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Visualisation techniquesVisualisation techniques
Goal: Embedding of corpus in a low-dimensional space Hierarchical Agglomerative Clustering (HAC)
lends itself easily to visualisaton Self-Organization map (SOM)
A close cousin of k-means Multidimensional scaling (MDS)
minimize the distortion of interpoint distances in the low-dimensional embedding as compared to the dissimilarity given in the input data.
Latent Semantic Indexing (LSI) Linear transformations to reduce number of dimensions
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Self-Organization Map (SOM)Self-Organization Map (SOM) Like soft k-means
Determine association between clusters and documents Associate a representative vector with each cluster and iteratively refine
Unlike k-means Embed the clusters in a low-dimensional space right from the beginning Large number of clusters can be initialised even if eventually many are to remain
devoid of documents
Each cluster can be a slot in a square/hexagonal grid. The grid structure defines the neighborhood N(c) for each cluster c Also involves a proximity function between clusters and
cc
c),( ch
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
SOM : Update RuleSOM : Update Rule
Like Neural network Data item d activates neuron (closest cluster) as well as the
neighborhood neurons Eg Gaussian neighborhood function
Update rule for node under the influence of d is:
Where is the ndb width and is the learning rate parameter
dc
)( dcN
))(2
||||exp(),(
2
2
tch c
))(,()()()1( dchttt d
)(t)(2 t
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
SOM : Example ISOM : Example I
SOM computed from over a million documents taken from 80 Usenet newsgroups. Lightareas have a high density of documents.
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
SOM: Example IISOM: Example II
Another example of SOM at work: the sites listed in the Open Directory have beenorganized within a map of Antarctica at
http://antarcti.ca/.
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Multidimensional Scaling(MDS)Multidimensional Scaling(MDS)
Goal “Distance preserving” low dimensional embedding of documents
Symmetric inter-document distances Given apriori or computed from internal representation
Coarse-grained user feedback User provides similarity between documents i and j . With increasing feedback, prior distances are overridden
Objective : Minimize the stress of embedding^
ijd
ijd
^
,
2
,
2)(
jiij
jiijij
d
dd
stress
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
MDS: issuesMDS: issues
Stress not easy to optimize Iterative hill climbing
1. Points (documents) assigned random coordinates by external heuristic
2. Points moved by small distance in direction of locally decreasing stress
For n documents Each takes time to be moved Totally time per relaxation
)(nO
)( 2nO
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Fast Map [Faloutsos ’95]Fast Map [Faloutsos ’95]
No internal representation of documents available Goal
find a projection from an ‘n’ dimensional space to a space with a smaller number `k‘’ of dimensions.
Iterative projection of documents along lines of maximum spread
Each 1D projection preserves distance information
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Best lineBest line
Pivots for a line: two points (a and b) that determine it Avoid exhaustive checking by picking pivots that are far apart First coordinates of point on “best line”
ba
xbbaxa
d
dddx
,
2,
2,
2,
1 2
1x x),( ba
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Iterative projectionIterative projection
For i = 1 to k1. Find a next (ith ) “best” line
A “best” line is one which gives maximum variance of the point-set in the direction of the line
2. Project points on the line
3. Project points on the “hyperspace” orthogonal to the above line
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
ProjectionProjection
Purpose To correct inter-point distances between points
by taking into account the components already accounted for by the first pivot line.
Project recursively upto 1-D space Time: O(nk) time
211
2,
'
,)('' yxdd yxyx
),( '' yx),( 11 yx
'' , yxd
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
IssuesIssues
Detecting noise dimensions Bottom-up dimension composition too slow Definition of noise depends on application
Running time Distance computation dominates Random projections Sublinear time w/o losing small clusters
Integrating semi-structured information Hyperlinks, tags embed similarity clues A link is worth a ? words
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Expectation maximization (EM): Pick k arbitrary ‘distributions’ Repeat:
Find probability that document d is generated from distribution f for all d and f
Estimate distribution parameters from weighted contribution of documents
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Extended similarityExtended similarity
Where can I fix my scooter? A great garage to repair your 2-wheeler is at … auto and car co-occur often Documents having related words are related Useful for search and clustering Two basic approaches
Hand-made thesaurus (WordNet) Co-occurrence and associations
… car …
… auto …
… auto …car… car … auto… auto …car
… car … auto… auto …car… car … auto
car auto
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
k
k-dim vector
Latent semantic indexingLatent semantic indexing
A
Documents
Ter
ms
U
d
t
r
D V
d
SVD
Term Document
car
auto
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Batman Rambo Andre Hiver Whispers StarWarsLyleEllenJasonFredDeanKaren
Batman Rambo Andre Hiver Whispers StarWarsLyleEllenJasonFredDeanKaren
Collaborative recommendationCollaborative recommendation
People=record, movies=features People and features to be clustered
Mutual reinforcement of similarity
Need advanced models
From Clustering methods in collaborative filtering, by Ungar and Foster
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
A model for collaborationA model for collaboration
People and movies belong to unknown classes Pk = probability a random person is in class k
Pl = probability a random movie is in class l
Pkl = probability of a class-k person liking a class-l movie
Gibbs sampling: iterate Pick a person or movie at random and assign to a class with probability
proportional to Pk or Pl
Estimate new parameters
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Aspect ModelAspect Model
Metric data vs Dyadic data vs Proximity data vs Ranked preference data. Dyadic data : domain with two finite sets of objects Observations : Of dyads X and Y Unsupervised learning from dyadic data Two sets of objects
},....{},,....{ 11 nini yyyYxxxX
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Aspect Model (contd)Aspect Model (contd)
Two main tasks Probabilistic modeling:
learning a joint or conditional probability model over
structure discovery: identifying clusters and data hierarchies.
YX
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Aspect ModelAspect Model
Statistical models Empirical co-occurrence frequencies
Sufficient statistics Data spareseness:
Empirical frequencies either 0 or significantly corrupted by sampling noise Solution
Smoothing Back-of method [Katz’87] Model interpolation with held-out data [JM’80, Jel’85] Similarity-based smoothing techniques [ES’92]
Model-based statistical approach: a principled approach to deal with data sparseness
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Aspect ModelAspect Model
Model-based statistical approach: a principled approach to deal with data sparseness Finite Mixture Models [TSM’85] Latent class [And’97] Specification of a joint probability distribution for latent and observable
variables [Hoffmann’98]
Unifies statistical modeling
Probabilistic modeling by marginalization
structure detection (exploratory data analysis) Posterior probabilities by baye’s rule on latent space of structures
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Aspect ModelAspect Model
Realisation of an underlying sequence of random variables
2 assumptions All co-occurrences in sample S are iid are independent given
P(c) are the mixture components
:),( 1 Nnnn yxS
:),( 1 Nnnn YXS
nAnn YX ,
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Aspect Model: Latent classesAspect Model: Latent classes
},..{
},...{
)}(),({
1
1
1
L
K
Nnnn
ddD
ccC
YDXC
},....{
),(
1
1
K
Nnnnn
aaA
YXA
},...{
}),({
1
1
K
Nnnn
ccC
YXC
},...{
}),({
1
1
K
Nnnn
ccC
YXC
IncreasingDegree ofRestrictionOn Latent
space
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Aspect ModelAspect Model
Symmetric Asymmetric
Xx Yy
yxn
AaXxYy
yxn
N
n
N
n
nnnnnnnn
ayPaxPaPyxPSP
ayPaxPaPayxPaSP
),(),(
1 1
)]|()|()([),()(
)|()|()(),,(),(
Xx Yy
yxn
AaXxYy
yxn
N
n
N
n
nnnnnnnn
ayPxaPxPyxPSP
ayPaxPaPayxPaSP
),(),(
1 1
)]|()|([)(),()(
)|()|()(),,(),(
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Clustering vs AspectClustering vs Aspect
Clustering model constrained aspect model
For flat:
For hierarchical Group structure on object spaces as against partition the
observations Notation
P(.) : are the parametersP{.}: are posteriors
acnn cxCxXaAPcxaP })(,|(),|(
ackk ac
),|(. cxaPca ackk
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Hierarchical Clustering modelHierarchical Clustering model
One-sided clustering Hierarchical clustering
Yy
yxn
Aa CcXx
Xx Yy
yxn
AaXxYy
yxn
ayPcxaPcPxP
ayPxaPxPyxPSP
),(
),(),(
)]|(),|([)()(
)]|()|([)(),()(
Yy
yxnxn
XxCcYy
yxn
Aa CcXx
Xx Yy
yxn
AaXxYy
yxn
ayPxPcPayPcxaPcPxP
ayPxaPxPyxPSP
),()(),(
),(),(
)]|([])([)()]|(),|([)()(
)]|()|([)(),()(
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Comparison of E’sComparison of E’s
Aa
nnn
ayPaxPaP
ayPaxPaPyYxXaAP
'
)'|()'|()'(
)|()|()(};,|{
Cc
yxn
Yy
yxn
Yyx cyPcP
cyPcP
ScxCP
'
),(
),(
)]'|([)'(
)]|([)(
},|)({
Cc Yy
yxn
yxn
Yy Aa
cxaPayPcP
cxaPayPcP
ScxCP
'
),(
),(
)]',|()|([)'(
)],|()|([)(
},|)({
Aa
nnn
ayPcxaP
ayPcxaPcxCyYxXaAP
'
)'|(),|'(
)|(),|(};)(,,|{
•Aspect model
•One-sided aspect model
•Hierarchical aspect model
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Tempered EM(TEM)Tempered EM(TEM)
Additively (on the log scale) discount the likelihood part in Baye’s formula:
1. Set and perform EM until the performance on held--out data deteriorates (early stopping).
2. Decrease e.g., by setting with some rate parameter .
3. As long as the performance on held-out data improves continue TEM iterations at this value of
4. Stop on i.e., stop when decreasing does not yield further improvements, otherwise goto step (2)
5. Perform some final iterations using both, training and heldout data.
Aa
nnn
ayPaxPaP
ayPaxPaPyYxXaAP
'
)]'|()'|()['(
)]|()|()[(};,|{
1
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
M-StepsM-Steps
)';,'|(),'(
)';,|(),(
)';,|(
)';,|(
)|(
,'1
:
yxaPyxn
yxaPyxn
yxaP
yxaP
axP
yx
y
nnN
n
nn
xxn n
)';',|()',(
)';,|(),(
)';,|(
)';,|(
)|(
',1
:
yxaPyxn
yxaPyxn
yxaP
yxaP
ayP
yx
x
nnN
n
nn
yyn n
}';|)({)(
}';|)({),()|(
xx
xx
ScxCPxn
ScxCPyxncyP
N
xnxP
)()(
}';',|{)',(
}';,|{),(
}';,|{
}';,|{
)|(
',1
:
yxaPyxn
yxaPyxn
yxaP
yxaP
ayP
yx
x
nnN
n
nn
yyn n
N
xnxP
)()(
N
xnxP
)()(
)';,'|(),'(
)';,|(),(
)';,|(
)';,|(
)|(
,'1
:
yxaPyxn
yxaPyxn
yxaP
yxaP
xaP
yx
y
nnN
n
nn
xxn n
1. Aspect
2. Assymetric
3. Hierarchical x-clustering
4. One-sided x-clustering
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Example Model [Hofmann and Popat CIKM 2001]Example Model [Hofmann and Popat CIKM 2001]
Hierarchy of document categories
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Example ApplicationExample Application
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Topic HierarchiesTopic Hierarchies
To overcome sparseness problem in topic hierarchies with large number of classes
Sparseness Problem: Small number of positive examples• Topic hierarchies to reduce variance in parameter estimation
Automatically differentiate Make use of term distributions estimated for more general, coarser text aspects to
provide better, smoothed estimates of class conditional term distributions Convex combination of term distributions in a Hierarchical Mixture Model refers to all inner nodes a above the terminal class node c.
ca
awPcaPcwP )|()|()|(
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Topic Hierarchies (Hierarchical X-clustering)
Topic Hierarchies (Hierarchical X-clustering)
X = document, Y = word
}';'),(|{)'),((
}';),(|{)),((
}';',|{)',(
}';,|{),(
}';,|{
}';,|{
)|(
',)(
)(
',1
:
yxcaPyxcn
yxcaPyxcn
yxaPyxn
yxaPyxn
yxaP
yxaP
ayP
yaxc
axc
yx
x
nnN
n
nn
yyn n
N
xnxP
)()(
Cc Yy
yxn
yxn
Yy ca
Cc Yy
yxn
yxn
Yy Aa
xcaPayPcP
xcaPayPcP
xcxaPayPcP
xcxaPayPcP
ScxCP
'
),(
),(
'
),(
),(
))]('|()|([)'(
))](|()|([)(
))](',|()|([)'(
))](,|()|([)(
},|)({
caAa
ayPxcaP
ayPcaP
ayPcxaP
ayPcxaPxcyaPxcyxaP
''
)'|())(|'(
)|()|(
)'|(),|'(
)|(),|(});(,|{});(,,|{
ca
y
xcyaP
xcyaPcyn
xcaP
'
))(,|'(
))(,|(),(
});(|{
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Document Classification ExerciseDocument Classification Exercise
Modification of Naïve Bayes
ca
awPcaPcwP )|()|()|(
xyi
c
xyi
i
i
cyPcP
cyPcP
xcP)'|()'(
)|()(
)|(
'
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Mixture vs ShrinkageMixture vs Shrinkage
Shrinkage [McCallum Rosenfeld AAAI’98]: Interior nodes in the hierarchy represent coarser views of the data which are obtained by simple pooling scheme of term counts
Mixture : Interior nodes represent abstraction levels with their corresponding specific vocabulary Predefined hierarchy [Hofmann and Popat CIKM 2001] Creation of hierarchical model from unlabeled data [Hofmann IJCAI’99]
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Mixture Density Networks(MDN) [Bishop CM ’94 Mixture Density Networks]
Mixture Density Networks(MDN) [Bishop CM ’94 Mixture Density Networks]
broad and flexible class of distributions that are capable of modeling completely general continuous distributions
superimpose simple component densities with well known properties to generate or approximate more complex distributions
Two modules: Mixture models: Output has a distribution given as mixture of distributions Neural Network: Outputs determine parameters of the mixture model.
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
MDN: ExampleMDN: Example
A conditional mixture density network with Gaussian component densities
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
MDNMDN
Parameter Estimation : Using Generalized EM (GEM) algo to speed up.
Inference Even for a linear mixture, closed form solution not possible Use of Monte Carlo Simulations as a substitute
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Vocabulary V, term wi, document represented by
is the number of times wi occurs in document Most f’s are zeroes for a single document Monotone component-wise damping function g such as log or
square-root
Document modelDocument model
Vwi iwfc ),()(
),( iwf
Vwi iwfgcg )),(())((
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Terminology
Expectation-Maximization (EM) Algorithm Iterative refinement: repeat until convergence to a locally optimal label
Expectation step: estimate parameters with which to simulate data
Maximization step: use simulated (“fictitious”) data to update parameters
Unsupervised Learning and Clustering Constructive induction: using unsupervised learning for supervised learning
Feature construction: “front end” - construct new x values
Cluster definition: “back end” - use these to reformulate y
Clustering problems: formation, segmentation, labeling
Key criterion: distance metric (points closer intra-cluster than inter-cluster)
AlgorithmsAutoClass: Bayesian clustering
Principal Components Analysis (PCA), factor analysis (FA)
Self-Organizing Maps (SOM): topology preserving transform (dimensionality
reduction) for competitive unsupervised learning
Computing & Information SciencesKansas State University
Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Summary Points
Expectation-Maximization (EM) Algorithm
Unsupervised Learning and Clustering Types of unsupervised learning
Clustering, vector quantization
Feature extraction (typically, dimensionality reduction)
Constructive induction: unsupervised learning in support of supervised learningFeature construction (aka feature extraction)
Cluster definition
AlgorithmsEM: mixture parameter estimation (e.g., for AutoClass)
AutoClass: Bayesian clustering
Principal Components Analysis (PCA), factor analysis (FA)
Self-Organizing Maps (SOM): projection of data; competitive algorithm
Clustering problems: formation, segmentation, labeling
Next Lecture: Time Series Learning and Characterization