June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster...
-
Upload
lesley-lane -
Category
Documents
-
view
225 -
download
0
Transcript of June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster...
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
1
Structuring Interactive Cluster Analysis
Wayne Oldford
University of Waterloo
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
2
Structuring Interactive Cluster Analysis
Wayne Oldford
University of Waterloo
This talk is about interactive cluster analysis, that is about interactive tools for finding and identifying groups in data.
But more than that, it's about stepping back and understanding the structure of this process so that software tools can be organized to simplify and to aid the analysis.
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
3
Overview
• ill-defined problem • high-interaction desirable• explore partitions• recast algorithms
The problem of `cluster analysis' or of `finding groups in data' is ill defined. So there can be no universal solution and any claimed solution must necessarily solve some other suitably constrained problem and not the more general one.
What we need instead are highly interactive tools which allow us to adapt to the peculiarities of the data and the problem at hand.
These tools are usefully organized and integrated if we step back and consider the problem as one of exploratory data analysis, except that now, in addition to the data itself, the exploration is to take place as well on the space of partitions of the data.
Existing algorithms need to be recast, and new ones developed, in terms of exploring the space of partitions. The algorithms can then be easily integrated with other interactive tools so that jointly they provide a broadly useful and easily adapted tool-set for finding and identifying groups in data.
Argument:
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
4
Overview
• ill-defined problem • high-interaction desirable• explore partitions• recast algorithms
• problems• resources• interactive clustering• partition moves• implications• prototype interface
Develop by example:Argument:
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
5
Problem … geometric/visual structure
Visual system easily identifies groups
… algorithms are often motivated and/or understood via visual intuition and geometric structure
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
6
Problem … geometric/visual structure
Visual system easily identifies groups
… algorithms are often motivated and/or understood via visual intuition and geometric structure
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
7
Problem
Context matters
… each point is a document located by each word’s frequency within the document
Word 1Word 2
… Consider visually grouping here:
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
8
Problem
… two similar documents of different lengths should be “closer”
… one of these has more text than the other.
Word 1Word 2
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
9
Problem
… green “closer” to orange than to red?
… “distance” measured by angle?
Word 1Word 2
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
10
Problem … structure in context
… image source
… segmentation in MRI
… groups are spatially contiguous in the plane of the image and nearby in the intensity.
… shape is not defined a priori
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
11
Problem … context specific structure
… image source
… aneurysm presents as intensity in blood vessels
… groups are spatially contiguous tubes of similar intensity
… shape is restricted a priori to be 3-d tubes
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
12
Problem … some specific some not
… image source
… same slice, five different measurements at each location
… spatial grouping as before, additional grouping possible across measurements
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
13
Problem … some specific some not
… image source
4 dimensional data from connected images:
… 2d intensity measures with abstract structure/grouping
… 2d spatial with clear biological grouping, connected to
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
14
Problem
– What do you mean similar?
• Find groups in data– Similar objects are together– Groups are separated
• Problem is ill defined:
– Can we believe it?
• E.g. what is contiguous structure?
– When are groups separate?
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
15
Computational resources1. Processing
2. Memory
3. Display
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
16
Computational resources (and response)
1. Processing
2. Memory
3. Display
• Gflops, Tflops, multiple processors
• problem constrained and optimized
• “computationally intensive” methods
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
17
Computational resources (and response)
1. Processing
2. Memory
3. Display
• GBs, TBs, disk and RAM• try to analyze huge data-sets• data-sets larger than necessary?
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
18
Computational resources (and response)
1. Processing
2. Memory
3. Display• high resolution, large• graphics processors, digital video• more data, more visual detail
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
19
Computational resources1. Processing
2. Memory
3. Display
Exploit no one resource exclusively
Balance and integrate
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
20
High interaction (much overlooked by researchers)
QuickTime™ and aMPEG-4 Video decompressor
are needed to see this picture.
• integrate computational resources
• assume multiple displays
• challenge is to design software to be simple, understandable, integrated and extensible
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
21
Example: image analysis … find groups via intensity (contours and two small unusual structures revealed)
QuickTime™ and aGraphics decompressor
are needed to see this picture.
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
22
Example: image analysis … other measurements may contain interesting structure
QuickTime™ and aGraphics decompressor
are needed to see this picture.
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
23
Example: image analysis … identify new structure location in the original image
QuickTime™ and aGraphics decompressor
are needed to see this picture.
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
24
Example: image analysis … mark new groups by colour (hue, preserving lightness in original image)
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
25
QuickTime™ and aGraphics decompressor
are needed to see this picture.
Example: image analysis … explore relation between old and new groups via contours in the image itself
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
26
Example: 8 dimensions from teeth measurements on species (+ sex)
QuickTime™ and aGraphics decompressor
are needed to see this picture.
Gorillas, orangutans
humans
hominids
chimps
Proconsul Africanus
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
27
Example: apes, hominids, modern humans
• context helps- knowing the species encourages grouping - grouping based on context + the visual information
• multiple and very different views- 3-d point clouds (of first 3 discriminant co-ordinates)- cases identified in a list- each point represented as a smooth curve by projecting it on a direction vector smoothly moving around the surface of an 8-d sphere- all linked via colour by cases being displayed
• grouping is confirmed across different kinds of display
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
28
Example: mutual support and shapes
QuickTime™ and aGraphics decompressor
are needed to see this picture.
a 3-d projection Shape from all dimensions
How many groups?
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
29
Example: mutual support and shapes
QuickTime™ and aGraphics decompressor
are needed to see this picture.
Groups found here Same in all dimensions?
How many groups?
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
30
Example: mutual support and shapesObserve effect here Split black group by shape
How many groups?
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
31
Example: mutual support and shapes
QuickTime™ and aGraphics decompressor
are needed to see this picture.
Get new 3-d projection Coloured by shape
Five groups corroborated
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
32
Example: exploratory data analysis
QuickTime™ and aGraphics decompressor
are needed to see this picture.
How many groups?
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
33
Example: exploratory data analysis
QuickTime™ and aGraphics decompressor
are needed to see this picture.
Choose data to cut away
Explore the rest
Distinguish groups
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
34
Example: exploratory data analysis
QuickTime™ and aGraphics decompressor
are needed to see this picture.
Bring data back
Explore all together
Some black with red?
Focus on centre
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
35
Example: exploratory data analysis
QuickTime™ and aGraphics decompressor
are needed to see this picture.
Explore separately
Mark group
Discard new view
Explore all together
Two groups
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
36
Interactive clustering• visual grouping
– location, motion, shape, texture, ...– linking across displays
• manual – selection
• cases, variates, groups, ...
– colouring– focus
• immediate and incremental– context can be used to form groups
• multiple partitions
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
37
Automated clustering: typical software
• resources dedicated to numerical computation– teletype interaction
– runs to completion
– graphical “output”
• don’t always work so well (no universal solution)• confirm via exploratory data analysis
Must be integrated with interactive methods
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
38
Example: K-means clustering
QuickTime™ and aGraphics decompressor
are needed to see this picture.
K = 2 groups
Starting groups as shown have centre ball in one group
K-means moves one point at a time to “improve” 2 groups
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
39
Example: K-means clustering
QuickTime™ and aGraphics decompressor
are needed to see this picture.
K = 2 groups
Final groups shown maximize F-like statistic (between/within)
Central ball is lost
K-means poor for this data configuration
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
40
Example: VERI Visual Empirical Regions of Influence
join points if no third point falls in this region
Visual Empirical Regions of Influence
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
41
Example: VERI Visual Empirical Regions of Influence
join points if no third point falls in this region
Visual Empirical Regions of Influence
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
42
Visual Empirical Regions of Influence• psychophysical experiments of human visual
perception to join data points – very special circumstances (two lines of three equi-
spaced points each)
• works well on demonstration 2-d cases• extends to higher dimensions
– two points are joined or not depending on their joint configuration with a third point
– each third point examined forms a plane with the candidate pair and so VERI shape applies
– works in high-d with published demonstration cases
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
43
Example: VERI
QuickTime™ and aMPEG-4 Video decompressor
are needed to see this picture.
Each colour is a different group found by VERI.
Central ball is lost.
There is no universal method, nor can there be.
VERI fails for this data configuration (also for small perturbations of demonstration cases).
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
44
Example: VERI (with parameters)
VERI algorithm, but parameterized now to shrink region size.Becomes minimal spanning tree in the limit (MST gets 2 groups here).
QuickTime™ and aMPEG-4 Video decompressor
are needed to see this picture.
Again. no universal method possible, but methods can be parameterized.
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
45
Integrating automatic methods:
Move about the space of partitions:
Pa --> Pb --> Pc --> ….
Which operators f
f(Pa) --> Pb
are of interest?
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
46
Refine
ReduceNeed not be nested.
Nesting produces hierarchy
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
47
Reassign
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
48
Refinement sequence: 1
Begin with partition containing all points in one group.
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
49
Refinement sequence: 1 -> 2
Refine partition to move to a new partition containing two groups.
Blue points are on the outer sphere.
This refinement was had by projecting all points onto the eigen-vector of the largest eigen value of the sample variance covariance matrix and splitting at the largest gap between projected points.
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
50
Refinement sequence: 1 -> 2 -> 3
Refine partition (2) to move to a new partition containing three groups.
Green points are also on the outer sphere.
Refinement move:• select group whose sample var-cov matrix has largest eigen-value• for that group, project and split as before.
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
51
Refinement sequence: 1 -> 2 -> 3 -> 4
Refine partition (3) to move to a new partition containing four groups.
New group contains a single (magenta) point on the outer sphere (middle right, up).
Refinement move as before, again splits red group.
Exploration of the data shows this to be a very poor partition with that single isolated point.
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
52
Refinement sequence: 1 -> 2 -> 3 -> 4 -> 5
Refine partition (4) to move to a new partition containing five groups.
New group contains a single (black) point on the outer sphere (bottom left).
Refinement move as before, again splits red group.
Again a poor partition; no further refinement step taken at this point.
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
53
Reassign, reduce sequence: 5 -> 5
A reassign move from one partition of five to another.
Reassignment move:k-means maximizing an F statistic.
Seems a better partition than before; explore to confirm.
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
54
QuickTime™ and aMPEG-4 Video decompressor
are needed to see this picture.
Explore present partition: 5
Reassignment seems to have isolated central red ball.
Remaining groups distributed around a spherical surface.
Consider reduction moves from this partition to `nearby’ partitions with fewer groups.
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
55
Partition to be reduced: 5
Same partition - back in the original position to make subsequent reduction moves visually comparable with previous refinement and reassignment moves.
Choice of reduction move can be based on what we have learned from exploring this partition.
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
56
Reduce sequence: 5 -> 4
Reduction move:Single-linkage between groups.i.e. join closest two groups as measured by euclidean distance between nearest points in each group.
Seems reasonable choice given structure observed in previous exploration.
Reduce partition (5) to move to a new partition containing four groups.
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
57
Reduce sequence: 5 -> 4 -> 3
Reduction move:As before.
Exploration suggests one more reduction move.
Reduce partition (4) to move to a new partition containing three groups.
Red ball remains.
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
58
Reduce sequence: 5 -> 4 -> 3 -> 2
Reduction move:As before.
Interactive exploration important to choose type and details of potentially interesting moves from one partition to another.
Reduce partition (3) to move to a new partition containing two groups.
This partition seems best.
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
59
Moves (generic functions) examples:
• refine (Pold) --> Pnew break minimal spanning tree
• reduce (Pold) --> Pnew join near centres
• reassign (Pold) --> Pnew k-means maximize F
• partition (graphic) --> Pnew
colours from point cloud
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
60
Challenges:• varying focus
• subsets (selected manually and at random)
• merging new data into partition
• interface design• control panels, options
• interaction
• exploring multiple partitions• interactive display and comparison
• resolving many to one
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
61
A prototype interface
• cluster analysis hub
- an analysis hub (Oldford, 1997) created on demand for partition:
- having all points in one group for named data-set, or
- as defined by colours of all points in topmost plot, or
- as defined by colours of selected points in topmost plot
- new hub can always be created for any subset
- maintains list of saved partitions
- offers moves from current partition via one of:
- reduce, refine, or reassign
- manually from current colours (so as to capture interactive
modification of existing partition)
- Other operations on one or more partitions (e.g. cluster plot,
dendrogram, ...)
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
62
Interface illustration: details of moves
• Each move - refine, reduce, reassign - is an entire collection of possible
moves, each with many possible choices.
• The next few slides illustrate the prototype implementation where:
• Buttons for refine, reduce, and reassign are given at the topmost
level.
• Once selected, each button pops up its own control panel where
various different kinds of moves and parameter choices can be
made. E.g. the analyst might choose to reduce by any of:
• Join groups with closest centres using Euclidean distance
• Join groups whose farthest points are closest (i.e. “complete
linkage”)
• Choose group with greatest spread and disperse its points
among the remaining groups. …
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
63
Interface - reduce
QuickTime™ and aGraphics decompressor
are needed to see this picture.
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
64
Interface - refine
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
65
Interface - reassign
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
66
Interface illustration: example of use
• The next few slides illustrate the prototype implementation applied to a
“ball in a sphere” data-set (a different one from before).
• Moves are made about the partition space (refines and reassign)
• Partitions are saved (can be named, deleted, revisited, etc.)
• Nested partitions compared via a dendrogram
• Non-nested partition compared with nested ones
• N.B. at any time, the analyst could have interacted with any
graphic
• to create a new partition by colouring - using “manual button”
• focus on a subset to examine via a new cluster analysis hub
and subsequently incorporate that into the partition of the
whole data-set.
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
67
InteractionSelecting refine pops up the refinement panel.
Start with partition having all points in a single group.
Refinement move:• Choose group with var-cov having largest eigen
value.• Project these points onto corresponding eigen-vector.• Split this group where the projected gap is largest.
Choose refinement details.
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
68
InteractionNew partition appears as `Refine Dataset’ in panel at left.
Refine produces new partition having two groups as shown by different colours in all graphics.
Refinement details unchanged.
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
69
name and save partition
QuickTime™ and aGraphics decompressor
are needed to see this picture.
New partition is named and saved.
New partition has three groups.
Refinement details unchanged.
Saved partition list.
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
70
prototype - refine to 4
New partition has four groups.
Refinement details unchanged.
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
71
prototype - refine to 5
New partition has five groups.The fifth group contains a single point (blue, top right).
Refinement details unchanged.
No further refinement pursued beyond this one.
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
72
Select nested partitionsand view dendrogram Dendrogram button.
Dendrogram shows 5 nested partitions:• Each block is a group, horizontal cuts at each vertical level is a partition.• Size and colour proportions vary with number of points.• Colouring is as displayed in point cloud (here showing the current partition) .
Select nested partitions11
22
33
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
73
Reassign, dendrogram updated
New partition appears as `Reassign Dataset’ in panel at left.
Colours update in all graphics including the dendrogram:• Reassignment partition can be explored as usual.• This partition can be visually compared with previous
partitions via the updated colours in the dendrogram.
Reassign move to new partition.Details:
• k-means• max F statistic
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
74
Cluster plot + dendrograminteraction movie
QuickTime™ and aGraphics decompressor
are needed to see this picture.
Cluster plot button operates on selected partition
Nested and non-nested partitions can be visually compared simultaneously through interaction.
Cluster plot:• groups as boxes• close groups are
visually close (via multi-dimensional scaling)
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
75
Other operators
• dissimilarity (Pi, Pj) --> di,j
• display (P1, ..., Pm)– dendrogram if P1 < …< Pm
– mds plot of all clusters in P1, …, Pm
– mds plot of all partitions P1, …, Pm
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
76
Creation:• partition (Data ; ...) --> Pnew
• “manually” from colours
• k-means, random start, mst, veri, etc
• from existing classifier.
• partition-path (Data ; …) --> {P1 , P2 , …, Pn }
• partition-path (Pold ; ...)
--> {Pold , P1 , P2 , …, Pn }• e.g. nested sequence from hierarchical clustering
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
77
Composition:
• merge (Pa , Pb ; …) --> P+
new
• combine non-overlapping partitions
• merge (Data, Pold ; …) --> P+
new
• classify additional points
• resolve (P1, ..., Pm; …) --> Pnew
• combine different partitions of the same data
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
78
Implications:
• Algorithms (re)cast in terms of moves:– refine, reduce– reassign – partition, partition-path– easily understandable (e.g. geometric
structures)– specify required data structures
• e.g. ms tree, triangulation, var-cov matrix, …
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
79
New problems:
• interface design
• multiple partitions– comparison and/or resolution– multiple display
• inference
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
80
Summary
• Cluster analysis is naturally exploratory and needs integration with modern interactive data analysis.
• Enlarging the problem to partitions:– simplifies and gives structure– encourages exploratory approach– integrates naturally– introduces new possibilities (analysis and
research)
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
81
Related references:
• Interactive clustering CASI talk, Oldford (2001)
• Quail: Overview (Interface 1998), graphics (Hurley and Oldford, ISI 1999) and code.
• Design principles: Oldford (Interface1999)
• Analysis hubs: Oldford (Interface 1997)
June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford
82
Acknowledgements:
• Catherine Hurley, Erin McLeish, Rayan Yahfoufi, Natasha Wiebe
• U(W) students in statistical computing
• Quail: Quantitative Analysis in Lisp
http://www.stats.uwaterloo.ca/Quail