June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster...

82
June, 2003 Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo

Transcript of June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster...

Page 1: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

1

Structuring Interactive Cluster Analysis

Wayne Oldford

University of Waterloo

Page 2: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

2

Structuring Interactive Cluster Analysis

Wayne Oldford

University of Waterloo

This talk is about interactive cluster analysis, that is about interactive tools for finding and identifying groups in data.

But more than that, it's about stepping back and understanding the structure of this process so that software tools can be organized to simplify and to aid the analysis.

Page 3: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

3

Overview

• ill-defined problem • high-interaction desirable• explore partitions• recast algorithms

The problem of `cluster analysis' or of `finding groups in data' is ill defined. So there can be no universal solution and any claimed solution must necessarily solve some other suitably constrained problem and not the more general one.

What we need instead are highly interactive tools which allow us to adapt to the peculiarities of the data and the problem at hand.

These tools are usefully organized and integrated if we step back and consider the problem as one of exploratory data analysis, except that now, in addition to the data itself, the exploration is to take place as well on the space of partitions of the data.

Existing algorithms need to be recast, and new ones developed, in terms of exploring the space of partitions. The algorithms can then be easily integrated with other interactive tools so that jointly they provide a broadly useful and easily adapted tool-set for finding and identifying groups in data.

Argument:

Page 4: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

4

Overview

• ill-defined problem • high-interaction desirable• explore partitions• recast algorithms

• problems• resources• interactive clustering• partition moves• implications• prototype interface

Develop by example:Argument:

Page 5: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

5

Problem … geometric/visual structure

Visual system easily identifies groups

… algorithms are often motivated and/or understood via visual intuition and geometric structure

Page 6: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

6

Problem … geometric/visual structure

Visual system easily identifies groups

… algorithms are often motivated and/or understood via visual intuition and geometric structure

Page 7: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

7

Problem

Context matters

… each point is a document located by each word’s frequency within the document

Word 1Word 2

… Consider visually grouping here:

Page 8: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

8

Problem

… two similar documents of different lengths should be “closer”

… one of these has more text than the other.

Word 1Word 2

Page 9: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

9

Problem

… green “closer” to orange than to red?

… “distance” measured by angle?

Word 1Word 2

Page 10: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

10

Problem … structure in context

… image source

… segmentation in MRI

… groups are spatially contiguous in the plane of the image and nearby in the intensity.

… shape is not defined a priori

Page 11: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

11

Problem … context specific structure

… image source

… aneurysm presents as intensity in blood vessels

… groups are spatially contiguous tubes of similar intensity

… shape is restricted a priori to be 3-d tubes

Page 12: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

12

Problem … some specific some not

… image source

… same slice, five different measurements at each location

… spatial grouping as before, additional grouping possible across measurements

Page 13: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

13

Problem … some specific some not

… image source

4 dimensional data from connected images:

… 2d intensity measures with abstract structure/grouping

… 2d spatial with clear biological grouping, connected to

Page 14: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

14

Problem

– What do you mean similar?

• Find groups in data– Similar objects are together– Groups are separated

• Problem is ill defined:

– Can we believe it?

• E.g. what is contiguous structure?

– When are groups separate?

Page 15: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

15

Computational resources1. Processing

2. Memory

3. Display

Page 16: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

16

Computational resources (and response)

1. Processing

2. Memory

3. Display

• Gflops, Tflops, multiple processors

• problem constrained and optimized

• “computationally intensive” methods

Page 17: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

17

Computational resources (and response)

1. Processing

2. Memory

3. Display

• GBs, TBs, disk and RAM• try to analyze huge data-sets• data-sets larger than necessary?

Page 18: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

18

Computational resources (and response)

1. Processing

2. Memory

3. Display• high resolution, large• graphics processors, digital video• more data, more visual detail

Page 19: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

19

Computational resources1. Processing

2. Memory

3. Display

Exploit no one resource exclusively

Balance and integrate

Page 20: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

20

High interaction (much overlooked by researchers)

QuickTime™ and aMPEG-4 Video decompressor

are needed to see this picture.

• integrate computational resources

• assume multiple displays

• challenge is to design software to be simple, understandable, integrated and extensible

Page 21: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

21

Example: image analysis … find groups via intensity (contours and two small unusual structures revealed)

QuickTime™ and aGraphics decompressor

are needed to see this picture.

Page 22: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

22

Example: image analysis … other measurements may contain interesting structure

QuickTime™ and aGraphics decompressor

are needed to see this picture.

Page 23: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

23

Example: image analysis … identify new structure location in the original image

QuickTime™ and aGraphics decompressor

are needed to see this picture.

Page 24: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

24

Example: image analysis … mark new groups by colour (hue, preserving lightness in original image)

Page 25: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

25

QuickTime™ and aGraphics decompressor

are needed to see this picture.

Example: image analysis … explore relation between old and new groups via contours in the image itself

Page 26: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

26

Example: 8 dimensions from teeth measurements on species (+ sex)

QuickTime™ and aGraphics decompressor

are needed to see this picture.

Gorillas, orangutans

humans

hominids

chimps

Proconsul Africanus

Page 27: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

27

Example: apes, hominids, modern humans

• context helps- knowing the species encourages grouping - grouping based on context + the visual information

• multiple and very different views- 3-d point clouds (of first 3 discriminant co-ordinates)- cases identified in a list- each point represented as a smooth curve by projecting it on a direction vector smoothly moving around the surface of an 8-d sphere- all linked via colour by cases being displayed

• grouping is confirmed across different kinds of display

Page 28: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

28

Example: mutual support and shapes

QuickTime™ and aGraphics decompressor

are needed to see this picture.

a 3-d projection Shape from all dimensions

How many groups?

Page 29: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

29

Example: mutual support and shapes

QuickTime™ and aGraphics decompressor

are needed to see this picture.

Groups found here Same in all dimensions?

How many groups?

Page 30: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

30

Example: mutual support and shapesObserve effect here Split black group by shape

How many groups?

Page 31: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

31

Example: mutual support and shapes

QuickTime™ and aGraphics decompressor

are needed to see this picture.

Get new 3-d projection Coloured by shape

Five groups corroborated

Page 32: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

32

Example: exploratory data analysis

QuickTime™ and aGraphics decompressor

are needed to see this picture.

How many groups?

Page 33: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

33

Example: exploratory data analysis

QuickTime™ and aGraphics decompressor

are needed to see this picture.

Choose data to cut away

Explore the rest

Distinguish groups

Page 34: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

34

Example: exploratory data analysis

QuickTime™ and aGraphics decompressor

are needed to see this picture.

Bring data back

Explore all together

Some black with red?

Focus on centre

Page 35: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

35

Example: exploratory data analysis

QuickTime™ and aGraphics decompressor

are needed to see this picture.

Explore separately

Mark group

Discard new view

Explore all together

Two groups

Page 36: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

36

Interactive clustering• visual grouping

– location, motion, shape, texture, ...– linking across displays

• manual – selection

• cases, variates, groups, ...

– colouring– focus

• immediate and incremental– context can be used to form groups

• multiple partitions

Page 37: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

37

Automated clustering: typical software

• resources dedicated to numerical computation– teletype interaction

– runs to completion

– graphical “output”

• don’t always work so well (no universal solution)• confirm via exploratory data analysis

Must be integrated with interactive methods

Page 38: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

38

Example: K-means clustering

QuickTime™ and aGraphics decompressor

are needed to see this picture.

K = 2 groups

Starting groups as shown have centre ball in one group

K-means moves one point at a time to “improve” 2 groups

Page 39: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

39

Example: K-means clustering

QuickTime™ and aGraphics decompressor

are needed to see this picture.

K = 2 groups

Final groups shown maximize F-like statistic (between/within)

Central ball is lost

K-means poor for this data configuration

Page 40: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

40

Example: VERI Visual Empirical Regions of Influence

join points if no third point falls in this region

Visual Empirical Regions of Influence

Page 41: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

41

Example: VERI Visual Empirical Regions of Influence

join points if no third point falls in this region

Visual Empirical Regions of Influence

Page 42: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

42

Visual Empirical Regions of Influence• psychophysical experiments of human visual

perception to join data points – very special circumstances (two lines of three equi-

spaced points each)

• works well on demonstration 2-d cases• extends to higher dimensions

– two points are joined or not depending on their joint configuration with a third point

– each third point examined forms a plane with the candidate pair and so VERI shape applies

– works in high-d with published demonstration cases

Page 43: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

43

Example: VERI

QuickTime™ and aMPEG-4 Video decompressor

are needed to see this picture.

Each colour is a different group found by VERI.

Central ball is lost.

There is no universal method, nor can there be.

VERI fails for this data configuration (also for small perturbations of demonstration cases).

Page 44: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

44

Example: VERI (with parameters)

VERI algorithm, but parameterized now to shrink region size.Becomes minimal spanning tree in the limit (MST gets 2 groups here).

QuickTime™ and aMPEG-4 Video decompressor

are needed to see this picture.

Again. no universal method possible, but methods can be parameterized.

Page 45: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

45

Integrating automatic methods:

Move about the space of partitions:

Pa --> Pb --> Pc --> ….

Which operators f

f(Pa) --> Pb

are of interest?

Page 46: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

46

Refine

ReduceNeed not be nested.

Nesting produces hierarchy

Page 47: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

47

Reassign

Page 48: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

48

Refinement sequence: 1

Begin with partition containing all points in one group.

Page 49: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

49

Refinement sequence: 1 -> 2

Refine partition to move to a new partition containing two groups.

Blue points are on the outer sphere.

This refinement was had by projecting all points onto the eigen-vector of the largest eigen value of the sample variance covariance matrix and splitting at the largest gap between projected points.

Page 50: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

50

Refinement sequence: 1 -> 2 -> 3

Refine partition (2) to move to a new partition containing three groups.

Green points are also on the outer sphere.

Refinement move:• select group whose sample var-cov matrix has largest eigen-value• for that group, project and split as before.

Page 51: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

51

Refinement sequence: 1 -> 2 -> 3 -> 4

Refine partition (3) to move to a new partition containing four groups.

New group contains a single (magenta) point on the outer sphere (middle right, up).

Refinement move as before, again splits red group.

Exploration of the data shows this to be a very poor partition with that single isolated point.

Page 52: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

52

Refinement sequence: 1 -> 2 -> 3 -> 4 -> 5

Refine partition (4) to move to a new partition containing five groups.

New group contains a single (black) point on the outer sphere (bottom left).

Refinement move as before, again splits red group.

Again a poor partition; no further refinement step taken at this point.

Page 53: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

53

Reassign, reduce sequence: 5 -> 5

A reassign move from one partition of five to another.

Reassignment move:k-means maximizing an F statistic.

Seems a better partition than before; explore to confirm.

Page 54: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

54

QuickTime™ and aMPEG-4 Video decompressor

are needed to see this picture.

Explore present partition: 5

Reassignment seems to have isolated central red ball.

Remaining groups distributed around a spherical surface.

Consider reduction moves from this partition to `nearby’ partitions with fewer groups.

Page 55: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

55

Partition to be reduced: 5

Same partition - back in the original position to make subsequent reduction moves visually comparable with previous refinement and reassignment moves.

Choice of reduction move can be based on what we have learned from exploring this partition.

Page 56: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

56

Reduce sequence: 5 -> 4

Reduction move:Single-linkage between groups.i.e. join closest two groups as measured by euclidean distance between nearest points in each group.

Seems reasonable choice given structure observed in previous exploration.

Reduce partition (5) to move to a new partition containing four groups.

Page 57: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

57

Reduce sequence: 5 -> 4 -> 3

Reduction move:As before.

Exploration suggests one more reduction move.

Reduce partition (4) to move to a new partition containing three groups.

Red ball remains.

Page 58: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

58

Reduce sequence: 5 -> 4 -> 3 -> 2

Reduction move:As before.

Interactive exploration important to choose type and details of potentially interesting moves from one partition to another.

Reduce partition (3) to move to a new partition containing two groups.

This partition seems best.

Page 59: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

59

Moves (generic functions) examples:

• refine (Pold) --> Pnew break minimal spanning tree

• reduce (Pold) --> Pnew join near centres

• reassign (Pold) --> Pnew k-means maximize F

• partition (graphic) --> Pnew

colours from point cloud

Page 60: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

60

Challenges:• varying focus

• subsets (selected manually and at random)

• merging new data into partition

• interface design• control panels, options

• interaction

• exploring multiple partitions• interactive display and comparison

• resolving many to one

Page 61: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

61

A prototype interface

• cluster analysis hub

- an analysis hub (Oldford, 1997) created on demand for partition:

- having all points in one group for named data-set, or

- as defined by colours of all points in topmost plot, or

- as defined by colours of selected points in topmost plot

- new hub can always be created for any subset

- maintains list of saved partitions

- offers moves from current partition via one of:

- reduce, refine, or reassign

- manually from current colours (so as to capture interactive

modification of existing partition)

- Other operations on one or more partitions (e.g. cluster plot,

dendrogram, ...)

Page 62: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

62

Interface illustration: details of moves

• Each move - refine, reduce, reassign - is an entire collection of possible

moves, each with many possible choices.

• The next few slides illustrate the prototype implementation where:

• Buttons for refine, reduce, and reassign are given at the topmost

level.

• Once selected, each button pops up its own control panel where

various different kinds of moves and parameter choices can be

made. E.g. the analyst might choose to reduce by any of:

• Join groups with closest centres using Euclidean distance

• Join groups whose farthest points are closest (i.e. “complete

linkage”)

• Choose group with greatest spread and disperse its points

among the remaining groups. …

Page 63: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

63

Interface - reduce

QuickTime™ and aGraphics decompressor

are needed to see this picture.

Page 64: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

64

Interface - refine

Page 65: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

65

Interface - reassign

Page 66: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

66

Interface illustration: example of use

• The next few slides illustrate the prototype implementation applied to a

“ball in a sphere” data-set (a different one from before).

• Moves are made about the partition space (refines and reassign)

• Partitions are saved (can be named, deleted, revisited, etc.)

• Nested partitions compared via a dendrogram

• Non-nested partition compared with nested ones

• N.B. at any time, the analyst could have interacted with any

graphic

• to create a new partition by colouring - using “manual button”

• focus on a subset to examine via a new cluster analysis hub

and subsequently incorporate that into the partition of the

whole data-set.

Page 67: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

67

InteractionSelecting refine pops up the refinement panel.

Start with partition having all points in a single group.

Refinement move:• Choose group with var-cov having largest eigen

value.• Project these points onto corresponding eigen-vector.• Split this group where the projected gap is largest.

Choose refinement details.

Page 68: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

68

InteractionNew partition appears as `Refine Dataset’ in panel at left.

Refine produces new partition having two groups as shown by different colours in all graphics.

Refinement details unchanged.

Page 69: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

69

name and save partition

QuickTime™ and aGraphics decompressor

are needed to see this picture.

New partition is named and saved.

New partition has three groups.

Refinement details unchanged.

Saved partition list.

Page 70: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

70

prototype - refine to 4

New partition has four groups.

Refinement details unchanged.

Page 71: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

71

prototype - refine to 5

New partition has five groups.The fifth group contains a single point (blue, top right).

Refinement details unchanged.

No further refinement pursued beyond this one.

Page 72: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

72

Select nested partitionsand view dendrogram Dendrogram button.

Dendrogram shows 5 nested partitions:• Each block is a group, horizontal cuts at each vertical level is a partition.• Size and colour proportions vary with number of points.• Colouring is as displayed in point cloud (here showing the current partition) .

Select nested partitions11

22

33

Page 73: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

73

Reassign, dendrogram updated

New partition appears as `Reassign Dataset’ in panel at left.

Colours update in all graphics including the dendrogram:• Reassignment partition can be explored as usual.• This partition can be visually compared with previous

partitions via the updated colours in the dendrogram.

Reassign move to new partition.Details:

• k-means• max F statistic

Page 74: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

74

Cluster plot + dendrograminteraction movie

QuickTime™ and aGraphics decompressor

are needed to see this picture.

Cluster plot button operates on selected partition

Nested and non-nested partitions can be visually compared simultaneously through interaction.

Cluster plot:• groups as boxes• close groups are

visually close (via multi-dimensional scaling)

Page 75: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

75

Other operators

• dissimilarity (Pi, Pj) --> di,j

• display (P1, ..., Pm)– dendrogram if P1 < …< Pm

– mds plot of all clusters in P1, …, Pm

– mds plot of all partitions P1, …, Pm

Page 76: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

76

Creation:• partition (Data ; ...) --> Pnew

• “manually” from colours

• k-means, random start, mst, veri, etc

• from existing classifier.

• partition-path (Data ; …) --> {P1 , P2 , …, Pn }

• partition-path (Pold ; ...)

--> {Pold , P1 , P2 , …, Pn }• e.g. nested sequence from hierarchical clustering

Page 77: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

77

Composition:

• merge (Pa , Pb ; …) --> P+

new

• combine non-overlapping partitions

• merge (Data, Pold ; …) --> P+

new

• classify additional points

• resolve (P1, ..., Pm; …) --> Pnew

• combine different partitions of the same data

Page 78: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

78

Implications:

• Algorithms (re)cast in terms of moves:– refine, reduce– reassign – partition, partition-path– easily understandable (e.g. geometric

structures)– specify required data structures

• e.g. ms tree, triangulation, var-cov matrix, …

Page 79: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

79

New problems:

• interface design

• multiple partitions– comparison and/or resolution– multiple display

• inference

Page 80: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

80

Summary

• Cluster analysis is naturally exploratory and needs integration with modern interactive data analysis.

• Enlarging the problem to partitions:– simplifies and gives structure– encourages exploratory approach– integrates naturally– introduces new possibilities (analysis and

research)

Page 81: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

81

Related references:

• Interactive clustering CASI talk, Oldford (2001)

• Quail: Overview (Interface 1998), graphics (Hurley and Oldford, ISI 1999) and code.

• Design principles: Oldford (Interface1999)

• Analysis hubs: Oldford (Interface 1997)

Page 82: June, 2003Structuring Interactive Cluster Analysis R.W. Oldford 1 Structuring Interactive Cluster Analysis Wayne Oldford University of Waterloo.

June, 2003 Structuring Interactive Cluster AnalysisR.W. Oldford

82

Acknowledgements:

• Catherine Hurley, Erin McLeish, Rayan Yahfoufi, Natasha Wiebe

• U(W) students in statistical computing

• Quail: Quantitative Analysis in Lisp

http://www.stats.uwaterloo.ca/Quail