Clustering Methods: Part 2d Pasi Fränti 31.3.2014 Speech & Image Processing Unit School of...

Post on 11-Jan-2016

222 views 4 download

Transcript of Clustering Methods: Part 2d Pasi Fränti 31.3.2014 Speech & Image Processing Unit School of...

Clustering Methods: Part 2d

Pasi Fränti

31.3.2014

Speech & Image Processing UnitSchool of Computing

University of Eastern FinlandJoensuu, FINLAND

Swap-based algorithms

Part I:

Random Swap algorithm

P. Fränti and J. KivijärviRandomised local search algorithm for the clustering problem Pattern Analysis and Applications, 3 (4), 358-369, 2000.

Pseudo code of Random Swap

RandomSwap(X) C, P

C SelectRandomRepresentatives(X); P OptimalPartition(X, C);

REPEAT T times

(Cnew,j) RandomSwap(X, C); Pnew LocalRepartition(X, Cnew, P, j); Cnew, Pnew Kmeans(X, Cnew, Pnew); IF f(Cnew, Pnew) < f(C, P) THEN

(C, P) Cnew , Pnew;

RETURN (C, P);

Demonstration of the algorithm

Two centroids , butonly one cluster .

One centroid , buttwo clusters .

Two centroids , butonly one cluster .

One centroid , buttwo clusters .

Centroid swap

Swap is made fromcentroid rich area tocentroid poor area.

Swap is made fromcentroid rich area tocentroid poor area.

Local repartition

Fine-tuning by K-means1st iteration

Fine-tuning by K-means2nd iteration

Fine-tuning by K-means3rd iteration

Fine-tuning by K-means16th iteration

Fine-tuning by K-means17th iteration

Fine-tuning by K-means18th iteration

Fine-tuning by K-means19th iteration

Fine-tuning by K-meansFinal result after 25 iterations

Implementation of the swap

1. Random swap:

2. Re-partition vectors from old cluster:

3. Create new cluster:

c x j random M i random Nj i ( , ), ( , )1 1

p d x c i p jik M

i k i

arg min ,1

2

p d x c i Nik j k p

i ki

arg min , ,2

1

Random swap as local search

Study neighbor solutions

Select one and move

Random swap as local search

Fine-tune solution by hill-climbing technique!

Role of K-means

Consider only local optima!

Role of K-means

Effective search space

Role of swap: reduce search space

Chain reaction by K-means after swap

176.53

163.93 163.63 163.51 163.08

150

155

160

165

170

175

180

185

190

K-means Random+ RS

K-means+ RS

Split +RS

Ward +RS

MS

E

Bridge

Independency of initializationResults for T = 5000 iterations

Worst

BestInitial

Initial

Initial

Part II:

Efficiency of Random Swap

Probability of good swap

• Select a proper centroid for removal:

– There are M clusters in total: premoval=1/M.

• Select a proper new location:

– There are N choices: padd=1/N

– Only M are significantly different: padd=1/M

• In total:– M2 significantly different swaps.

– Probability of each different swap is pswap=1/M2

– Open question: how many of these are good?

Number of neighbors

1

4

2

3

6

5

Open question: what is the size of neighborhood ()?

1

2

3

Voronoi neighbors Neighbors by distance

0 %

5 %

10 %

15 %

20 %

25 %

30 %

35 %

40 %

45 %

1 2 3 4 5 6 7 8 9

Number of neighbours

Fre

qu

en

cy

Average = 3.9

Observed number of neighborsData set S2

Average number of neighbors

• Probability of not finding good swap:T

Mq

2

2

1

Expected number of iterations

2

2

1loglogM

Tq

2

2

1log

log

M

qT

• Estimated number of iterations:

Observed q-values Estimated iterations (T ) S1 S2 S3 S4 S1 S2 S3 S4

q=10% 19% 14% 22% 22% 53 47 39 37 q=1% 3.1% 1.2% 1.0% 3.6% 106 93 78 74 q=0.1% 0.1% 0.1% 0.2% 1.1% 159 140 117 111 Expected: 72 56 55 48 23 21 17 16

Estimated number of iterationsdepending on T

S1 S2 S3 S4

Observed = Number of iterations needed in practice.Estimated = Estimate of the number of iterations needed for given q

Probability of success (p)depending on T

0

20

40

60

80

100

0 50 100 150 200 250 300

Iterations

p

0.000000001

0.00000001

0.0000001

0.000001

0.00001

0.0001

0.001

0.01

0.1

1

0 50 100 150 200 250 300

Iterations

q

Probability of failure (q) depending on T

0.01 %

0.10 %

1.00 %

10.00 %

100.00 %

16 32 64 128 256 512 1024

Dimensionality

q

Observed for q=0.10%

Observed for q=1%

Observed for q=10%

Observed probabilities depending on dimensionality

2

2

ln -α

MqT

2

2

2222-ln

/

ln -

/1ln

ln

α

Mq

q

qT

Bounds for the number of iterations

Upper limit:

Lower limit similarly; resulting in:

2

2

1 2

M

iT

w

i

Multiple swaps (w)

Probability for performing less than w swaps:

Expected number of iterations:

iTiw

i MMi

Tq

2

2

2

21

0

1

K-means clustering result(3 swaps needed)

Final clustering result

Number of swaps neededExample from image quantization

Efficiency of the random swap

Total time to find correct clustering:– Time per iteration Number of

iterations

Time complexity of a single step:– Swap: O(1)– Remove cluster: 2MN/M = O(N)– Add cluster: 2N = O(N)– Centroids: 2(2N/M) + 2 + 2 = O(N/M) – (Fast) K-means iteration: 4N = O(N)*

*See Fast K-means for analysis.

Observed number of steps at iteration: Step: Time complexity:

50 100 500 Centroid swap 2 2 2 2

Cluster removal 2N 7,526 8,448 10,137 Cluster addition 2N 8,192 8,192 8,192 Update centroids 4N/M + 2 + 1 53 61 60

K-means iterations 4N 300,901 285,555 197,327 Total O(N) 316,674 302,258 215,718

Time complexity and the observed number of steps

0

20

40

60

80

100

120

140

0 50 100 150 200 250 300 350 400 450 500

k-means 2. iterationk-means 1. iterationlocal repartition

Bridge

Time spent by K-means iterations

0.1 1 10 100160

165

170

175

180

185

190

1 iteration2 iterations3 iterations4 iterations5 iterations

10 20 30 40 50167

168

169

170

171

172

173

174

Ve rs io n w ith o n e it e ra t io n s e e ms to b e w e a ke s t a ll th e t ime .

Ve rs io n s w ith o th e r a mo u n ts o fit e ra t io n s a re p re t ty e v e n .

T im e (s )

Err

or (

MSE

)

B rid ge

Effect of K-means iterations

Total time complexity

Number of iterations needed (T):

α

NMq-N

α

Mq-MNT

2

2

2 lnln ,

2

2

ln -α

MqT

t = O(αN)

Total time:

Time complexity of a single step (t):

Time complexity: conclusions

1. Logarithmic dependency on q

2. Linear dependency on N

3. Quadratic dependency on M (With large number of clusters, can be too slow)

4. Inverse dependency on (worst case = 2) (Higher the dimensionality and higher the cluster overlap, faster the method)

α

NMq-MNT

2ln,

Bridge

160

165

170

175

180

185

190

0.1 1 10 100 1000Time

MS

E

Random Swap

Repeated k-means

Time-distortion performance

Time-distortion performance

Missa1

5.00

5.25

5.50

5.75

6.00

6.25

6.50

1 10 100 1000Time

MS

E

RamdomSwap

Repeated k-means

Time-distortion performance

400

450

500

550

600

1 10 100 1000 10000Time

MS

E

Random Swap

Repeated k-means

Birch1

Mill

ions

Time-distortion performance

Birch2

0.0

2.0

4.0

6.0

8.0

10.0

1 10 100 1000

Mill

ions

Time

MS

E

Repeated k-means

Random Swap

Time-distortion performance

Europe

2

4

6

8

10

12

14

16

1 10 100 1000Time

MS

E Repeated k-means

RandomSwap

Mill

ion

s

Time-distortion performance

KDD-Cup04 Bio

7.42

7.44

7.46

7.48

7.50

7.52

7.54

7.56

7.58

7.60

100 1000 10000 100000Time

MS

E

Random Swap

Repeated k-means

Mill

ions

References

Random swap algorithm:• P. Fränti and J. Kivijärvi, "Randomised local search algorithm

for the clustering problem", Pattern Analysis and Applications, 3 (4), 358-369, 2000.

• P. Fränti, J. Kivijärvi and O. Nevalainen, "Tabu search algorithm for codebook generation in VQ", Pattern Recognition, 31 (8), 1139‑1148, August 1998.

Pseudo code:• http://cs.joensuu.fi/sipu/soft/

Efficiency of Random swap algorithm:• P. Fränti, O. Virmajoki and V. Hautamäki, “Efficiency of

random swap based clustering", IAPR Int. Conf. on Pattern Recognition (ICPR’08), Tampa, FL, Dec 2008.

Part III:

Example when 4 swaps needed

MSE = 4.2 * 109 MSE = 3.4 * 109

1st swap

MSE = 3.1* 109 MSE = 3.0 * 109

2nd swap

MSE = 2.3 * 109 MSE = 2.1 * 109

3rd swap

MSE = 1.9 * 109 MSE = 1.7 * 109

4th swap

MSE = 1.3 * 109

Final result

Part IV:

Deterministic Swap

13

10

15

6

11

1

7

4

5

12

8

14

2

3

9

Two centroids , butonly one cluster .

One centroid , buttwo clusters .

Deterministic swap

Cluster Removal Addition 1 0.80 0.39

2 1.04 0.64 3 5.48 1.09 4 5.66 0.92 5 6.50 0.76 6 7.67 1.01 7 8.47 0.45 8 9.10 0.75 9 9.90 1.42

10 11.09 1.26 11 11.47 0.61 12 12.17 4.70 13 14.61 0.94 14 16.41 0.93 15 16.68 1.41

Costs for the swap:

From where to where?

• Merge two existing clusters [Frigui 1997, Kaukoranta 1998] following the spirit of agglomerative clustering.

• Local optimization: remove the prototype that increases the cost function value least [Fritzke 1997, Likas 2003, Fränti 2006].

• Smart swap: find two nearest prototypes, and remove one of them randomly [Chen, 2010].

• Pairwise swap: locate a pair of inconsistent prototypes in two solutions [Zhao, 2012].

Cluster removal

1. Select an existing cluster– Depending on strategy: 1..M choices.– Each choice takes O(N) time to test.

2. Select a location within this cluster– Add new prototype– Consider only existing points

Cluster addition

Select the cluster

• Cluster with the biggest MSE– Intuitive heuristic [Fritzke 1997, Chen 2010]

– Computationally demanding:

• Local optimization– Try all clusters for the addition [Likas et al, 2003]

– Computationally demanding: O(NM)-O(N2)

Select the location

1. Current prototype + ε [Fritzke 1997]

2. Furthest vector [Fränti et al 1997]

3. Any other split heuristic [Fränti et al, 1997]

4. Random location

5. Every possible location [Likas et al, 2003]

Complexity of swaps

Furthest point in cluster

Prototype removed

Cluster where added

Furthest pointselected

• Initialization: O(MN) • Swap Iteration

– Finding nearest pair: O(M2) – Calculating distortion: O(N) – Sorting clusters: O(M∙logM) – Evaluation of result: O(N) – Repartition and fine-tuning: O(N) Total: O(MN+M2+I∙N)

• Number of iteration expected: < 2∙M

• Estimated total time: O(2M2N)

Smart swap

Smart swap

Nearestprototypes

Cluster with largest distortion

SmartSwap(X,M) → C,PC ← InitializeCentroids(X);P ←PartitionDataset(X, C);Maxorder ← log2M;

order ← 1;WHILE order < Maxorder ci, cj ←FindNearestPair(C);

S ← SortClustersByDistortion(P, C); cswap ←RandomSelect(ci, cj );

clocation ←sorder;

Cnew ← Swap(cswap, clocation);

Pnew ← LocalRepartition(P, Cnew);

KmeansIteration(Pnew, Cnew);

IF f(Cnew) < f(C), THEN

order ← 1; C ←Cnew ;

ELSE order ← order + 1; KmeansIteration(P, C);

Smart swappseudo code

Pairwise swap

Unpaired prototypes

Unpairedprototypes

Nearest neighborsof each other

Nearest neighbor ofthe other set further than in the same set

→Subject to swap

Combinations of random and deterministic swap

Variant Removal Addition

RR Random Random

RD Random Deterministic

DR Deterministic Random

DD Deterministic Deterministic

D2R Deterministic+ data update

Random

D2D Deterministic+ data update

Deterministic

Summary of the time complexities

Random removal

Deterministic removal

RR RD DR DD D2R D2DRemoval O(1) O(1) O(MN) O(MN) O(αN) O(αN)

Addition O(1) O(N) O(1) O(N) O(1) O(N)

Repartition

O(N) O(N) O(N) O(N) O(N) O(N)

K-means O(αN) O(αN) O(αN) O(αN) O(αN) O(αN)

O(αN) O(αN) O(MN) O(MN) O(αN) O(αN)

Profiles of the processing time

0,00

0,05

0,10

0,15

0,20

0,25

0,30

0,35

0,40

0,45

RR RD DR DD D2R D2D

Time (

s) / it

eratio

n

Others

Repartition

Sw ap

K-means

0,00

0,50

1,00

1,50

2,00

2,50

RR RD DR DD D2R D2D

Time (

s) / it

eratio

n

Others

Repartition

Sw ap

K-means

Bridge Birch2

Test data setsData set Type of data set Number of data

vectors (N) Number of clusters (M)

Dimension of data vector (d)

Bridge Gray-scale image 4086 256 16

House* RGB image 34112 256 3

Miss America Residual vectors 6480 256 16

Europe Differential coordinates 169673 2

BIRCH1-BIRCH3 Synthetically generated 100000 100 2

S1- S4 Synthetically generated 5000 15 2

Dim32-1024 Synthetically generated 1000 256 32 – 1024

Data set S1 Data set S2 Data set S3 Data set S4

Birch data sets

Birch1 Birch2 Birch3

ExperimentsBridge

1 10 100160

165

170

175

180

185

Time (s)

Err

or

(MS

E)

Bridge

RRDRRDDD

RD

DD

DR

RandomSwap

ExperimentsBridge

Bridge

165

170

175

180

185

190

0.1 1 10 100Time

MS

E

Random Swap

Repeated k-means

DR

D2R

10 1002

2.5

3

3.5

4

4.5

5x 10

6

Time (s)

Err

or

(MS

E)

Birch2

RRDRRDDD

ExperimentsBirch2

Random Swap

DD

DRRD

Missa1

5.3

5.5

5.7

5.9

6.1

6.3

6.5

1 10 100

Time

MS

E

RamdomSwap

Repeated k-means

DR

D2R

ExperimentsMiss America

Quality comparisons (MSE)with 10 second time constraint

18:14:16:15:14:12:1Average speed-up from RR to RD

2.785.111.025.586.10171.20RD-variant

4.435.701.265.856.41174.08Random Swap

4.105.491.525.926.58177.66Repeated K-means

22.3513.102.378.3412.12251.32Repeated Random

Birch2

×106

Birch1

×108

Europe×107

Miss America

HouseBridge

Literature1. P. Fränti and J. Kivijärvi, "Randomised local search algorithm for the

clustering problem", Pattern Analysis and Applications, 3 (4), 358-369, 2000.

2. P. Fränti, J. Kivijärvi and O. Nevalainen, "Tabu search algorithm for codebook generation in VQ", Pattern Recognition, 31 (8), 1139‑1148, August 1998.

3. P. Fränti, O. Virmajoki and V. Hautamäki, “Efficiency of random swap based clustering", IAPR Int. Conf. on Pattern Recognition (ICPR’08), Tampa, FL, Dec 2008.

4. P. Fränti, M. Tuononen and O. Virmajoki, "Deterministic and randomized local search algorithms for clustering", IEEE Int. Conf. on Multimedia and Expo, (ICME'08), Hannover, Germany, 837-840, June 2008.

5. P. Fränti and O. Virmajoki, "On the efficiency of swap-based clustering", Int. Conf. on Adaptive and Natural Computing Algorithms (ICANNGA'09), Kuopio, Finland, LNCS 5495, 303-312, April 2009.

5. J. Chen, Q. Zhao, and P. Fränti, "Smart swap for more efficient clustering", Int. Conf. Green Circuits and Systems (ICGCS’10), Shanghai, China, 446-450, June 2010.

6. B. Fritzke, The LBG-U method for vector quantization – an improvement over LBG inspired from neural networks. Neural Processing Letters 5(1) (1997) 35-45.

7. P. Fränti and O. Virmajoki, "Iterative shrinking method for clustering problems", Pat. Rec., 39 (5), 761-765, May 2006.

8. T. Kaukoranta, P. Fränti and O. Nevalainen "Iterative split-and-merge algorithm for VQ codebook generation", Optical Engineering, 37 (10), 2726-2732, October 1998.

9. H. Frigui and R. Krishnapuram, "Clustering by competitive agglomeration". Pattern Recognition, 30 (7), 1109-1119, July 1997.

Literature

10. A. Likas, N. Vlassis and J.J. Verbeek, "The global k-means clustering algorithm", Pattern Recognition 36, 451-461, 2003.

11. PAM (Kaufman and Rousseeuw, 1987)

12. CLARA (Kaufman and Rousseeuw in 1990)

13. CLARANS (A Clustering Algorithm based on Randomized Search) (Ng and Han 1994)

14. R.T. Ng and J. Han, “CLARANS: A method for clustering objects for spatial data mining,” IEEE Transactions on knowledge and data engineering, 14 (5), September/October 2002.

Literature