Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia...

82
Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 1/82 Introduction to statistical learning 2.2 Unsupervised learning: Clustering V. Lefieux June 2018

Transcript of Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia...

Page 1: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

1/82

Introduction to statistical learning2.2 Unsupervised learning: Clustering

V. Lefieux

June 2018

Page 2: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

2/82

Table of contents

Introduction

Metrics

Inertia

K-means

Hierarchical clustering

Conclusion

Page 3: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

3/82

Table of contents

Introduction

Metrics

Inertia

K-means

Hierarchical clustering

Conclusion

Page 4: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

4/82

Introduction

I Grouping a set of n data objects (with p variables) intoclusters.

I A cluster is a set of elements (individuals):I near (similar) to one another in the same cluster,I far (dissimilar) to the elements in other clusters.

I Clustering is an unsupervised classification

Page 5: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

5/82

Exemples

I Customers.

I Climatic zones.

I Genes.

I Pictures.

I Textual informations.

I Web pages.

Page 6: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

6/82

Many partitions of a set I

The total number of partitions of an n-element set is rapidlyincreasing with n.For example, for 4 elements A,B,C et D, there are 15possible partitions:

ABCD ABC D ABD C ACD B BCD A

AB CD AC BD AD BC AB C D AC B D

AD B C BC A D BD A C CD A B A B C D

Page 7: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

7/82

Many partitions of a set II

The total number of partitions of an n-element set is givenby the Bell number:

Bn =1

e

+∞∑k=0

kn

k!.

So:

n 4 6 10

pn 15 203 115 975

The number of partitions of an n-element set into exactly Knonempty parts is asymptotically equivalent to Kn

K ! .

Page 8: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

8/82

Aim

In general one consider thousands of individuals, it’simpossible to compare all the partitions (with a criterium).

This is why have been developed algorithmic methods tosolve this problem.

Page 9: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

9/82

Methods

I Partitioning-based algorithms: build and evaluatepartitions with a criterion.

I Hierarchical-based algorithms: build a hierarchicaldecomposition.

I Model-based algorithms: a probability distribution isassumed for each of the clusters (e.g ExpectationMaximization EM using gaussian mixture models).

I Grid-based algorithms: based on a multiple-levelgranularity structure

I Density-based algorithms: based on density functions(e.g DBSCAN: Density-Based Spatial Clustering ofApplications with Noise).

Page 10: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

10/82

Partitioning-based and hierarchical-basedalgorithms

I For both methods a dissimilarity metric (or similarity)between individuals is needed.

I In partitioning-based algorithms, a partitioning criterionis needed.

I In hierarchical-based algorithms, an agglomerativecriterion (or divisive criterion) is needed.

Page 11: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

11/82

Table of contents

Introduction

Metrics

Inertia

K-means

Hierarchical clustering

Conclusion

Page 12: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

12/82

Data

p variables measured on n individuals.

The data set that is represented in terms of an n × p table:

X =(x ji

)i∈1,...,n,j∈1,...,p

,

where the n rows are the individuals and the p columns arethe variables.

x ji : value of X j measured on individual i .

Page 13: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

13/82

Dataset matrix

Variables

1 . . . j . . . p

Individuals

1 x11 . . . x j1 . . . xp1...

......

...

i x1i . . . x ji . . . xpi...

......

...

n x1n . . . x jn . . . xpn

Page 14: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

14/82

Individuals and variables

Commonly individual i refers to:

Xi =(x1i , . . . , x

pi

)>and variable j to:

X j =(x j1, . . . , x

jn

)>.

Page 15: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

15/82

Weights

The sample should be representative: a miniature of thepopulation it comes from. If not, one assign to eachindividual i a weight ωi (e.g from a survey design):

I ∀i ∈ 1, . . . , n : ωi > 0 ,

I∑n

i=1 ωi = 1 .

One consider the matrix:

W = diag (ω1, . . . , ωn) .

Usually weights are uniform:

∀i ∈ 1, . . . , n : ωi =1

n,

that is:

W =1

nIn .

Page 16: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

16/82

Barycenter

The barycenter of the data set is:

G = X>W 1n =n∑

i=1

ωiXi

where 1n is a n dimensional all ones vector.

Page 17: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

17/82

Metrics

One consider a metric d defined on Ω×Ω where Ω is the setof individuals.

One usually consider properties for 3 individuals(i1, i2, i3) ∈ Ω3:

1. d (i1, i2) ≥ 0 .

2. d (i1, i2) = d (i2, i1) .

3. d (i1, i2) = 0 ⇒ i1 = i2 .

4. d (i1, i2) ≤ d (i1, i3) + d (i3, i2) .

5. d (i1, i2) ≤ max (d (i1, i3) , d (i3, i2)).

Page 18: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

18/82

Vocabulary

Dissimilarity (1) (2) (3)

Distance (1) (2) (3) (4)

Ultrametric distance (1) (2) (3) (5)

Page 19: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

19/82

Similarity

A similarity is a function s defined on Ω× Ω such for 2individuals (i1, i2):

1. s (i1, i2) ≥ 0 .

2. s (i1, i2) = s (i2, i1) .

3. s (i1, i2) ≤ s (i1, i1) .

4. s (i1, i1) = s (i2, i2) := smax .

Based on a similarity s, it’s possible to build adissimilarity d . For example:

d (i1, i2) = smax − s (i1, i2) .

Page 20: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

20/82

Metric choice

I Many metrics can be found in literature.

I The choice of a metric depends essentially on thenature of the variables.

Page 21: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

21/82

Classic metrics: quantitative case I

I Minkoswski distance:

d (i1, i2) =

p∑j=1

∣∣∣x ji1 − x ji2

∣∣∣q 1

q

.

I Manhattan distance (Minkoswski, q = 1):

d (i1, i2) =

p∑j=1

∣∣∣x ji1 − x ji2

∣∣∣ .I Chebychev distance:

d (i1, i2) = maxj∈1,...,p

∣∣∣x ji1 − x ji2

∣∣∣ .

Page 22: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

22/82

Classic metrics: quantitative case II

Euclidean distance (Minkoswski, q = 2) is one of the mostused in practice:

d (i1, i2) =

p∑j=1

(x ji1 − x ji2

)2 12

.

One can also write:

d2 (i1, i2) = (Xi1 − Xi2)> (Xi1 − Xi2) .

Page 23: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

23/82

Classic metrics: quantitative case III

We consider M euclidean distance:

d2M (i1, i2) = (Xi1 − Xi2)>M (Xi1 − Xi2)

where M is a non negative matrix.

Distance dM is induced by the norm:

∀x ∈ Rp : ‖x‖2M = x>M x .

The inner product is:

∀ (x , y) ∈ Rp × Rp : 〈x , y〉M = x>M y .

Page 24: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

24/82

Classic metrics: quantitative case IV

Variables are usually normalized:

I Sebestyen distance:

d2 (i1, i2) = (Xi1 − Xi2)>D (Xi1 − Xi2)

where D is a positive diagonal matrix, for example:

D = D 1s2

:= diag

(1

s21, . . . ,

1

s2p

).

I Mahalanobis distance:

d2 (i1, i2) = (Xi1 − Xi2)> S−1 (Xi1 − Xi2)

where S is the variance-covariance matrix.

Page 25: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

25/82

Classic metrics: binary case I

In the case of binary variables, with values in 0, 1, considerfor 2 individuals i1 et i2:

I a: number of attributes that both i1 et i2 don’t own,

I b: number of attributes that i2 owns but not i1,

I c : number of attributes that i1 owns but not i2,

I d : number of attributes that both i1 et i2 own.

i2

0 1

i10 a b

1 c d

Page 26: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

26/82

Classic metrics: binary case IIClassic similarities are:

I Jaccard:

d (i1, i2) =d

b + c + d.

I Sokal, Sneath and Anderberg:

d (i1, i2) =d

2(b + c) + d.

I Czekanowski, Sørensen and Dice:

d (i1, i2) =2d

b + c + 2d.

I Russel et Rao:

d (i1, i2) =d

a + b + c + d.

I Ochiai:

d (i1, i2) =d√

d + b√d + c

.

Page 27: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

27/82

Table of contents

Introduction

Metrics

Inertia

K-means

Hierarchical clustering

Conclusion

Page 28: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

28/82

Assumptions

Consider a partition that individuals are in a partition (in apartition, individuals are all in a unique cluster) with Kclusters C1, . . . ,CK.

The cluster Ck has weight:

µk =∑i∈Ck

ωi

and barycenter:

Gk =1

µk

∑i∈Ck

ωiXi .

Page 29: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

29/82

Inertia relative to a point A

Inertia relative to a point A (A ∈ Rp) is:

IA =n∑

i=1

ωi d2M (i ,A) .

Huygens theorem gives:

IA = IG + d2M (A,G ) .

Page 30: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

30/82

Total inertia

Total inertia is:

Itot =n∑

i=1

ωi d2M (i ,G ) .

Page 31: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

31/82

Intraclass inertia

Intraclass inertia (between inertia) is:

Iintra =K∑

k=1

∑i∈Ck

ωi d2M (i ,Gk) .

Page 32: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

32/82

Interclass inertia

Interclass inertia (within inertia) is:

Iinter =K∑

k=1

µk d2M (Gk ,G ) .

Page 33: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

33/82

Inertia decomposition

Huygens theorem gives:

Itot = Iintra + Iinter .

Page 34: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

34/82

A dataset

Page 35: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

35/82

Total inertia

Page 36: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

36/82

Intra-class and inter-class inertia (K = 3)

Page 37: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

37/82

Example

−0.5

0.0

0.5

1.0

1.5

0 1

X1

X2

Page 38: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

38/82

Example: partition 1 with K = 2 I

−0.5

0.0

0.5

1.0

1.5

0 1

X1

X2

Partition 1

1

2

Page 39: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

39/82

Example: partition 1 with K = 2 II

−0.5

0.0

0.5

1.0

1.5

0 1

X1

X2

Partition 1

1

2

Page 40: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

40/82

Example: partition 2 with K = 2 I

−0.5

0.0

0.5

1.0

1.5

0 1

X1

X2

Partition 2

1

2

Page 41: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

41/82

Example: partition 2 with K = 2 II

−0.5

0.0

0.5

1.0

1.5

0 1

X1

X2

Partition 2

1

2

Page 42: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

42/82

Clustering criterion

For K known, a good clustering minimizes intra-class inertiaand maximizes inter-class inertia

The quality of a clustering can be measured byIinterItot

.

Page 43: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

43/82

Table of contents

Introduction

Metrics

Inertia

K-means

Hierarchical clustering

Conclusion

Page 44: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

44/82

Partition

P = C1, . . . ,CK is a partition of Ω if:

I ∀k ∈ 1, . . . ,K : Ck 6= ∅,I ∀ (k, k ′) ∈ 1, . . . ,K2 × 1, . . . ,K : Ck ∩ Ck ′ = ∅,I⋃

k∈1,...,K Ck = Ω.

Page 45: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

45/82

Partitioning algorithms

I Partitioning algorithms principle: build a first partitionand then improve it till convergence.

I K is chosen a priori.

I K-means is the most known of these methods.

Page 46: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

46/82

K-means algorithm

To build a partition with K clusters:

1. InitializationChoose K seed points in the data set (a priori orrandom choice).

2. Points assignmentAssign each of the n individuals to the cluster with thenearest seed point (with the chosen distance).

3. Seed points computingK seed points are replaced by the K barycenters ofclusters of step 2.

4. Go back to step 2 and stop when there is no more newassignment.

Page 47: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

47/82

Illustration (K = 2) I

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+ +

+ +

+

+

+

++

+

+

++

+

+ +

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

++

+

+

+

+

++

+

++

++

+

+

+

++

+

+

+

+

+

+

+

++

+

+

+

+

+

++

+

+

+

+

−0.5 0.0 0.5 1.0 1.5

−0.

50.

00.

51.

01.

5

X1

X2

Page 48: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

48/82

Illustration (K = 2) II

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+ +

+ +

+

+

+

++

+

+

++

+

+ +

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

++

+

+

+

+

++

+

++

++

+

+

+

++

+

+

+

+

+

+

+

++

+

+

+

+

+

++

+

+

+

+

−0.5 0.0 0.5 1.0 1.5

−0.

50.

00.

51.

01.

5

X1

X2

Page 49: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

49/82

Illustration (K = 2) III

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+ +

+ +

+

+

+

++

+

+

++

+

+ +

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

++

+

+

+

+

++

+

++

++

+

+

+

++

+

+

+

+

+

+

+

++

+

+

+

+

+

++

+

+

+

+

−0.5 0.0 0.5 1.0 1.5

−0.

50.

00.

51.

01.

5

X1

X2

Page 50: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

50/82

Illustration (K = 2) IV

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+ +

+ +

+

+

+

++

+

+

++

+

+ +

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

++

+

+

+

+

++

+

++

++

+

+

+

++

+

+

+

+

+

+

+

++

+

+

+

+

+

++

+

+

+

+

−0.5 0.0 0.5 1.0 1.5

−0.

50.

00.

51.

01.

5

X1

X2

Page 51: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

51/82

Illustration (K = 2) V

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+ +

+ +

+

+

+

++

+

+

++

+

+ +

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

++

+

+

+

+

++

+

++

++

+

+

+

++

+

+

+

+

+

+

+

++

+

+

+

+

+

++

+

+

+

+

−0.5 0.0 0.5 1.0 1.5

−0.

50.

00.

51.

01.

5

X1

X2

Page 52: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

52/82

Illustration (K = 2) VI

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+ +

+ +

+

+

+

++

+

+

++

+

+ +

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

++

+

+

+

+

++

+

++

++

+

+

+

++

+

+

+

+

+

+

+

++

+

+

+

+

+

++

+

+

+

+

−0.5 0.0 0.5 1.0 1.5

−0.

50.

00.

51.

01.

5

X1

X2

Page 53: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

53/82

Illustration (K = 2) VII

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+ +

+ +

+

+

+

++

+

+

++

+

+ +

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

++

+

+

+

+

++

+

++

++

+

+

+

++

+

+

+

+

+

+

+

++

+

+

+

+

+

++

+

+

+

+

−0.5 0.0 0.5 1.0 1.5

−0.

50.

00.

51.

01.

5

X1

X2

Page 54: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

54/82

Properties

I In K-means intra-class variance decreases with thenumber of iterations.

I K-means algorithm is sensitive to outliers.

Page 55: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

55/82

Some alternatives

I K-medoids (Partitioning Around Medoids: PAM):barycenter is replaced by the medoid (the most centrallylocated point):

medoid (Ck) = arg mini∈Ck

∑j/j∈Ck ,j 6=i

ωiωjd2 (j , i) .

PAM algorithm is less sensitive to outliers thanK-means algorithm.

I The “real” K-means: barycenters are computed aftereach assignment (order of individuals isn’t neutral).

Page 56: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

56/82

Conclusion

Partitioning algorithms:

I converge fastly (towards a local minimum of intra-classinertia),

I but:I need to know the number of clusters K ,

I depend on seed points (another idea: compute K-meanswith a lot of different seed points).

Page 57: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

57/82

Table of contents

Introduction

Metrics

Inertia

K-means

Hierarchical clustering

Conclusion

Page 58: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

58/82

Principle

I Agglomerative hierarchical clusteringInitially each individual is a cluster and, iteratively,clusters are merged together.

I Divisive hierarchical clusteringInitially all individuals are in a cluster and, iteratively,clusters are divided.

Page 59: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

59/82

Agglomerative hierarchical clustering

1. Initialization (k = n)The first partition has n clusters:

Pn = 1 , . . . , n .

2. Agglomeration (k ∈ n − 1, . . . , 2)Pk is built by agglomerating the 2 closest clusters (inthe sense of an ultrametric distance ∇) in Pk−1.

3. End (k = 1)The last partition has only 1 cluster:

P1 = 1, . . . , n .

We obtain a binary hierarchical tree.

Page 60: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

60/82

Illustration

Page 61: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

61/82

Hierarchy

H is a hierarchy on Ω if:

I Ω ∈ H,

I ∀i ∈ Ω : i ∈ H,

I ∀ (H1,H2) ∈ Ω2 : H1 ∩ H2 ∈ ∅,H1,H2.

Page 62: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

62/82

Hierarchy height

Hierarchy height (level) of H is a function h : H → R+ suchthat:

I ∀H ∈ H : h (H) = 0⇔ H ∈ 1 , . . . , n,I ∀ (H1,H2) ∈ H2 : H1 ⊂ H2 ⇒ h (H1) ≤ h (H2).

One use dendrogram.

Page 63: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

63/82

Linkage strategy

I The choice of the linkage strategy ∇ is sensitive.

I ∇ is an ultrametric distance.

Page 64: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

64/82

Some linkage strategies

For 2 clusters A and B:

I Single linkage:

∇ (A,B) = mini∈A,j∈B

d (i , j) .

I Complete linkage:

∇ (A,B) = maxi∈A,j∈B

d (i , j) .

I Average lainage:

∇ (A,B) =1

Card(A) Card(B)

∑i∈A,j∈B

d (i , j) .

Page 65: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

65/82

Example

Page 66: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

66/82

Ward linkage

Ward linkage (Ward linkage):

∇ (A,B) =µA µBµA + µB

d2 (GA,GB)

where:

I µA and µB are the weights of A and B,

I GA and GB are the barycenters of A and B,

I d is the euclidean distance.

Page 67: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

67/82

Interpretation

If A and B are linked, the interclass inertia variation is:

∆Iinter = µAd2 (GA,G )+µBd

2 (GB ,G )−(µA + µB) d2 (GA∪B ,G ) .

Huygens theorem gives:

∆Iinter =µAµBµA + µB

d2 (GA,GB) .

Page 68: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

68/82

Dendrogram

I The dendrogram is a tree representation of thehierarchical clustering.

I Each level (height) shows the clusters considered.

I Leaves: individual clusters.

I Root: one cluster.

Page 69: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

69/82

Exemple I

0 1 2 3 4 5

12

34

x

y

A

B

C

D E

F G

H

Page 70: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

70/82

Exemple II

I Euclidean distance.

I Single linkage.

Distances matrix:

A B C D E F G

B 0.50

C 0.25 0.56

D 5.00 4.72 4.80

E 5.78 5.55 5.57 1.00

F 4.32 4.23 4.07 2.01 2.06

G 4.92 4.84 4.68 2.06 1.81 0.61

H 5.00 5.02 4.75 3.16 2.90 1.28 1.12

Page 71: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

71/82

Exemple III2

34

y

B

D E

F G

A B C D E F GB 0.50 C 0.25 0.56 D 5.00 4.72 4.80 E 5.78 5.55 5.57 1.00 F 4.32 4.23 4.07 2.01 2.06 G 4.92 4.84 4.68 2.06 1.81 0.61 H 5.00 5.02 4.75 3.16 2.90 1.28 1.122

34

Hau

teur

0 1 2 3 4 5

1

x

A C H

0.25

A C B D E F G H

01

Page 72: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

72/82

Exemple IV2

34

y

B

D E

F G

AC B D E F GB 0.50 D 4.80 4.72E 5.57 5.55 1.00 F 4.07 4.23 2.01 2.06 G 4.68 4.84 2.06 1.81 0.61 H 4.75 5.02 3.16 2.90 1.28 1.122

34

Hau

teur

0 1 2 3 4 5

1

x

A C H

0.5

A C B D E F G H

01

Page 73: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

73/82

Exemple V

12

34

ABC D E FGD 4.72E 5.55 1.00 FG 4.07 2.01 1.81 H 4.75 3.16 2.90 1.12

ABC DE FGDE 4.72FG 4.23 1.81 H 4.07 2.90 1.12

ABC DEDE 4.72FGH 4.07 1.81

ABC DEFGH 4.07

4e regroupement

5e regroupement

6e regroupement

7e regroupement

ABC,DE,FG,H

ABC,DE,FGH

ABC,DEFGH

ABCDEFGH

A C B D E F G H

0ABC D E F G

D 4.72E 5.55 1.00 F 4.07 2.01 2.06 G 4.68 2.06 1.81 0.61 H 4.75 3.16 2.90 1.28 1.12

3e regroupement

A B C D E F GB 0.50 C 0.25 0.56 D 5.00 4.72 4.80 E 5.78 5.55 5.57 1.00 F 4.32 4.23 4.07 2.01 2.06 G 4.92 4.84 4.68 2.06 1.81 0.61 H 5.00 5.02 4.75 3.16 2.90 1.28 1.12

AC B D E F GB 0.50 D 4.80 4.72E 5.57 5.55 1.00 F 4.07 4.23 2.01 2.06 G 4.68 4.84 2.06 1.81 0.61 H 4.75 5.02 3.16 2.90 1.28 1.12

1er regroupement

2e regroupement

A,B,C,D,E,F,G,H

AC,B,D,E,F,G,H

ABC,D,E,F,G,H

ABC,D,E,FG,H

ABC,DE,FG,H

Page 74: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

74/82

Choice of the number of clusters

A classic method consists in finding an“elbow” in the plot of:

∀k ∈ 2, . . . , n :Iinter (Ck+1)− Iinter (Ck)

Itot.

Page 75: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

75/82

Illustration

Page 76: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

76/82

Illustration

Page 77: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

77/82

Illustration

Page 78: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

78/82

Terminology

I AGNES: AGglomerative NESting.

I DIANA: DIvisive ANAlysis.

Page 79: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

79/82

Conclusion

Hierarchical clustering algorithms:

I don’t depend on seed points,I but:

I are computationally costly,

I could be far from the optimum if the number oflinkages n − K is too high.

Page 80: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

80/82

Table of contents

Introduction

Metrics

Inertia

K-means

Hierarchical clustering

Conclusion

Page 81: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

81/82

Conclusion

Interpret clusters:

I Using exploratory analysis of variables (active orsupplementary) by clusters.

I Using medoids by clusters.

Page 82: Introduction Introduction to statistical learning · 2018-06-04 · Introduction Metrics Inertia K-means Hierarchical clustering Conclusion References 4/82 Introduction I Grouping

Introduction

Metrics

Inertia

K-means

Hierarchicalclustering

Conclusion

References

82/82

References

Tuffery, S. (2011). Data mining and statistics for decisionmaking. Wiley series in Computational Statistics. Wiley.