Data Analysis, Statistics, Machine Learning

Data Analysis, Statistics, Machine Learning

Leland Wilkinson Adjunct Professor UIC Computer Science Chief Scien<st H2O.ai [email protected]

2

Grouping o  We can create groups of variables or groups of cases o  These methods involve what we call Cluster Analysis

o  Hierarchical methods make trees of nested clusters o  Non-‐hierarchical methods group cases into k clusters

o  These k clusters may be discrete or overlapping

o  Two considera<ons are especially important for hierarchical o  Distance/Dissimilarity measure o  Agglomera<on or spliMng rule

o  The collec<on of clustering methods is huge o  Early applica<ons were for numerical taxonomy in biology

Copyright © 2016 Leland Wilkinson

3

Grouping Non-‐hierarchical clustering

k-‐means o  o  We seek to minimize SSW

o  1. Choose k o  2. Ini<alize k centroids o  3. Assign every point y to nearest centroid (squared Euclidean distance) o  4. Compute SSW o  5. Repeat 3 and 4 un<l SSW does not improve (get smaller)

No<ce similarity to ANOVA/MANOVA k-‐means is a family of algorithms – many varia<ons

SSW =k�

j=1

nj�

i=1

(yij −mj)2 sum of squares within clusters


4


Geometry of k-‐means Voronoi Tessella<on Clusters are convex

o 


Grouping Aspects of the algorithm

Quan<za<on Even if there are no clusters, k-‐means can be used to par<<on

Detec<on of outliers Compute outliers rela<ve to cluster centroids

Clusters are convex

If a case is a weighted average of cases in the cluster, the case also lies in the cluster

k-‐means demo haps://www.cs.uic.edu/~wilkinson/Applets/cluster.html

Copyright © 2016 Leland Wilkinson 5


Picking k Run k-‐means from 1 to kmax Plot SSW against index and look for elbow Or, (Har<gan, 1975) SSW is distributed approximately as chi-‐square So are differences between SSWk-‐1 and SSWk Examine chi-‐square sta<s<cs for these differences (Hamerly and Elkan did something like this in 2003) They didn’t seem to be aware of Har<gan Or, compute a measure of clumpiness from minimum spanning tree (MST) Har<gan RUNT sta<s<c or Stuetzle & Nugent



Picking star<ng centroids Random centroids (ugh!) Har<gan run k-‐means from 1 to k split on variable with largest variance from previous solu<on

Evalua<ng k-‐means solu<ons Can do F-‐tests on each variable (forget about p values) Can do MANOVA on groups (you’re kidding, right?) Use visualiza<on (we’ll talk about that later)


8

Grouping o  Spectral clustering

o  1. Compute k-‐nearest-‐neighbor adjacency matrix A o  2. Choose bandwidth t o  2. Compute heat kernel D matrix, o  3. Compute Laplacian matrix o  4. Compute eigendecomposi<on of L o  5. Cluster last two (or more) eigenvectors of L o  o  This algorithm is due to Belkin and Niyogi o  There are varia<ons o  Spectral clustering works on nonlinear projec<on of points o  So it can deal with cluster shapes that are intertwined

di,j = exp(−ai,j ∗ ai,j/t)L = D−A


9

Grouping o  Spectral clustering

Regular k-‐means Spectral


10

Grouping o  Hierarchical clustering (agglomera<ve)

o  1. Start with each point in a cluster of its own o  2. Un<l there is only one cluster

o  (a) Find the closest pair of clusters o  (b) Merge them


11

Grouping o  Hierarchical clustering

o  Two essen<al components o  1. Distance method for compu<ng distance between points o  2. Linkage (amalgama<on) method for compu<ng distance between clusters

o  There are many versions of each, so the number of possible algorithms is enormous.

o  Strengths o  You can use any distance method o  Works well with non-‐convex clusters o  Hierarchical, so clusters do not have to be disjoint

o  Weaknesses o  A pig in both space and <me o  There are workarounds, but they are not preay


12


o  Distance methods o  Euclidean

o  City block

o  Chebyshev

o  Cosine (1 – Pearson R)

o  Jaccard

. . .

��i(xj − xk)2

�i |xj − xk|

max |xj − xk|

1− xj ·xk

�xj��xk�

1− |A∩B||A∪B|


13


o  Linkage methods o  Single

o  Complete

o  Average

o  Centroid

. . .

min{da,b : a ∈ A, b ∈ B}

max{da,b : a ∈ A, b ∈ B}�

a∈A

�b∈B da,b/(nanb)

�cA − cB�


14


o  Linkage methods o  Single long, stringy clusters

o  Complete convex clusters having comparable diameters

o  Average clusters with small variance joined first

o  Centroid slightly more robust to outliers

o  Boris Mirkin has developed a full con<nuum between Single and Complete

. . .


15


o  Ward’s method o  The distance between two clusters, A and B, is how much the sum of squares

within clusters will increase when we merge them

o  Choose minimum merging distance for next join

o  Similar to k-‐means in that it favors convex clusters o  Euclidean distances with Ward’s method yields solu<ons similar to k-‐means o  Sensi<ve to outliers and departures from normality

o  Joe Ward has a high-‐school named aqer him o  He is passionate about helping secondary students learn sta<s<cs


16


o  Choosing the number of clusters o  Can’t just cut the tree at some distance o  Must consider separa<on of density modes

o  Look at length of branch above merged node

nope

nope


17


o  Choosing the number of clusters o  Can compute margin to other clusters o  Silhoueae method of Rousseeuw works as well

o  Includes <ghtness as well as separa<on

o  Methods based on Har<gan RUNT sta<s<c work well too (Stuetzle) o  Forget SS(Within) methods unless you have Gaussian clusters o  Forget looking for elbow in SS plot

o  You never see them with real data


18

Grouping Visualizing clustering

Trees

SwitzerlandCanadaSwedenNorwayDenmarkFinland

NetherlandsFrance

WGermanyIrelandItaly

BelgiumAustria

UKEGermany

GreeceCzechoslov

HungaryPortugalSpainPolandCuba

BarbadosUruguayArgentina

ChileJamaica

CostaRicaPanama

VenezuelaTrinidadMalaysiaColombia

PeruTurkeyBrazil

DominicanR.Ecuador

ElSalvadorHondurasGuatemala

AlgeriaLibyaIraq

BoliviaBangladesh

HaitiPakistanSudan

SenegalEthiopiaSomaliaYemen

MaliGuinea

AfghanistanGambia

The distance between two leaves is propor<onal to the posi<on of the closest node joining them.

⬆ 0 distance … ︀︁︂︃︄︅︆︇︈︉︊︋︌︍︎️ greatest distance ⬆


19


Polar Trees

presentmomentlifebowshi

pblackfl

askwo

rldde

adholdlower

ahabmatter

live

soon

chapter

spermw

ater

iron

thought

hand

place

harpoonfish

godbonestandmastpointroundcrewinstantcry

arm

air

run

boat

leg

day

deck

mate

whaling

voyage

sort

captain

boy

sailor

night

sail

men

wind

sun

eye s

trange

thu

whaleman

sea

sightpoor

leviathan

word

hard

face

light

whale

soul

feel

wild

heart

feet

Useful when there are numerous leaves


20


SPLOM

Burglary

Burglary Rape Assault Robbery Larceny Autotheft Murder

Burglary

Rape Rape

Assault Assault

Robbery Robbery

Larce

ny Larceny

Autotheft

Autotheft

Burglary

Murder

Rape Assault Robbery Larceny Autotheft Murder

Murder

1234

Cluster

Coloring can reveal which projec<ons have greatest cluster separa<on


21


Profiles

Burglary

Rape

Assault

Robbery

Larceny

Autotheft

Murder

Variable

1234

Cluster

The bands represent confidence intervals on each variable in each group. More informa<ve than parallel coordinates.


22


Polar Profiles

Burglary

Rape

Assault

Robbery

Larceny

Autotheft

Murder 1234

Cluster

Good for many variables


23


Parallel Coordinates

Burglary

Rape

Assault

Robbery

Larceny

Autotheft

Murder

Variable

1234

ClusterA poor cluster display method unless clusters are <ght. Confidence bands are more useful.


24


Wroclaw diagram (level curves onto MDS)

Napior (1972)

Effec<ve for hierarchical clusterings. If contours are not <ght, this reflects badness of fit in 2D.


25


Projec<ons

-12 -6 0 6 12Component 1

-5.0

-2.5

0.0

2.5

5.0

Comp

onen

t 2

1234

cluster

-12 -6 0 6 12Component 1

-5.0

-2.5

0.0

2.5

5.0

Comp

onen

t 2

1234

cluster

-12 -6 0 6 12Component 1

-5.0

-2.5

0.0

2.5

5.0

Comp

onen

t 2

1234

cluster

-12 -6 0 6 12Component 1

-5.0

-2.5

0.0

2.5

5.0

Comp

onen

t 2

1234

cluster

Convex hulls (upper right) Alpha shapes (lower leq) Kernels (lower right) Alpha shapes are nonconvex hulls. They reveal shape of clusters beaer. Best to project these on MDS, but principal components will do OK.


26


Box Plots

Cluster 1 Cluster 2 Cluster 3 Cluster 4Cluster

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Dista

nce

Hofman, Kafadar, Wickham leaer value box plots reveal outliers beaer than ordinary box plots.


27


Cluster Heatmap hap://www.datavis.ca/papers/HeatmapHistory-‐tas.2009.pdf

hsa-miR-18a hsa-miR-377 hsa-miR-147 hsa-miR-337-3p hsa-miR-212 hsa-miR-188-5p hsa-miR-141 hsa-miR-127-3p hsa-miR-381 hsa-miR-323-3p hsa-miR-299-5p hsa-miR-106a hsa-miR-382 hsa-miR-198 hsa-miR-383 hsa-miR-20a hsa-miR-376a hsa-miR-124 hsa-miR-370 hsa-miR-24-1* hsa-miR-211 hsa-miR-302b* hsa-miR-1 hsa-miR-375 hsa-miR-380* hsa-miR-219-5p hsa-miR-346 hsa-miR-92a hsa-miR-182* hsa-miR-133a hsa-miR-379 hsa-miR-326 hsa-miR-129-5p hsa-miR-203 hsa-miR-369-3p hsa-miR-208b hsa-miR-128 hsa-miR-302a* hsa-miR-154* hsa-miR-215 hsa-miR-325 hsa-miR-154 hsa-miR-194 hsa-miR-216a hsa-miR-192 hsa-miR-19b hsa-miR-19a hsa-miR-200a hsa-miR-9* hsa-miR-17 hsa-miR-200c hsa-miR-338-3p hsa-miR-136 hsa-miR-206 hsa-miR-133b hsa-miR-134 hsa-miR-17* hsa-miR-9

hsa-miR-342-3phsa-miR-208bhsa-miR-103hsa-miR-152hsa-miR-15bhsa-miR-126*hsa-miR-374ahsa-miR-370hsa-miR-203hsa-miR-200ahsa-miR-187hsa-miR-23ahsa-miR-138hsa-miR-184hsa-miR-214

hsa-miR-199b-5phsa-miR-199a-3p

hsa-miR-143hsa-miR-369-3phsa-miR-154*hsa-miR-124hsa-miR-383hsa-miR-1

hsa-miR-198hsa-miR-219-5p

hsa-miR-25hsa-miR-30ahsa-miR-182hsa-miR-125bhsa-miR-221hsa-miR-99b

hsa-let-7ehsa-miR-216ahsa-miR-93hsa-miR-136

hsa-miR-140-5phsa-miR-326hsa-miR-381hsa-miR-206hsa-miR-372

hsa-miR-371-3phsa-miR-133ahsa-miR-106bhsa-miR-182*

hsa-miR-590-5phsa-miR-100hsa-miR-29ahsa-miR-99ahsa-miR-34ahsa-miR-218hsa-miR-9

hsa-miR-30e*hsa-miR-133b

hsa-miR-125a-5phsa-miR-137hsa-miR-382hsa-miR-95hsa-miR-210hsa-miR-196b

hsa-miR-331-3phsa-miR-339-5p

hsa-miR-18ahsa-miR-134

hsa-let-7chsa-let-7b

hsa-miR-373hsa-miR-19bhsa-miR-346hsa-miR-186hsa-miR-190hsa-miR-126hsa-miR-26a

hsa-miR-28-5phsa-miR-154

hsa-miR-24-1*hsa-let-7a

hsa-miR-34c-5phsa-miR-204

hsa-miR-338-3phsa-miR-211hsa-miR-320ahsa-miR-135bhsa-miR-128hsa-miR-147hsa-miR-345hsa-miR-378*hsa-miR-135a

hsa-let-7fhsa-miR-197hsa-miR-33a

hsa-miR-181a*hsa-miR-30chsa-miR-183hsa-miR-148b

hsa-miR-423-3phsa-miR-199a-5p

hsa-miR-302bhsa-miR-10bhsa-miR-340*hsa-miR-196ahsa-miR-146ahsa-miR-155

hsa-miR-296-5phsa-miR-29bhsa-miR-10ahsa-miR-24hsa-miR-26b

hsa-miR-361-5phsa-miR-181ahsa-miR-325hsa-miR-593*hsa-miR-16hsa-miR-17*

hsa-miR-376ahsa-miR-194hsa-miR-30a*hsa-miR-367hsa-miR-153hsa-miR-222hsa-miR-379

hsa-miR-129-5phsa-miR-188-5phsa-miR-130a

hsa-miR-299-5phsa-miR-302b*hsa-miR-181b

hsa-miR-330-3phsa-miR-148ahsa-miR-30dhsa-miR-375hsa-miR-23bhsa-miR-185

hsa-miR-302a*hsa-miR-328

hsa-let-7ihsa-miR-378hsa-miR-20ahsa-miR-92ahsa-miR-98hsa-let-7g

hsa-miR-422ahsa-miR-22hsa-miR-424

hsa-miR-550a*hsa-miR-27b

hsa-miR-323-3phsa-miR-31

hsa-miR-302ahsa-miR-7hsa-let-7d

hsa-miR-17hsa-miR-193a-3p

hsa-miR-106ahsa-miR-181chsa-miR-29c

hsa-miR-127-3phsa-miR-192hsa-miR-215hsa-miR-191hsa-miR-21

hsa-miR-302dhsa-miR-30bhsa-miR-30ehsa-miR-212hsa-miR-9*

hsa-miR-200chsa-miR-141

hsa-miR-151-3phsa-miR-324-3phsa-miR-324-5phsa-miR-301ahsa-miR-15ahsa-miR-195hsa-miR-425*hsa-miR-32hsa-miR-377hsa-miR-607hsa-miR-132hsa-miR-149hsa-miR-380*hsa-miR-302chsa-miR-96hsa-miR-27ahsa-miR-130bhsa-miR-34b*hsa-miR-205hsa-miR-223hsa-miR-19ahsa-miR-150hsa-miR-107hsa-miR-145hsa-miR-335hsa-miR-101hsa-miR-224

hsa-miR-142-3phsa-miR-142-5phsa-miR-337-3p

hsa-miR-105hsa-miR-139-5p


28


Addi<ve Trees

D

E

F

G

A

B

C

L

MN

O

H

I

J

K

U

TW

V

Q

P

S

R

Y

X

Z

00.0040.0080.0120.016

Stress Contribution

The distance between two leaves is propor<onal to the shortest path joining them.


29


Minimum Spanning Tree (MST)

RI

VT

HIME

VA

MI

DE

ID

IA

MD

MA AR

IL

UT

IN

MN

AZ

MO

MT

MS

NH

NJ

NM

AK

TX

AL

NC

ND

NE

NY

GA

NV

TNCA

OK

OH

WY

FL

SD

SC

CT

WV

WI

KY

KS

OR

LA

WA

CO

PA

This is shortest (or one of shortest) tree spanning all the objects. Successively dele<ng longest edges in MST defines a single-‐linkage tree.


30


Maps

12

Cluster

Drawing clusters on maps is a natural way to display geographic clusters. Clustering by state, however, can be misleading. The level of aggrega<on conceals city-‐rural differences.


31

Grouping o  References

o  Fisher, L. and Van Ness, J.W. (1971). Admissible clustering procedures. Biometrika, 58, 91-104. o  Gower, J.C. (1967). A comparison of some methods of cluster analysis. Biometrics, 23, 623-637. o  Gruvaeus, G. and Wainer, H. (1972). Two additions to hierarchical cluster analysis. The British

Journal of Mathematical and Statistical Psychology, 25, 200-206. o  Hartigan, J.A. (1975). Clustering algorithms. New York: John Wiley & Sons. o  Hartigan, J.A., and Mohanty, S. (1992), The RUNT Test for Multi-modality, Journal of

Classification, 9, 63-70. o  Johnson, S.C. (1967). Hierarchical clustering schemes. Psychometrika, 32, 241-254. o  Lance, G.N., and Williams, W.T. (1967), A general theory of classificatory sorting strategies, I.

Hierarchical Systems. Computer Journal, 9, 373-380. o  Milligan, G.W. (1980). An examination of the effects of six types of error perturbation on fifteen

clustering algorithms. Psychometrika, 45, 325-342. o  Milligan, G.W. and Cooper, M. C. (1985). An examination of procedures for determining number of

clusters in a data set. Psychometrika, 50, 159-179. o  Milligan, G.W. and Cooper, M.C. (1988). A study of standardization of variables in cluster analysis.

Journal of Classification, 45, 181-204. o  Sattath, S., & Tversky, A. (1977). Additive similarity trees. Psychometrika, 42, 319-345. o  Sokal, R.R. and Sneath, P.H.A. (1963). Principles of numerical taxonomy. San Francisco: W. H.

Freeman and Company. o  Stuetzle, W. and Nugent, R. (2010). A generalized single linkage algorithm for estimating the cluster

tree of a density. Journal of Computational and Graphical Statistics, 19, 397-418. o  Ward, J.H. (1963). Hierarchical grouping to optimize an objective function. Journal of the American

Statistical Association, 58, 236-244.


Data Analysis, Statistics, Machine Learning

Documents

Transcript of Data Analysis, Statistics, Machine Learning