Download - 01.04 Clustering Examples

Transcript
Page 1: 01.04 Clustering Examples

8/13/2019 01.04 Clustering Examples

http://slidepdf.com/reader/full/0104-clustering-examples 1/14

Cluster Analysis

Page 2: 01.04 Clustering Examples

8/13/2019 01.04 Clustering Examples

http://slidepdf.com/reader/full/0104-clustering-examples 2/14

Cluster Analysis

Page 3: 01.04 Clustering Examples

8/13/2019 01.04 Clustering Examples

http://slidepdf.com/reader/full/0104-clustering-examples 3/14

Cluster Analysis (“data segmentation” )

is an exploratory method for identifying

homogenous groups (“clusters”) of records

Similar records should belong to the same cluster

Dissimilar records should belong to different

clusters

Page 4: 01.04 Clustering Examples

8/13/2019 01.04 Clustering Examples

http://slidepdf.com/reader/full/0104-clustering-examples 4/14

case  sex  glasses  moustac

he  smile  hat 

2  f   n  n  y  n 3  m  y  n  n  n 4  m  n  n  n  n 5  m  n  n  y  n 6  m  n  y  n  y 7  m  y  n  y  n 8  m  n  n  y  n 9  m  y  y  y  n 

10  f   n  n  n  n 11  m  n  y  n  n 12  f   n  n  n  n 

From http://obelia.jde.aca.mmu.ac.uk/multivar/ca.htm (no longer)

Page 5: 01.04 Clustering Examples

8/13/2019 01.04 Clustering Examples

http://slidepdf.com/reader/full/0104-clustering-examples 5/14

Example 1: Exploring Stars(from Data Mining Techniques by Berry & Linoff)

Astronomers study the relationship between

brightness of stars and their temperatures

   L   u   m

   i   n   o   s   i   t   y

Temperature

Page 6: 01.04 Clustering Examples

8/13/2019 01.04 Clustering Examples

http://slidepdf.com/reader/full/0104-clustering-examples 6/14

Example 2: Fitting the Troops(from Data Mining Techniques by Berry & Linoff)

The US army recently commissioned a study on how toredesign the uniforms of female soldiers. The army’s goal

is to reduce the number of different uniform sizes that have

to be kept in inventory while still providing each soldier with

well-fitting khakis.

Researchers Ashdown and Paal @ Cornell University

designed a new set of sizes based on the actual shapes of

women in the army. Unlike traditional clothing size systems,

the new sizes are not an ordered set of graduated sizes

where all dimensions increase together.

Instead, they came up with sizes that fit particular body

types (e.g., short-legged, small-waisted, large-busted

women with long torsos, average arms, broad shoulders,

and skinny necks).

Page 7: 01.04 Clustering Examples

8/13/2019 01.04 Clustering Examples

http://slidepdf.com/reader/full/0104-clustering-examples 7/14

Market Segmentation

Segment customers based on demographic info and transaction

history in order to tailor marketing strategy for each segment

Page 8: 01.04 Clustering Examples

8/13/2019 01.04 Clustering Examples

http://slidepdf.com/reader/full/0104-clustering-examples 8/14

Investment: 

Cluster securities based on financial performance info(return, volatility, beta) and other info (industry andmarket capitalization) to create a balanced portfolio

Industry Analysis: For a given industry, cluster firms based on {growth rate,profitability, market size, product range, presence invarious international markets}, to understand industry

structure (determine competitors)

Lots more @ http://www.clustan.com/clustering_examples.html 

Page 9: 01.04 Clustering Examples

8/13/2019 01.04 Clustering Examples

http://slidepdf.com/reader/full/0104-clustering-examples 9/14

UG Business ProgramsUniversities Clustering.xls 

Data for 25 undergraduateprograms at business schools inUS universities in 1995.

Univ SAT Top10 Accept SFRatio Expenses GradRate

Brown 1310 89 22 13 22,704 94

CalTech 1415 100 25 6 63,575 81

CMU 1260 62 59 9 25,026 72

Columbia 1310 76 24 12 31,510 88

Cornell 1280 83 33 13 21,864 90

Dartmouth 1340 89 23 10 32,162 95

Duke 1315 90 30 12 31,585 95

Georgetown 1255 74 24 12 20,126 92

Harvard 1400 91 14 11 39,525 97

JohnsHopkins 1305 75 44 7 58,691 87

MIT 1380 94 30 10 34,870 91

Northwestern 1260 85 39 11 28,052 89

NotreDame 1255 81 42 13 15,122 94PennState 1081 38 54 18 10,185 80

Princeton 1375 91 14 8 30,220 95

Purdue 1005 28 90 19 9,066 69

Stanford 1360 90 20 12 36,450 93

TexasA&M 1075 49 67 25 8,704 67

UCBerkeley 1240 95 40 17 15,140 78

UChicago 1290 75 50 13 38,380 87

UMichigan 1180 65 68 16 15,470 85

UPenn 1285 80 36 11 27,553 90

UVA 1225 77 44 14 13,349 92UWisconsin 1085 40 69 15 11,857 71

Yale 1375 95 19 11 43,514 96

This dataset excludes image variables 

(student satisfaction, employersatisfaction, deans’ opinions)

Student

quality  Program Placement

Page 10: 01.04 Clustering Examples

8/13/2019 01.04 Clustering Examples

http://slidepdf.com/reader/full/0104-clustering-examples 10/14

Why cluster?

How can clustering help a prospective applicant?

How can clustering help a business school dean?

Page 11: 01.04 Clustering Examples

8/13/2019 01.04 Clustering Examples

http://slidepdf.com/reader/full/0104-clustering-examples 11/14

Cluster Analysis:

Backstage

Page 12: 01.04 Clustering Examples

8/13/2019 01.04 Clustering Examples

http://slidepdf.com/reader/full/0104-clustering-examples 12/14

Simple Clustering: 1-2 variables

Visual inspection of data (histogram, scatterplot)

   L   u   m

   i   n   o   s   i   t   y

Temperature

Page 13: 01.04 Clustering Examples

8/13/2019 01.04 Clustering Examples

http://slidepdf.com/reader/full/0104-clustering-examples 13/14

Clustering with >2 variables

Two approaches:

Distance: compute multivariate distance

between records, and group “close” records 

Homogeneity: group records to increase within-

group homogeneity 

Page 14: 01.04 Clustering Examples

8/13/2019 01.04 Clustering Examples

http://slidepdf.com/reader/full/0104-clustering-examples 14/14

Two Types of Clustering Algorithms

Hierarchical methods - agglomerative:Begin with n clusters; sequentially mergesimilar clusters until 1 cluster is left.

• Useful when goal is to arrange the

clusters into a natural hierarchy.• Requires specifying distance measure

Clustering music on Last.fm (from anthony.liekens.net)

Non-Hierarchical methods:

Pre-specified number of clusters,assign records to each of the clusters.• Requires specifying # clusters• Computationally cheap