Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten...

Databases and Data Mining

Lecture 3:Descriptive Data Mining

Peter van der Putten(putten_at_liacs.nl)

Course Outline

• Objective– Understand the basics of data mining– Gain understanding of the potential for applying it in the

bioinformatics domain– Hands on experience

• Schedule

• Evaluation– Practical assignment (2nd) plus take home exercise

• Website– http://www.liacs.nl/~putten/edu/dbdm05/

Date Time Room4-Nov-05 13.45 - 15.30 174 Lecture18-Nov-05 13.45 - 15.30 413 Lecture

15.45 - 17.30 306/308 Practical Assignments25-Nov-05 13.45 - 15.30 413 Lecture2-Dec-05 13.45 - 15.30 413 Lecture

15.45 - 17.30 306/308 Practical Assignments

Agenda Today:Descriptive Data Mining

• Before Starting to Mine….• Descriptive Data Mining

– Dimension Reduction & Projection– Clustering

• Hierarchical clustering• K-means• Self organizing maps

– Association rules• Frequent item sets• Association Rules• APRIORI• Bio-informatics case: FSG for frequent subgraph discovery

Before starting to mine….

• Pima Indians Diabetes Data– X = body mass

index– Y = age


• Attribute Selection– This example: InfoGain by Attribute– Keep the most important ones

0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20

Plasma glucose concentration a 2 hours in an oralglucose tolerance test

Body mass index (weight in kg/(height in m)^2)

Age (years)

2-Hour serum insulin (mu U/ml)

Triceps skin fold thickness (mm)

Number of times pregnant

Diabetes pedigree function

Diastolic blood pressure (mm Hg)


• Types of Attribute Selection– Uni-variate versus multivariate (sub set selection)

• The fact that attribute x is a strong uni-variate predictor does not necessarily mean it will add predictive power to a set of predictors already used by a model

– Filter versus wrapper• Wrapper methods involve the subsequent learner (classifier

or other)

Dimension Reduction

• Projecting high dimensional data into a lower dimension– Principal Component Analysis– Independent Component Analysis– Fisher Mapping, Sammon’s Mapping etc.– Multi Dimensional Scaling

• See Pattern Recognition Course (Duin)

Data Mining Tasks: Clustering

f.e. age

f.e.

wei

ght

Clustering is the discovery of groups in a set of instances

Groups are different, instances in a group are similar

In 2 to 3 dimensional pattern space you could just visualise the data and leave the recognition to a human end user

Data Mining Tasks: Clustering

f.e. age

f.e.

wei

ght

Clustering is the discovery of groups in a set of instances

Groups are different, instances in a group are similar

In 2 to 3 dimensional pattern space you could just visualise the data and leave the recognition to a human end user

In >3 dimensions this is not possible

Clustering Techniques

• Hierarchical algorithms– Agglomerative– Divisive

• Partition based clustering– K-Means– Self Organizing Maps / Kohonen Networks

• Probabilistic Model based– Expectation Maximization / Mixture Models

Hierarchical clustering

• Agglomerative / Bottom up– Start with single-instance clusters– At each step, join the two closest clusters– Method to compute distance between cluster x and y: single

linkage (distance between closest point in cluster x and y), average linkage (average distance between all points), complete linkage (distance between furthest points), centroid

– Distance measure: Euclidean, Correlation etc.

• Divisive / Top Down– Start with all data in one cluster– Split into two clusters based on category utility– Proceed recursively on each subset

• Both methods produce a dendrogram

Levels of Clustering

Divisive

Agglomerative

Dunham, 2003

Hierarchical Clustering Example

• Clustering Microarray Gene Expression Data– Gene expression measured using microarrays studied under variety of

conditions– On budding yeast Saccharomyces cerevisiae – Groups together efficiently genes of known similar function,

• Data taken from: Cluster analysis and display of genome-wide expression patterns. Eisen, M., Spellman, P., Brown, P., and Botstein, D. (1998). PNAS, 95:14863-14868; Picture generated with J-Express Pro

Hierarchical Clustering Example

• Method– Genes are the instances, samples the attributes!– Agglomerative– Distance measure = correlation

• Data taken from: Cluster analysis and display of genome-wide expression patterns. Eisen, M., Spellman, P., Brown, P., and Botstein, D. (1998). PNAS, 95:14863-14868; Picture generated with J-Express Pro

Simple Clustering: K-means

• Pick a number (k) of cluster centers (at random)• Cluster centers are sometimes called codes, and the

k codes a codebook

• Assign every item to its nearest cluster center• F.i. Euclidean distance

• Move each cluster center to the mean of its assigned items

• Repeat until convergence• change in cluster assignments less than a threshold

KDnuggets

K-means example, step 1

k1

k2

k3

X

Y

Initiallydistributecodesrandomlyin patternspace

KDnuggets


k1

k2

k3

X

Y

Assigneach pointto the closestcode

KDnuggets


X

Y

Moveeach codeto the meanof all its assigned points

k1

k2

k2

k1

k3

k3

KDnuggets


X

Y

Repeat the process – reassign the data points to the codes

Q: Which points are reassigned?

k1

k2

k3

KDnuggets

K-means example

X

Yk1

k3k2

KDnuggets

Repeat the process – reassign the data points to the codes

Q: Which points are reassigned?

K-means example

X

Y

re-compute cluster means

k1

k3k2

KDnuggets

K-means example

X

Y

move cluster centers to cluster means

k2

k1

k3

KDnuggets

K-means clustering summary

Advantages• Simple, understandable• items automatically

assigned to clusters

Disadvantages• Must pick number of

clusters before hand• All items forced into a

cluster• Sensitive to outliers

Extensions• Adaptive k-means• K-mediods (based on median instead of mean)

– 1,2,3,4,100 average 22, median 3

Biological Example

• Clustering of yeast cell images– Two clusters are found– Left cluster primarily cells with thick capsule, right

cluster thin capsule• caused by media, proxy for sick vs healthy

Self Organizing Maps(Kohonen Maps)

• Claim to fame– Simplified models of cortical maps in the brain– Things that are near in the outside world link

to areas near in the cortex– For a variety of modalities: touch, motor, ….

up to echolocation– Nice visualization

• From a data mining perspective:– SOMs are simple extensions of k-means

clustering– Codes are connected in a lattice– In each iteration codes neighboring winning

code in the lattice are also allowed to move

SOM

10x10 SOM

Gaussian Distribution

SOM example

Famous example:Phonetic Typewriter

• SOM lattice below left is trained on spoken letters, after convergence codes are labeled

• Creates a ‘phonotopic’ map• Spoken word creates a sequence of labels

Famous example:Phonetic Typewriter

• Criticism– Topology preserving property is not used so why use SOMs and

not adaptive k-means for instance?• K-means could also create a sequence

• This is true for most SOM applications!

– Is using clustering for classification optimal?

Bioinformatics ExampleClustering GPCRs

• Clustering G Protein Coupled Receptors (GPCRs) [Samsanova et al, 2003, 2004]

• Important drug target, function often unknown

Bioinformatics ExampleClustering GPCRs

Association Rules Outline

• What are frequent item sets & association rules?

• Quality measures– support, confidence, lift

• How to find item sets efficiently?– APRIORI

• How to generate association rules from an item set?

• Biological examples

KDnuggets

Market Basket ExampleGene Expression Example

TID Produce

1 MILK, BREAD, EGGS

2 BREAD, SUGAR

3 BREAD, CEREAL

4 MILK, BREAD, SUGAR

5 MILK, CEREAL

6 BREAD, CEREAL

7 MILK, CEREAL

8 MILK, BREAD, CEREAL, EGGS

9 MILK, BREAD, CEREAL

ID Expressed Genes in Sample

1 GENE1, GENE2, GENE 5

2 GENE1, GENE3, GENE 5

3 GENE2

4 GENE8, GENE9

5 GENE8, GENE9, GENE10

6 GENE2, GENE8

7 GENE9, GENE10

8 GENE2

9 GENE11

• Frequent item set

• {MILK, BREAD} = 4

• Association rule

• {MILK, BREAD} {EGGS}

• Frequency / importance = 2 (‘Support’)

• Quality = 50% (‘Confidence’)

• What genes are expressed (‘active’) together?

• Interaction / regulation

• Similar function

Association Rule Definitions

• Set of items: I={I1,I2,…,Im}

• Transactions: D={t1,t2, …, tn}, tj I

• Itemset: {Ii1,Ii2, …, Iik} I

• Support of an itemset: Percentage of transactions which contain that itemset.

• Large (Frequent) itemset: Itemset whose number of occurrences is above a threshold.

Dunham, 2003

Frequent Item Set Example

I = { Beer, Bread, Jelly, Milk, PeanutButter}

Support of {Bread,PeanutButter} is 60%

Dunham, 2003

Association Rule Definitions

• Association Rule (AR): implication X Y where X,Y I and X,Y disjunct;

• Support of AR (s) X Y: Percentage of transactions that contain X Y

• Confidence of AR () X Y: Ratio of number of transactions that contain X Y to the number that contain X

Dunham, 2003

Association Rules Ex (cont’d)

Dunham, 2003

Association Rule Problem

• Given a set of items I={I1,I2,…,Im} and a database of transactions D={t1,t2, …, tn} where ti={Ii1,Ii2, …, Iik} and Iij I, the Association Rule Problem is to identify all association rules X Y with a minimum support and confidence.

• NOTE: Support of X Y is same as support of X Y.

Dunham, 2003

Association Rules Example

• Q: Given frequent set {A,B,E}, what association rules have minsup = 2 and minconf= 50% ?

A, B => E : conf=2/4 = 50%

A, E => B : conf=2/2 = 100%

B, E => A : conf=2/2 = 100%

E => A, B : conf=2/2 = 100%

Don’t qualify

A =>B, E : conf=2/6 =33%< 50%

B => A, E : conf=2/7 = 28% < 50%

__ => A,B,E : conf: 2/9 = 22% < 50%

TID List of items

1 A, B, E

2 B, D

3 B, C

4 A, B, D

5 A, C

6 B, C

7 A, C

8 A, B, C, E

9 A, B, C

KDnuggets

Solution Association Rule Problem

• First, find all frequent itemsets with sup >=minsup– Exhaustive search won’t work

• Assume we have a set of m items 2m subsets!

– Exploit the subset property (APRIORI algorithm)

• For every frequent item set, derive rules with confidence >= minconf

KDnuggets

Finding itemsets: next level

• Apriori algorithm (Agrawal & Srikant) • Idea: use one-item sets to generate two-item

sets, two-item sets to generate three-item sets, ..– Subset Property: If (A B) is a frequent item set, then

(A) and (B) have to be frequent item sets as well!– In general: if X is frequent k-item set, then all (k-1)-

item subsets of X are also frequent Compute k-item set by merging (k-1)-item sets

KDnuggets

An example

• Given: five three-item sets

(A B C), (A B D), (A C D), (A C E), (B C D)

• Candidate four-item sets: (A B C D) Q: OK? A: yes, because all 3-item subsets are frequent

(A C D E) Q: OK?

A: No, because (C D E) is not frequent

KDnuggets

From Frequent Itemsets to Association Rules

• Q: Given frequent set {A,B,E}, what are possible association rules? – A => B, E– A, B => E– A, E => B– B => A, E– B, E => A– E => A, B – __ => A,B,E (empty rule), or true => A,B,E

KDnuggets

Example: ‘Generating Rules from an Itemset

• Frequent itemset from golf data:

• Seven potential rules:

Humidity = Normal, Windy = False, Play = Yes (4)

If Humidity = Normal and Windy = False then Play = Yes

If Humidity = Normal and Play = Yes then Windy = False

If Windy = False and Play = Yes then Humidity = Normal

If Humidity = Normal then Windy = False and Play = Yes

If Windy = False then Humidity = Normal and Play = Yes

If Play = Yes then Humidity = Normal and Windy = False

If True then Humidity = Normal and Windy = False and Play = Yes

4/4

4/6

4/6

4/7

4/8

4/9

4/12

KDnuggets

Example:Generating Rules

• Rules with support > 1 and confidence = 100%:

• In total: 3 rules with support four, 5 with support three, and 50 with support two

Association rule Sup. Conf.

1 Humidity=Normal Windy=False Play=Yes 4 100%

2 Temperature=Cool Humidity=Normal 4 100%

3 Outlook=Overcast Play=Yes 4 100%

4 Temperature=Cold Play=Yes Humidity=Normal 3 100%

... ... ... ... ...

58 Outlook=Sunny Temperature=Hot Humidity=High 2 100%

KDnuggets

Weka associations: output

KDnuggets

Extensions and Challenges

• Extra quality measure: Lift– The lift of an association rule I => J is defined as:

• lift = P(J|I) / P(J) • Note, P(I) = (support of I) / (no. of transactions)• ratio of confidence to expected confidence

– Interpretation:• if lift > 1, then I and J are positively correlated

lift < 1, then I are J are negatively correlated.

lift = 1, then I and J are independent

• Other measures for interestingness– A B, B C, but not A C

• Efficient algorithms

• Known Problem– What to do with all these rules? How to exploit / make useful /

actionable?

KDnuggets

Biomedical ApplicationHead and Neck Cancer Example

1. ace27=0 fiveyr=alive 381 tumorbefore=0 372 conf:(0.98) 2. gender=M ace27=0 467 tumorbefore=0 455 conf:(0.97) 3. ace27=0 588 tumorbefore=0 572 conf:(0.97) 4. tnm=T0N0M0 ace27=0 405 tumorbefore=0 391 conf:(0.97) 5. loc=LOC7 tumorbefore=0 409 tnm=T0N0M0 391 conf:(0.96) 6. loc=LOC7 442 tnm=T0N0M0 422 conf:(0.95) 7. loc=LOC7 gender=M tumorbefore=0 374 tnm=T0N0M0 357

conf:(0.95) 8. loc=LOC7 gender=M 406 tnm=T0N0M0 387 conf:(0.95) 9. gender=M fiveyr=alive 633 tumorbefore=0 595 conf:(0.94)10. fiveyr=alive 778 tumorbefore=0 726 conf:(0.93)

Bioinformatics Application

• The idea of association rules have been customized for bioinformatics applications

• In biology it is often interesting to find frequent structures rather than items– For instance protein or other chemical structures

• Solution: Mining Frequent Patterns– FSG (Kuramochi and Karypis, ICDM 2001)– gSpan (Yan and Han, ICDM 2002)– CloseGraph (Yan and Han, KDD 2002)

FSG: Mining Frequent Patterns

FSG Algorithmfor finding frequent subgraphs

Frequent Subgraph ExamplesAIDS Data

• Compounds are active, inactive or moderately active (CA, CI, CM)

Predictive Subgraphs

• The three most discriminating sub-structures forthe PTC, AIDS, and Anthrax datasets

FSG References

• Frequent Sub-structure Based Approaches for Classifying Chemical CompoundsMukund Deshpande, Michihiro Kuramochi, and George KarypisICDM 2003

• An Efficient Algorithm for Discovering Frequent SubgraphsMichihiro Kuramochi and George KarypisIEEE TKDE

• Automated Approaches for Classifying StructuresMukund Deshpande, Michihiro Kuramochi, and George KarypisBIOKDD 2002

• Discovering Frequent Geometric SubgraphsMichihiro Kuramochi and George KarypisICDM 2002

• Frequent Subgraph Discovery Michihiro Kuramochi and George Karypis1st IEEE Conference on Data Mining 2001

Recap

• Before Starting to Mine….• Descriptive Data Mining

– Dimension Reduction & Projection– Clustering

• Hierarchical clustering• K-means• Self organizing maps

– Association rules• Frequent item sets• Association Rules• APRIORI• Bio-informatics case: FSG for frequent subgraph discovery

• Next week– Bioinformatics Data Mining Cases / Lab Session / Take Home

Exercise

Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten...

Documents

Transcript of Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten...