Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota...

44
Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota [email protected] www.cs.umn.edu/~kumar Team Members: Michael Steinbach, Rohit Gupta, Blayne Field, Meenal Chhabra, Beth Zirbes Work done in collaboration with Hui Xiong, X. He, Chris Ding, Ya Zhang, Stephen R. Holbrook Research supported by NSF, IBM

Transcript of Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota...

Page 1: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

Association Pattern Analysis – Applications in Bioinformatics

Vipin KumarUniversity of Minnesota

[email protected]

www.cs.umn.edu/~kumar

Team Members: Michael Steinbach, Rohit Gupta, Blayne Field, Meenal Chhabra, Beth Zirbes

Work done in collaboration with Hui Xiong, X. He, Chris Ding, Ya Zhang, Stephen R. Holbrook

Research supported by NSF, IBM

Page 2: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 2

Data Mining for Bioinformatics

Recent technological advances are helping to generate large amounts of both medical and genomic data

• High-throughput experiments/techniques- Gene and protein sequences- Gene-expression data- Biological networks and phylogenetic profiles

• Electronic Medical Records- IBM-Mayo clinic partnership has created a DB of 5

million patients- Single Nucleotides Polymorphisms (SNPs)

Data mining offers potential solution for analysis of large-scale data

• Automated analysis of patients history for customized treatment

• Prediction of the functions of anonymous genes• Identification of putative binding sites in protein

structures for drugs/chemicals discovery

Protein Interaction Network

Page 3: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 3

Association Analysis• Association analysis: Analyzes

relationships among items (attributes) in a binary transaction data– Example data: market basket data– Applications in business and science

• Identification of functional modules from protein complexes• Marketing and Sales Promotion

• Two types of patterns – Itemsets: Collection of items

• Example: {Milk, Diaper}– Association Rules: X Y, where X

and Y are itemsets.• Example: Milk Diaper

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Set-Based Representation of Data

ons transactiTotal

Y and Xcontain that ons transacti# s Support,

Xcontain that ons transacti#

Y and Xcontain that ons transacti# c ,Confidence

Page 4: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 4

I. Identification of Protein Function Modules

Proteins usually do not function in isolation. They interact with other proteins either in pairs or as components of large complexes

Protein complexes can be determined using large scale experimental studies

Functional module is a group of proteins that is involved in common elementary biological function

Association analysis techniques can be used for identification of functional modules from a collection of protein complexes

Protein Complexes Proteins

c1 p1, p2

c2 p1, p3, p4, p5

c3 p2, p3, p4, p6Protein Complex Data

Page 5: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 5

II. Personalized Medicine

• Given: Patient data set that records– Phenotypic characteristics – Genetic characteristics (SNPs) – Disease

• Objective: Find relationships between disease and medical and genomic characteristics

• Association analysis can be used to find groups of phenotypic and genetic characteristics that are highly associated with disease

Page 6: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 6

III. Protein Function Prediction Using Phylogenetic Profiles

• Phylogenetic profiles: – For a given protein, BLAST its sequence

against N sequenced genomes– Construct a vector with N coordinates s.t.

if a protein has a homolog in the organism n, set coordinate n to 1, Otherwise set it to 0

• Basic Idea: If two proteins, P1 and P2 function/interact together, they must co-evolve. So every organism that has a homolog of P1 must also have a homolog of P2

• Association techniques can be used to identify the protein groups and the functional linkages among them with the help of phylogenetic profiles

Page 7: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 7

Process of finding interesting patterns:• Find frequent itemsets using a support threshold• Find association rules for frequent itemsets• Sort association rules according to confidence

Support filtering is necessary • To eliminate spurious patterns• To avoid exponential search

- Support has anti-monotone property: X Y implies (Y) ≤ (X)

Confidence is used because of its interpretation as conditional probability

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Association Analysis

Given d items, there are 2d possible candidate itemsets

Page 8: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 8

Drawback of Confidence

Coffee Coffee

Tea 15 5 20

Tea 75 5 80

90 10 100

Association Rule: Tea Coffee

Confidence= P(Coffee|Tea) = 0.75

but P(Coffee) = 0.9

Although confidence is high, rule is misleading

P(Coffee|Tea) = 0.9375

Ref: Brin, Motwani

Page 9: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 9

There are lots of measures proposed in the literature

Page 10: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 10

Comparing Different MeasuresExample f11 f10 f01 f00

E1 8123 83 424 1370E2 8330 2 622 1046E3 9481 94 127 298E4 3954 3080 5 2961E5 2886 1363 1320 4431E6 1500 2000 500 6000E7 4000 2000 1000 3000E8 4000 2000 2000 2000E9 1720 7121 5 1154

E10 61 2483 4 7452

10 examples of contingency tables:

Rankings of contingency tables using various measures [4] Tan et al:

Page 11: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 11

h-Confidence

• h-confidence(i1,i2,..,ik):

Advantages of h-confidence:

1. High h-confidence implies tight coupling amongst all items in the pattern

2. Eliminate cross-support patterns such as {caviar,milk}

3. Min function has anti-monotone property

• low support, high h-confidence patterns can be discovered efficientlymil

kcaviar

[5,6] Xiong et al

Page 12: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 12

Protein Complex Data

Protein Complexes(Higher-order Functions)

Functional Modules(Elementary Functions)

The TAP-MS dataset by Gavin et al 2002: Tandem affinity purification (TAP) – mass spectrometry (MS)

Contains 232 multi-protein complexes formed using over 1300 proteins

Protein Complexes Proteins

c1 p1, p2

c2 p1, p3, p4, p5

c3 p2, p3, p4, p6

Page 13: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 13

Hyperclique Patterns from Protein Complex Data

2 Tif4632 Tif4631 2 Cdc33 Snp1 2 YHR020W Mir1 2 Cka1 Ckb1 2 Ckb2 Cka2 2 Cop1 Sec27 2 Erb1 YER006W 2 Ilv1 YGL245W 2 Ilv1 Sec27 2 Ioc3 Rsc8 2 Isw2 Itc1 2 Kre33 YJL109C 2 Kre33 YPL012W 2 Mot1 Isw1 2 Npl3 Smd3 2 Npl6 Isw2 2 Npl6 Mot1 2 Rad52 Rfa1 2 Rpc40 Rsc8 2 Rrp4 Dis3 2 Rrp40 Rrp46 2 Cbf5 Kre33 3 YGL128C Clf1 YLR424W 3 Cka2 Cka1 Ckb1 3 Has1 Nop12 Sik1 3 Hrr25 Enp1 YDL060W 3 Hrr25 Swi3 Snf2

3 Kre35 Nog1 YGR103W 3 Krr1 Cbf5 Kre33

3 Nab3 Nrd1 YML117W

3 Nog1 YGR103W YER006W

3 Bms1 Sik1 Rpp2b

3 Rpn10 Rpt3 Rpt6

3 Rpn11 Rpn12 Rpn8

3 Rpn12 Rpn8 Rpn10

3 Rpn9 Rpt3 Rpt5

3 Rpn9 Rpt3 Rpt6

3 Brx1 Sik1 YOR206W

3 Sik1 Kre33 YJL109C

3 Taf145 Taf90 Taf60

4 Fyv14 Krr1 Sik1 YLR409C

4 Mrpl35 Mrpl8 YML025C Mrpl3

4 Rpn12 Rpn8 Rpt3 Rpt6

5 Rpn6 Rpt2 Rpn12 Rpn3 Rpn8

5 Ada2 Gcn5 Rpo21 Spt7 Taf60

6 YLR033W Ioc3 Npl6 Rsc2 Itc1 Rpc40

6 Dim1 Ltv1 YOR056C YOR145C Enp1 YDL060W

6 Luc7 Rse1 Smd3 Snp1 Snu71 Smd2

6 Pre3 Pre2 Pre4 Pre5 Pre8 Pup3

7 Clf1 Lea1 Rse1 YLR424W Prp46 Smd2 Snu114

7 Pre1 Pre7 Pre2 Pre4 Pre5 Pre8 Pup3

7 Blm3 Pre10 Pre2 Pre4 Pre5 Pre8 Pup3

8 Clf1 Prp4 Smb1 Snu66 YLR424W Prp46 Smd2 Snu114

8 Pre2 Pre4 Pre5 Pre8 Pup3 Pre6 Pre9 Scl1

10 Cdc33 Dib1 Lsm4 Prp31 Prp6 Clf1 Prp4 Smb1 Snu66 YLR424W

12 Dib1 Lsm4 Prp31 Prp6 Clf1 Prp4 Smb1 Snu66 YLR424W Prp46 Smd2 Snu114

12 Emg1 Imp3 Imp4 Kre31 Mpp10 Nop14 Sof1 YMR093W YPR144C Krr1 YDR449C Enp1

13 Ecm2 Hsh155 Prp19 Prp21 Snt309 YDL209C Clf1 Lea1 Rse1 YLR424W Prp46 Smd2 Snu114

13 Brr1 Mud1 Prp39 Prp40 Prp42 Smd1 Snu56 Luc7 Rse1 Smd3 Snp1 Snu71 Smd2

39 Cus1 Msl1 Prp3 Prp9 Sme1 Smx2 Smx3 Yhc1 YJR084W Brr1 Dib1 Ecm2 Hsh155 Lsm4 Mud1 Prp11 Prp19 Prp21 Prp31 Prp39 Prp40 Prp42 Prp6 Smd1 Snt309 Snu56 Srb2 YDL209C Clf1 Lea1 Luc7 Prp4 Rse1 Smb1 Smd3 Snp1 Snu66 Snu71 YLR424W

List of maximal hyperclique patterns at a support threshold 0 and an h-confidence threshold 60%. [1] Xiong et al. (Detailed results are at http://cimic.rutgers.edu/~hui/pfm/pfm.html)

Page 14: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 14

Functional Group Verification Using Gene Ontology

Hypothesis: Proteins within the same pattern are more likely to perform the same function and participate in the same biological process

Gene Ontology• Three separate ontologies:

Biological Process, Molecular Function, Cellular Component

• Organized as a DAG describing gene products (proteins and functional RNA)

• Collaborative effort between major genome databases

http://www.geneontology.org

Page 15: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 15

Function Annotation for Hyperclique {PRE2 PRE4 PRE5 PRE6 PRE8 PRE9 PUP3 SCL1}

GO hierarchy shows that the identified proteins in hyperclique perform the same function and involved in same biological process

Page 16: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 16

More Hyperclique Examples

# distinct proteins in cluster = 13

# proteins in one group = 10

(rest denoted as )

# distinct proteins in cluster = 13

# proteins in one group = 12

(rest denoted as )

Page 17: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 17

More Hyperclique Examples..

# distinct proteins in cluster = 12

# proteins in one group = 12

# distinct proteins in cluster = 8

# proteins in one group = 8

Page 18: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 18

More Hyperclique Examples..

# distinct proteins in cluster = 12

# proteins in one group = 10

(rest denoted by )

Page 19: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 19

More Hyperclique Examples..

# distinct proteins in cluster = 10

# proteins in one group = 9

(rest denoted as )

Page 20: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 20

More Hyperclique Examples.. Only two Proteins

SRB2 and CSN12 involved in cellular process and development got clustered together with group of proteins involved in physiological process

It is observed that 37 proteins out of 39 annotated proteins are responsible for same molecular function, mRNA splicing via spliceosome

# distinct proteins in cluster = 39

# proteins in one group = 32

# proteins at node ‘mRNA splicing’ = 37

Page 21: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:
Page 22: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:
Page 23: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 23

Clustering of Protein Complex Data

Clustering software CLUTO (http://

glaros.dtc.umn.edu/gkhome/views/cluto) is used to cluster the proteins in groups• Repeated bisection method is used as the base method

for clustering• Cosine similarity measure is used to find similarity

between proteins Parameter to define the maximum number of

clusters that could be obtained is set to 100 Best 17 clusters (as measured by internal similarity)

are analyzed as candidates for functional modules

Page 24: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 24

Clustering Results – GO Hierarchies

# distinct proteins in cluster = 3

# proteins in one group = 2

(Protein Pho12 not annotated)

# distinct proteins in cluster = 4

# proteins in one group = 4

Page 25: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 25

Clustering Results – GO Hierarchies

# distinct proteins in cluster = 8

# proteins in one group = 8# distinct proteins in cluster = 7

# proteins in one group = 6

(rest denoted by )

Page 26: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 26

Clustering Results – GO Hierarchies

# distinct proteins in cluster = 5

# proteins in one group = 5

# distinct proteins in cluster = 2

# proteins in one group = 2

Page 27: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 27

Clustering Results – GO Hierarchies

# distinct proteins in cluster = 6

# proteins in one group = 6

# distinct proteins in cluster = 5

# proteins in one group = 5

Page 28: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 28

Clustering Results – GO Hierarchies

Proteins MNN10 and ANP1 (denoted by ) involved in metabolism got clustered together with group of proteins involved in physiological process

# distinct proteins in cluster = 6

# proteins in one group = 4

Page 29: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 29

Clustering Results – GO Hierarchies

# distinct proteins in cluster = 11

# proteins in one group = 10

Protein SKN1 (denoted by ) involved in metabolism got clustered together with proteins involved in cellular physiological process

Page 30: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 30

Clustering Results – GO Hierarchies

# distinct proteins in cluster = 30

# proteins in one group = 22

Group of 22 proteins

Page 31: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 31

Clustering Results – GO Hierarchies

# distinct proteins in cluster = 7

# proteins in one group = 4

(Rest of the 3 proteins are marked as )

Page 32: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 32

Clustering Results – GO Hierarchies

# distinct proteins in cluster = 8

# proteins in one group = 8

# distinct proteins in cluster = 5

# proteins in one group = 5

Page 33: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 33

Clustering Results – GO Hierarchies

# distinct proteins in cluster = 8

# proteins in one group = 6

(rest denoted by )

Protein AAP1 and VAM6 (denoted by ) got clustered together with proteins involved in biological process of membrane fusion

Page 34: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 34

Clustering Results – GO Hierarchies

Protein AAP1 and VAM6 (denoted by ) got clustered together with group of proteins involved in biological process of membrane fusion

# distinct proteins in cluster = 8

# proteins in one group = 4

(rest denoted by )

Page 35: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 35

Clustering Results – GO Hierarchies

# distinct proteins in cluster = 7

# proteins in one group = 5

(rest denoted by )

Page 36: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 36

Error Tolerant Itemsets (ETIs)• An error-tolerant itemset (ETI) can have a fraction of the items missing in each transaction.

Example: see the data in the table– Let = 1/4. In other words, each

transaction needs to have 3/4 (75%) of the items.

– X = {i1, i2, i3, i4} andY = {i5, i6, i7, i8} are both

ETIs with a support of 4

Page 37: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 37

ETIs to Identify Protein Functional Modules Groups of proteins are identified as error tolerant

itemsets (ETIs) ETI relaxes the density constraints of the pattern

in both dimensions Maximum sparseness allowed: 0.2 (along row)

and 0.25 (along column) Minimum support: 5 protein complexes Gene Ontology is used to validate following

three identified ETIs• {CLF1,LEA1,PRP4,PRP46,RSE1,SMB1,SMD2,SNU114,SPP382}• {Pre2,Pre4,Pre5,Pre6,Pre8, Pre9,Pup3,Rpt3,Scl1}• {Rpn10,Rpn12,Rpn3,Rpn6,Rpn8,Rpn9,Rpt2,Rpt3,Rpt6}

Page 38: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 38

ETI Pattern validated using GO

Pattern: {CLF1, LEA1, PRP4, PRP46, RSE1, SMB1, SMD2, SNU114, SPP382}

Almost all proteins involved in one biological process (mRNA splicing)

Page 39: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 39

More ETI Patterns.. Pattern:

{Pre2,Pre4,Pre5,Pre6,Pre8, Pre9,Pup3,Rpt3,Scl1}

All proteins involved in one biological process, ubiquitin-dependent protein catabolism

Hyperclique technique identified the same pattern except protein RPT3, which is found to have same function – relaxing the constraints using ETI technique helped identify bigger group

Page 40: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 40

More ETI Patterns..

Pattern: {Rpn10, Rpn12, Rpn3, Rpn6, Rpn8, Rpn9, Rpt2, Rpt3, Rpt6}

All proteins involved in one biological process, ubiquitin-dependent protein catabolism

Page 41: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 41

Concluding Remarks

Hyperclique and ETI patterns show great promise for identifying protein modules and for annotating uncharacterized proteins

Clustering does not perform as well as hypercliques and ETI due to a variety of reasons:• Each protein gets assigned to some cluster even if

there is no right cluster for it• Modules can be overlapping• Modules can of different sizes• Data is high-dimensional

Page 42: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 42

References[1] Hui Xiong, X. He, Chris Ding, Ya Zhang, Vipin Kumar, Stephen R. Holbrook, Identification of

Functional Modules in Protein Complexes via Hyperclique Pattern Discovery, in Proc. of the Pacific Symposium on Biocomputing, (PSB 2005), 2005

[2] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, Introduction to Data Mining, Addison-Wesley April 2005

[3] Jinze Liu, Susan Paulsen, Xing Xu, Wei Wang, Andrew Nobel, Jan Prins, Mining Approximate Frequent Item sets in the Presence of Noise: Algorithms and Analysis, SIAM 2006

[4] Pang-Ning Tan, Vipin Kumar, and Jaideep Srivastava, Selecting the Right Interestingness Measure for Association Patterns, Proc of the Eighth ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining (SIGKDD-2002)

[5] Hui Xiong, Pang-Ning Tan, and Vipin Kumar, Mining Strong Affinity Association Patterns in Data Sets with Skewed Support Distribution, In Proc. of the Third IEEE International Conference on Data Mining (ICDM 2003)

[6] Hui Xiong, Pang-Ning Tan, and Vipin Kumar, Hyperclique Pattern Discovery, Data Mining and Knowledge Discovery Journal, accepted for publication as a regular paper, 2006

[7] A. Gavin et al. Functional organization of the yeast proteome by systematic analysis of protein complexes, Nature,  415:141-147, 2002

[8] Matteo Pellegrini et al., Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles, Proc. Natl. Acad. Sci. USA Vol. 96, pp. 4285–4288, April 1999, Biochemistry

Page 43: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 43

Organizing Committee Steering Committee Chair

Vipin Kumar, University of MinnesotaConference Co-Chairs Chid Apte, IBM ResearchDavid Skillicorn, Queen’s UniversityProgram Co-Chairs Srinivasan Parthasarathy, Ohio State University Bing Liu, University of Illinois at Chicago Tutorial Chair Pang-Ning Tan, Michigan State University Workshop Co-Chairs Michael Berry, University of TennesseePhilip Chan, Florida Institute of TechnologyPublicity Chair Hui Xiong, Rutgers University

http://www.siam.org/meetings/sdm07/

Page 44: Association Pattern Analysis – Applications in Bioinformatics Vipin Kumar University of Minnesota kumar@cs.umn.edu kumar Team Members:

June 28, 2006 Association Pattern Analysis – Applications in Bioinformatics 44